<a href="https://colab.research.google.com/github/kreatorkat2004/Arbitrage_Detector/blob/main/comp341_hw5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## COMP 341: Practical Machine Learning
## Homework Assignment 5: Who Said It?
### Due: Thursday, November 6 at 11:59pm on Gradescope

In this assignment, we will explore how machine learning can be used to explore text documents. Towards this end, we have assembled collections of text from different speakers. When possible, this text was derived from unique talks, papers, or books. Text passages are categorized based on the characteristics of the author / speaker (e.g., an actor, writer, academic, etc).

We handled the initial text preprocessing (including lemmatization which helps reduce words to a singular representative form) to provide you with two clean matrices - a word counts per document (bag_of_words) and normalized word counts in the form of tf-idf (you can read more about tf-idf [here](https://en.wikipedia.org/wiki/Tf–idf)).

The speaker / category labels are provided in (`document_labels.txt`), which correspond to the rows in the matrices. Exploring these data, we will attempt three tasks - identifying topics and potential authors for the "other" category, as well as explore more in-depth classification tasks predicting the category of a passage of text.

As always, fill in missing code following `# TODO:` comments or `####### YOUR CODE HERE ########` blocks and be sure to answer the short answer questions marked with `[WRITE YOUR ANSWER HERE]` in the text.

All code in this notebook will be run sequentially so make sure things work in order! Be sure to also use good coding practices (e.g., logical variable names, comments as needed, etc), and make plots that are clear and legible.

For this assignment, there will be **15 points** allocated for general coding and formatting points:
* **5 points** for coding style
* **5 points** for code flow (accurate results when everything is run sequentially)
* **5 points** for additional style guidelines listed below

Additional style guidelines:
* **The ipynb files are not rendering properly on gradescope due to size limits, so for the convenience of your TAs, please export a pdf of your colab notebook (and include a rice-accessible private link to the notebook at the end of the assignment). Your file should be named: `netid-hw5.pdf`**
* For any TODO cell, make sure to include that cell's output in the .ipynb file that you submit. Many text editors have an option to clear cell outputs which is useful for getting a blank slate and running everything beginning-to-end, but always be sure to run the notebook before submitting and ensure that every cell has an output.
* When displaying DataFrames, please do not include `.head()` or `.tail()` calls unless asked to. Just removing these calls will work as well, and will allow us to see both the beginning and end of your DataFrames, which help us ensure data is processed properly. Notebooks will by default show only the beginning and end, so you don't have to worry about long outputs here.
* If column names are specified in the question, please use the specified name, and please avoid any sorting not specified in the instructions.
* For plots, please ensure you have included axis labels, legends, and titles.
* To format your short answer responses nicely, we recommend either **bolding** or *italicizing* your answer, or formatting it ```as a code block```.
* Generally, please keep your notebook cells to one solution per cell, and preserve the order of the questions asked.
* Finally, this can be harder to check/control and depends on which plotting libraries you prefer, but it would be helpful to limit the size/resolution of plot images in the notebook. Our grading platform has an upper limit on submission sizes it can display, and high-res plots are the usual culprit when submissions are hidden or truncated.

### Setup
First, we need to import some libraries that are necessary to complete the assignment.

In [None]:
import pandas as pd

Add additional modules/libraries to import here (rather than wherever you first use them below):

In [None]:
# additional modules/libraries to import


We provide some code to get the data file for this assignment into your workspace below. You only need to do the following 4 steps once:
1. Go to 'My Drive' in your own Google Drive
2. Make a new folder named `comp341`
3. From the [Google Drive link](https://drive.google.com/drive/folders/17jIHXbrNHa6tN9UQ_geDJobVNp3f7LKt?usp=sharing), you can right click the `comp341-hw5` title, and select `Add shortcut to Drive`, and add a link to the whole folder to your `comp341` folder. This is a convenient alternative to having to download and re-upload the files to your own drive.

If you run into trouble with accessing the files from the shortcut, then:

4. Download the following files: `bag_of_words.csv`, `tfidf.csv`, `document_labels.csv`, `test-bag_of_words.csv`, `test-tfidf.csv` to your computer.
5. In the `comp341` folder you created in step 2, click `New -> File Upload` and upload the downloaded files from your computer.

Now, we will mount your local Google Drive in colab so that you can read the file in (you will need to do this each time your runtime restarts).

In [None]:
# note that this command will trigger a request from google to allow colab
# to access your files: you will need to accept the terms in order to access
# the files this way
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# if you followed the instructions above exactly, each file should be
# in comp341/; if your files are in a different directory
# on your Google Drive, you will need to change the path below accordingly
DATADIR = '/content/drive/My Drive/comp341/'

In [None]:
bag_of_words = pd.read_csv(DATADIR + 'comp341-hw5/bag_of_words.csv')
doc_labels = pd.read_csv(DATADIR + 'comp341-hw5/document_labels.csv')
tfidf = pd.read_csv(DATADIR + 'comp341-hw5/tfidf.csv')

### Part 1: Data Exploration [35 pts]
In this section we will explore our data using unsupervised learning to infer general topics, as well as try to infer the number of unique authors/speakers in the 'other' category.

In [None]:
# TODO: how many categories are there in this dataset? What about authors? [1 pt]


In [None]:
# TODO: What about authors? [1 pt]


In [None]:
# TODO: using the bag_of_words, run Latent Dirichlet Allocation to find 10 topics.
# List the top 10-15 words that characterize each of these topics. [6 pts]


In [None]:
# TODO: use PCA to reduce the data (either bag_of_words or tfidf)
# down to 2 dimensions, then visualize the documents (i.e., each point is a document),
# colored by their category [3 pts]


In [None]:
# TODO: use t-SNE (with the PCA initialization) to reduce the data (either bag_of_words or tfidf)
# down to 2 dimensions, then visualize the documents (i.e., each point is a document),
# colored by their category [3 pts]


**Short Answer Question**: Are there any differences between your PCA and t-SNE plots? If there are differences, explain what you think what might be driving the differences? It may be helpful to support your answer with simple data analyses and/or additional plots. [3 pts]

`[WRITE YOUR ANSWER HERE]`


The `president` category of documents consists of speeches from several different presidents. Let's explore whether we can find clustered observations in the normalized bag of words (tf-idf).

In [None]:
# TODO: Using tf-idf, take only data from the `president` category and
# use PCA to reduce the number of dimensions to 2 and plot these 2 PCs [2 pts]


In [None]:
# TODO: Cluster the resulting 2 dimensions from PCA above using k-means with k from 2-10 clusters, and
# store any values you will need to make an elbow plot and average silhouette score plot (below) [5 pts]


In [None]:
# TODO: make an elbow plot for k=2 to 10 [1 pt]


In [None]:
# TODO: make a plot of the average silhouette scores for k=2 to 10 [1 pt]


**Short Answer Question**: Using the elbow plot and the silhouette plots above, how many presidents would you expect the data to have? Explain. [2 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: Repeat the above analysis (PCA, clustering, elbow, and silhouette analyses), except now, with PC2 and PC3. [5 pts]


**Short Answer Question**: How many clusters does clustering on PC2 and PC3 suggest? Why might using different PCs yield the same or different number of clusters? [2 pts]

`[WRITE YOUR ANSWER HERE]`

### Part 2: Text Classification [20 pts]
In Part 1, we examined how the documents may differ by category. Can we make a multiclass classifier that predicts the author of a particular passage of text?

As mentioned in class, you will have more freedom to use a classification algorithm of your choice. For the sections below, you can choose your favorite classification method and evaluate its performance by looking at ROC curves, precision-recall curves, and a confusion matrix.

In [None]:
# TODO: Using any of the classification methods we have discussed in class, make a multiclass
# classifier to predict the author label given the tf-idf data [8 pts]


In [None]:
# TODO: Show the performance of your classifer by making a ROC curve and
# precision-recall curve for each of the author labels [6 pts]
# Note: If you use cross-validation, you are free to either plot the performance
# per fold or the average performance across folds.


In [None]:
# TODO: Plot a confusion matrix for your multiclass classifier [2 pts]
# Hint: Calculating the matrix by hand isn't difficult, but you are also welcome
# to take advantage of any convenient functions sklearn provides for evaluation metrics


**Short Answer Question**: Looking at the confusion matrix, we do not know who the authors are, but tracing back to the categories they fall into, are there any categories that tend to get mixed up more often than others? Explain whether you think this makes sense. [4 pts]

`[WRITE YOUR ANSWER HERE]`

###Part 3: Text Classification II [30 pts]
Now, we test our classification skills by looking at data we have not yet touched (`test-bag_of_words.csv` and `test-tfidf.csv`).

Again, we will be using [Kaggle](https://www.kaggle.com/t/91f7c687cfc34d8e9a76145544f865d6).

Unlike last time, for this homework, *you will be graded on your ability to pass perfomance benchmarks*. Specifically, you will receive 5 points for each of the benchmarks (baseline, easy, moderate) you pass for both the public and private leaderboards, for a total of 30 points. Remember that since part of your grade is based on the private leaderboard, you want to refrain from overfitting to the public leaderboard!


The top three leaders on the private leaderboard will recieve extra credit (if there are ties, everyone tied will receive the same number of points):
* 5 points for first place
* 3 points for second place
* 2 points for third place



The following Kaggle notes from the previous assignment still apply:
* You can use any team name (the name that will show up on the Kaggle leaderboard) as long as it is not inappropriate or offensive; however, in order to receive credit, you **must** specify your `team name` in your notebook here. If you do not, there is no way for us to assign you credit!
* Kaggle lists the close date as several days after the homework's due date. This is because Kaggle does not support late submissions. The homework and your submission on Kaggle are due by the due date listed here, but you may use late days and turn it in late (i.e., if you submit Kaggle predictions after the due date, it will automatically count towards your late days even if you have turned in your notebook already).
* This portion of the assignment **must** be completed independently. You cannot share prediction code or predictions with each other. In fact, you must put the exact code you use for your final predictions below. Violations will result in point deductions.
* Related, you cannot modify your prediction files manually. Violations will result in point deductions.
* You can only use classification models that we have discussed in class (though you can feel free to preprocess your data / tune any of the parameters in the models however you like)!



**Kaggle team name:** `[fill in here]`

Now, we will finally read in the test datasets.

In [None]:
test_bag = pd.read_csv(DATADIR + "comp341-hw5/test-bag_of_words.csv")
test_tfidf = pd.read_csv(DATADIR + "comp341-hw5/test-tfidf.csv")

In [None]:
# TODO: put all code needed (including preprocessing steps) to make your
# final kaggle submission; note that this code must match the predictions
# that you provide on kaggle


You can see details about the file format for submission on kaggle (`sample_submission.csv`, essentially a 2 column file with `textid`, the unique identifier in your test set, and `author`, your predictions). To make things easier, we provide here some sample code that you can modify to make your own submission file if your predictions were in a variable called `y_pred_kagg`.

In [None]:
results = pd.Series(y_pred_kagg.flatten(), name="author")
results = pd.concat([test_bag['textid'], results], axis=1)
results.to_csv('my_submission.csv', index=False)

Once you output your csv file, you need to download the file from colab to your local computer (you can click the file folder icon on the left panel to see the files in your workspace) and upload that file to the Kaggle site as your submission. Note that you can submit multiple times (up to 10 times a day)!

## To Submit
Please provide a Google Colab link (enable `Viewer` permissions to `Rice University` only) by clicking the `Share` button and toggling permissions accordingly and copying the link here: [Colab notebook](https://)

Now export the notebook as a PDF (`File > Print > Save as PDF`), make sure it is named `netid-hw5.pdf`, and upload it to the corresponding Gradescope assignment.

Also, double check that your Kaggle submission shows up on the public leaderboard.