# Assignment 4
## Machine learning and sentiment analysis

In this assignment you'll apply the techniques we discussed in class for applying machine learning to language data.  Be sure you've gone through and understood the "Scikit tour" notebook to get the background for this assignment.

For this assignment we'll be using the file `reviews_peoria.csv`.  As in the Scikit tour, this file contains Yelp reviews of various businesses, but these business are in Peoria, AZ rather than Champaign, IL.  The format of the file is the same, but it is signficantly larger.

Previous assignments were generally divided into two sections: a "practice" section where you did small tasks with some Python library (e.g., pandas, spacy), and a "task" section where you used that library to perform some task (Markov text generation, summarization, etc.).  In this assignment we go straight to the task.  This is because it isn't really useful to explore scikit on tiny example datasets; the power of scikit lies in its ability to handle larger chunks of data.

In each code section, be sure to display a brief sample of whatever data was generated.  Typically you'll do this using `.head()` when working with Pandas structures.  In other words, if the instructions say "Create a table like this...", don't just create the table, but create it and then call `.head()` on it and be sure the output is displaying the first few rows of your table.

### Classification

**Exercises**

Load the data file into a Pandas DataFrame.  If you like, you can strip out the "ReviewID", "UserID", and "BusinessID" columns, as we won't be using them here.

In [None]:
import pandas as pd

reviews = pd.read_csv('reviews_peoria.csv')

Our first task will be to predict the sentiment of reviews.  In the Scikit tour we did this by predicting the star rating of a review directly.  Now, we will instead classify reviews as either "positive", "negative" or "neutral".  We will define "negative" reviews as those which are 1 or 2 stars, "neutral" as 3 stars, and "positive" as 4 or 5 stars.

To begin, add a new column called "Polarity" to your DataFrame to represent the category of the review.  Because scikit requires categories to be numbers, we will use the number -1 for negative reviews, 0 for neutral, and 1 for positive.  So in your new column each value should be either -1, 0 or 1, depending on the star rating.  This will be our "Y" value that will we try to predict.

(Note that even though the values are numerical, we are treating this as a classification problem.  That means the model will not "know" that positive is more similar to neutral than it is to negative.  We are treating the three categories as separate and independent.)

Next, use the appropriate scikit function to split the data into train and test sets.

Unless you want to confuse yourself, you should call these variables `x_train`, `x_test`, `y_train`, and `y_test`

As in the Scikit tour, we will be using text features created by a vectorizer.  In this case, however, we will use the TfidfVectorizer rather than CountVectorizer.  As discussed in class Tf-idf weights the word frequencies by their inverse document frequency, so that words which occur in many reviews are given less weight.

In the following sections of the assignment, we will be creating slightly different vectorizers and using them to create different feature sets.  Be sure to save your features into distinct variables at each stage.  For instance, if you call your features `features_train` and `features_test`, and then later we create a new set of features, don't overwrite the same variable names.  The reason is that we're going to use these same features later for a different task, so you'll want to keep the features around so you can re-use them later without having to rewrite the code that creates them.

Create a `TfidfVectorizer` and fit it on the `Text` column of the training data.  Be sure to include `stop_words='english'` so the vectorizer will remove stopwords.

Now use the vectorizer to transform the `Text` column and obtain sparse matrices containing the word features.  You should wind up with two new variables, called `features_train` and `features_test`, for training and test sets, respectively.

How many features did your vectorizer create?  (Obviously, show the code you used to find this.)

Now it's time to do our classification.  This time we will use a multinomial Naive Bayes classifier, available as `MultinomialNB` in scikit.  You may have to google to find out exactly where to find this model in scikit.  Once you've found it, create the model object.

Now fit your model object on the transformed features you got from the vectorizer.

Our model is now ready to make predictions.  Before we actually do that, though, let's prepare a place to store those predictions.  Make a Pandas DataFrame that (for now) has just one column, called `Actual`, which contains the true `Polarity` values for the test set (that is, the values we are trying to predict).  This DataFrame should have a nice name like `output` or `predictions` or `results` or something like that.

As before, it's a good idea to create a "dumb" model as a baseline of comparison for our (hopefully) more sophisticated model.  Since this is a classification problem, the simplest thing is to find which of our three categories is the most common, and always predict that category.

So, find out what the most common Polarity category is *in the training set*, and add a column to your predictions DataFrame, so it will now have two columns, `Actual` and `Baseline`.  (Okay, you can call it "Dumb" if you want.  Or "Simple".  You get the idea.)  This "Baseline" column will just have the same value in every row, namely whichever category is most common in the training set.

Now we can actually put our model through its paces.  Use your model to generate predictions for the test set and add them as a new column (perhaps called `NB`) to your predictions table.

Now it's time to evaluate our model.  Write code that that loops over all the columns of your predictions DataFrame and, for each column, computes and displays the following classification metrics: accuracy score, Matthews coefficient, and F1 score.  Be sure each is labeled in the display so that the output clearly shows how each model scored on each metric.  (You may want to skip the "Actual" column since that will just show up as having perfect accuracy.)

When computing the F1 score, you should pass the additional argument `average="micro"`.

We're going to have to re-run this code many times, so if you want to be extra cool, you can write it as a function, so that later when we have to call it again you can just call that function instead of copying and pasting your code.

To get another look at the model's performance, display the confusion matrix, showing how predictions compared to true values (for our Naive Bayes model).  We would expect the confusion matrix to be a 3x3 table, since we have three categories, but it may wind up being smaller than 3x3 if the model never predicted certain categories, because then columns/rows for those categories will be missing.

Let's try a new twist on this.  Look up the scikit documentation for `TfidfVectorizer`.  You'll find that, when creating the model, you can pass an argument to have the model create features not just for the individual words in the text, but for the n-grams as well.  Use that argument to create a new `TfidfVectorizer` that will create features for bigrams and trigrams as well as individual words.  (Don't forget to include the `stop_words` argument as well, to exclude stop words!)

Train your new vectorizer on the training set, transform the features to create new feature matrices, and use the new features to fit a new Naive Bayes model.  (In other words, repeat the model-creation process we did above.)  We don't need to redo the train/test split since we didn't make any modifications to the data.

Beware, this vectorizer and model may take a bit longer to run!

How many features did your new vectorizer create?

Now use your new model to generate new predictions for the test set, and add them as a new column (called `NB_ngram` or something) to your predictions table.  Be sure that your new model really is using the new feature matrices you created!  (I mention this because it's easy to copy and paste code and forget to change variables names, etc., so that you may wind up creating a new model but fitting it on the old data.  Watch out for that.)

Now use your code from earlier to compute the model evaluation metrics again.  (You should copy and paste the code here.  Or, if you wrote a function above, you could just call that function here.  But just make sure you have output here that shows the new metrics; don't just re-run the code above and leave the output up there in that earlier section.)

Now for one more modification.  If you go back to the scikit documentation for `TfidfVectorizer`, you'll see there is also a way to limit the maximum number of features that the model creates, so that it only consider the most frequent words (and/or ngrams).  Use this to repeat the process above, this time creating a vectorizer that still handles words, bigrams, and trigrams, but limits the total number of features to 100,000.  Create another `MultinomialNB` model and fit it on your newly transformed data.

I'm not going to ask you how many features this vectorizer created because, guess what, it's probably 100,000.

Once again, generate predictions with your new model and add them as a new column to your predictions table.

Finally, compute the classification metrics again.

**Questions**

1. We created three models here: the original Naive Bayes model, a second one using n-grams, and a third one using n-grams limited to 100,000 features.
  1. Which model performed best?  How do you know?  Why do you think this model outperformed the others?
  2. Overall, how did the three models compare to the "dumb" baseline model and to each other?  Was there a large difference in performance between any of them?
  3. If you had to use such a model to make real predictions (for instance, for commercial purposes), which model would you choose?  Why?  Are there other factors to consider besides model performance?
2. How did adding the ngram features affect the model?  Why do you think this was?
3. How did limiting the number of features affect the model?  Why do you think this was?

### Regression

Now we will use the same data to perform a regression task.  Instead of predicting the star rating, though, we will try to predict the number of "funny" votes received by each review.  This differs from the star rating in one important way: the number of stars must be between 1 and 5, but there is no upper limit on how many "funny" votes a review can receive.  (The lower limit, of course, is zero.)

To do this, we will re-use the already-created feature vectors you created above.  As mentioned earlier, be sure you've stored each set of feature vectors (train and test features for the plain vectorizer, ngram vectorizer, and limited-ngram vectorizer) into separate variables so you can make use of them as needed.

In order to re-use these feature vectors, we have to keep the same train/test split.  But our old train/test split was made using our `Polarity` column as the Y value.  How can we re-use the same split when our Y value has now changed?

The answer is that we can just use the columns of our `x_train` and `x_test` tables.  Because we did our train/test split on a Pandas DataFrame, `x_train` and `x_test` actually contain all the columns in the original dataset.  So, instead of using `y_train` as our Y value, we can just use `x_train.Funny`; and likewise we can use `x_test.Funny` as our "true" Y values (instead of `y_train`) to compare against when making predictions.

**Exercises**

For this task we'll use the LinearSVR model.  Go ahead and create the model object now, and fit it using the `features_train` vectors created above (by your first, "plain" vectorizer).  Use your `x_train.Funny` from above as your Y value.

We'll need to create a new predictions table since the nature of our predictions is different now.  Create a new output table (called `reg_output` or `reg_predictions` or `funny_predictions` or something like that) for our predictions of funniness.  To begin with, it should just have one column, `Actual`, which contains the number of funny votes given to each review in the test set.  (This is the `Funny` column in `x_test`.)

As our baseline prediction, this time we can use the mean number of funny votes as our simple guess.  Find the mean number of funny votes in the *training* set and add a "Baseline" column to your predictions table which just repeats this mean value for every row.

Now use your LinearSVR model to predict the values on the test set (`features_test` which you created earlier) and add them to your table.

To evaluate this regression model, we'll use the mean absolute error and mean squared error.  Write a loop similar to the one you did above, which computes and displays these metrics for each column in our predictions table.

Is your model predicting any impossible values?  Show whatever code you used to check this.

If your model predicts any impossible values, write some code here to modify the results to fix this, and re-insert them into the output table.  Then show the new evaluation metrics.

This time, instead of proceeding by using different features, we'll try a different model.  Repeat your model creation process above, but use a SGDRegressor model instead of LinearSVR.  Add its predictions to an "SGD" column in your predictions table, and output the evaluation metrics.

Let's peek inside the model to see which words are having the largest effect on its predictions.  Make a DataFrame with two columns.  The `Word` column should contain the various words represented by your vectorized features; the `Coef` column should contain the coefficients for each word, taken from your fitted model object.

Sort your table by `Coef` and display the top few and bottom few values (i.e., those words with the lowest and highest values of `Coef`).

We'll have some questions about these values at the end.

Now, just for kicks, instead of changing either the data or the type of model, we'll try tweaking some of the parameters of the model.  Most model types have different kinds of parameters that can be changed when creating the model in order to alter its behavior somehow.

Go to the scikit documentation for SGDRegressor and look at the various parameters described there.  You'll find there is one called `loss`, whose default value is "squared_loss".  Some other possible values are described.  Create a new SGDRegressor model where you pass one of these alternative values for `loss`.  Fit your new model again, generate the predictions, and show the new evaluation metrics.

Finally, go back to the SGDRegressor documentation, choose some other model parameter there, and make yet another SGDRegressor where you set this parameter to some value.  The documentation contains some information about what values might be appropriate, but if you're not sure, just try something.  As long as your model successfully runs, you'll get credit, even its predictions are way off; our goal here is just to experiment with the model parameters to see how they affect results.

Do not choose the `verbose`, `random_state`, or `warm_start` parameters, since these don't really affect the workings of the model, but just tweak superficial details (such as what is displayed).  Also, you'll see that some parameters say something like "parameter X is only used if parameter Y is set to such-and-such value", so obviously don't choose one of those if it's not going to have any effect.

Use your new regressor to add new predictions to your output table and run the evaluation metrics again.

**Questions**

1. Which of your models performed best?  How can you tell?
2. How did your models compare to the baseline model and to each other?
3. If you were doing a regression task to predict some number (star rating, funny votes, cool votes, item price, etc.) from textual data, and you had to choose between using the LinearSVR or SGDRegressor, which would you start with?  Why?
4. At a certain point, we created a DataFrame showing the model coefficients of each word.
  1. When we sorted by Coef, do you feel like the words towards the top and bottom of the list "make sense"?  That is, do they match your intuition of what words would make a review funny or not?  Explain your impression of these top and bottom words.
  2. In the Scikit tour, we were able to add up the frequencies of all words in our feature vectors, and then use that to filter out high-frequency words from the list of coefficients.  Why can we not do that here?
5. Toward the end of this section, we modified the model parameter `loss`.
  1. Explain what you think this parameter does, and what the different values mean, as best as you can.  Don't just copy from the documentation; use your own words, even if you're not sure if what you're saying is correct.
  2. What effect did changing the value of this parameter have on the model's performance?  Give your best guess as to why it had that effect.
6. After that, you modified a parameter of your choosing.
  1. Which parameter did you choose?
  2. What kinds of possible values could this parameter take?  Why did you choose the one you did?
  3. What effect did changing the value of this parameter have on the model's performance?  Give your best guess as to why it had that effect.
7. As mentioned, predicting funny votes is a bit different from predicting star ratings because funny votes are unbounded, while star ratings are constrained between 1 and 5.
  1. Are there any other important features of the funny votes that could influence how we decide to deal with them?  What do you notice about the distribution of numbers of funny votes per review?
  2. Based on your answer to the previous question, how might you change your approach to the prediction of funny votes?

Be sure to submit your notebook on GauchoSpace for credit!