# Introduction

In this workbook, we will train a linear classifier model to be able to label movie reviews as either generally positive or negative. This task, called sentiment analysis, is a very frequently applied tool used by companies to understand the general feeling toward a certain product through automated text analysis. Instead of having to read reviews one by one, we can get a rough snapshot of public opinion very quickly.

To train our model, we will use the Sentiment Polarity Data Set v2.0 from [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) by Pang, Lee and Vaithyanathan. This csv file contains 1000 positive and 1000 negative movie reviews, and each have been labeled as such. By supplying this labeled data to our machine learning algorithm, we can train a model to be able to predict whether any movie review is negative or positive.

This Jupyter Notebook will walk you through the code that explores the data, pre-processes the reviews into a format that the machine learning algorithm can understand, and trains and tests the model. We will also demonstrate how the model can predict the sentiment of movie reviews that you write!

# Machine Learning Workflow

## Loading the data into our workbook

First we will load in our data. We will use the popular data manipulation and analysis package called "pandas" to read in our csv files and explore the data. The data is already separated into a training set and a testing set on the Kaggle website where we are downloading the data from, so we will load each of these separately.


In [1]:
import pandas as pd

reviews_training = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_train.csv")
reviews_test = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_test.csv")


Now that our data is loaded, we should take some time to explore it and become familiar with it.

## Exploring the data

Let's explore our data and see how it's formatted. We will use the `.head()` method from pandas to look at the first 5 rows of our training data.

In [2]:
reviews_training.head()

Unnamed: 0,Content,Label
0,every once in a while you see a film that is s...,pos
1,the love for family is one of the strongest dr...,pos
2,after the terminally bleak reservoir dogs and ...,pos
3,( warning to those who have not seen seven : ...,pos
4,"having not seen , "" who framed roger rabbit "" ...",pos


We see the index column on the left, and two labeled columns in our dataset. We have the "Content" column, which contains the movie reviews. The other column "Label" contains "pos", which I am assuming means it has been labeled as a positive review. Let's check out a random row and read the movie review to see if we agree with the existing labels. The pandas `.loc` method lets us pick out certain values by row index label and column name. I randomly chose row 42, and printed out both the 'Content' and 'Label'.

In [3]:
print(reviews_training.loc[42,'Content'])
print("The sentiment has been labeled as", reviews_training.loc[42,'Label'])

sometimes a movie comes along that falls somewhat askew of the rest . 
some people call it " original " or " artsy " or " abstract " . 
some people simply call it " trash " . 
a life less ordinary is sure to bring about mixed feelings . 
definitely a generation-x aimed movie , a life less ordinary has everything from claymation to profane angels to a karaoke-based musical dream sequence . 
whew ! 
anyone in their 30's or above is probably not going to grasp what can be enjoyed about this film . 
it's somewhat silly , it's somewhat outrageous , and it's definitely not your typical romance story , but for the right audience , it works . 
a lot of hype has been surrounding this film due to the fact that it comes to us from the same team that brought us trainspotting . 
well sorry folks , but i haven't seen trainspotting so i can't really compare . 
whether that works in this film's favor or not is beyond me . 
but i do know this : ewan mcgregor , whom i had never had the pleasure of watch

I would agree with the label! Now that we've looked at some examples, let's make sure the testing set we loaded earlier is in a similar format. We will use the `.head()` method again.

In [4]:
reviews_test.head()

Unnamed: 0,Content,Label
0,hedwig ( john cameron mitchell ) was born a bo...,pos
1,one of the more unusual and suggestively viole...,pos
2,what do you get when you combine clueless and ...,pos
3,>from the man who presented us with henry : th...,pos
4,tibet has entered the american consciousness s...,pos


Looks the same as the testing set, so that is great news. Let's see how the website has split the training and testing data. Usually, we hold back around 20% of the data for testing, and use the other 80% to actually train the model. Here we print the `.shape` of both datasets.

In [5]:
print(reviews_training.shape)
print(reviews_test.shape)

(1800, 2)
(200, 2)


Looks like there are 1800 training objects, and 200 testing objects. We did read earlier that there were 2000 total objects, so this checks out. 1800 is 90% of 2000, so the training set contains 90% of the data and the testing set contains the other 10%. This is alright, so we will proceed.

The metadata said there were 1000 positive and 1000 negative reviews. It is important that there is a fairly equal distribution of classes in our training set so we do not create a biased model, and we would like to see a good distribution in our testing set as well so we know how well the model performs on each type. To check this, we will use the `.value_counts()` method from pandas to print out all unique values in a column and how many times they occur.

In [6]:
print(reviews_training['Label'].value_counts())
print(reviews_test['Label'].value_counts())

pos    900
neg    900
Name: Label, dtype: int64
pos    100
neg    100
Name: Label, dtype: int64


Great! Both training and testing sets are split exactly evenly with positive and negative examples. Now that we have a good grasp on the data, we should move on to preprocessing the data.

## Preprocessing the data

Preprocessing of natural language data is extremely important. How we transform the words into data arrays that a machine learning model can learn from will directly determine the success or failure of a model. For this workflow, we will focus on the 'bag-of-words' model for transforming the movie reviews into data arrays.

The 'bag-of-words' model will represent each review as a collection (or bag) of individual words, without regard to their order. Each review has a giant array associated with it, where a column represents a potential word that could in that review (called the vocabulary). If a review contains that word once, it will be assigned a 1. If a review contains that word twice, that column will have a 2, and so on. In this way, you can represent every review by simply tallying up what words appear in it (called 'term frequency', or TF) and what words in the vocabulary don't appear.

There is a problem associated with this strategy. There are a lot of words in English that appear quite often, and don't really determine the positive or negative sentiment of a review. Words like "the", "of", or "and". To make sure these words don't confuse and overcomplicate the model, we do a inverse document frequency calculation (IDF). This calculation gives a large weight (or importance) to words that appear less frequently in all the reviews, and common words are given a low weight. By combining the term frequency within a document (TF) and the IDF score, you can assign a high weight to words in reviews that are both frequent and unique. This method will allow us to focus on the words that make a much bigger difference in the overall sentiment.

There is a tool that does all of these calculations for us! It is the `TfidfVectorizer` from sci-kit learn. With a large vocabulary, the arrays can be rather large, so the `TfidfVectorizer` outputs the high-dimensional vector in a format called a "compressed sparse row matrix". You can learn more about that from [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html).

Here we import `TfidfVectorizer` and initialize it with some hyperparameters.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df = 1, max_df = 1.0, sublinear_tf = True, use_idf = True)



`min_df` and `max_df` represent the minimum and maximum frequency threshold a word must have to be included in the vocabulary. With a `min_df` of 1, a word must appear at least once to be included. With the `max_df` set at 1.0 (meaning 100%), it means a word can appear as many times as it wants and will still be included in the vocabulary. When `sublinear_tf` is set to True, it will apply a log transformation to the term frequency scaling. This will greatly reduce the impact of those very frequent terms in each review, that might not carry significant information. Lastly, the `use_idf` parameter determines whether it will use inverse document frequency weighting at all, as discussed above. There are more hyperparameters that can be customized, and that information is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

Now with the vectorizer initialized, we are reading to transform our data. We first use the method `fit_transform` on our training data to create the vocabulary based on the hyperparameters we set, and then call on the `transform` method to use that same vocabulary to transform the testing data. At the end we print out the training_vectors to take a look at the compressed sparse row matrix.


In [8]:
training_vectors = vectorizer.fit_transform(reviews_training['Content'])
testing_vectors = vectorizer.transform(reviews_test['Content'])
print(training_vectors)

  (0, 37728)	0.025100981093704258
  (0, 3591)	0.023889006401408175
  (0, 12297)	0.04736892526046346
  (0, 17656)	0.04679663303327439
  (0, 12346)	0.0623848771840738
  (0, 27323)	0.045859217752151536
  (0, 15767)	0.043967339462022845
  (0, 29509)	0.03965913358706709
  (0, 22330)	0.046251603379063845
  (0, 29250)	0.0892927449327645
  (0, 10870)	0.0623848771840738
  (0, 8393)	0.0499960511075312
  (0, 28411)	0.025312511323553593
  (0, 9454)	0.03455505730553403
  (0, 35929)	0.053920525510464334
  (0, 6599)	0.04152331287466379
  (0, 7739)	0.058303281874462265
  (0, 10361)	0.07672081959651322
  (0, 9320)	0.039218946256633525
  (0, 29172)	0.06787285235103453
  (0, 36072)	0.0447568565992461
  (0, 14950)	0.06549673966980514
  (0, 14344)	0.024648936483590747
  (0, 33376)	0.039436923506143674
  (0, 37056)	0.026312347552146977
  :	:
  (1799, 27189)	0.027477323513756653
  (1799, 15413)	0.01749213768440475
  (1799, 2440)	0.029040213738001894
  (1799, 26764)	0.055525776455428374
  (1799, 25356)	0.0762

We see parts of the first and last element. You can see each review number and index of the word in the vocabulary set in parenthesis, and on the right is the weight that the TF-IDF calculations assigned that word in that review. We are now ready to pass these vectors into our linear classifier and train our model!

## Training the model

We can now use our data to train a linear classifier. Let's use a support vector machine classifier. We import one from sci-kit learn, initialize it with the default hyperparameters, and call on the `.fit` method to train on the newly created vectors and the labels from our original dataset.

In [9]:
from sklearn import svm

classifier = svm.SVC()
classifier.fit(training_vectors, reviews_training['Label'])

Our model is now trained! Time to evaluate the model.

## Testing the model

Before we can start using the model confidently, we need to evaluate how well it performs. First, let's use the `.predict` method of our SVM classifier to predict on the testing vectors we created earlier.

In [10]:
predictions = classifier.predict(testing_vectors)

Now that we have our predictions ready, let's use some metrics from sci-kit learn. We will try both `classification_report` and the common `accuracy_score`. We import both and give them the true labels for our testing set and the predictions we made. For `classification_report`, we specify the output format as a dictionary. This method will print out a report of some common metrics used to evaluate models including the precision, recall, F1-score, and how many samples were tested for each class. `accuracy_score` will just give us the simple percentage of how many predictions were correct out of all attempts. We print out each report for the 'pos' class and the 'neg' class. We then print the overall accuracy.

In [11]:
from sklearn.metrics import classification_report, accuracy_score

report = classification_report(reviews_test["Label"],predictions,output_dict=True)
accuracy = accuracy_score(reviews_test["Label"], predictions)

print('Positives: ', report['pos'])
print('Negatives: ', report['neg'])
print("Overall accuracy: ", accuracy)

Positives:  {'precision': 0.9, 'recall': 0.9, 'f1-score': 0.9, 'support': 100}
Negatives:  {'precision': 0.9, 'recall': 0.9, 'f1-score': 0.9, 'support': 100}
Overall accuracy:  0.9


We achieved 90% for the precision, recall, F1-score, and overall accuracy! There was a great balance to our results, meaning our model has the same levels of accuracy for both negative and positive movie reviews. Let's go back and adjust some of those hyperparameters to see how the accuracy of the model changes. We will copy the exact same code as before, but change some hyperparameters.

## Experimenting with hyperparameters

Hyperparameters let us fine-tune the model to our particular application. Adjusting and experimenting is a vital part to creating the best model possible. Let's start with turning off the sublinear log scaling we are doing to our term frequency.

In [12]:
vectorizer = TfidfVectorizer(min_df = 1, max_df = 1.0, sublinear_tf = False, use_idf = True)
training_vectors = vectorizer.fit_transform(reviews_training['Content'])
testing_vectors = vectorizer.transform(reviews_test['Content'])

classifier = svm.SVC()
classifier.fit(training_vectors, reviews_training['Label'])

predictions = classifier.predict(testing_vectors)
report = classification_report(reviews_test["Label"],predictions,output_dict=True)
accuracy = accuracy_score(reviews_test["Label"], predictions)

print('Positives: ', report['pos'])
print('Negatives: ', report['neg'])
print("Overall accuracy: ", accuracy)

Positives:  {'precision': 0.8571428571428571, 'recall': 0.84, 'f1-score': 0.8484848484848485, 'support': 100}
Negatives:  {'precision': 0.8431372549019608, 'recall': 0.86, 'f1-score': 0.8514851485148515, 'support': 100}
Overall accuracy:  0.85


All of our scores decreased! Looks like taking weight away from those frequent words really was helping the model. Let's change that back to 'True'. Next, let's experiment with increasing the min_df to 10. Maybe the very unique words are making the model too complicated, and removing those words from the vocabulary would help generalize the model.

In [13]:
vectorizer = TfidfVectorizer(min_df = 10, max_df = 1.0, sublinear_tf = True, use_idf = True)
training_vectors = vectorizer.fit_transform(reviews_training['Content'])
testing_vectors = vectorizer.transform(reviews_test['Content'])

classifier = svm.SVC()
classifier.fit(training_vectors, reviews_training['Label'])

predictions = classifier.predict(testing_vectors)
report = classification_report(reviews_test["Label"],predictions,output_dict=True)
accuracy = accuracy_score(reviews_test["Label"], predictions)

print('Positives: ', report['pos'])
print('Negatives: ', report['neg'])
print("Overall accuracy: ", accuracy)

Positives:  {'precision': 0.9175257731958762, 'recall': 0.89, 'f1-score': 0.9035532994923858, 'support': 100}
Negatives:  {'precision': 0.8932038834951457, 'recall': 0.92, 'f1-score': 0.9064039408866995, 'support': 100}
Overall accuracy:  0.905


There was a very slight increase in accuracy, but it did not seem to help the model much. Increasing min_df to more than 10 caused a decrease in the accuracy as well. I will set it back to 1 for now. Let's try increasing the max_df parameter to take away the most frequent words from the vocabulary, and see if that helps. Let's set it to 0.8, which means if a word is in more than 80% of the reviews it will not be incuded.

In [14]:
vectorizer = TfidfVectorizer(min_df = 1, max_df = 0.8, sublinear_tf = True, use_idf = True)
training_vectors = vectorizer.fit_transform(reviews_training['Content'])
testing_vectors = vectorizer.transform(reviews_test['Content'])

classifier = svm.SVC()
classifier.fit(training_vectors, reviews_training['Label'])

predictions = classifier.predict(testing_vectors)
report = classification_report(reviews_test["Label"],predictions,output_dict=True)
accuracy = accuracy_score(reviews_test["Label"], predictions)

print('Positives: ', report['pos'])
print('Negatives: ', report['neg'])
print("Overall accuracy: ", accuracy)

Positives:  {'precision': 0.898989898989899, 'recall': 0.89, 'f1-score': 0.8944723618090452, 'support': 100}
Negatives:  {'precision': 0.8910891089108911, 'recall': 0.9, 'f1-score': 0.8955223880597015, 'support': 100}
Overall accuracy:  0.895


This also does not seem to help! I performed further iterations, trying percentages closer to 100, and it did increase the performance of the model. This makes sense, because the most common words are already being weighted near 0 because of the log scaling of the term frequencies and inverse document frequency calculations. Lastly, let's try turning off the inverse document frequency calculation to see how important that is.

In [15]:
vectorizer = TfidfVectorizer(min_df = 1, max_df = 1.0, sublinear_tf = True, use_idf = False)
training_vectors = vectorizer.fit_transform(reviews_training['Content'])
testing_vectors = vectorizer.transform(reviews_test['Content'])

classifier = svm.SVC()
classifier.fit(training_vectors, reviews_training['Label'])

predictions = classifier.predict(testing_vectors)
report = classification_report(reviews_test["Label"],predictions,output_dict=True)
accuracy = accuracy_score(reviews_test["Label"], predictions)

print('Positives: ', report['pos'])
print('Negatives: ', report['neg'])
print("Overall accuracy: ", accuracy)

Positives:  {'precision': 0.8979591836734694, 'recall': 0.88, 'f1-score': 0.888888888888889, 'support': 100}
Negatives:  {'precision': 0.8823529411764706, 'recall': 0.9, 'f1-score': 0.8910891089108911, 'support': 100}
Overall accuracy:  0.89


Our accuracy did decrease, but not by too much! With some experimenting done, let's set the model back to the hyperparameters that achieved the best performance.

In [16]:
vectorizer = TfidfVectorizer(min_df = 10, max_df = 1.0, sublinear_tf = True, use_idf = True)
training_vectors = vectorizer.fit_transform(reviews_training['Content'])
testing_vectors = vectorizer.transform(reviews_test['Content'])

classifier = svm.SVC()
classifier.fit(training_vectors, reviews_training['Label'])

predictions = classifier.predict(testing_vectors)
report = classification_report(reviews_test["Label"],predictions,output_dict=True)
accuracy = accuracy_score(reviews_test["Label"], predictions)

print('Positives: ', report['pos'])
print('Negatives: ', report['neg'])
print("Overall accuracy: ", accuracy)

Positives:  {'precision': 0.9175257731958762, 'recall': 0.89, 'f1-score': 0.9035532994923858, 'support': 100}
Negatives:  {'precision': 0.8932038834951457, 'recall': 0.92, 'f1-score': 0.9064039408866995, 'support': 100}
Overall accuracy:  0.905


We are now ready to use our model on some new movie reviews we can write ourselves.

## Using our model on new movie reviews

Now it's time to try the model on some new movie reviews that we will write! I wrote one short positive review, used the same vectorizer to transform it, and then printed out the prediction the model made.

In [17]:
new_review = "Although at times dragging with mundane realism, Sweeney managed to utilize these awkward gaps and conversations to captivate us with emotion"
review_vector = vectorizer.transform([new_review])
print("The sentiment of this review is ", classifier.predict(review_vector))

The sentiment of this review is  ['pos']


Impressive! The model managed to predict the positive sentiment of my own movie review. Let's try a negative one and see how it does.

In [18]:
another_review = 'Although once novel in its inception, it fails to evolve beyond the familiar tropes and pit-falls of its genre'
review_vector = vectorizer.transform([another_review])
print("The sentiment of this review is ", classifier.predict(review_vector))

The sentiment of this review is  ['neg']


The model succeeded once again! These were rather simple reviews, but I am very happy with the model's performance here. Go ahead and modify these reviews and try writing your own. See how it does!

# Conclusion

In this workbook we used a data set of 2000 movie reviews to train a model to perform sentiment analysis. We wanted to create a model that could automatically determine whether a review was positive or negative. We used the 'TfidfVectorizer' to vectorize each review and feed it into a support vector machine classifier. We achieved 90% accuracy, and tested the model on our own reviews.

We discovered that log scaling of the term frequencies helped the model ignore the more common words and achieve better performance. We also saw that inverse document frequency weighting was important for the same reasons, and when combined achieved the best performance. For further testing, we should find another movie data set that is labeled and test our model with that. This is a very common starting use case in learning natural langauge processing and sentiment analysis, and I am confident there are many data sets available. With new data, we can further fine-tune the model hyperparameters. We should also experiment with different linear classifiers, and there hyperparameters. For this notebook, I wanted to focus exclusively on the TfidfVectorizer parameters. 