# Week 3: Lab - Natural Language Processing
## Sentiment Analysis of Movie Reviews

Welcome to Week 3 ! In today's lab, you will learn about Natural Language Processing (*NLP*). We will compare 3 methods of featurizing text data: 
* `CountVectorizer` (Bag of Words)
* `TfidfVectorizer` (TF-IDF)
* `Doc2Vec` 

in order to perform **sentiment analysis** on the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/).

### Input Format

We can't directly input the raw reviews from the Cornell movie review data repository. Instead, we have to "clean them up" by:
1. Converting everything to lower case
2. Removing punctuation
3. Removing common words (stop words)
4. Stemming

'Cleaning up' text is an important **Data Pre-processing** step in NLP, and is crucial to getting good results. In the same way that we do with our numerical features (egs: filling na values with a mean, etc.), we need to make sure that words that we are going to use as features are consistently formatted and don't include information that will end up being unnecessary.

To practise, we are going to perform the above 4 steps on the sample movie review below.

In [1]:
movie_review = """Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!"""
print(movie_review)

Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!


#### The first step is to lowercase it
You can do this using the `.lower()` function. Try it out on `movie_review`, and print it to see the result.

In [2]:
movie_review = movie_review.lower()

Next, we need to remove punctuation. import `string`, and then from `string` import `punctuation`.
Print `punctuation` to see the list of punctuation marks in the library.

In [3]:
from string import punctuation

The way we remove punctuation from a string is by creating a `translator` object, and then calling `.translate` on our string using the `translator` object.

Create a `translator` object by calling `str.maketrans('', '', punctuation)`.

In [4]:
tran = str.maketrans('', '', punctuation)

Now, call `.translate` on your `movie_review` and pass it your `translator` object. Then, print your movie review.

In [5]:
movie_review = movie_review.translate(tran)

You should see that all the punctuation has been removed!

If you want to understand why/how this works, check out these posts:
* https://www.tutorialspoint.com/python/string_maketrans.htm
* https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate

#### Remove stop words

Notice all of the punctuation has been removed.  Next we will remove common words.  This is because in NLP we want to find things that distinct between different sets of texts.  We can make that easier by removing words that are common to ALL texts (and, is are, etc.)

from `sklearn.feature_extraction.stop_words` import `ENGLISH_STOP_WORDS`. Then, print `ENGLISH_STOP_WORDS` to see a list of common stop words.

In [6]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

We want to remove the above words from `movie_review`. First, convert `movie_review` into a list by calling the `.split()` method. Call your new object `split_review`.

In [7]:
split_review = movie_review.split()
split_review

['bromwell',
 'high',
 'is',
 'nothing',
 'short',
 'of',
 'brilliant',
 'expertly',
 'scripted',
 'and',
 'perfectly',
 'delivered',
 'this',
 'searing',
 'parody',
 'of',
 'a',
 'students',
 'and',
 'teachers',
 'at',
 'a',
 'south',
 'london',
 'public',
 'school',
 'leaves',
 'you',
 'literally',
 'rolling',
 'with',
 'laughter',
 'its',
 'vulgar',
 'provocative',
 'witty',
 'and',
 'sharp',
 'the',
 'characters',
 'are',
 'a',
 'superbly',
 'caricatured',
 'cross',
 'section',
 'of',
 'british',
 'society',
 'or',
 'to',
 'be',
 'more',
 'accurate',
 'of',
 'any',
 'society',
 'following',
 'the',
 'escapades',
 'of',
 'keisha',
 'latrina',
 'and',
 'natella',
 'our',
 'three',
 'protagonists',
 'for',
 'want',
 'of',
 'a',
 'better',
 'term',
 'the',
 'show',
 'doesnt',
 'shy',
 'away',
 'from',
 'parodying',
 'every',
 'imaginable',
 'subject',
 'political',
 'correctness',
 'flies',
 'out',
 'the',
 'window',
 'in',
 'every',
 'episode',
 'if',
 'you',
 'enjoy',
 'shows',
 'tha

Now, you want to use a `for` loop to create a new list (call it `clean_words`). In each iteration of your loop, go through `split_review` and check every word. If the word is not in `ENGLISH_STOP_WORDS`, append it to your `clean_words`.

In [8]:
clean_words = [ word for word in split_review if not word in ENGLISH_STOP_WORDS]
len(clean_words)

65

Finally, put the clean words back together to re-create `movie_review`, by using the `.join'` method on `clean_words`, separated by a space.

In [9]:
movie_review = ' '.join(clean_words)
movie_review

'bromwell high short brilliant expertly scripted perfectly delivered searing parody students teachers south london public school leaves literally rolling laughter vulgar provocative witty sharp characters superbly caricatured cross section british society accurate society following escapades keisha latrina natella protagonists want better term doesnt shy away parodying imaginable subject political correctness flies window episode enjoy shows arent afraid poke fun taboo subject imaginable bromwell high disappoint'

Print your movie review! It should look like this:

#### Stem words
Finally, we will "stem" the words so that we take away the differences between words like "expertly" and "expert" since they have the same meaning. Read more on stemming [here](https://en.wikipedia.org/wiki/Stemming):

We will use the `SnowballStemmer` library. import it from `nltk.stem.snowball`

In [10]:
from nltk.stem.snowball import SnowballStemmer

SnowballStemmer takes in a language as an argument. Since we are working with english, create a `SnowballStemmer` object and pass it `english` as the language. Call your object `stemmer`.

In [11]:
stemmer = SnowballStemmer('english')
stemmer

<nltk.stem.snowball.SnowballStemmer at 0x7fa46cd115c0>

To check the stem of a word, call `stemmer.stem()`. Try it out with the word `running`. See what it prints!

In [12]:
stemmer.stem('stemmed_words')

'stemmed_word'

Now, similar to how we removed the stop words, we want to now go through our review and stem each word. So:
* Turn your `movie_review` back into a list using `.split()`
* Create an empty list called `stemmed_words`
* Use a `for` loop to go through every word in your `movie_review` and call `stemmer.stem` on it.
* Append the newly stemmed word to your `stemmed_words` list
* Finally, re-create movie_review into a string by calling `.join` using a space as your separator.

In [13]:
stem_word = lambda word: stemmer.stem(word)
# stemmed_words = movie_review.split(" ").apply(stem_word)
stemmed_words = list(map(stem_word, movie_review.split(" ")))
movie_review = ' '.join(stemmed_words)

Print your final movie review! It should look like this:

In [14]:
movie_review

'bromwel high short brilliant expert script perfect deliv sear parodi student teacher south london public school leav liter roll laughter vulgar provoc witti sharp charact superbl caricatur cross section british societi accur societi follow escapad keisha latrina natella protagonist want better term doesnt shi away parodi imagin subject polit correct fli window episod enjoy show arent afraid poke fun taboo subject imagin bromwel high disappoint'

#### Put it all together
We can put all the steps above together in a function, like this (pseudo-code given):

In [15]:
# def clean_text(raw_text):
#     initialize empty clean_words list
#     make raw_text lower case
#     remove punctuation from raw_text
#     split_words into a list

#     for word in split_words:
#         if word not in ENGLISH_STOP_WORDS:
#             make stemmed_word
#             append stemmed_word to clean_words
    
#     return ' '.join(clean_words)
def clean_text(raw_text):
    processing_text = raw_text.lower()
    tran = str.maketrans('', '', punctuation)
    processing_text = processing_text.translate(tran)
    stemmer = SnowballStemmer('english')
    clean_words = [ stemmer.stem(word) for word in processing_text.split() if word not in ENGLISH_STOP_WORDS ]
    return ' '.join(clean_words)
movie_review_test = """Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!"""
clean_text(movie_review_test)

'bromwel high short brilliant expert script perfect deliv sear parodi student teacher south london public school leav liter roll laughter vulgar provoc witti sharp charact superbl caricatur cross section british societi accur societi follow escapad keisha latrina natella protagonist want better term doesnt shi away parodi imagin subject polit correct fli window episod enjoy show arent afraid poke fun taboo subject imagin bromwel high disappoint'

Let's see how it works. Here is an unclean review:

In [16]:
unclean_review = """Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!"""
print(unclean_review)

Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!


Now clean it by calling the `clean_text` function above and make sure you notice the difference!

In [17]:
clean_text(unclean_review)

'bromwel high short brilliant expert script perfect deliv sear parodi student teacher south london public school leav liter roll laughter vulgar provoc witti sharp charact superbl caricatur cross section british societi accur societi follow escapad keisha latrina natella protagonist want better term doesnt shi away parodi imagin subject polit correct fli window episod enjoy show arent afraid poke fun taboo subject imagin bromwel high disappoint'

## Let's now clean up all our data!

Our data can be be found in the file `all_reviews_small.csv` (some of the cleaning steps have already been done).

import `pandas` and read `all_reviews_small.csv` into a dataframe. Call it `df_reviews`.

In [18]:
import pandas as pd
df_reviews = pd.read_csv('all_reviews_small.csv')
df_reviews.head()

Unnamed: 0,label,train_test_split,text
0,pos,train,bromwell high is a cartoon comedy it ran at th...
1,pos,train,homelessness or houselessness as george carlin...
2,pos,train,brilliant over acting by lesley ann warren bes...
3,pos,train,this is easily the most underrated film inn th...
4,pos,train,this is not the typical mel brooks film it was...


Print the `head` and `shape`. You should see 4000 reviews, with 3 columns.

In [19]:
df_reviews.shape

(4000, 3)

In [20]:
df_reviews.columns.values

array(['label', 'train_test_split', 'text'], dtype=object)

Now apply our `clean_words` function to all the reviews! Store the clean reviews in a new column called `clean_text`.

In [21]:
df_reviews['clean_text'] = df_reviews.loc[:, 'text'].apply(clean_text)

Check the `head` again to see your new dataframe's `clean_text` column.

In [22]:
df_reviews.head()
df_reviews.loc[:, 'label'].unique()

array(['pos', 'neg'], dtype=object)

## Bag of Words
In Python, the `CountVectorizer` object represents the Bag Of Words model. import it from `sklearn.feature_extraction.text`, and create a `CountVectorizer()` object called `count_vect`.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

Now we want to convert all our `clean_text` reviews into a bag of words representation. Call `count_vect.fit_transform` on all our clean reviews (i.e. `df_review['clean_text']`) to do this. Save the result in `bag_of_words`.

In [24]:
bag_of_words = count_vect.fit_transform(df_reviews['clean_text'])

Print your `bag_of_words`!

In [25]:
bag_of_words

<4000x20719 sparse matrix of type '<class 'numpy.int64'>'
	with 325930 stored elements in Compressed Sparse Row format>

Because our vocabulary is so large, CountVectorizer creates a sparse matrix for memory efficiency. Check `bag_of_words.shape`. You should see 4000 vectors, each with a dimension of 20719.

In [26]:
bag_of_words.shape

(4000, 20719)

bag_of_words is now a 4000 X 20,719 feature matrix, where every row is a move review, and every column is the count of words for the word that column represents. The words can be found using the `.get_feature_names()` method.

Check the 200th word in `count_vect`. It should be `adulthood`!

In [27]:
count_vect.get_feature_names()[200]

'adulthood'

Likewise, we can do the opposite using `.vocabulary_.get()`. Check `'adulthood'`; it should be the 200th word.

In [28]:
count_vect.vocabulary_.get('adulthood')

200

## TF-IDF

In Python, the `TfidfVectorizer` object represents the Bag Of Words model. import it from `sklearn.feature_extraction.text`, and create a `CountVectorizer()` object called `tf_idf_vect`.

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer()

Again, just like you did with the `CountVectorizer`, fit and transform your clean text reviews and store the result in a variable called `tf_idf`. 

In [30]:
tf_idf = tf_idf_vect.fit_transform(df_reviews['clean_text'])

Print `tf_idf` and notice how the values are different. Print the `shape` to confirm that the dimensions are the same as `bag_of_words`.

In [31]:
tf_idf.shape
print(tf_idf)

  (0, 2278)	0.585588780724
  (0, 8284)	0.265921160639
  (0, 2762)	0.0902999132203
  (0, 3537)	0.0563874590305
  (0, 14695)	0.0945774478549
  (0, 18420)	0.032304584135
  (0, 14289)	0.0916098400766
  (0, 15898)	0.204555821375
  (0, 10470)	0.0473356073499
  (0, 18066)	0.361199652881
  (0, 20542)	0.0443336898392
  (0, 18065)	0.0906180654526
  (0, 14274)	0.109971880498
  (0, 10300)	0.0568227934732
  (0, 1562)	0.0509203455254
  (0, 15781)	0.0949876038184
  (0, 3377)	0.0981740657368
  (0, 14793)	0.0775818392717
  (0, 15956)	0.127698101194
  (0, 17782)	0.0858828642895
  (0, 6429)	0.105583461765
  (0, 9067)	0.0906180654526
  (0, 17536)	0.312759136853
  (0, 15279)	0.0536901910038
  (0, 13360)	0.0876761686291
  :	:
  (3999, 19905)	0.0916471906954
  (3999, 12051)	0.1126913754
  (3999, 18145)	0.220035597446
  (3999, 2181)	0.0854704855585
  (3999, 8399)	0.0939076675673
  (3999, 9101)	0.101129378855
  (3999, 16251)	0.0986796855085
  (3999, 20030)	0.104072028647
  (3999, 14070)	0.0956348466385
  (3999

Again, because our dataset has so many unique words, tfidf vectorizer creates a sparse matrix.

This matrix will again be 4000 X 20,719, where each column is the term frequence (count of times that word appears in the review) times the by the inverse document frequency (basically total number of reviews / number of reviews the word appears in).

Let's compare the differences in the two feature sets for one of our reviews.

Print the 9084th word in any one of our objects (`count_vect` or `tf_idf_vect`). It should be `inspir`.

In [32]:
tf_idf_vect.get_feature_names()[9084]

'inspir'

Let's see how often it appears in Review 1: print the value of `(1, 9084)` in your `bag_of_words`.

In [33]:
bag_of_words[(1, 9084)]

1

You should see `1` !

What about it's tf-idf value?

In [34]:
tf_idf[(1, 9084)]

0.045896271833222251

You should see `0.04589627183322225`.

Notice how much smaller it is? This means it must appear in a good deal of other reviews

### Classification

Now, let's make a classifier to actually feed our feature data and train/test it. We'll use a Logistic Regression Classifier.

First, do this for the `CountVectorizer`. Use `train_test_split` with `test_size=0.1` and `random_state=42`. Your features will simply be your `bag_of_words` and your labels will be `df_reviews['label']`. Because our data is balanced, you can use `accuracy_score` if you like to check the accuracy of your classifier.

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(bag_of_words, df_reviews['label'], test_size=0.1, random_state=42)
lr = LogisticRegression().fit(X_train, y_train)
y_predict = lr.predict(X_test)
print(accuracy_score(y_test, y_predict))

0.8875


88.75 %, not bad! 

Now let's try using our `TfidfVectorizer` and see if it performs better. Use the same parameters as above, the only different is that your features are `tf_idf`.

In [36]:
X_train, X_test, y_train, y_test = train_test_split(tf_idf, df_reviews['label'], test_size=0.1, random_state=9)
lr_tf = LogisticRegression().fit(X_train, y_train)
y_predict = lr_tf.predict(X_test)
print(accuracy_score(y_test, y_predict))

0.92


90.25% ! So in this case, tf-idf is a little more accurate than bag of words.

## Doc2Vec

The `Doc2Vec` documentation can be found here:<br>
https://radimrehurek.com/gensim/models/doc2vec.html

A readable, easy introduction to `Doc2Vec` is available in this medium article:<br>
https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

You don't need to understand the main details about how `Doc2Vec` works, but it's more important that you understand how to use it -- and *that* will be the goal of this lab.

## Setup

First of all, you need to install `gensim`, which is the module that contains `Doc2Vec`. Open up your Terminal (on Mac) or Command Prompt (on Windows) and type in the following:

`easy_install -U gensim`

### Modules

We use `gensim`, since `gensim` has a much more readable implementation of `Word2Vec` (and `Doc2Vec`). We also use `numpy` for general array manipulation, and `sklearn` for Logistic Regression classifier.

First, from `gensim.models` import `Doc2Vec`.

In [37]:
from gensim.models import Doc2Vec

Next, import the usual suspects: `numpy`, and `LogisticRegression` from `sklearn.linear_model`

In [45]:
import numpy as np
from sklearn.linear_model import LogisticRegression

### Building a Doc2Vec model

The way Doc2Vec works is that, each 'document' (lyrics of a song, words in an email, etc.) needs to be fully 'cleaned' (no punctuation, stemmed, etc.) and on a single line each in a `.txt` file. In our case, we have 50,000 movie reviews, split into 4 different `.txt` files:

- `test-neg.txt`: 12500 negative movie reviews from the test data
- `test-pos.txt`: 12500 positive movie reviews from the test data
- `train-neg.txt`: 12500 negative movie reviews from the training data
- `train-pos.txt`: 12500 positive movie reviews from the training data

#### Check out the above text files and briefly go through them.

You can look at the `Doc2VecHelperFunctions.ipynb` file if you are curious to see how the text files are converted into our Doc2Vec model, `imdb.d2v`.

If you're curious about the parameters, do read the Doc2Vec/Word2Vec [documentation](
https://radimrehurek.com/gensim/models/doc2vec.html).

For this lab, the model is already prepared. It is named `imdb.d2v`. Load it by calling `Doc2Vec.load('./imdb.d2v')` and save it in a variable called `model`.

In [46]:
model = Doc2Vec.load('imdb.d2v')

### Inspecting the Model

Let's see what our model gives. If we want to see what words are most 'similar' to `'good'`, we can call `model.vw.most_similar('good')` on our model. Try it out!

In [47]:
model.wv.most_similar('good')

[('decent', 0.7383522391319275),
 ('great', 0.7068538665771484),
 ('bad', 0.6740086078643799),
 ('fine', 0.6538171768188477),
 ('solid', 0.6522965431213379),
 ('nice', 0.6312328577041626),
 ('excellent', 0.58672696352005),
 ('terrific', 0.5646074414253235),
 ('poor', 0.5573974847793579),
 ('strong', 0.5290022492408752)]

Are some of the words above used in similar ways in which you would use the word 'good' ? If yes, that means our model has kind of understood the *meaning* of the word `good`. This is really awesome (and important), since we are doing sentiment analysis.

We can also look deeper and see what the model actually contains. To see the feature vector for the first review in the training set for negative reviews, check `model['TRAIN_NEG_0']`:

In [59]:
model['TRAIN_NEG_2']
model['TRAIN_POS_1']


array([ 0.15591073, -1.00769353, -0.29605961,  2.18465304,  0.9532637 ,
       -3.87005091, -2.33878136, -1.34991467,  1.09373116,  2.16895318,
       -0.15583986, -0.20290834,  0.96698886,  2.25928164,  0.38636756,
        1.23074746,  1.31522524,  0.01632519,  2.00626898, -2.68454742,
        0.7009055 ,  1.54163992, -0.50989616, -0.36360747, -0.65092176,
        0.82744628,  1.19471276, -0.33523622, -1.06338334, -0.53705174,
       -0.6888628 ,  1.54129934, -0.77872807, -0.26769787,  0.11388908,
       -2.11309719,  3.09696221,  0.94127595,  1.019858  ,  1.09815776,
       -0.19643727,  2.81080961, -1.01270771, -1.69791448, -0.71819431,
        0.82034248,  1.07061815, -0.72113127,  0.57014245, -2.12918043,
        1.90972507,  3.25470948, -0.14005719,  0.7353341 , -1.21092224,
       -1.59447384,  1.04394937, -2.00083256,  3.26442623,  0.95649093,
       -0.29830626, -1.43709934,  0.14037158,  0.28651956, -0.08995567,
        0.83530617, -0.67329133,  0.7457509 ,  1.15693736, -2.04

In [51]:
type(model)

gensim.models.doc2vec.Doc2Vec

## Classifying Sentiments

### Training Vectors

Now let's use these vectors to train a classifier. First, we must extract the training vectors. Remember that we have a total of 25000 training reviews, with equal numbers of positive and negative ones (12500 positive, 12500 negative). There are two parallel arrays, one containing the vectors (`train_arrays`) and the other containing the labels (`train_labels`). We simply put the positive ones at the first half of the array, and the negative ones at the second half.

We will use a `for` loop to go through all `25000` training reviews, adding the vector for each review in `train_arrays` and it's corresponding label (`1` for a positive review, and `0` for a negative review) in `train_labels`.

#### Read the code below and ask your Instructor/TA if you have any questions!

In [50]:
train_arrays = np.zeros((25000, 100))
train_labels = np.zeros(25000)

for i in range(12500):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    train_arrays[i] = model[prefix_train_pos]
    train_arrays[12500 + i] = model[prefix_train_neg]
    train_labels[i] = 1
    train_labels[12500 + i] = 0

Print `train_arrays`. You should see rows and rows of vectors representing each sentence.

In [52]:
print(train_arrays)

[[-0.09523074  0.10516991 -0.07066526 ..., -1.50765908  0.37817046
   0.45435163]
 [ 0.15591073 -1.00769353 -0.29605961 ..., -1.45913517  1.49660051
   1.72079444]
 [-0.49689472 -0.63923281 -1.31833351 ..., -2.12929225  0.9443326
   0.63289094]
 ..., 
 [-0.2536512  -0.89831948 -0.24197805 ...,  1.50290143  1.01230037
  -0.3398996 ]
 [-2.00854516  0.64646685 -0.45022076 ...,  1.535079    0.13337763
   0.06628666]
 [-0.50857788  0.85919422 -0.78979629 ..., -0.4446539   1.05848455
   0.50058913]]


Print `train_labels`. They are simply category labels for the sentence vectors -- 1 representing positive and 0 for negative.

In [53]:
print(train_labels)

[ 1.  1.  1. ...,  0.  0.  0.]


### Testing Vectors

We do the same for testing data -- data that we are going to feed to the classifier after we've trained it using the training data. This allows us to evaluate our results. The process is pretty much the same as extracting the results for the training data.

#### Read the code below and ask your Instructor/TA if you have any questions!

In [55]:
test_arrays = np.zeros((25000, 100))
test_labels = np.zeros(25000)

for i in range(12500):
    prefix_test_pos = 'TEST_POS_' + str(i)
    prefix_test_neg = 'TEST_NEG_' + str(i)
    test_arrays[i] = model[prefix_test_pos]
    test_arrays[12500 + i] = model[prefix_test_neg]
    test_labels[i] = 1
    test_labels[12500 + i] = 0

### Classification

Now, train a logistic regression classifier using the training data.

Create a LogisticRegression Classifier, and `fit` it to your `train_arrays` and `train_labels`.

In [56]:
lr_doc2vec = LogisticRegression().fit(train_arrays, train_labels)

Call `score` on your classifier, passing in your `test_arrays` and `test_labels`.

In [57]:
score = accuracy_score(test_labels, lr_doc2vec.predict(test_arrays))
score

0.86439999999999995

You should see that we have achieved nearly 87% accuracy for sentiment analysis.

Finally, if you have time, try running your classifier on a bunch of individual reviews and see if you agree with the predictions! You can do this in the following steps:
* Choose a review from one of the `.txt` files.
* You can grab it's corresponding vector by using the correct index in your model.
    * For example, for the 3rd negative test review, the feature vector is `model['TEST_NEG_2']`
* Call `classifier.predict` on your feature vector to see the prediction (you may have to use `.reshape` to get it in the correct format).
    * A result of 0 means it's a positive review, and 1 means negative.
* Do you agree :) ?

In [40]:
# Try a random review

In [41]:
# Try a random review

In [42]:
# Try a random review

In [43]:
# Try a random review

In [44]:
# Try a random review

Again, ask your Instructor or TA if you have any questions. Good luck on the Assignment!

## References

- Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html
- Paper that inspired this: https://arxiv.org/pdf/1405.4053.pdf

---