# To Do 9: Movie Reviews Sentiment Analysis  
### Chris Lagunilla [CJL71]

Following these tutorials:
    - http://www.pitt.edu/~naraehan/presentation/Movie+Reviews+sentiment+analysis+with+Scikit-Learn.html
    - http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

## Load movie_reviews corpus data using sklearn

In [1]:
import sklearn
from sklearn.datasets import load_files

In [2]:
# define filepath to movie_reviews corpus
moviedir = '/Users/ChrisLagunilla/nltk_data/corpora/movie_reviews'

In [3]:
# load in files as training set
movie_training = load_files(moviedir, shuffle=True)

In [4]:
# check the length of the data
len(movie_training.data)

2000

In [5]:
# check target names
movie_training.target_names

['neg', 'pos']

In [6]:
# peek the first file
movie_training.data[0][:500]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so cal"

In [7]:
movie_training.filenames[0]

'/Users/ChrisLagunilla/nltk_data/corpora/movie_reviews/neg/cv405_21868.txt'

In [8]:
# I'M NOT REALLY SURE HOW TO READ/INTERPRET THIS
movie_training.target[0]

0

## Try Out CountVectorizer & TF-IDF

In [11]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
%pprint

Pretty printing has been turned OFF


In [12]:
sents = ['A rose is a rose is a rose is a rose.',
         'Oh, what a fine day it is.',
         "It ain't over till it's over I tell you!!"
        ]

In [14]:
# initialize CountVectorizer
foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)
sents_counts = foovec.fit_transform(sents)

# unique words are mapped to indices
foovec.vocabulary_

{'a': 4, 'rose': 14, 'is': 9, '.': 3, 'oh': 12, ',': 2, 'what': 17, 'fine': 7, 'day': 6, 'it': 10, 'ai': 5, "n't": 11, 'over': 13, 'till': 16, "'s": 1, 'i': 8, 'tell': 15, 'you': 18, '!': 0}

In [15]:
# (doc_count, #_unique_words)
sents_counts.shape

(3, 19)

In [17]:
# view vector in an array form
sents_counts.toarray()

array([[0, 0, 0, 1, 4, 0, 0, 0, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0],
       [2, 1, 0, 0, 0, 1, 0, 0, 1, 0, 2, 1, 0, 2, 0, 1, 1, 0, 1]], dtype=int64)

In [21]:
# convert freq counts into TF-IDF
# (Term Frequency - Inverse Document Frequency)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [22]:
# TF-IDF Values
sents_tfidf.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.13650997,  0.54603988,
         0.        ,  0.        ,  0.        ,  0.        ,  0.40952991,
         0.        ,  0.        ,  0.        ,  0.        ,  0.71797683,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.36977238,  0.28122142,  0.28122142,
         0.        ,  0.36977238,  0.36977238,  0.        ,  0.28122142,
         0.28122142,  0.        ,  0.36977238,  0.        ,  0.        ,
         0.        ,  0.        ,  0.36977238,  0.        ],
       [ 0.48065817,  0.24032909,  0.        ,  0.        ,  0.        ,
         0.24032909,  0.        ,  0.        ,  0.24032909,  0.        ,
         0.36555293,  0.24032909,  0.        ,  0.48065817,  0.        ,
         0.24032909,  0.24032909,  0.        ,  0.24032909]])

## Transforming Movie Reviews

In [26]:
# initialize movie vector object
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)
movie_counts = movie_vec.fit_transform(movie_training.data)

In [27]:
# test searching for specific word counts

In [31]:
print(movie_vec.vocabulary_.get('screen'))
print(movie_vec.vocabulary_.get('seagal'))

19637
19690


In [32]:
movie_counts.shape

(2000, 25313)

In [35]:
# converting raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)

In [36]:
# verify dimensions stayed the same
movie_tfidf.shape

(2000, 25313)

In [38]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

In [42]:
# split data into training and test sets
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, 
    movie_training.target, 
    test_size = 0.20, 
    random_state = 12
)

In [43]:
# train the multimodal naive bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [44]:
y_pred = clf.predict(docs_test)
sklearn.metrics.accuracy_score(y_test, y_pred)

0.82250000000000001

In [46]:
# Making a confusion matrix (idk what this is??)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[176,  30],
       [ 41, 153]])

## Trying the Classifier on Fake Movie Reviews

In [47]:
reviews_new = ['This movie was excellent', 
               'Absolute joy ride', 
               'Steven Seagal was terrible', 
               'Steven Seagal shined through.', 
               'This was certainly a movie', 
               'Two thumbs up', 
               'I fell asleep halfway through', 
               "We can't wait for the sequel!!", 
               '!', 
               '?', 
               'I cannot recommend this highly enough', 
               'instant classic.', 
               'Steven Seagal was amazing. His performance was Oscar-worthy.'
              ]
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [48]:
# predict on the new reviews
pred = clf.predict(reviews_new_tfidf)

In [50]:
# print results
for review, category in zip(reviews_new, pred):
    print('%r => %s' % (review, movie_training.target_names[category]))

'This movie was excellent' => pos
'Absolute joy ride' => pos
'Steven Seagal was terrible' => neg
'Steven Seagal shined through.' => neg
'This was certainly a movie' => neg
'Two thumbs up' => neg
'I fell asleep halfway through' => neg
"We can't wait for the sequel!!" => neg
'!' => neg
'?' => neg
'I cannot recommend this highly enough' => pos
'instant classic.' => pos
'Steven Seagal was amazing. His performance was Oscar-worthy.' => neg


## Thoughts
I'm not quite sure I understand what the sklearn libraries are doing when it comes to the CountVectorizer, tfidf, and transformers to be able to do this on my own

Compared to the NLTK Naive-Bayes, a lot of the logic into contructing the classifier is abstracted into multiple objects so it's kind of hard to follow what exactly I was doing during the tutorial