Paige Haring

peh40@pitt.edu

10/16/17

This notebook is my progress following along with this tutorial on scikit-learn.org: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

### Movie Reviews

In [1]:
import sklearn
import nltk
from sklearn.datasets import load_files

In [2]:
#setting directory to where my nltk data has the movie reviews corpus
nltk.data.path #gotta find it first...
movie_dir = "/Users/Paige/nltk_data/corpora/movie_reviews/"

#load all of the files in the movie_reviews corpus as training data
movie_train = load_files(movie_dir, shuffle=True)

In [3]:
#What is movie_train?
type(movie_train)

sklearn.datasets.base.Bunch

In [4]:
dir(movie_train)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [5]:
movie_train.target_names #These are the labels or classes we will want to predict

['neg', 'pos']

In [6]:
movie_train.target

array([0, 1, 1, ..., 1, 0, 0])

In [7]:
movie_train.filenames

array(['/Users/Paige/nltk_data/corpora/movie_reviews/neg/cv405_21868.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/pos/cv190_27052.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/pos/cv132_5618.txt',
       ...,
       '/Users/Paige/nltk_data/corpora/movie_reviews/pos/cv653_19583.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/neg/cv559_0057.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/neg/cv684_12727.txt'], 
      dtype='<U64')

In [8]:
#How many reviews do we have?
len(movie_train.data)

2000

In [9]:
#Let's look at the last review and the information we have about it.
movie_train.data[-1][:500]

#Seems like it's about the film Dial M for Murder and it doesn't look too good.

b"any remake of an alfred hitchcock film is at best an uncertain project , as a perfect murder illustrates . \nfrankly , dial m for murder is not one of the master director's greatest efforts , so there is ample room for improvement . \nunfortunately , instead of updating the script , ironing out some of the faults , and speeding up the pace a little , a perfect murder has inexplicably managed to eliminate almost everything that was worthwhile about dial m for murder , leaving behind the nearly- unw"

In [10]:
movie_train.target[-1] #Yep. That's negative.

0

### CountVectorizer & TF-IDF

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
pprint%

Pretty printing has been turned OFF


In [13]:
sents = ['Hello, how are you today and how are your dogs?', 'Just fine!', 'How are the wife and kids?']

In [14]:
#We are forcing CountVectorizer to use nltk's word tokenizer because it's better for us. It doesn't ignore stopwords and punctuation like the default does.
foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)

In [15]:
help(foovec.fit_transform)

Help on method fit_transform in module sklearn.feature_extraction.text:

fit_transform(raw_documents, y=None) method of sklearn.feature_extraction.text.CountVectorizer instance
    Learn the vocabulary dictionary and return term-document matrix.
    
    This is equivalent to fit followed by transform, but more efficiently
    implemented.
    
    Parameters
    ----------
    raw_documents : iterable
        An iterable which yields either str, unicode or file objects.
    
    Returns
    -------
    X : array, [n_samples, n_features]
        Document-term matrix.



In [16]:
#sents_sounts is a word vector with infomation about the word frequencies for each sent in sents
sents_counts = foovec.fit_transform(sents)

#Calling fit_transform modified foovec
#foovec now has an attribute called vocabulary that is kind of like a dictionary. The keys are the unique tokens
#The values are a unique, numerical id
foovec.vocabulary_ 

{'hello': 7, ',': 1, 'how': 8, 'are': 4, 'you': 14, 'today': 12, 'and': 3, 'your': 15, 'dogs': 5, '?': 2, 'just': 9, 'fine': 6, '!': 0, 'the': 11, 'wife': 13, 'kids': 10}

In [17]:
# sents_counts has a dimension of 3 (document count) by 14 (# of unique words)
sents_counts.shape

#Note these 14 elements correspond to the id in foovec.vocabulary

(3, 16)

In [18]:
sents_counts

<3x16 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

In [19]:
sents_counts.toarray()
#Example:
#The indexes with ones in the first list are 1, 2, 4, 6, 7, 11, 13
#Those id's correspond to the words: ',', '?', 'are', 'hello', 'how', 'today', 'you'
#Those are the words in the first sentence! sents_counts is essentially an inventory of which of the total number of words in sents
#are in each sentence in sents

array([[0, 1, 1, 1, 2, 1, 0, 1, 2, 0, 0, 0, 1, 0, 1, 1],
       [1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]], dtype=int64)

In [20]:
# Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values

#TF_IDF helps us figure out which words are the most important in each document in our corpus

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [21]:
# TF-IDF values
# raw counts have been normalized against document length, 
# terms that are found across many docs are weighted down
# They aren't stopwords, but since they are found across docs, they aren't going to help us classify
# This keeps document specific keywords weighted high!

sents_tfidf.toarray()

# We see in sentence two, id 0 = '!' has one of the highest weights because it is found only in that document

array([[ 0.        ,  0.29130889,  0.22154792,  0.22154792,  0.44309583,
         0.29130889,  0.        ,  0.29130889,  0.44309583,  0.        ,
         0.        ,  0.        ,  0.29130889,  0.        ,  0.29130889,
         0.29130889],
       [ 0.57735027,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.57735027,  0.        ,  0.        ,  0.57735027,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.32992832,  0.32992832,  0.32992832,
         0.        ,  0.        ,  0.        ,  0.32992832,  0.        ,
         0.43381609,  0.43381609,  0.        ,  0.43381609,  0.        ,
         0.        ]])

### Back to Movie Reviews

In [22]:
# initialize movie_vector object, and then turn movie train data into a vector 
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)         # use all 25K words. 82.2% acc.
movie_counts = movie_vec.fit_transform(movie_train.data)

In [23]:
movie_vec.vocabulary_ #Wow that's big
movie_counts.shape
#There are 2000 entries and a total of 25,313 tokens

(2000, 25313)

In [24]:
# Convert raw frequency counts into TF-IDF values so the important words are paid attention to
tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)

In [25]:
#This has the same dimensions as movie_counts, but it is no longer raw frequency counts
#Now, these are the weights of the words, where a high weight corresponds to a word that isn't frequent across docs
#These words are "important"
movie_tfidf.shape

(2000, 25313)

### Naive Bayes Classifier

In [26]:
# Import Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

In [27]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, movie_train.target, test_size = 0.20, random_state = 12)

#set random state so we can replicate the same randomization next time we run the code

In [28]:
# Train the classifier with the training documents and their targets
clf = MultinomialNB().fit(docs_train, y_train)

In [29]:
# Predicting the Test set results, find accuracy
y_pred = clf.predict(docs_test)
y_pred

array([0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0,

In [30]:
#How good is our classifier at classifying?
sklearn.metrics.accuracy_score(y_test, y_pred)

0.82250000000000001

In [31]:
#I really don't know what this step entails..
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[176,  30],
       [ 41, 153]])

### Testing on my roommate's review

In [32]:
reviews_new= ['This movie sucked.', 
          "Tom Cruise's performance is as wooden as the plank of wood from edd ed and eddy.", 
          'The writing was incredibly breathtaking despite the terrible delivery.']

In [33]:
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [34]:
pred = clf.predict(reviews_new_tfidf)

In [35]:
# print out results
for review, category in zip(reviews_new, pred):
    print('%r => %s' % (review, movie_train.target_names[category]))

'This movie sucked.' => neg
"Tom Cruise's performance is as wooden as the plank of wood from edd ed and eddy." => pos
'The writing was incredibly breathtaking despite the terrible delivery.' => neg


In [36]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'Steven Seagal was terrible', 'Steven Seagal shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [37]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [38]:
# print out results
for review, category in zip(reviews_new, pred):
    print('%r => %s' % (review, movie_train.target_names[category]))

'This movie was excellent' => pos
'Absolute joy ride' => pos
'Steven Seagal was terrible' => neg
'Steven Seagal shined through.' => neg
'This was certainly a movie' => neg
'Two thumbs up' => neg
'I fell asleep halfway through' => neg
"We can't wait for the sequel!!" => neg
'!' => neg
'?' => neg
'I cannot recommend this highly enough' => pos
'instant classic.' => pos
'Steven Seagal was amazing. His performance was Oscar-worthy.' => neg


It's interesting that the Steven Seagal reviews are never positive. Last sememster in LING1330, we used NLTK's Naive Bayes Classifier. The same thing happened then! That is because the Naive Bayes Classifier uses the same Naive Bayes formula. Our NLTK classifier uses a kitchen sink strategy. Steven Segal must signify a negative review because 'Steven' and 'Seagal' are not common across the entire corpus, but show up mainly in negative reviews, giving them a strong weight towards negative reviews. In the NLTK classifier, we don't do the TF-IDF change that we do here with sklearn. We also don't use a raw count. We use true or false values. We only care if word is present in that document, or it is not. In the sklearn classifer, the counts seem to give our classifier more information.