Paige Haring

peh40@pitt.edu

10/16/17

This notebook is my progress following along with this tutorial on scikit-learn.org: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

### Movie Reviews

In [3]:
import sklearn
import nltk
from sklearn.datasets import load_files

In [7]:
#setting directory to where my nltk data has the movie reviews corpus
nltk.data.path #gotta find it first...
movie_dir = "/Users/Paige/nltk_data/corpora/movie_reviews/"

#load all of the files in the movie_reviews corpus as training data
movie_train = load_files(movie_dir, shuffle=True)

In [9]:
#What is movie_train?
type(movie_train)

sklearn.datasets.base.Bunch

In [10]:
dir(movie_train)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [11]:
movie_train.target_names #These are the labels or classes we will want to predict

['neg', 'pos']

In [12]:
movie_train.target

array([0, 1, 1, ..., 1, 0, 0])

In [13]:
movie_train.filenames

array(['/Users/Paige/nltk_data/corpora/movie_reviews/neg/cv405_21868.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/pos/cv190_27052.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/pos/cv132_5618.txt',
       ...,
       '/Users/Paige/nltk_data/corpora/movie_reviews/pos/cv653_19583.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/neg/cv559_0057.txt',
       '/Users/Paige/nltk_data/corpora/movie_reviews/neg/cv684_12727.txt'], 
      dtype='<U64')

In [16]:
#How many reviews do we have?
len(movie_train.data)

2000

In [19]:
#Let's look at the last review and the information we have about it.
movie_train.data[-1][:500]

#Seems like it's about the film Dial M for Murder and it doesn't look too good.

b"any remake of an alfred hitchcock film is at best an uncertain project , as a perfect murder illustrates . \nfrankly , dial m for murder is not one of the master director's greatest efforts , so there is ample room for improvement . \nunfortunately , instead of updating the script , ironing out some of the faults , and speeding up the pace a little , a perfect murder has inexplicably managed to eliminate almost everything that was worthwhile about dial m for murder , leaving behind the nearly- unw"

In [20]:
movie_train.target[-1] #Yep. That's negative.

0

### CountVectorizer & TF-IDF

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
pprint%

Pretty printing has been turned OFF


In [23]:
sents = ['Hello, how are you today?', 'Just fine!', 'How are the wife and kids?']

In [24]:
#We are forcing CountVectorizer to use nltk's word tokenizer because it's better for us. It doesn't ignore stopwords and punctuation like the default does.
foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)

In [27]:
help(foovec.fit_transform)

Help on method fit_transform in module sklearn.feature_extraction.text:

fit_transform(raw_documents, y=None) method of sklearn.feature_extraction.text.CountVectorizer instance
    Learn the vocabulary dictionary and return term-document matrix.
    
    This is equivalent to fit followed by transform, but more efficiently
    implemented.
    
    Parameters
    ----------
    raw_documents : iterable
        An iterable which yields either str, unicode or file objects.
    
    Returns
    -------
    X : array, [n_samples, n_features]
        Document-term matrix.



In [28]:
#sents_sounts is a word vector with infomation about the word frequencies for each sent in sents
sents_counts = foovec.fit_transform(sents)

#Calling fit_transform modified foovec
#foovec now has an attribute called vocabulary that is kind of like a dictionary. The keys are the unique tokens
#The values are a unique, numerical id
foovec.vocabulary_ 

{'hello': 6, ',': 1, 'how': 7, 'are': 4, 'you': 13, 'today': 11, '?': 2, 'just': 8, 'fine': 5, '!': 0, 'the': 10, 'wife': 12, 'and': 3, 'kids': 9}

In [30]:
# sents_counts has a dimension of 3 (document count) by 14 (# of unique words)
sents_counts.shape

#Note these 14 elements correspond to the id in foovec.vocabulary

(3, 14)

In [32]:
sents_counts

<3x14 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [36]:
sents_counts.toarray()
#Example:
#The indexes with ones in the first list are 1, 2, 4, 6, 7, 11, 13
#Those id's correspond to the words: ',', '?', 'are', 'hello', 'how', 'today', 'you'
#Those are the words in the first sentence! sents_counts is essentially an inventory of which of the total number of words in sents
#are in each sentence in sents

array([[0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0]], dtype=int64)

In [37]:
# Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [38]:
# TF-IDF values
# raw counts have been normalized against document length, 
# terms that are found across many docs are weighted down
sents_tfidf.toarray()

array([[ 0.        ,  0.41756662,  0.31757018,  0.        ,  0.31757018,
         0.        ,  0.41756662,  0.31757018,  0.        ,  0.        ,
         0.        ,  0.41756662,  0.        ,  0.41756662],
       [ 0.57735027,  0.        ,  0.        ,  0.        ,  0.        ,
         0.57735027,  0.        ,  0.        ,  0.57735027,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.31757018,  0.41756662,  0.31757018,
         0.        ,  0.        ,  0.31757018,  0.        ,  0.41756662,
         0.41756662,  0.        ,  0.41756662,  0.        ]])