# Contents

- [Regular Expressions](#reg_ex)
- [NLTK](#nltk)  
- [Vectorization](#vec)  
    - [CountVectorizer](#cnt_vec)  
    - [TFIDF Vectorizer](#tf_vec)  
- [Topic Modeling](#top_mod)  
    - [LDA](#lda)  
    - [NMF](#nmf)  

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data

In [2]:
from nltk.corpus import movie_reviews

In [3]:
path = '/home/kevcon/nltk_data/corpora/movie_reviews/'

In [4]:
pos_docs = []

for file in movie_reviews.fileids('pos')[0:250]:
    with open(path + file, 'r') as f:
        pos_docs.append(f.read())

In [5]:
neg_docs = []

for file in movie_reviews.fileids('neg')[0:250]:
    with open(path + file, 'r') as f:
        neg_docs.append(f.read())

In [6]:
pos_df = pd.DataFrame(pos_docs, columns=['review'])
pos_df.head()

Unnamed: 0,review
0,films adapted from comic books have had plenty...
1,every now and then a movie comes along from a ...
2,you've got mail works alot better than it dese...
3,""" jaws "" is a rare film that grabs your atten..."
4,moviemaking is a lot like being the general ma...


In [7]:
pos_df['sentiment'] = ['positive'] * len(pos_docs)
pos_df.head()

Unnamed: 0,review,sentiment
0,films adapted from comic books have had plenty...,positive
1,every now and then a movie comes along from a ...,positive
2,you've got mail works alot better than it dese...,positive
3,""" jaws "" is a rare film that grabs your atten...",positive
4,moviemaking is a lot like being the general ma...,positive


In [8]:
neg_df = pd.DataFrame(neg_docs, columns=['review'])
neg_df['sentiment'] = ['negative'] * len(neg_docs)

In [9]:
rev_df = pd.concat([pos_df, neg_df], axis=0, ignore_index=True)

# Regular Expressions <a name="reg_ex"></a>

# NLTK <a name="nltk"></a>

In [10]:
import nltk

# Vectorization <a name="vec"></a>

## CountVectorizer <a name="cnt_vec"></a>

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
documents = rev_df['review']

In [13]:
# create instance of CountVectorizer
vectorizer = CountVectorizer(stop_words='english')

In [14]:
# learn vocabulary from input documents
vectorizer.fit(documents)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [15]:
# transform documents into document-term matrix
X = vectorizer.transform(documents)

In [16]:
# return terms from documents
cv_terms = vectorizer.get_feature_names()
cv_terms

['00',
 '000',
 '0009f',
 '007',
 '10',
 '100',
 '1000',
 '101',
 '102',
 '105',
 '107',
 '109',
 '10b',
 '11',
 '110',
 '111',
 '112',
 '117',
 '12',
 '123',
 '125',
 '126',
 '13',
 '137',
 '138',
 '13th',
 '14',
 '14th',
 '15',
 '1521',
 '155',
 '157',
 '16',
 '160',
 '161',
 '1692',
 '16x9',
 '17',
 '175',
 '1799',
 '17th',
 '18',
 '180',
 '1800s',
 '1812',
 '1888',
 '1896',
 '1898',
 '18th',
 '19',
 '1900',
 '1900s',
 '1925',
 '1928',
 '1930',
 '1930s',
 '1933',
 '1937',
 '1938',
 '1939',
 '1940',
 '1942',
 '1944',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1957',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1964',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1998s',
 '1999',
 '19th',
 '1st',
 '

In [17]:
# return matrix of term counts
X.toarray()

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [18]:
# create dataframe of document-term matrix
cv_df = pd.DataFrame(X.toarray(), columns=[cv_terms])
cv_df.head()

Unnamed: 0,00,000,0009f,007,10,100,1000,101,102,105,...,zooming,zooms,zorg,zorro,zucker,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## TFIDF Vectorizer <a name="tf_vec"></a>

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
documents = rev_df['review']

In [21]:
# create instance of TFIDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

In [22]:
# learn vocabulary from input documents
vectorizer.fit(documents)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [23]:
# transform documents into document-term matrix
X = vectorizer.transform(documents)

In [24]:
# return terms from documents
tf_terms = vectorizer.get_feature_names()
tf_terms

['00',
 '000',
 '0009f',
 '007',
 '10',
 '100',
 '1000',
 '101',
 '102',
 '105',
 '107',
 '109',
 '10b',
 '11',
 '110',
 '111',
 '112',
 '117',
 '12',
 '123',
 '125',
 '126',
 '13',
 '137',
 '138',
 '13th',
 '14',
 '14th',
 '15',
 '1521',
 '155',
 '157',
 '16',
 '160',
 '161',
 '1692',
 '16x9',
 '17',
 '175',
 '1799',
 '17th',
 '18',
 '180',
 '1800s',
 '1812',
 '1888',
 '1896',
 '1898',
 '18th',
 '19',
 '1900',
 '1900s',
 '1925',
 '1928',
 '1930',
 '1930s',
 '1933',
 '1937',
 '1938',
 '1939',
 '1940',
 '1942',
 '1944',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1957',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1964',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1998s',
 '1999',
 '19th',
 '1st',
 '

In [25]:
# return matrix of normalized term counts
X.toarray()

array([[0.06717283, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [26]:
# create dataframe of document-term matrix
tf_df = pd.DataFrame(X.toarray(), columns=[tf_terms])
tf_df.head()

Unnamed: 0,00,000,0009f,007,10,100,1000,101,102,105,...,zooming,zooms,zorg,zorro,zucker,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.067173,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060035,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.026797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Topic Modeling <a name="top_mod"></a>

## LDA <a name="lda"></a>

In [27]:
from sklearn.decomposition import LatentDirichletAllocation

In [28]:
# create instance of model, input number of topics to output
lda = LatentDirichletAllocation(n_components=10)

In [29]:
# fit model to vectorized data
lda.fit(cv_df)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [30]:
cv_terms

['00',
 '000',
 '0009f',
 '007',
 '10',
 '100',
 '1000',
 '101',
 '102',
 '105',
 '107',
 '109',
 '10b',
 '11',
 '110',
 '111',
 '112',
 '117',
 '12',
 '123',
 '125',
 '126',
 '13',
 '137',
 '138',
 '13th',
 '14',
 '14th',
 '15',
 '1521',
 '155',
 '157',
 '16',
 '160',
 '161',
 '1692',
 '16x9',
 '17',
 '175',
 '1799',
 '17th',
 '18',
 '180',
 '1800s',
 '1812',
 '1888',
 '1896',
 '1898',
 '18th',
 '19',
 '1900',
 '1900s',
 '1925',
 '1928',
 '1930',
 '1930s',
 '1933',
 '1937',
 '1938',
 '1939',
 '1940',
 '1942',
 '1944',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1957',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1964',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1998s',
 '1999',
 '19th',
 '1st',
 '

In [31]:
lda.components_

array([[ 0.11203614,  0.11401723,  0.11679096, ...,  0.11245952,
         0.11216085,  0.11161081],
       [ 0.11438535,  0.12416534,  0.11448454, ...,  0.11013957,
         1.87560798,  0.11207704],
       [ 0.11299368,  0.11896678,  0.11421405, ...,  0.11155248,
         0.11402856,  0.11265823],
       ...,
       [ 0.11272754,  0.11279648,  0.11139569, ...,  0.11183007,
         0.11424361,  0.11184821],
       [ 0.97711197, 21.31224159,  0.12692479, ...,  1.10431615,
         0.20947663,  1.10714732],
       [ 0.11329878,  0.11415782,  0.11384415, ...,  0.11203808,
         0.11345057,  0.11436703]])

In [32]:
# function to print top words of topic model
def print_top_words(model, feature_names, n_top_words):
    for index, topic in enumerate(model.components_):
        message = "\nTopic #{}:".format(index)
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]])
        print(message)
        print("="*70)

In [33]:
print_top_words(lda, cv_terms, 25)


Topic #0:shanta dil muslim navaz carol india creepers sikh hindu jeepers earth henry fiorentino coburn lenny wade dramas stitched salva ua philips wayne mgm visceral bitter

Topic #1:jedi wars star lumumba obi luke wan vader return darth capone mermaid anakin carry connor lucas effects film stretch skywalker empire gon qui spoon caveman

Topic #2:cole malcolm dinosaurs dogma ku ba sixth osment jurassic trek insurrection memento detectives bethany island park willis doyle bartleby loki shyamalan dinosaur troubled height bedroom

Topic #3:argento kelly wild stephane bacon sam chucky lombardo suzie camille tenebrae rudolph richards murray dillon things stendhal campbell denise neal marianne syndrome tiffany matt duquette

Topic #4:guido jakob benigni dora finn holocaust jews estella son burnham camps carolyn lester growing liar expectations beautiful pip indy life german survivors bentley thora snakes

Topic #5:tibbs negotiator roman brady rea damon kaye francie butcher carlos jewison ra

## NMF <a name="nmf"></a>

In [34]:
from sklearn.decomposition import NMF

In [35]:
# create instance of model, input number of topics to output
nmf = NMF(n_components=10)

In [36]:
# fit model to vectorized data
nmf.fit(tf_df)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=10, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [37]:
tf_terms

['00',
 '000',
 '0009f',
 '007',
 '10',
 '100',
 '1000',
 '101',
 '102',
 '105',
 '107',
 '109',
 '10b',
 '11',
 '110',
 '111',
 '112',
 '117',
 '12',
 '123',
 '125',
 '126',
 '13',
 '137',
 '138',
 '13th',
 '14',
 '14th',
 '15',
 '1521',
 '155',
 '157',
 '16',
 '160',
 '161',
 '1692',
 '16x9',
 '17',
 '175',
 '1799',
 '17th',
 '18',
 '180',
 '1800s',
 '1812',
 '1888',
 '1896',
 '1898',
 '18th',
 '19',
 '1900',
 '1900s',
 '1925',
 '1928',
 '1930',
 '1930s',
 '1933',
 '1937',
 '1938',
 '1939',
 '1940',
 '1942',
 '1944',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1957',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1964',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1998s',
 '1999',
 '19th',
 '1st',
 '

In [38]:
nmf.components_

array([[1.43154823e-03, 1.84524386e-02, 2.62418338e-03, ...,
        1.69348593e-04, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.01805444e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 2.24177424e-02, 0.00000000e+00, ...,
        4.04972621e-05, 0.00000000e+00, 0.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 4.89881559e-02, 0.00000000e+00],
       [5.88421772e-04, 9.39329792e-03, 0.00000000e+00, ...,
        2.45008275e-03, 0.00000000e+00, 0.00000000e+00],
       [2.58663137e-04, 0.00000000e+00, 0.00000000e+00, ...,
        4.70239517e-04, 0.00000000e+00, 0.00000000e+00]])

In [39]:
# function to print top words of topic model
def print_top_words(model, feature_names, n_top_words):
    for index, topic in enumerate(model.components_):
        message = "\nTopic #{}:".format(index)
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]])
        print(message)
        print("="*70)

In [40]:
print_top_words(nmf, tf_terms, 25)


Topic #0:film story like just time characters action good films really movie character plot killer world scene make way does director don people effects scenes know

Topic #1:jackie chan fight mr master fu drunken hong kung action martial li arts films chinese fei kong rush nice guy condor cooking tucker scenes jet

Topic #2:mars alien carpenter ghosts aliens ripley planet henstridge cube red species horror martian precinct mission mining flashback space film zombie science ice movie desolation ship

Topic #3:10 movie just bad film julia really pretty ve guy funny like roberts got time don plot didn thing li seen movies crystal big critique

Topic #4:jedi wars star vader luke obi wan anakin darth gon qui lucas effects skywalker kenobi phantom empire trilogy jinn special jar film menace movie queen

Topic #5:apes burton planet ape 1968 wahlberg tim humans human sleepy hollow leo original davidson carter film bonham astronaut chimp serling battle lead heston ichabod monkey

Topic #6:dam