# Contents

- [Regular Expressions](#reg_ex)
- [NLTK](#nltk)  
    - [Sentence Tokenization](#sent_toke)  
    - [Word Tokenization](#word_toke)  
    - [Stopwords](#stp_wrd)  
    - [N-grams](#ngrm)  
    - [Stemming](#stem)
    - [Lemmatization](#lem)
- [TextBlob](#txt_blb)  
    - [Sentiment Analysis](#sent_anal)
- [Vectorization](#vec)  
    - [CountVectorizer](#cnt_vec)  
    - [TFIDF Vectorizer](#tf_vec)  
- [Topic Modeling](#top_mod)  
    - [LDA](#lda)  
    - [NMF](#nmf)  
- [Word2Vec](#w2v)  

# Imports

In [1]:
import pandas as pd

# Data

In [2]:
from nltk.corpus import movie_reviews

In [3]:
# grab range of filenames from corpus
fileids = movie_reviews.fileids('pos')[:50]
# create list of words for each file
doc_words = [movie_reviews.words(fileid) for fileid in fileids]
# create list of sentences by combining words back into sentences for each file
pos_docs = [' '.join(words) for words in doc_words]

In [4]:
fileids = movie_reviews.fileids('neg')[:50]
doc_words = [movie_reviews.words(fileid) for fileid in fileids]
neg_docs = [' '.join(words) for words in doc_words]

In [5]:
documents = pos_docs + neg_docs

# Regular Expressions <a name="reg_ex"></a>

In [6]:
import re

## match
Searches for match from begginning of the string

In [7]:
# match result
re.match(r'baloney', 'baloney')

<_sre.SRE_Match object; span=(0, 7), match='baloney'>

In [8]:
# no match
re.match(r'baloney', 'caloney')

In [9]:
re.match(r'baloney', 'mebaloneycheese')

In [10]:
re.match(r'baloney', 'baloneycheese')

<_sre.SRE_Match object; span=(0, 7), match='baloney'>

## search
Searches for match anywhere in the string

In [11]:
# match result
re.search(r'spam', 'spam')

<_sre.SRE_Match object; span=(0, 4), match='spam'>

In [12]:
# no match
re.search(r'spam', 'maam')

In [13]:
re.search(r'spam', 'this is some spam over here')

<_sre.SRE_Match object; span=(13, 17), match='spam'>

## findall
Searches for all matches in the string

In [14]:
# match result
re.findall(r'dynasty', 'this is the first dynasty but whose dynasty?')

['dynasty', 'dynasty']

In [15]:
# no match
re.findall(r'dynasty', 'this is the first but whose?')

[]

## General Searching

In [16]:
sample_text = 'purple alice-b@google.com, blah monkey32 bob@abc.com blah dishwasher'

In [17]:
# search for specific character
re.search(r'b', sample_text).group()

'b'

In [18]:
# search for specific character group
re.search(r'bl', sample_text).group()

'bl'

In [19]:
# search for specific charcter group with wildcard
re.search(r'bl.h', sample_text).group()

'blah'

In [20]:
# search for specific character range
re.search(r'[a-c]', sample_text).group()

'a'

In [21]:
# search for general letter
re.search(r'\w', sample_text).group()

'p'

In [22]:
# search for general letter group
re.search(r'\w+', sample_text).group()

'purple'

In [23]:
# search for general number
re.search(r'\d', sample_text).group()

'3'

In [24]:
re.search(r'\d+', sample_text).group()

'32'

In [25]:
# search for general character group followed by general number group
re.search(r'\S+\d+', sample_text).group()

'monkey32'

### Example: email addresses

In [26]:
sample_text

'purple alice-b@google.com, blah monkey32 bob@abc.com blah dishwasher'

In [27]:
# specify groupings in match result
match = re.search(r'(\w+)@(\w+)', sample_text)

In [28]:
# complete match
match.group()

'b@google'

In [29]:
# first group of match result
match.group(1)

'b'

In [30]:
# second group of match result
match.group(2)

'google'

In [31]:
# findall full matches in string
re.findall(r'[\w.-]+@[\w.-]+', sample_text)

['alice-b@google.com', 'bob@abc.com']

In [32]:
# findall with groupings
match = re.findall(r'([\w.-]+)@([\w.-]+)', sample_text)
match

[('alice-b', 'google.com'), ('bob', 'abc.com')]

In [33]:
# first match result
match[0]

('alice-b', 'google.com')

In [34]:
# first group in first match result
match[0][0]

'alice-b'

## Substitution

In [35]:
sample_text

'purple alice-b@google.com, blah monkey32 bob@abc.com blah dishwasher'

In [36]:
# substitute match with specified item
re.sub('com', 'bomb', sample_text)

'purple alice-b@google.bomb, blah monkey32 bob@abc.bomb blah dishwasher'

# NLTK <a name="nltk"></a>

In [37]:
import nltk

## Sentence Tokenization <a name="sent_toke"></a>
Break document into list of sentences

In [38]:
from nltk import tokenize

In [39]:
text_sample = documents[0]
text_sample

'films adapted from comic books have had plenty of success , whether they \' re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there \' s never really been a comic book like from hell before . for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid \' 80s with a 12 - part series called the watchmen . to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . in other words , don \' t dismiss this film because of its source . if you can get past the whole comic book thing , you might find another stumbling block in from hell \' s directors , albert and allen hughes . getting the hughes brothers to direct this seems

In [40]:
sentences = tokenize.sent_tokenize(text_sample)
sentences[0]

"films adapted from comic books have had plenty of success , whether they ' re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there ' s never really been a comic book like from hell before ."

## Word Tokenization <a name="word_toke"></a>
Break sentence into list of words

In [41]:
words = tokenize.word_tokenize(sentences[0])
words

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success',
 ',',
 'whether',
 'they',
 "'",
 're',
 'about',
 'superheroes',
 '(',
 'batman',
 ',',
 'superman',
 ',',
 'spawn',
 ')',
 ',',
 'or',
 'geared',
 'toward',
 'kids',
 '(',
 'casper',
 ')',
 'or',
 'the',
 'arthouse',
 'crowd',
 '(',
 'ghost',
 'world',
 ')',
 ',',
 'but',
 'there',
 "'",
 's',
 'never',
 'really',
 'been',
 'a',
 'comic',
 'book',
 'like',
 'from',
 'hell',
 'before',
 '.']

## Stopwords <a name="stp_wrd"></a>
Words like "the", "and", "of", etc.

In [42]:
from nltk.corpus import stopwords

In [43]:
# create list of words to remove from corpus
stop = stopwords.words('english')
stop += ['.', ',', '(', ')', "'", '"']
stop = set(stop)

In [44]:
# remove stopwords
cln_words = [w for w in words if w not in stop]
cln_words

['films',
 'adapted',
 'comic',
 'books',
 'plenty',
 'success',
 'whether',
 'superheroes',
 'batman',
 'superman',
 'spawn',
 'geared',
 'toward',
 'kids',
 'casper',
 'arthouse',
 'crowd',
 'ghost',
 'world',
 'never',
 'really',
 'comic',
 'book',
 'like',
 'hell']

## N-grams <a name="ngrm"></a>
Adjacent words found in text

In [45]:
from nltk.util import ngrams

In [46]:
# create word pairs
bigrams = ngrams(words, 2)
bigrams

<generator object ngrams at 0x7fcdb27ec468>

In [47]:
for gram in bigrams:
    print(gram)

('films', 'adapted')
('adapted', 'from')
('from', 'comic')
('comic', 'books')
('books', 'have')
('have', 'had')
('had', 'plenty')
('plenty', 'of')
('of', 'success')
('success', ',')
(',', 'whether')
('whether', 'they')
('they', "'")
("'", 're')
('re', 'about')
('about', 'superheroes')
('superheroes', '(')
('(', 'batman')
('batman', ',')
(',', 'superman')
('superman', ',')
(',', 'spawn')
('spawn', ')')
(')', ',')
(',', 'or')
('or', 'geared')
('geared', 'toward')
('toward', 'kids')
('kids', '(')
('(', 'casper')
('casper', ')')
(')', 'or')
('or', 'the')
('the', 'arthouse')
('arthouse', 'crowd')
('crowd', '(')
('(', 'ghost')
('ghost', 'world')
('world', ')')
(')', ',')
(',', 'but')
('but', 'there')
('there', "'")
("'", 's')
('s', 'never')
('never', 'really')
('really', 'been')
('been', 'a')
('a', 'comic')
('comic', 'book')
('book', 'like')
('like', 'from')
('from', 'hell')
('hell', 'before')
('before', '.')


In [48]:
trigrams = ngrams(words, 3)

In [49]:
for gram in trigrams:
    print(gram)

('films', 'adapted', 'from')
('adapted', 'from', 'comic')
('from', 'comic', 'books')
('comic', 'books', 'have')
('books', 'have', 'had')
('have', 'had', 'plenty')
('had', 'plenty', 'of')
('plenty', 'of', 'success')
('of', 'success', ',')
('success', ',', 'whether')
(',', 'whether', 'they')
('whether', 'they', "'")
('they', "'", 're')
("'", 're', 'about')
('re', 'about', 'superheroes')
('about', 'superheroes', '(')
('superheroes', '(', 'batman')
('(', 'batman', ',')
('batman', ',', 'superman')
(',', 'superman', ',')
('superman', ',', 'spawn')
(',', 'spawn', ')')
('spawn', ')', ',')
(')', ',', 'or')
(',', 'or', 'geared')
('or', 'geared', 'toward')
('geared', 'toward', 'kids')
('toward', 'kids', '(')
('kids', '(', 'casper')
('(', 'casper', ')')
('casper', ')', 'or')
(')', 'or', 'the')
('or', 'the', 'arthouse')
('the', 'arthouse', 'crowd')
('arthouse', 'crowd', '(')
('crowd', '(', 'ghost')
('(', 'ghost', 'world')
('ghost', 'world', ')')
('world', ')', ',')
(')', ',', 'but')
(',', 'but', 't

## Stemming <a name="stem"></a>
Returns root of words

In [50]:
# initialize stemmer
stemmer = nltk.stem.porter.PorterStemmer()

In [51]:
# stem each word in words list
for word in words:
    print(stemmer.stem(word))

film
adapt
from
comic
book
have
had
plenti
of
success
,
whether
they
'
re
about
superhero
(
batman
,
superman
,
spawn
)
,
or
gear
toward
kid
(
casper
)
or
the
arthous
crowd
(
ghost
world
)
,
but
there
'
s
never
realli
been
a
comic
book
like
from
hell
befor
.


## Lemmatization <a name="lem"></a>
Similar to stemming but less aggressive

In [52]:
# initialize lemma
lemma=nltk.stem.WordNetLemmatizer()

In [53]:
# lemmatize each word in words list
for word in words:
    print(lemma.lemmatize(word))

film
adapted
from
comic
book
have
had
plenty
of
success
,
whether
they
'
re
about
superheroes
(
batman
,
superman
,
spawn
)
,
or
geared
toward
kid
(
casper
)
or
the
arthouse
crowd
(
ghost
world
)
,
but
there
'
s
never
really
been
a
comic
book
like
from
hell
before
.


# TextBlob <a name="txt_blb"></a>

In [54]:
from textblob import TextBlob

In [55]:
# break input into sentences
TextBlob(text_sample).sentences

[Sentence("films adapted from comic books have had plenty of success , whether they ' re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there ' s never really been a comic book like from hell before ."),
 Sentence("for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid ' 80s with a 12 - part series called the watchmen ."),
 Sentence("to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd ."),
 Sentence("the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes ."),
 Sentence("in other words , don ' t dismiss this film because of its source ."),
 Sentence("if you can get past the whole comic book thing , you might find another stumbling block in from hell ' s directors ,

In [56]:
# break input into words
TextBlob(sentences[0]).words

WordList(['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', 'whether', 'they', 're', 'about', 'superheroes', 'batman', 'superman', 'spawn', 'or', 'geared', 'toward', 'kids', 'casper', 'or', 'the', 'arthouse', 'crowd', 'ghost', 'world', 'but', 'there', 's', 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before'])

In [57]:
# retrieve word counts
TextBlob(sentences[0]).word_counts

defaultdict(int,
            {'films': 1,
             'adapted': 1,
             'from': 2,
             'comic': 2,
             'books': 1,
             'have': 1,
             'had': 1,
             'plenty': 1,
             'of': 1,
             'success': 1,
             'whether': 1,
             'they': 1,
             're': 1,
             'about': 1,
             'superheroes': 1,
             'batman': 1,
             'superman': 1,
             'spawn': 1,
             'or': 2,
             'geared': 1,
             'toward': 1,
             'kids': 1,
             'casper': 1,
             'the': 1,
             'arthouse': 1,
             'crowd': 1,
             'ghost': 1,
             'world': 1,
             'but': 1,
             'there': 1,
             's': 1,
             'never': 1,
             'really': 1,
             'been': 1,
             'a': 1,
             'book': 1,
             'like': 1,
             'hell': 1,
             'before': 1})

## Sentiment Analysis <a name="sent_anal"></a> 

In [58]:
# positive or negative polarity for document
TextBlob(text_sample).sentiment

Sentiment(polarity=0.02095457285330703, subjectivity=0.47927694668201004)

In [59]:
# positive or negative polarity for sentence
TextBlob(sentences[0]).sentiment

Sentiment(polarity=0.17500000000000002, subjectivity=0.3)

# Vectorization <a name="vec"></a>
Bag of Words (BOW)

## CountVectorizer <a name="cnt_vec"></a>
Returns word counts per document

In [60]:
from sklearn.feature_extraction.text import CountVectorizer

In [61]:
# create instance of CountVectorizer
cv_vectorizer = CountVectorizer(stop_words='english', # remove inconsequential words
                             max_df=0.95, # ignore most frequent words (corpus-specific stopwords)
                             min_df=0.05 # ignore least frequent words (irrelevant words)
                            )

In [62]:
# learn vocabulary from input documents
cv_vectorizer.fit(documents)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=None, min_df=0.05,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [63]:
# transform documents into document-term matrix
X = cv_vectorizer.transform(documents)
X

<100x1123 sparse matrix of type '<class 'numpy.int64'>'
	with 12324 stored elements in Compressed Sparse Row format>

In [64]:
# return matrix of term counts per document
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 1, 0]])

In [65]:
# return tokens from documents
cv_vectorizer.get_feature_names()

['10',
 '13',
 '1999',
 '20',
 '30',
 'ability',
 'able',
 'absolutely',
 'accent',
 'act',
 'acting',
 'action',
 'actions',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actually',
 'adaptation',
 'add',
 'addition',
 'adds',
 'adult',
 'adults',
 'african',
 'age',
 'agent',
 'ago',
 'al',
 'alan',
 'alien',
 'allen',
 'allowing',
 'amazing',
 'america',
 'american',
 'amusing',
 'angry',
 'anthony',
 'antics',
 'anybody',
 'apparently',
 'appear',
 'appears',
 'approach',
 'appropriate',
 'appropriately',
 'aren',
 'army',
 'art',
 'artist',
 'ask',
 'asked',
 'asks',
 'assured',
 'atmosphere',
 'attempt',
 'attempts',
 'attention',
 'audience',
 'author',
 'award',
 'away',
 'background',
 'bad',
 'band',
 'based',
 'basically',
 'batman',
 'battle',
 'beat',
 'beats',
 'beautiful',
 'begin',
 'beginning',
 'begins',
 'believable',
 'believe',
 'ben',
 'best',
 'better',
 'big',
 'biggest',
 'billy',
 'bit',
 'bizarre',
 'black',
 'bland',
 'block',
 'blood',
 'blowing',


In [66]:
# create dataframe of document-term matrix
cv_df = pd.DataFrame(X.toarray(), columns=[cv_vectorizer.get_feature_names()])
cv_df.head()

Unnamed: 0,10,13,1999,20,30,ability,able,absolutely,accent,act,...,writer,writing,written,wrong,yeah,year,years,york,young,younger
0,0,0,0,0,1,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,2,0,0,0,0
3,0,0,0,0,0,0,0,0,0,5,...,0,0,0,0,0,0,1,1,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Add lemmatizer

In [67]:
import nltk
from nltk import tokenize

In [68]:
lemma=nltk.stem.WordNetLemmatizer()
# lemmatize function
def lemma_func(document):
    return [lemma.lemmatize(word) for word in tokenize.word_tokenize(document)]

In [69]:
cv_vectorizer = CountVectorizer(stop_words='english',
                             max_df=0.95,
                             min_df=0.05,
                             tokenizer=lemma_func
                            )

In [70]:
X_lem = cv_vectorizer.fit_transform(documents)

In [71]:
cv_df_lem = pd.DataFrame(X_lem.toarray(), columns=[cv_vectorizer.get_feature_names()])
cv_df_lem.head()

Unnamed: 0,!,$,&,*,+,--,/,1,10,13,...,write,writer,writing,written,wrong,yeah,year,york,young,younger
0,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,2,0,0,0
3,0,0,0,0,0,2,0,1,0,0,...,0,0,0,0,0,0,1,1,1,0
4,1,0,0,0,1,3,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


## TFIDF Vectorizer <a name="tf_vec"></a>
Returns weights of words per document normalized against occurence in entire corpus

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [73]:
# create instance of TFIDF vectorizer
tf_vectorizer = TfidfVectorizer(stop_words='english', # remove inconsequential words
                             max_df=0.95, # ignore most frequent words (corpus-specific stopwords)
                             min_df=0.05 # ignore least frequent words (irrelevant words)
                            )

In [74]:
# learn vocabulary from input documents
tf_vectorizer.fit(documents)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=None, min_df=0.05,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [75]:
# transform documents into document-term matrix
X = tf_vectorizer.transform(documents)
X

<100x1123 sparse matrix of type '<class 'numpy.float64'>'
	with 12324 stored elements in Compressed Sparse Row format>

In [76]:
# return matrix of normalized term counts
X.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.08754269, ..., 0.        , 0.05128158,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.10143881, 0.06166595,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.04563867,
        0.        ]])

In [77]:
# return tokens from documents
tf_vectorizer.get_feature_names()

['10',
 '13',
 '1999',
 '20',
 '30',
 'ability',
 'able',
 'absolutely',
 'accent',
 'act',
 'acting',
 'action',
 'actions',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actually',
 'adaptation',
 'add',
 'addition',
 'adds',
 'adult',
 'adults',
 'african',
 'age',
 'agent',
 'ago',
 'al',
 'alan',
 'alien',
 'allen',
 'allowing',
 'amazing',
 'america',
 'american',
 'amusing',
 'angry',
 'anthony',
 'antics',
 'anybody',
 'apparently',
 'appear',
 'appears',
 'approach',
 'appropriate',
 'appropriately',
 'aren',
 'army',
 'art',
 'artist',
 'ask',
 'asked',
 'asks',
 'assured',
 'atmosphere',
 'attempt',
 'attempts',
 'attention',
 'audience',
 'author',
 'award',
 'away',
 'background',
 'bad',
 'band',
 'based',
 'basically',
 'batman',
 'battle',
 'beat',
 'beats',
 'beautiful',
 'begin',
 'beginning',
 'begins',
 'believable',
 'believe',
 'ben',
 'best',
 'better',
 'big',
 'biggest',
 'billy',
 'bit',
 'bizarre',
 'black',
 'bland',
 'block',
 'blood',
 'blowing',


In [78]:
# create dataframe of document-term matrix
tf_df = pd.DataFrame(X.toarray(), columns=[tf_vectorizer.get_feature_names()])
tf_df.head()

Unnamed: 0,10,13,1999,20,30,ability,able,absolutely,accent,act,...,writer,writing,written,wrong,yeah,year,years,york,young,younger
0,0.0,0.0,0.0,0.0,0.075444,0.0,0.0,0.0,0.150887,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.041974,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095799,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.119973,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2927,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0385,0.060557,0.036814,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062963,0.0,0.0


### Add lemmatizer

In [79]:
import nltk
from nltk import tokenize

In [80]:
lemma=nltk.stem.WordNetLemmatizer()
# lemmatize function
def lemma_func(document):
    return [lemma.lemmatize(word) for word in tokenize.word_tokenize(document)]

In [81]:
tf_vectorizer = TfidfVectorizer(stop_words='english',
                             max_df=0.95,
                             min_df=0.05,
                             tokenizer=lemma_func
                            )

In [82]:
X_lem = tf_vectorizer.fit_transform(documents)

In [83]:
tf_df_lem = pd.DataFrame(X_lem.toarray(), columns=[tf_vectorizer.get_feature_names()])
tf_df_lem.head()

Unnamed: 0,!,$,&,*,+,--,/,1,10,13,...,write,writer,writing,written,wrong,yeah,year,york,young,younger
0,0.0,0.0,0.0,0.0,0.0,0.0,0.083274,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.041326,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.032651,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.058183,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.09194,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.058141,0.0,0.045453,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.023646,0.048515,0.029493,0.0
4,0.03192,0.0,0.0,0.0,0.058591,0.105325,0.036144,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058591,0.0,0.0


# Topic Modeling <a name="top_mod"></a>

## LDA <a name="lda"></a>

In [84]:
from sklearn.decomposition import LatentDirichletAllocation

In [85]:
# create instance of model, input number of topics to output
lda = LatentDirichletAllocation(n_components=10)

In [86]:
# fit model to vectorized data
lda.fit(cv_df)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [87]:
# documents x token weights
lda.transform(cv_df)

array([[5.46467050e-04, 5.46473458e-04, 5.46469566e-04, 9.95081409e-01,
        5.46569993e-04, 5.46546366e-04, 5.46535340e-04, 5.46560186e-04,
        5.46498324e-04, 5.46470597e-04],
       [5.95260425e-04, 5.95267785e-04, 5.95260437e-04, 9.94642395e-01,
        5.95309701e-04, 5.95345269e-04, 5.95298633e-04, 5.95311257e-04,
        5.95292839e-04, 5.95259018e-04],
       [7.75220330e-04, 7.75224411e-04, 7.75226717e-04, 9.93022411e-01,
        7.75299554e-04, 7.75324306e-04, 7.75466858e-04, 7.75355691e-04,
        7.75249382e-04, 7.75222061e-04],
       [4.13235780e-04, 4.13237342e-04, 4.13241471e-04, 9.96280554e-01,
        4.13322358e-04, 4.13317304e-04, 4.13280759e-04, 4.13304806e-04,
        4.13268942e-04, 4.13237308e-04],
       [6.13509849e-04, 6.13520337e-04, 6.13514889e-04, 6.13679516e-04,
        6.13590598e-04, 6.13599905e-04, 6.13564742e-04, 9.94477961e-01,
        6.13544666e-04, 6.13514066e-04],
       [4.95064209e-04, 4.95067093e-04, 4.95072708e-04, 1.89196960e-01,
   

In [88]:
# topics x token weights
lda.components_

array([[0.26925859, 0.30519133, 0.29208775, ..., 0.28164101, 0.30447978,
        0.30456744],
       [0.62688594, 0.28234955, 0.28343785, ..., 0.30694108, 0.28150166,
        0.27148719],
       [0.29271273, 0.28252142, 0.30316538, ..., 0.26339421, 0.34643057,
        0.28872322],
       ...,
       [1.62106886, 1.58730857, 0.27752652, ..., 1.78772797, 5.48191087,
        0.27428871],
       [0.30982889, 0.30844214, 0.31081463, ..., 0.32141873, 0.79059417,
        0.28660539],
       [0.31799216, 0.32504928, 0.33665251, ..., 0.27875948, 0.26304   ,
        0.27854935]])

In [89]:
# tokens from vectorizer
cv_vectorizer.get_feature_names()

['!',
 '$',
 '&',
 '*',
 '+',
 '--',
 '/',
 '1',
 '10',
 '13',
 '1999',
 '2',
 '20',
 '3',
 '30',
 '4',
 '5',
 '6',
 '60',
 '7',
 '8',
 '9',
 '90',
 ':',
 ';',
 '?',
 '``',
 'ability',
 'able',
 'absolutely',
 'accent',
 'act',
 'acting',
 'action',
 'actor',
 'actress',
 'actual',
 'actually',
 'ad',
 'adaptation',
 'add',
 'addition',
 'adult',
 'adventure',
 'african',
 'age',
 'agent',
 'ago',
 'al',
 'alan',
 'alien',
 'allen',
 'allowing',
 'amazing',
 'america',
 'american',
 'amusing',
 'angle',
 'angry',
 'answer',
 'anthony',
 'antic',
 'anybody',
 'apparently',
 'appear',
 'appearance',
 'appears',
 'approach',
 'appropriate',
 'appropriately',
 'aren',
 'army',
 'art',
 'artist',
 'ask',
 'asked',
 'asks',
 'aspect',
 'associate',
 'assured',
 'atmosphere',
 'attack',
 'attempt',
 'attention',
 'audience',
 'author',
 'award',
 'away',
 'b',
 'background',
 'bad',
 'ball',
 'band',
 'bar',
 'based',
 'basically',
 'batman',
 'battle',
 'beat',
 'beautiful',
 'begin',
 'begi

In [118]:
# return index of top five tokens in first topic
lda.components_[0].argsort()[::-1][:5]

array([359, 659, 421, 143, 572])

In [125]:
# return top five tokens in first topic
[cv_vectorizer.get_feature_names()[i] for i in lda.components_[0].argsort()[::-1][:5]]

['extremely', 'message', 'form', 'c', 'killing']

In [90]:
# function to print top words of topic model
def print_top_words(model, feature_names, n_top_words):
    for index, topic in enumerate(model.components_):
        message = "\nTopic #{}:".format(index)
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]])
        print(message)
        print("="*70)

In [91]:
print_top_words(lda, cv_vectorizer.get_feature_names(), 25)


Topic #0:extremely message form c killing generation 1999 george decade cartoon flick liner foot spent doing friend patience batman killed going opportunity project produced documentary family

Topic #1:extremely killing message dead ! fail chief ha spent flick rest buzz inevitable literally film important forced project killed imagine type filmed interested violence taking

Topic #2:message free spent extremely killed buzz killing violence intelligence literally teen interested fall opening 13 presence jack ha day wonderful thriller forever pay wake c

Topic #3:extremely message killing intelligence teen buzz form violence spent fail c killed day project likely literally dead ride appearance jack decade 2 patience atmosphere michael

Topic #4:message extremely version song taken life word entirely entertainment day male ex killing control dangerous intelligence month jack modern claim violence teen law dress apparently

Topic #5:extremely spent follows c teen killing fail atmosphere 

## NMF <a name="nmf"></a>

In [92]:
from sklearn.decomposition import NMF

In [93]:
# create instance of model, input number of topics to output
nmf = NMF(n_components=10)

In [94]:
# fit model to vectorized data
nmf.fit(tf_df)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=10, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [95]:
# documents x topic weights
nmf.transform(tf_df)

array([[3.04108826e-04, 0.00000000e+00, 4.04948276e-03, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.81748469e-02,
        0.00000000e+00, 5.89968710e-01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        5.94841618e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 6.01319602e-02, 7.70214958e-03, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 1.44874254e-01, 0.00000000e+00,
        1.00847875e-01, 2.37604661e-01],
       [7.94331547e-02, 8.21007876e-02, 0.00000000e+00, 2.33526850e-02,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.50605200e-01,
        7.88512961e-02, 4.78854779e-02],
       [0.00000000e+00, 1.59765569e-01, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 2.15012584e-01, 4.18378249e-02, 0.00000000e+00,
        0.00000000e+00, 1.26673077e-03],
       [2.43599829e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
   

In [96]:
# topics x token weights
nmf.components_

array([[0.        , 0.02070135, 0.00347374, ..., 0.00720451, 0.08153155,
        0.        ],
       [0.        , 0.01890001, 0.01308674, ..., 0.        , 0.04655268,
        0.        ],
       [0.73220665, 0.00485234, 0.        , ..., 0.        , 0.07326216,
        0.        ],
       ...,
       [0.        , 0.        , 0.00518614, ..., 0.04766497, 0.02265739,
        0.01819346],
       [0.        , 0.        , 0.        , ..., 0.        , 0.07882352,
        0.        ],
       [0.        , 0.        , 0.01956792, ..., 0.03206298, 0.02326672,
        0.        ]])

In [97]:
# tokens from vectorizer
tf_vectorizer.get_feature_names()

['!',
 '$',
 '&',
 '*',
 '+',
 '--',
 '/',
 '1',
 '10',
 '13',
 '1999',
 '2',
 '20',
 '3',
 '30',
 '4',
 '5',
 '6',
 '60',
 '7',
 '8',
 '9',
 '90',
 ':',
 ';',
 '?',
 '``',
 'ability',
 'able',
 'absolutely',
 'accent',
 'act',
 'acting',
 'action',
 'actor',
 'actress',
 'actual',
 'actually',
 'ad',
 'adaptation',
 'add',
 'addition',
 'adult',
 'adventure',
 'african',
 'age',
 'agent',
 'ago',
 'al',
 'alan',
 'alien',
 'allen',
 'allowing',
 'amazing',
 'america',
 'american',
 'amusing',
 'angle',
 'angry',
 'answer',
 'anthony',
 'antic',
 'anybody',
 'apparently',
 'appear',
 'appearance',
 'appears',
 'approach',
 'appropriate',
 'appropriately',
 'aren',
 'army',
 'art',
 'artist',
 'ask',
 'asked',
 'asks',
 'aspect',
 'associate',
 'assured',
 'atmosphere',
 'attack',
 'attempt',
 'attention',
 'audience',
 'author',
 'award',
 'away',
 'b',
 'background',
 'bad',
 'ball',
 'band',
 'bar',
 'based',
 'basically',
 'batman',
 'battle',
 'beat',
 'beautiful',
 'begin',
 'begi

In [98]:
# function to print top words of topic model
def print_top_words(model, feature_names, n_top_words):
    for index, topic in enumerate(model.components_):
        message = "\nTopic #{}:".format(index)
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]])
        print(message)
        print("="*70)

In [99]:
print_top_words(nmf, tf_vectorizer.get_feature_names(), 25)


Topic #0:extremely spent okay buzz killing case ground making literally killed seemingly performance line lost asks away friend actual cut earlier combination intelligence lawrence ride atmosphere

Topic #1:message killing jack directing song project secret game day appearance intelligence situation critique version comedic tale wild claim dead decade attempt okay type comparison spent

Topic #2:! extremely mind appears planet message suppose intelligence month computer taken clue designer produced choice likely villain let attention series lee decade budget course recent

Topic #3:break family inevitable happen extremely purpose lead $ act particular expect directing rip thought led hate shot situation dramatic wake sight perfectly clothes weak male

Topic #4:ridiculous chief extremely grant flick desire investigation documentary killing sent filmed odd screenwriter person middle general forever film present viewer beginning coming sure fail project

Topic #5:2 extremely fail experie

# Word2Vec <a name="w2v"></a>
Find similarity between words based on given corpus

In [100]:
import gensim

## Preprocess

In [101]:
from nltk.corpus import stopwords
from nltk import tokenize

In [102]:
stoplist = stopwords.words('english')
stoplist += ['.', ',', '(', ')', "'", '"']
stoplist = set(stop)

In [103]:
doc_words = [[word for word in tokenize.word_tokenize(document.lower()) if word not in stoplist]
         for document in documents]
doc_words[0:5]

[['films',
  'adapted',
  'comic',
  'books',
  'plenty',
  'success',
  'whether',
  'superheroes',
  'batman',
  'superman',
  'spawn',
  'geared',
  'toward',
  'kids',
  'casper',
  'arthouse',
  'crowd',
  'ghost',
  'world',
  'never',
  'really',
  'comic',
  'book',
  'like',
  'hell',
  'starters',
  'created',
  'alan',
  'moore',
  'eddie',
  'campbell',
  'brought',
  'medium',
  'whole',
  'new',
  'level',
  'mid',
  '80s',
  '12',
  '-',
  'part',
  'series',
  'called',
  'watchmen',
  'say',
  'moore',
  'campbell',
  'thoroughly',
  'researched',
  'subject',
  'jack',
  'ripper',
  'would',
  'like',
  'saying',
  'michael',
  'jackson',
  'starting',
  'look',
  'little',
  'odd',
  'book',
  '``',
  'graphic',
  'novel',
  '``',
  '500',
  'pages',
  'long',
  'includes',
  'nearly',
  '30',
  'consist',
  'nothing',
  'footnotes',
  'words',
  'dismiss',
  'film',
  'source',
  'get',
  'past',
  'whole',
  'comic',
  'book',
  'thing',
  'might',
  'find',
  'ano

## Model

### Continuous Bag of Words (CBOW)
Predicts word from context

In [104]:
# initialize model
model = gensim.models.Word2Vec(doc_words, size=100, window=5, min_count=1, workers=2)

In [105]:
# find similar words in corpus
model.wv.most_similar(positive='director')

  if np.issubdtype(vec.dtype, np.int):


[('-', 0.989967942237854),
 ('much', 0.9894535541534424),
 ('one', 0.9889310002326965),
 ('film', 0.9889243841171265),
 ('get', 0.9888666868209839),
 ('way', 0.9888195395469666),
 ('man', 0.9887357354164124),
 ('also', 0.9884523153305054),
 (';', 0.988426685333252),
 ('movie', 0.988409161567688)]

In [106]:
# rate similarity between two words in corpus
model.wv.similarity(w1='director', w2='actor')

  if np.issubdtype(vec.dtype, np.int):


0.98249686

### skip-gram
Predicts context from word

In [107]:
# initialize model
model = gensim.models.Word2Vec(doc_words, size=100, window=10, min_count=1, workers=2, sg=1)

In [108]:
# find similar words in corpus
model.wv.most_similar(positive='director')

  if np.issubdtype(vec.dtype, np.int):


[('writer', 0.9993724822998047),
 ('dragon', 0.9992942810058594),
 ('crouching', 0.9992886781692505),
 ('tiger', 0.9992499351501465),
 ('hidden', 0.9992386698722839),
 ('replacement', 0.9991947412490845),
 ('austin', 0.9991775751113892),
 ('powers', 0.99915611743927),
 ('rock', 0.9991480112075806),
 ('detroit', 0.9991359710693359)]

In [109]:
# rate similarity between two words in corpus
model.wv.similarity(w1='director', w2='actor')

  if np.issubdtype(vec.dtype, np.int):


0.99335575