We'll now turn to machine learning applied to text, a.k.a. Natural Language Processing (NLP) or computational linguistics.  We'll use the Reddit 2015 comments dataset for our investigations.

Resources:

(1) Pandas has some useful, basic support for manipulating text.  
        http://pandas.pydata.org/pandas-docs/stable/text.html
        
(2) The Python Natural Language Tookit (nltk), which installed with Anaconda: 
        http://www.nltk.org/
        
(3) Reference and practice for Regular Expressions (regex): 
        http://regex.learncodethehardway.org/book/
        http://regexr.com/
        
(4) Gensim-- Python library for topic modeling which works alongside NLTK.  Website: https://radimrehurek.com/gensim/index.html
    You can install with the shell command (do it from your Anaconda directory if using windows):
    `python -m pip install gensim`
    

Scenario: Your company, a car manufacturer, wants to use social media to understand current trends in public interest and opinion.  What does the public think about the company's products?  Competitor's products?  You (a data scientist, software developer, or business analyst) have been picked as a team member due to your practical knowledge of machine-learning techniques. You decide to start with an analysis of Reddit comments.  For example: https://www.reddit.com/r/Toyota/   

WARNING: WE WILL BE WORKING WITH UNCENSORED COMMENTS POSTED IN A PUBLIC DISCUSSION FORUM.  SOME OF THESE COMMENTS MAY INCLUDE OFFENSIVE CONTENT.  NEITHER THE INSTRUCTOR NOR ANYONE AT CODEMENTOR CREATED ANY OF THE COMMENTS.  WE HAVE NOT SELECTED ANY OFFENSIVE COMMENTS TO SHOW IN THE CLASS.  OFFENSIVE COMMENTS WHICH MAY APPEAR ARE THE RESULT OF APPLYING ML-ALGORITHMS FOR THE PURPOSE OF TOPIC MODELING AND OPINION MINING.

In [199]:
%matplotlib inline
# allows plotting in cells, we'll use later on.
import pylab
pylab.rcParams['figure.figsize'] = (10, 6) # set a larger figure size

In [200]:
# Ford vs. Toyota
# Topic modeling (unsupervised learning)

In [201]:
import sqlite3
conn = sqlite3.connect('C:/Users/peter/CM-ML-Class/3/data/database.sqlite')

In [202]:
import pandas

In [203]:
# List the tables in this database!

query = """
SELECT name from sqlite_master WHERE type='table';
"""
c = conn.cursor()
c.execute(query)
data = c.fetchall()  # you can also iterate over c e.g. for row in c: doStuff(data)
print data

[(u'May2015',)]


In [204]:
# List the fields in this table!
query = """
pragma table_info('May2015');
"""
c = conn.cursor()
c.execute(query)
data = c.fetchall()  # you can also iterate over c e.g. for row in c: doStuff(data)
print data

[(0, u'created_utc', u'INTEGER', 0, None, 0), (1, u'ups', u'INTEGER', 0, None, 0), (2, u'subreddit_id', u'', 0, None, 0), (3, u'link_id', u'', 0, None, 0), (4, u'name', u'', 0, None, 0), (5, u'score_hidden', u'', 0, None, 0), (6, u'author_flair_css_class', u'', 0, None, 0), (7, u'author_flair_text', u'', 0, None, 0), (8, u'subreddit', u'', 0, None, 0), (9, u'id', u'', 0, None, 0), (10, u'removal_reason', u'', 0, None, 0), (11, u'gilded', u'int', 0, None, 0), (12, u'downs', u'int', 0, None, 0), (13, u'archived', u'', 0, None, 0), (14, u'author', u'', 0, None, 0), (15, u'score', u'int', 0, None, 0), (16, u'retrieved_on', u'int', 0, None, 0), (17, u'body', u'', 0, None, 0), (18, u'distinguished', u'', 0, None, 0), (19, u'edited', u'', 0, None, 0), (20, u'controversiality', u'int', 0, None, 0), (21, u'parent_id', u'', 0, None, 0)]


In [205]:
query = """
SELECT subreddit,
    body,
    score
    FROM May2015
    WHERE subreddit = '{0}'
"""
# On Reddit, topics are called 'subreddits'
# https://www.reddit.com/r/Toyota/
# https://www.reddit.com/r/Ford/

df_ford = pandas.read_sql(query.format('Ford'), conn)
df_toyota =  pandas.read_sql(query.format('Toyota'), conn)

In [206]:
df_ford.head()

Unnamed: 0,subreddit,body,score
0,Ford,Yes a bit more definitely. I guess assume I'll...,1
1,Ford,No it's the tube for the intake from the inter...,1
2,Ford,Looks very similar to the Australian Ford Falc...,2
3,Ford,Very hard to find. I prefer them for their uni...,1
4,Ford,No way... you found a Ford in STL???,0


In [207]:
df_ford.describe()

Unnamed: 0,score
count,1471.0
mean,1.82257
std,2.250426
min,-22.0
25%,1.0
50%,1.0
75%,2.0
max,35.0


In [14]:
df_toyota.head()

Unnamed: 0,subreddit,body,score
0,Toyota,What kind of truck? Is that the wrong question...,1
1,Toyota,"I'm a bit of a traitor here, but it's a Nissan...",1
2,Toyota,Those are a good truck though. I used to sell ...,1
3,Toyota,"Time can do as much damage as miles though, mo...",2
4,Toyota,"Not gonna lie, I hate it lol. It's more of a p...",1


In [15]:
df_toyota.describe()

Unnamed: 0,score
count,1094.0
mean,1.739488
std,1.654171
min,-4.0
25%,1.0
50%,1.0
75%,2.0
max,24.0


In [16]:
# save out data frames so interested parties can skip the 30 GB database
# df_toyota.to_csv("C:/Users/peter/CM-ML-Class/3/data/toyota.csv", index=False, index_label=False, encoding='utf-8')
# df_ford.to_csv("C:/Users/peter/CM-ML-Class/3/data/ford.csv", index=False, index_label=False, encoding='utf-8')

In [17]:
# here's how to read csv's into a DataFrame:
# df_toyota = pandas.read_csv("C:/Users/peter/CM-ML-Class/3/data/toyota.csv")
# df_ford = pandas.read_csv("C:/Users/peter/CM-ML-Class/3/data/ford.csv")

In [18]:
df_toyota.head()

Unnamed: 0,subreddit,body,score
0,Toyota,What kind of truck? Is that the wrong question...,1
1,Toyota,"I'm a bit of a traitor here, but it's a Nissan...",1
2,Toyota,Those are a good truck though. I used to sell ...,1
3,Toyota,"Time can do as much damage as miles though, mo...",2
4,Toyota,"Not gonna lie, I hate it lol. It's more of a p...",1


In [208]:
df = pandas.concat([df_ford, df_toyota])

In [209]:
df.head()

Unnamed: 0,subreddit,body,score
0,Ford,Yes a bit more definitely. I guess assume I'll...,1
1,Ford,No it's the tube for the intake from the inter...,1
2,Ford,Looks very similar to the Australian Ford Falc...,2
3,Ford,Very hard to find. I prefer them for their uni...,1
4,Ford,No way... you found a Ford in STL???,0


In [210]:
df.tail()

Unnamed: 0,subreddit,body,score
1089,Toyota,I'm super excited to finally get it. For a co...,2
1090,Toyota,"If you're going to have to wait for delivery, ...",2
1091,Toyota,"Hey, I'm in a similar situation in looking to ...",1
1092,Toyota,http://www.crutchfield.com,1
1093,Toyota,If you upgrade the door speakers be sure to pu...,1


In [211]:
df.describe()

Unnamed: 0,score
count,2565.0
mean,1.787135
std,2.017839
min,-22.0
25%,1.0
50%,1.0
75%,2.0
max,35.0


Data prep: Tokenize, remove stop words, and stem.

# Tokenize
This means breaking documents into sentences and sentences into words.
What's the right tokenizer for our data?
See www.nltk.org/api/nltk.tokenize.html

In [213]:
# old school
x = "A very simple example."
x.split(' ')

['A', 'very', 'simple', 'example.']

In [214]:
x = df.iloc[222]['body']
x


u'Gotta love Ford blue.'

In [215]:
x.split(' ')

[u'Gotta', u'love', u'Ford', u'blue.']

In [216]:
x = df.iloc[2224]['body']
x.split(' ')

[u'What',
 u'did',
 u'you',
 u'have',
 u'to',
 u'do',
 u'repair/maintenance',
 u'wise',
 u'during',
 u'your',
 u'ownership?']

In [239]:
import nltk
nltk.download()
# http://www.nltk.org/data.html

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


KeyboardInterrupt: 

In [217]:
from nltk.tokenize import word_tokenize

In [218]:
word_tokenize(x)

[u'What',
 u'did',
 u'you',
 u'have',
 u'to',
 u'do',
 u'repair/maintenance',
 u'wise',
 u'during',
 u'your',
 u'ownership',
 u'?']

In [219]:
from nltk.tokenize import TreebankWordTokenizer

In [220]:
TreebankWordTokenizer().tokenize(x)

[u'What',
 u'did',
 u'you',
 u'have',
 u'to',
 u'do',
 u'repair/maintenance',
 u'wise',
 u'during',
 u'your',
 u'ownership',
 u'?']

In [221]:
from nltk.tokenize import TweetTokenizer

In [222]:
TweetTokenizer().tokenize(x)

[u'What',
 u'did',
 u'you',
 u'have',
 u'to',
 u'do',
 u'repair',
 u'/',
 u'maintenance',
 u'wise',
 u'during',
 u'your',
 u'ownership',
 u'?']

In [223]:
df.iloc[1011]['body']

u"Gm makes the Le Mans vette... That's super right? ;) "

In [224]:
x=df.iloc[1011]['body']

In [225]:
TweetTokenizer().tokenize(x)

[u'Gm',
 u'makes',
 u'the',
 u'Le',
 u'Mans',
 u'vette',
 u'...',
 u"That's",
 u'super',
 u'right',
 u'?',
 u';)']

In [226]:
TreebankWordTokenizer().tokenize(x)

[u'Gm',
 u'makes',
 u'the',
 u'Le',
 u'Mans',
 u'vette',
 u'...',
 u'That',
 u"'s",
 u'super',
 u'right',
 u'?',
 u';',
 u')']

In [227]:
word_tokenize(x)

[u'Gm',
 u'makes',
 u'the',
 u'Le',
 u'Mans',
 u'vette',
 u'...',
 u'That',
 u"'s",
 u'super',
 u'right',
 u'?',
 u';',
 u')']

Looks like the TweetTokenizer is best for our data!  If none of the tokenizers work well for your data, you can always create a custom tokenizer with regular expressions (regex or regexp).

In [229]:
from nltk.tokenize import RegexpTokenizer

In [233]:
regex_caps = '[A-Z]\w+' # capitalized words
regex_emotes = '(?::|;|=)(?:-)?(?:\)|\(|D|P)' # emoticons/smileys
tokenizer_caps = RegexpTokenizer(regex_caps)
tokenizer_emotes = RegexpTokenizer(regex_emotes)

In [234]:
tokenizer_caps.tokenize(x)

[u'Gm', u'Le', u'Mans', u'That']

In [235]:
tokenizer_emotes.tokenize(x)

[u';)']

In [236]:
class CapsEmoteTokenizer:
    def __init__(self):
        regex_caps = '[A-Z]\w+' # capitalized words
        regex_emotes = '(?::|;|=)(?:-)?(?:\)|\(|D|P)' # emoticons/smileys
        self.tokenizer_caps = RegexpTokenizer(regex_caps)
        self.tokenizer_emotes = RegexpTokenizer(regex_emotes)
        return
    
    def tokenize(self, raw):
        caps = self.tokenizer_caps.tokenize(raw)
        emotes = self.tokenizer_emotes.tokenize(raw)
        answer = caps + emotes
        return answer
    

In [237]:
t = CapsEmoteTokenizer()
t.tokenize(x)

[u'Gm', u'Le', u'Mans', u'That', u';)']

# "Stop Word" removal
A "stop word" is a low-information content word such as "is", "and", "but", etc.  They occur frequently and add no value to the analysis.  Thus, it makes sense to remove them.  The exact list of stop words depends on the data and analysis.  NLTK provides a list of stopwords for starters.

In [1]:
from nltk.corpus import stopwords as stopwordfactory

In [2]:
stopwords = stopwordfactory.words('english')

In [7]:
type(stopwordfactory)

nltk.corpus.reader.wordlist.WordListCorpusReader

In [3]:
type(stopwords)

list

In [242]:
len(stopwords)

127

In [243]:
stopwords[0:10]

[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your']

In [244]:
x=df.iloc[1012]['body']
tokens = TweetTokenizer().tokenize(x)
tokens

[u'I',
 u'make',
 u'decent',
 u'money',
 u'for',
 u'just',
 u'coming',
 u'out',
 u'of',
 u'college',
 u'and',
 u"I've",
 u'got',
 u'a',
 u"'",
 u'14',
 u'Focus',
 u'Sport',
 u'...',
 u'now',
 u"I'm",
 u'looking',
 u'at',
 u'the',
 u'ST',
 u'with',
 u'hungry',
 u'eyes',
 u'lol']

In [245]:
# list comprehension quick tutorial
range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [247]:
mylist = []
for i in range(10):
    mylist.append(2*i)
mylist

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [249]:
[ 2*i for i in range(10) if i > 5]

[12, 14, 16, 18]

In [251]:
names = ['Curly', 'Larry', 'Moe', 'Schemp']
[ name + ' is a stooge' for name in names]

['Curly is a stooge',
 'Larry is a stooge',
 'Moe is a stooge',
 'Schemp is a stooge']

In [254]:
[t for t in tokens if t.lower() not in stopwords] # note we don't retain lower case as we need car models caps

[u'make',
 u'decent',
 u'money',
 u'coming',
 u'college',
 u"I've",
 u'got',
 u"'",
 u'14',
 u'Focus',
 u'Sport',
 u'...',
 u"I'm",
 u'looking',
 u'ST',
 u'hungry',
 u'eyes',
 u'lol']

In [255]:
stopwords += ["i've", "...", "i'm", ".", "'"]

In [256]:
len(stopwords)

132

In [257]:
tokens_stopped = [t for t in tokens if t.lower() not in stopwords]
tokens_stopped

[u'make',
 u'decent',
 u'money',
 u'coming',
 u'college',
 u'got',
 u'14',
 u'Focus',
 u'Sport',
 u'looking',
 u'ST',
 u'hungry',
 u'eyes',
 u'lol']

# Stemming
Often we'll want to only use some sort of root word-- or stem-- in our analysis.  For example, in our present problem, we might want 'selling' & 'sold' to both map to 'sell'.  This process is called 'stemming'.  (A related but more fundamental mapping is called 'lemmatization'.) NLTK provides us with several stemming options. http://www.nltk.org/api/nltk.stem.html

In [258]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

In [259]:
[porter.stem(t) for t in tokens_stopped] # it messed with our car model name; we'll have to separate out before stemming

[u'make',
 u'decent',
 u'money',
 u'come',
 u'colleg',
 u'got',
 u'14',
 u'Focu',
 u'Sport',
 u'look',
 u'ST',
 u'hungri',
 u'eye',
 u'lol']

In [260]:
from nltk.stem.lancaster import LancasterStemmer
lanc = LancasterStemmer()

In [261]:
[lanc.stem(t) for t in tokens_stopped] # hungri eye or hungry ey lol, literally

[u'mak',
 u'dec',
 u'money',
 u'com',
 u'colleg',
 u'got',
 u'14',
 u'foc',
 u'sport',
 u'look',
 u'st',
 u'hungry',
 u'ey',
 u'lol']

In [262]:
from nltk.stem.snowball import EnglishStemmer
snow = EnglishStemmer()

In [263]:
[lanc.stem(t) for t in tokens_stopped]

[u'mak',
 u'dec',
 u'money',
 u'com',
 u'colleg',
 u'got',
 u'14',
 u'foc',
 u'sport',
 u'look',
 u'st',
 u'hungry',
 u'ey',
 u'lol']

# Now let's process all of our Reddit comments for word analysis!
Proposed algorithm:
1. TweetTokenizer
2. Put caps words aside for safekeeping
3. Remove stop words from both
4. Porter stemmer
5. Add caps words back in

This seems to make sense for our current application.  We can do some diagnostics, see what we get, then modify this algorithm if necessary.   

In [270]:
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords as StopwordFactory

class CarsDataPrepper:
    def __init__(self):
        self.porter = PorterStemmer()
        regex_caps = '[A-Z]\w+' # capitalized words
        self.tokenizer_caps = RegexpTokenizer(regex_caps)
        self.tokenizer_tweet = TweetTokenizer()
        self.stopwords = StopwordFactory.words('english')
        self.stopwords += ["i've", "...", "i'm", ".", "'", "it's", "/", ")", "(", "]",
                           "[", ",", "!", "?", '"', '-', ':', '*', 'deleted', "|", "^", "#"]
        return
    
    def _tokenize(self, raw):
        caps = self.tokenizer_caps.tokenize(raw)
        allwords = self.tokenizer_tweet.tokenize(raw)
        lowers = [w for w in allwords if w not in caps]
        return caps, lowers
    
    def _removeStopwords(self, caps, lowers):
        lowers_stopped = [w for w in lowers if w.lower() not in self.stopwords]
        caps_stopped = [w for w in caps if w.lower() not in self.stopwords]
        return caps_stopped, lowers_stopped
    
    def prep(self, raw):
        caps, lowers = self._tokenize(raw)
        caps_stopped, lowers_stopped = self._removeStopwords(caps, lowers)
        lowers_stemmed = [self.porter.stem(word) for word in lowers_stopped]
        return caps_stopped + lowers_stemmed
    

In [288]:
prepper = CarsDataPrepper()

In [289]:
x

u"I make decent money for just coming out of college and I've got a '14 Focus Sport.....now I'm looking at the ST with hungry eyes lol"

In [290]:
prepper.prep(x)

[u'Focus',
 u'Sport',
 u'ST',
 u'make',
 u'decent',
 u'money',
 u'come',
 u'colleg',
 u'got',
 u'14',
 u'look',
 u'hungri',
 u'eye',
 u'lol']

In [291]:
prepper.prep("Will the real Slim Shady please stand up?")

['Slim', 'Shady', u'real', u'pleas', u'stand']

In [293]:
type(df_ford.body)

pandas.core.series.Series

In [295]:
type(df_ford.body.apply)

instancemethod

In [273]:
# note, prepper.prep is a function (method)
prepper.prep

<bound method CarsDataPrepper.prep of <__main__.CarsDataPrepper instance at 0x0000000011C33088>>

In [297]:
type(df_ford.body.apply(prepper.prep))

pandas.core.series.Series

In [298]:
# thus we may apply with the DataFrame apply() method 
df_ford['words'] = df_ford.body.apply(prepper.prep)
df_ford.head()

Unnamed: 0,subreddit,body,score,words
0,Ford,Yes a bit more definitely. I guess assume I'll...,1,"[Yes, Thanks, bit, definit, guess, assum, I'll..."
1,Ford,No it's the tube for the intake from the inter...,1,"[tube, intak, intercool, keep, eye, check, eng..."
2,Ford,Looks very similar to the Australian Ford Falc...,2,"[Looks, Australian, Ford, Falcon, similar, 2015]"
3,Ford,Very hard to find. I prefer them for their uni...,1,"[hard, find, prefer, uniqu]"
4,Ford,No way... you found a Ford in STL???,0,"[Ford, STL, way, found]"


In [299]:
df_toyota['words'] = df_toyota.body.apply(prepper.prep)
df_toyota.head()

Unnamed: 0,subreddit,body,score,words
0,Toyota,What kind of truck? Is that the wrong question...,1,"[kind, truck, wrong, question, >, <]"
1,Toyota,"I'm a bit of a traitor here, but it's a Nissan...",1,"[Nissan, Titan, bit, traitor, say, cheap]"
2,Toyota,Those are a good truck though. I used to sell ...,1,"[good, truck, though, use, sell]"
3,Toyota,"Time can do as much damage as miles though, mo...",2,"[Time, much, damag, mile, though, car, isn't, ..."
4,Toyota,"Not gonna lie, I hate it lol. It's more of a p...",1,"[gonna, lie, hate, lol, person, thing, though,..."


# Some basic word analysis

In [300]:
from nltk import FreqDist

In [301]:
listOfordWords = [w for comment in df_ford.words for w in comment]
listOfordWords[0:10]


[u'Yes',
 u'Thanks',
 u'bit',
 u'definit',
 u'guess',
 u'assum',
 u"I'll",
 u'never',
 u'one',
 u"can't"]

In [302]:
fdist_ford = FreqDist(listOfordWords)

In [304]:
fdist_ford.most_common(20) # we can add to the stopwords at this point

[(u'car', 277),
 (u'get', 259),
 (u'like', 235),
 (u'one', 184),
 (u'look', 178),
 (u'would', 141),
 (u'go', 139),
 (u'new', 135),
 (u'truck', 120),
 (u'Ford', 118),
 (u'drive', 108),
 (u'work', 106),
 (u'year', 100),
 (u'think', 97),
 (u"don't", 94),
 (u'time', 91),
 (u'good', 91),
 (u'thing', 89),
 (u'use', 89),
 (u'much', 88)]

In [305]:
listOtoyWords = [w for comment in df_toyota.words for w in comment]
fdist_toyota = FreqDist(listOtoyWords)


In [306]:
fdist_toyota.most_common(20)

[(u'car', 245),
 (u'get', 175),
 (u'like', 166),
 (u'look', 135),
 (u'one', 127),
 (u'Toyota', 114),
 (u'drive', 105),
 (u'go', 104),
 (u'time', 91),
 (u'would', 89),
 (u'got', 73),
 (u'truck', 72),
 (u"don't", 72),
 (u'good', 72),
 (u'Corolla', 72),
 (u'use', 72),
 (u'realli', 67),
 (u'know', 67),
 (u'work', 61),
 (u'make', 61)]

# Digrams

In [307]:
from nltk import collocations

In [308]:
c_ford = collocations.BigramCollocationFinder.from_words(listOfordWords)

In [309]:
c_ford.ngram_fd.most_common(20)

[((u'look', u'like'), 24),
 ((u"don't", u'know'), 23),
 ((u'sound', u'like'), 15),
 ((u'new', u'one'), 15),
 ((u'year', u'old'), 13),
 ((u'Focus', u'ST'), 12),
 ((u'brand', u'new'), 11),
 ((u'new', u'truck'), 11),
 ((u'feel', u'like'), 11),
 ((u'pretti', u'much'), 11),
 ((u'someth', u'like'), 10),
 ((u'post', u'r'), 10),
 ((u'get', u'one'), 9),
 ((u'look', u'good'), 9),
 ((u'make', u'sure'), 9),
 ((u"I'd", u'like'), 8),
 ((u'bodi', u'style'), 8),
 ((u'never', u'heard'), 8),
 ((u'r', u'ford'), 7),
 ((u'account', u'less'), 7)]

In [310]:
c_toyota = collocations.BigramCollocationFinder.from_words(listOtoyWords)
c_toyota.ngram_fd.most_common(20)

[((u'time', u'belt'), 22),
 ((u'look', u'like'), 12),
 ((u'sound', u'like'), 10),
 ((u'shift', u'point'), 10),
 ((u"don't", u'know'), 10),
 ((u'floor', u'mat'), 9),
 ((u'look', u'good'), 8),
 ((u'make', u'sure'), 8),
 ((u'Toyota', u'Corolla'), 8),
 ((u'find', u'one'), 8),
 ((u'use', u'car'), 7),
 ((u'head', u'unit'), 7),
 ((u'water', u'pump'), 7),
 ((u'seem', u'like'), 7),
 ((u"don't", u'think'), 6),
 ((u'need', u'replac'), 6),
 ((u"I'd", u'like'), 6),
 ((u'center', u'consol'), 6),
 ((u'fuel', u'economi'), 6),
 ((u'Toyota', u'Toyota'), 6)]

# Clustering and TF-IDF
Term-frequency/inverse document frequency (TF-IDF): a measure of word importance in a particular document, relative to a document collection.  0 = appears in all documents. Higher = more important.

In [313]:
from gensim import corpora, models
import gensim
dictionary = corpora.Dictionary(df_ford.words)
dictionary.items()


[(4053, u'Friday'),
 (1760, u'Sadly'),
 (4372, u'pardon'),
 (3087, u'mothbal'),
 (4423, u'cakeday'),
 (1976, u'demand'),
 (3788, u'bear'),
 (384, u'yellow'),
 (1864, u'poorli'),
 (630, u'Continental'),
 (4092, u'Relearn'),
 (4509, u'Power'),
 (2499, u'oper'),
 (2493, u'captain'),
 (120, u'hate'),
 (3266, u'giddi'),
 (3053, u'accur'),
 (4458, u'Everyone'),
 (1377, u'sorri'),
 (3349, u'Honda'),
 (4825, u'Watch'),
 (4726, u'85,000'),
 (3480, u'Mostly'),
 (1792, u'Ranger'),
 (1891, u'bike'),
 (91, u'swap'),
 (4517, u'80-90'),
 (1208, u'aux'),
 (4496, u'Mountune'),
 (3810, u'sway'),
 (1027, u'worth'),
 (3891, u'Euro'),
 (627, u'@'),
 (2101, u'digit'),
 (270, u'GT'),
 (351, u'everi'),
 (2665, u'risk'),
 (4437, u'Isn'),
 (3786, u'void'),
 (3990, u'mouth'),
 (3449, u'voic'),
 (4074, u'relearn'),
 (1705, u'GF'),
 (2201, u'LittleHelperRobot'),
 (277, u'GA'),
 (1966, u'Henry'),
 (111, u'affect'),
 (2928, u'GM'),
 (521, u"isn't"),
 (4665, u'vast'),
 (2489, u'Dont'),
 (1422, u'four'),
 (125, u"we'l

In [314]:
corpus = [dictionary.doc2bow(comment) for comment in df_ford.words]
corpus

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1)],
 [(3, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1)],
 [(39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1)],
 [(45, 1), (46, 1), (47, 1), (48, 1)],
 [(39, 1), (49, 1), (50, 1), (51, 1)],
 [(52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1)],
 [(61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1)],
 [(72, 1), (73, 1), (74, 1)],
 [(51, 1),
  (75, 1),
  (76, 1),
  (77, 2),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1),
  (82, 1),
  (83, 1),
  (84, 1),
  (85, 1)],
 [(86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (9

In [None]:
tfidf = models.TfidfModel(corpus) 
corpus_tfidf[0]

In [188]:
numwords = len(set(listOfordWords))
numwords

5956

In [189]:
tfidf_matrix = gensim.matutils.corpus2dense(corpus, numwords) # documents are columns
tfidf_matrix

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)

In [190]:
from sklearn.cluster import KMeans


In [191]:
km = KMeans(n_clusters=100)
km.fit(tfidf_matrix)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=100, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [192]:
len(km.labels_)


5956

In [193]:
set(km.labels_)

{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99}

In [194]:
dictionary.id2token[42]

u'personal'

In [195]:
km.labels_

array([ 8, 17, 17, ...,  1,  1,  1])

In [196]:
import numpy as np

In [197]:
c0word_ids = np.where(km.labels_ == 0)[0]
c0word_ids

array([ 307,  329,  367,  758,  823, 1095, 1917, 1997, 2681, 2707, 4672], dtype=int64)

In [198]:
[dictionary.id2token[i] for i in c0word_ids]

[u'dash',
 u'light',
 u'switch',
 u'lights',
 u'LEDs',
 u'grill',
 u'pictures',
 u'send',
 u'harness',
 u'panel',
 u'bar']

# Latent Dirichlet Allocation (LDA)
"Latent Dirichlet allocation (LDA) is a topic model that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document set."  Very nice information here: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

Note there is another completely different LDA-- "Linear Discriminant Analysis"-- which is a classification algorithm.

In [73]:
from gensim import corpora, models
import gensim

In [74]:
dictionary = corpora.Dictionary(df_ford.words)

In [76]:
dictionary.items()

[(3358, u"SOHC's"),
 (418, u'yellow'),
 (1578, u'four'),
 (2061, u'woods'),
 (5743, u'Recaro'),
 (5796, u'hanging'),
 (5239, u'infuriates'),
 (3630, u'lord'),
 (1024, u'shaving'),
 (3105, u'digit'),
 (5444, u'basics'),
 (1397, u'grueling'),
 (4821, u'Tuesday'),
 (4359, u'Paul'),
 (105, u'straight'),
 (2111, u'tired'),
 (4558, u'corvette'),
 (908, u'tires'),
 (451, u'second'),
 (1029, u'275'),
 (4974, u'contributed'),
 (5946, u'increasing'),
 (2204, u'admiral'),
 (5566, u'shocks'),
 (5418, u'8,300'),
 (1574, u'leaning'),
 (1229, u'reported'),
 (1404, u'chassis'),
 (3820, u'kids'),
 (5050, u'k'),
 (3249, u'reports'),
 (706, u"i'd"),
 (1833, u'Market'),
 (1194, u'criticism'),
 (231, u'golden'),
 (241, u'replace'),
 (561, u'brought'),
 (2558, u'unit'),
 (3615, u'cheating'),
 (3100, u'browse'),
 (5431, u'6F35'),
 (3651, u'music'),
 (5167, u'Y7oJ'),
 (2873, u'duals'),
 (3213, u'paperwork'),
 (5778, u'holy'),
 (5075, u'detracts'),
 (2769, u'brings'),
 (1285, u'351W'),
 (4485, u'hurt'),
 (465,

In [79]:
comment = df_ford.words.iloc[42]
comment

[u'Exactly',
 u'reason',
 u'slightly',
 u'restoring',
 u'aging',
 u'like',
 u'old',
 u'gross',
 u'headlights',
 u'faded',
 u'grill',
 u'bad',
 u'driver',
 u'seat',
 u'dirty',
 u'sitting',
 u'dealer',
 u'lot']

In [80]:
# encode a comment
dictionary.doc2bow(comment)

[(54, 1),
 (113, 1),
 (188, 1),
 (191, 1),
 (232, 1),
 (236, 1),
 (383, 1),
 (408, 1),
 (481, 1),
 (482, 1),
 (483, 1),
 (484, 1),
 (485, 1),
 (486, 1),
 (487, 1),
 (488, 1),
 (489, 1),
 (490, 1)]

In [83]:
dictionary.id2token[54]

u'sitting'

In [84]:
# encode all comments = a corpus.  "Bag of Words"
corpus = [dictionary.doc2bow(comment) for comment in df_ford.words]

In [85]:
corpus[42]

[(54, 1),
 (113, 1),
 (188, 1),
 (191, 1),
 (232, 1),
 (236, 1),
 (383, 1),
 (408, 1),
 (481, 1),
 (482, 1),
 (483, 1),
 (484, 1),
 (485, 1),
 (486, 1),
 (487, 1),
 (488, 1),
 (489, 1),
 (490, 1)]

In [93]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word = dictionary, passes=30)

In [94]:
ldamodel.print_topics(num_topics=4, num_words=9)

[(0,
  u'0.012*get + 0.008*car + 0.007*would + 0.006*one + 0.005*new + 0.005*like + 0.005*love + 0.005*good + 0.004*look'),
 (1,
  u"0.008*car + 0.008*like + 0.007*one + 0.006*get + 0.006*don't + 0.005*Ford + 0.005*know + 0.004*time + 0.004*would"),
 (2,
  u'0.008*truck + 0.007*new + 0.007*one + 0.006*like + 0.004*looks + 0.004*get + 0.004*car + 0.004*good + 0.003*never'),
 (3,
  u'0.011*like + 0.007*car + 0.006*think + 0.005*Ford + 0.005*get + 0.005*cars + 0.005*$ + 0.005*would + 0.005*really')]

In [95]:
# compare to Toyota
dictionary = corpora.Dictionary(df_toyota.words)
corpus = [dictionary.doc2bow(comment) for comment in df_toyota.words]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word = dictionary, passes=30)
ldamodel.print_topics(num_topics=4, num_words=9)

[(0,
  u'0.011*car + 0.010*Toyota + 0.005*> + 0.005*new + 0.004*Corolla + 0.004*great + 0.003*got + 0.003*get + 0.003*driving'),
 (1,
  u"0.007*one + 0.005*car + 0.004*get + 0.004*really + 0.004*don't + 0.004*Toyota + 0.004*would + 0.004*could + 0.004*like"),
 (2,
  u'0.010*car + 0.008*get + 0.008*like + 0.005*truck + 0.005*one + 0.005*good + 0.005*would + 0.004*Toyota + 0.004*Corolla'),
 (3,
  u'0.010*like + 0.007*car + 0.007*get + 0.006*got + 0.005*really + 0.005*look + 0.004*one + 0.004*time + 0.004*good')]

# Latent Semantic Indexing

In [111]:
corpus = [dictionary.doc2bow(comment) for comment in df_ford.words]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_tfidf[0]

[(11, 0.23898023739560503),
 (21, 0.1354119364219414),
 (58, 0.14447614353507082),
 (190, 0.22094803461897872),
 (282, 0.26471739660869364),
 (396, 0.27148713883672504),
 (461, 0.2926868068283426),
 (500, 0.22581982966556594),
 (520, 0.345972163883959),
 (628, 0.24562664635837642),
 (781, 0.2292987734133461),
 (860, 0.21643745692683378),
 (970, 0.21643745692683378),
 (3124, 0.3254927995129894),
 (3843, 0.3770635376751005)]

In [None]:
tfidf.

In [102]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=4) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi.print_topics(4)

[(0,
  u'-0.251*"like" + -0.217*"get" + -0.193*"one" + -0.188*"car" + -0.160*"new" + -0.150*"would" + -0.129*"love" + -0.125*"look" + -0.123*"don\'t" + -0.117*"Ford"'),
 (1,
  u'-0.935*"Thanks" + -0.181*"man" + 0.103*"Nice" + -0.066*"appreciate" + -0.057*"take" + 0.049*"car" + -0.040*"Yeah" + 0.040*"like" + -0.038*"look" + 0.035*"got"'),
 (2,
  u'-0.910*"Nice" + -0.154*"nice" + -0.151*"love" + -0.104*"Enjoy" + -0.093*"Thanks" + 0.092*"like" + -0.084*"best" + -0.065*"truck" + -0.058*"dig" + -0.056*"parking"'),
 (3,
  u'-0.441*"r" + -0.254*"%" + -0.226*"post" + -0.224*"5" + -0.210*"ford" + -0.205*"comment" + -0.205*"questions" + -0.202*"automatically" + -0.201*"compose" + -0.201*"concerns"')]

In [97]:
corpus = [dictionary.doc2bow(comment) for comment in df_toyota.words]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=4) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi.print_topics(4)

[(0,
  u'-0.214*"car" + -0.168*"like" + -0.166*"one" + -0.160*"get" + -0.132*"don\'t" + -0.127*"Toyota" + -0.126*"would" + -0.122*"good" + -0.119*"really" + -0.116*"truck"'),
 (1,
  u'0.935*"Nice" + 0.115*"pic" + 0.070*"Corollas" + 0.063*"mkiii" + 0.063*"ride" + 0.057*"truck" + 0.055*"choice" + 0.053*"Congrats" + 0.053*"Looks" + 0.053*"mod"'),
 (2,
  u'-0.482*"love" + -0.401*"Thanks" + -0.303*"Looks" + -0.151*"pics" + -0.135*"great" + -0.132*"4runner" + -0.124*"I\'ll" + -0.119*"nice" + 0.107*"Toyota" + -0.103*"agree"'),
 (3,
  u'-0.513*"love" + 0.416*"Thanks" + 0.165*"I\'ll" + -0.161*"Corolla" + 0.152*"that\'s" + 0.148*"truck" + -0.128*"Toyota" + -0.128*"pics" + -0.123*"4runner" + 0.119*"heads"')]

In [100]:
for doc in corpus_lsi[0:4]: 
    print (doc)

[(0, -0.065564302596982288), (1, 0.018365625681761904), (2, -0.0053696441030800235), (3, 0.055195557399478581)]
[(0, -0.039459212347970098), (1, -0.0019199817680356033), (2, 0.013970633425867622), (3, -0.0015775667385075293)]
[(0, -0.18099211850198388), (1, 0.015058364811425492), (2, -0.072614791114328087), (3, 0.17151735259700107)]
[(0, -0.12062792848167209), (1, -0.011089514165082397), (2, 0.036114455195307378), (3, 0.021770981450609253)]
