The purpose of this notebook is to utilize TF-IDF vectorization in topic modeling, using the new_df dataframe built in final_dataframe_cleanup.ipynb.

Importing packages:

In [1]:
import nltk
from nltk.corpus import stopwords
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10

Picking in Data:

In [2]:
with open('Data/cleaned_string_df.pickle','rb') as read_file:
    new_df = pickle.load(read_file)

In [3]:
new_df.head()

Unnamed: 0,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type,Year,Speaker_Type,line_length,Election_Result,string
0,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,100,,good evening welcome first debate among major ...
1,The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,perot,Commission for Presidential Debates,General-President,1992,Independent,74,Loser,think principal separate half million people c...
2,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Governor Clinton, a one minute response.",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,3,,one minute response
3,The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,clinton,Commission for Presidential Debates,General-President,1992,Democrat,45,Winner,important distinction campaign represent real ...
4,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute response, sir.",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,4,,one minute response sir


# TF-IDF Vectorizer

For the next round of topic modeling, I will be using TF-IDF vectorizer, to compare the results.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
stop = ['presidential', 'vice', 'evening', 'debate', 'candidate', 'campaign', 'minute']

In [6]:
vectorizer = TfidfVectorizer(stop_words=stop)

Note, more stop words were implemented in final_dataframe_cleanup.ipynb.

Since some responses can be very short (i.e. just a brief statement/quip), I am setting a minimum threshold of words for topic modelling.

In [7]:
X = new_df[new_df.line_length >= 40]['string']
tfi_model = vectorizer.fit_transform(X)

In [8]:
tf_term_document_matrix = pd.DataFrame(tfi_model.toarray(), columns=vectorizer.get_feature_names())

In [9]:
tf_term_document_matrix.shape

(4439, 13077)

## Topic Modelling

After getthe the document set in terms of TF-IDF vectorization, below will try Topic Modelling using a few different tools.

### Topic Modelling via NMF:

In [10]:
nmf_model = NMF(5)

Topics from the relating model, for each line:

In [11]:
tf_doc_topic = nmf_model.fit_transform(tf_term_document_matrix)
tf_doc_topic.shape



(4439, 5)

Pulling the top 10 words for each of the k topics:

In [12]:
tf_words = vectorizer.get_feature_names()
tf = nmf_model.components_.argsort(axis=1)[:,-12:]
tf_topic_words = [[tf_words[e] for e in l] for l in tf]
for i, words in enumerate(tf_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

Topic 1:
['nuclear', 'policy', 'believe', 'right', 'military', 'one', 'world', 'war', 'think', 'state', 'united', 'would']


Topic 2:
['security', 'money', 'billion', 'budget', 'rate', 'social', 'pay', 'plan', 'income', 'percent', 'cut', 'tax']


Topic 3:
['drug', 'universal', 'affordable', 'company', 'system', 'people', 'cost', 'medicare', 'plan', 'insurance', 'care', 'health']


Topic 4:
['family', 'every', 'student', 'need', 'college', 'public', 'parent', 'kid', 'teacher', 'child', 'education', 'school']


Topic 5:
['economy', 'know', 'america', 'need', 'work', 'got', 'think', 'make', 'get', 'job', 'people', 'going']




Based on this, the 6 topics seem to be about the following: 
1. Random Bucket
2. Economy/Taxes
3. Healthcare (clear)
4. War/Foreign Policy
5. Education
6. Random Bucket, with hints of jobs/campaign "speak"

Pulling the document-topic matrix:

In [13]:
tf_doc_topic

array([[0.04950754, 0.        , 0.00210397, 0.00154647, 0.00231771],
       [0.00473577, 0.00114885, 0.00267559, 0.        , 0.09758879],
       [0.00831532, 0.02411698, 0.        , 0.        , 0.05854825],
       ...,
       [0.00495852, 0.00372739, 0.01367063, 0.        , 0.06726091],
       [0.        , 0.        , 0.        , 0.        , 0.08531274],
       [0.        , 0.        , 0.        , 0.        , 0.08581064]])

In [25]:
tf_doc_topic[0][2]

0.0021039735995541033

Mapping these out onto the individual documents;

In [21]:
topic_df = new_df[new_df.line_length >= 40].copy()

In [22]:
topic_df['Topic_1'] = 0
topic_df['Topic_2'] = 0
topic_df['Topic_3'] = 0
topic_df['Topic_4'] = 0
topic_df['Topic_5'] = 0

In [26]:
for i, text in enumerate(topic_df.string):
    topic_df.iloc[i, 10] = tf_doc_topic[i][0]
    topic_df.iloc[i, 11] = tf_doc_topic[i][1]
    topic_df.iloc[i, 12] = tf_doc_topic[i][2]
    topic_df.iloc[i, 13] = tf_doc_topic[i][3]
    topic_df.iloc[i, 14] = tf_doc_topic[i][4]

In [27]:
topic_df.head()

Unnamed: 0,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type,Year,Speaker_Type,line_length,Election_Result,string,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5
0,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,100,,good evening welcome first debate among major ...,0.049508,0.0,0.002104,0.001546,0.002318
1,The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,perot,Commission for Presidential Debates,General-President,1992,Independent,74,Loser,think principal separate half million people c...,0.004736,0.001149,0.002676,0.0,0.097589
3,The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,clinton,Commission for Presidential Debates,General-President,1992,Democrat,45,Winner,important distinction campaign represent real ...,0.008315,0.024117,0.0,0.0,0.058548
5,The First Clinton-Bush-Perot Presidential Deb...,"PRESIDENT BUSH: Well, I think one thing that d...",president bush,Commission for Presidential Debates,General-President,1992,Republican,81,Loser,well think one thing distinguishes experience ...,0.032392,0.0,0.0,0.0,0.086368
7,The First Clinton-Bush-Perot Presidential Deb...,"CLINTON: I believe experience counts, but it’s...",clinton,Commission for Presidential Debates,General-President,1992,Democrat,167,Winner,believe experience count ’ everything value ju...,0.015961,0.017202,0.022157,0.008847,0.117196


### EDA Using Topics:

### Topic Modelling via LDA:

In [14]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [15]:
lda_model = LDA(n_components = 5)

In [16]:
lda_doc_topic = lda_model.fit_transform(tf_term_document_matrix)
lda_doc_topic.shape

(4439, 5)

Pulling the top 10 words for each of the k topics:

In [17]:
tf_words = vectorizer.get_feature_names()
tf = lda_model.components_.argsort(axis=1)[:,-12:]
tf_topic_words = [[tf_words[e] for e in l] for l in tf]
for i, words in enumerate(tf_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

Topic 1:
['make', 'state', 'right', 'united', 'know', 'world', 'war', 'one', 'going', 'people', 'think', 'would']


Topic 2:
['answer', 'league', 'second', 'first', 'court', 'two', 'voter', 'commission', 'thank', 'news', 'tonight', 'question']


Topic 3:
['health', 'care', 'need', 'job', 'make', 'one', 'would', 'get', 'think', 'going', 'tax', 'people']


Topic 4:
['chevy', 'shah', 'rangel', 'serial', 'prophecy', 'curing', 'upwards', 'crippling', 'tha', 'dust', 'panic', 'armageddon']


Topic 5:
['liheap', 'conversion', 'pastor', 'seller', 'xx', 'wallet', 'maker', 'mac', 'mae', 'immunity', 'fannie', 'freddie']




These topics definitely make less sense as of now compared to NMF.

### Topic Modelling via LSA:

For LSA, using TruncatedSVD:

In [18]:
from sklearn.decomposition import TruncatedSVD

In [19]:
lsa = TruncatedSVD(5)
doc_topic = lsa.fit_transform(tf_term_document_matrix)
lsa.explained_variance_ratio_

array([0.00291337, 0.00773774, 0.00568379, 0.00461147, 0.00401439])

Pulling the top 10 words for each of the k topics:

In [20]:
tf_words = vectorizer.get_feature_names()
tf = lsa.components_.argsort(axis=1)[:,-12:]
tf_topic_words = [[tf_words[e] for e in l] for l in tf]
for i, words in enumerate(tf_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

Topic 1:
['america', 'job', 'know', 'need', 'make', 'one', 'get', 'would', 'think', 'going', 'tax', 'people']


Topic 2:
['business', 'billion', 'budget', 'money', 'social', 'rate', 'pay', 'plan', 'income', 'percent', 'cut', 'tax']


Topic 3:
['affordable', 'education', 'people', 'cost', 'system', 'school', 'plan', 'medicare', 'child', 'insurance', 'care', 'health']


Topic 4:
['life', 'gun', 'family', 'student', 'public', 'parent', 'college', 'kid', 'teacher', 'child', 'education', 'school']


Topic 5:
['work', 'sure', 'business', 'need', 'trade', 'world', 'energy', 'people', 'america', 'economy', 'going', 'job']


