The purpose of this notebook is to utilize TF-IDF vectorization in topic modeling, using the new_df dataframe built in final_dataframe_cleanup.ipynb.

Importing packages:

In [1]:
import nltk
from nltk.corpus import stopwords
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10

Picking in Data:

In [2]:
with open('Data/cleaned_string_df.pickle','rb') as read_file:
    new_df = pickle.load(read_file)

In [3]:
new_df.head()

Unnamed: 0,index,stemmed,string,line_length
0,0,"[good, evening, welcome, first, debate, among,...",good evening welcome first debate among major ...,100
1,1,"[think, principal, separate, half, million, pe...",think principal separate half million people c...,74
2,2,"[one, minute, response]",one minute response,3
3,3,"[important, distinction, campaign, represent, ...",important distinction campaign represent real ...,45
4,4,"[one, minute, response, sir]",one minute response sir,4


# TF-IDF Vectorizer

For the next round of topic modeling, I will be using TF-IDF vectorizer, to compare the results.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
stop = ['presidential', 'vice', 'evening', 'debate', 'candidate', 'campaign', 'minute']

In [6]:
vectorizer = TfidfVectorizer(stop_words=stop)

Note, more stop words were implemented in final_dataframe_cleanup.ipynb.

Since some responses can be very short (i.e. just a brief statement/quip), I am setting a minimum threshold of words for topic modelling.

In [7]:
X = new_df[new_df.line_length >= 40]['string']
tfi_model = vectorizer.fit_transform(X)

In [8]:
tf_term_document_matrix = pd.DataFrame(tfi_model.toarray(), columns=vectorizer.get_feature_names())

In [9]:
tf_term_document_matrix.shape

(4439, 13077)

## Topic Modelling

After getthe the document set in terms of TF-IDF vectorization, below will try Topic Modelling using a few different tools.

### Topic Modelling via NMF:

In [10]:
nmf_model = NMF(5)

Topics from the relating model, for each line:

In [11]:
tf_doc_topic = nmf_model.fit_transform(tf_term_document_matrix)
tf_doc_topic.shape



(4439, 5)

Pulling the top 10 words for each of the k topics:

In [12]:
tf_words = vectorizer.get_feature_names()
tf = nmf_model.components_.argsort(axis=1)[:,-12:]
tf_topic_words = [[tf_words[e] for e in l] for l in tf]
for i, words in enumerate(tf_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

Topic 1:
['nuclear', 'policy', 'believe', 'right', 'military', 'one', 'world', 'war', 'think', 'state', 'united', 'would']


Topic 2:
['security', 'money', 'billion', 'budget', 'rate', 'social', 'pay', 'plan', 'income', 'percent', 'cut', 'tax']


Topic 3:
['drug', 'universal', 'affordable', 'company', 'system', 'people', 'cost', 'medicare', 'plan', 'insurance', 'care', 'health']


Topic 4:
['family', 'every', 'student', 'need', 'college', 'public', 'parent', 'kid', 'teacher', 'child', 'education', 'school']


Topic 5:
['economy', 'know', 'america', 'need', 'work', 'got', 'think', 'make', 'get', 'job', 'people', 'going']




Based on this, the 6 topics seem to be about the following: 
1. Random Bucket
2. Economy/Taxes
3. Healthcare (clear)
4. War/Foreign Policy
5. Education
6. Random Bucket, with hints of jobs/campaign "speak"

Pulling the document-topic matrix:

In [20]:
tf_doc_topic

array([[0.04950632, 0.        , 0.00210316, 0.00154697, 0.00233575],
       [0.0047256 , 0.00114881, 0.00267339, 0.        , 0.09788421],
       [0.00831018, 0.02411788, 0.        , 0.        , 0.05872428],
       ...,
       [0.00496225, 0.00373031, 0.01367251, 0.        , 0.06744766],
       [0.        , 0.        , 0.        , 0.        , 0.08555705],
       [0.        , 0.        , 0.        , 0.        , 0.08605519]])

### Topic Modelling via LDA:

In [13]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [14]:
lda_model = LDA(n_components = 5)

In [15]:
lda_doc_topic = lda_model.fit_transform(tf_term_document_matrix)
lda_doc_topic.shape

(4439, 5)

Pulling the top 10 words for each of the k topics:

In [16]:
tf_words = vectorizer.get_feature_names()
tf = lda_model.components_.argsort(axis=1)[:,-12:]
tf_topic_words = [[tf_words[e] for e in l] for l in tf]
for i, words in enumerate(tf_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

Topic 1:
['hussein', 'isi', 'north', 'china', 'saddam', 'ally', 'korea', 'israel', 'syria', 'weapon', 'iran', 'nuclear']


Topic 2:
['time', 'right', 'need', 'know', 'make', 'one', 'get', 'tax', 'would', 'think', 'going', 'people']


Topic 3:
['april', 'cardiologist', 'mar', 'klan', 'floyd', 'paso', 'conversion', 'shame', 'filter', 'yugoslavia', 'gramm', 'depletion']


Topic 4:
['xx', 'invention', 'freddie', 'clip', 'suburb', 'mac', 'humble', 'shooting', 'occurs', 'investigation', 'uphold', 'contra']


Topic 5:
['la', 'contain', 'matsu', 'flight', 'quemoy', 'lesbian', 'columbine', 'trigger', 'formosa', 'shooting', 'island', 'handgun']




These topics definitely make less sense as of now compared to NMF.

### Topic Modelling via LSA:

For LSA, using TruncatedSVD:

In [17]:
from sklearn.decomposition import TruncatedSVD

In [18]:
lsa = TruncatedSVD(5)
doc_topic = lsa.fit_transform(tf_term_document_matrix)
lsa.explained_variance_ratio_

array([0.00291337, 0.00773775, 0.00568375, 0.00461262, 0.00402213])

Pulling the top 10 words for each of the k topics:

In [19]:
tf_words = vectorizer.get_feature_names()
tf = lsa.components_.argsort(axis=1)[:,-12:]
tf_topic_words = [[tf_words[e] for e in l] for l in tf]
for i, words in enumerate(tf_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

Topic 1:
['america', 'job', 'know', 'need', 'make', 'one', 'get', 'would', 'think', 'going', 'tax', 'people']


Topic 2:
['business', 'billion', 'budget', 'money', 'social', 'rate', 'pay', 'plan', 'income', 'percent', 'cut', 'tax']


Topic 3:
['affordable', 'education', 'people', 'system', 'cost', 'school', 'plan', 'medicare', 'child', 'insurance', 'care', 'health']


Topic 4:
['life', 'gun', 'family', 'student', 'public', 'parent', 'college', 'kid', 'teacher', 'child', 'education', 'school']


Topic 5:
['teacher', 'education', 'medicare', 'federal', 'child', 'think', 'cut', 'program', 'school', 'security', 'social', 'would']


