# Topic Modeling with LDA and LSA

The purpose of this notebook is to conduct topic modeling on my text corpus of Debate Transcripts, using LDA and LSA.  These were compared to NMF (run in tf-idf_vectorizer_topic_modeling), which was determined to be the best model for the final product.

Importing packages:

In [1]:
import nltk
from nltk.corpus import stopwords
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10

Picking in Data:

In [2]:
with open('Data/cleaned_string_df.pickle','rb') as read_file:
    new_df = pickle.load(read_file)

In [3]:
new_df.head()

Unnamed: 0,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type,Year,Speaker_Type,line_length,Election_Result,string
0,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,100,,good evening welcome first debate among major ...
1,The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,perot,Commission for Presidential Debates,General-President,1992,Independent,74,Loser,think principal separate half million people c...
2,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Governor Clinton, a one minute response.",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,3,,one minute response
3,The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,clinton,Commission for Presidential Debates,General-President,1992,Democrat,45,Winner,important distinction campaign represent real ...
4,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute response, sir.",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,4,,one minute response sir


# TF-IDF Vectorizer

For the next round of topic modeling, I will be using TF-IDF vectorizer, to compare the results.

In [226]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [227]:
stop = ['presidential', 'vice', 'evening', 'debate', 'candidate', 'campaign', 'minute']

In [228]:
vectorizer = TfidfVectorizer(stop_words=stop, max_df = 0.8)

Note, more stop words were implemented in final_dataframe_cleanup.ipynb.

Since some responses can be very short (i.e. just a brief statement/quip), I am setting a minimum threshold of words for topic modelling.

In [229]:
X = new_df[new_df.line_length >= 15]['string']
tfi_model = vectorizer.fit_transform(X)

In [230]:
tf_term_document_matrix = pd.DataFrame(tfi_model.toarray(), columns=vectorizer.get_feature_names())

In [231]:
tf_term_document_matrix.shape

(32661, 20116)

### Topic Modelling via LDA:

In [None]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [None]:
lda_model = LDA(n_components = 5)

In [None]:
lda_doc_topic = lda_model.fit_transform(tf_term_document_matrix)
lda_doc_topic.shape

Pulling the top 10 words for each of the k topics:

In [None]:
lda_words = vectorizer.get_feature_names()
lda = lda_model.components_.argsort(axis=1)[:,-12:-1]
lda_topic_words = [[lda_words[e] for e in l] for l in lda]
for i, words in enumerate(lda_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

These topics definitely make less sense as of now compared to NMF.

### Topic Modelling via LSA:

For LSA, using TruncatedSVD:

In [None]:
from sklearn.decomposition import TruncatedSVD

Lowering minimum word count to 15:

In [None]:
X = new_df[new_df.line_length >= 10]['string']
tfi_model = vectorizer.fit_transform(X)

In [None]:
tf_term_document_matrix = pd.DataFrame(tfi_model.toarray(), columns=vectorizer.get_feature_names())

In [None]:
tf_term_document_matrix.shape

In [None]:
lsa = TruncatedSVD(10)
doc_topic = lsa.fit_transform(tf_term_document_matrix)
lsa.explained_variance_ratio_

Pulling the top 10 words for each of the k topics:

In [None]:
tf_words = vectorizer.get_feature_names()
tf = lsa.components_.argsort(axis=1)[:,-12:-1]
tf_topic_words = [[tf_words[e] for e in l] for l in tf]
for i, words in enumerate(tf_topic_words, 1):
    print('Topic {}:'.format(i))
    print(words)
    print('\n')

In [None]:
doc_topic