# Initial Topic Modeling
The purpose of this notebook is to begin initial topic modeling work on my corpus.  The data was originally cleaned and pre-processed in the final_dataframe_cleanup.ipynb file in the Text_Cleaning folder of the repo.

Importing packages:

In [1]:
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10

Pickling in data:

In [2]:
pwd

'/Users/patrickbovard/Documents/GitHub/presidential_debate_analysis/NLP_Topic_Modeling'

In [3]:
cd ..

/Users/patrickbovard/Documents/GitHub/presidential_debate_analysis


In [4]:
with open('Data/cleaned_string_df.pickle','rb') as read_file:
    corpus_df = pickle.load(read_file)

In [5]:
corpus_df.head()

Unnamed: 0,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type,Year,Speaker_Type,line_length,Election_Result,string
0,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,100,,good evening welcome first debate among major ...
1,The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,perot,Commission for Presidential Debates,General-President,1992,Independent,74,Loser,think principal separate half million people c...
2,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Governor Clinton, a one minute response.",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,3,,one minute response
3,The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,clinton,Commission for Presidential Debates,General-President,1992,Democrat,45,Winner,important distinction campaign represent real ...
4,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute response, sir.",lehrer,Commission for Presidential Debates,General-President,1992,Moderator/Other,4,,one minute response sir


## Count Vectorizer:

For the first round of topic modeling, I will try using count vectorizer.  Initializing Count Vectorizer:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(stop_words='english')

Since some responses can be very short (i.e. just a brief statement/quip), I am setting a minimum threshold of words for topic modelling, starting here at 25 words. This will also make sure to capture topics the candidates are covering in detail.

In [7]:
X = corpus_df[corpus_df.line_length >= 40]['string']
cv_model = count_vectorizer.fit_transform(X)

In [8]:
term_document_matrix = pd.DataFrame(cv_model.toarray(), columns=count_vectorizer.get_feature_names())

In [9]:
term_document_matrix.shape

(4439, 12835)

In [10]:
term_document_matrix.head()

Unnamed: 0,aaa,aah,aapi,aarp,aayuh,abandon,abandoned,abandoning,abandonment,abc,...,zeroing,zimbabwe,zion,zip,zippo,zone,zoning,zoom,zubowski,³who
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Topic Modeling
From here, I'll move into topic modeling using NMF.  I'll start with a k of 5, to represent topics I expect to see:
- Foreign Policy
- Economy
- Domestic Social Issues
- Immigration
- Catch-All: (i.e. all others, possibly guns, election related words, etc.)

In [11]:
from sklearn.decomposition import NMF

Initializing the NMF Model:

In [12]:
nmf_model = NMF(5)

Topics from the relating model, for each line:

In [13]:
doc_topic = nmf_model.fit_transform(term_document_matrix)
doc_topic.shape



(4439, 5)

Pulling the top 10 words for each of the k topics:

In [14]:
words = count_vectorizer.get_feature_names()
t = nmf_model.components_.argsort(axis=1)[:,-10:]
topic_words = [[words[e] for e in l] for l in t]
topic_words

[['policy',
  'time',
  'war',
  'believe',
  'administration',
  'year',
  'america',
  'world',
  'united',
  'state'],
 ['family',
  'dollar',
  'billion',
  'pay',
  'plan',
  'income',
  'money',
  'percent',
  'cut',
  'tax'],
 ['thing',
  'million',
  'job',
  'make',
  'work',
  'need',
  'know',
  'care',
  'health',
  'people'],
 ['thing',
  'way',
  'security',
  'got',
  'job',
  'sure',
  'know',
  'say',
  'make',
  'going'],
 ['ought',
  'child',
  'way',
  'say',
  'know',
  'thing',
  'right',
  'make',
  'school',
  'think']]

Some clear-ish topics are here (taxes), but there seems to be a lot of vagueness and/or generalities. 

Topic modelling using TD-IDF vectorization will be in the **tf-idf_vectorizer_topic_modeling.ipynb** notebook.

### Comparison:

After comparison with the topics generated using tf-idf vectorization, that will be the method used for the final modeling of this project.

NOTE: sentiment analysis and related EDA for sentiment of speakers, lines, parties, etc. is in sentiment_analysis_work.ipynb.