# Introduction

The goal of the topic modelling is to identify the various topics in our text. Each document in the text can include one or multiple topics. I will apply the steps of **Latent Dirichlet Allocation (LDA)** that is a technique used for topic modeling. To implement topic modeling technique, we need to use a document-term matrix and the number of topics. 
When the topic modeling technique is applied, we will interpret the results and see if the words in each topic make sense.

The steps in the notebook are:
1. Topic Modeling for All Text
2. Topic Modeling for Nouns Only
3. Topic Modeling for Nouns and Adjectives
4. Identify Topics in Each Document

# Importing Libraries

In [1]:
import pandas as pd
import pickle
from gensim import matutils, models
import scipy.sparse
from nltk import word_tokenize, pos_tag
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


## Topic Modeling for All Text

### 1st Presidential Debate

In [3]:
first_dtm = pd.read_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/first_dtm_stop.pkl")
first_dtm = first_dtm.loc[["Donald Trump","Joe Biden"]]
first_dtm.head()

Unnamed: 0,ability,able,abolish,abraham,absolutely,absorb,abuse,academic,accept,accompany,accomplish,accord,accountable,acknowledge,acre,act,actually,add,addition,additional,address,administration,admission,admit,advantage,advisor,affect,affidavit,afford,affordable,afraid,african,africanamerican,africanamericans,agency,ago,agree,ahead,air,airport,...,weekend,welcome,west,western,whatsoever,wherewithal,whichever,whistle,white,wide,wife,willing,win,wing,winner,wipe,wishful,woman,womens,wonder,word,work,worker,workforce,world,worried,worth,wrap,write,wrong,wuhan,xenophobic,xi,yapping,yeah,year,yes,york,young,zero
Donald Trump,0,1,0,0,3,0,0,1,1,0,0,3,0,0,1,1,1,0,1,0,0,4,0,0,0,0,0,0,1,0,1,0,2,0,0,3,7,5,4,2,...,0,0,1,0,1,0,0,0,1,1,1,1,9,4,0,0,0,0,0,1,7,2,0,1,1,0,0,1,0,10,0,2,0,0,2,26,5,4,3,0
Joe Biden,2,17,0,0,3,1,0,0,5,1,1,3,3,2,0,6,1,0,2,1,0,6,0,1,2,0,0,1,0,5,1,2,1,2,0,0,0,0,0,0,...,0,0,0,0,0,1,0,2,4,0,0,0,4,0,1,3,1,4,1,0,2,8,0,0,5,2,1,0,2,6,1,0,2,1,5,9,5,0,1,1


In [4]:
# One of the required inputs is a term-document matrix, preparing for LDA

first_dtm = first_dtm.transpose()
first_dtm.head() 

Unnamed: 0,Donald Trump,Joe Biden
ability,0,2
able,1,17
abolish,0,0
abraham,0,0
absolutely,3,3


In [5]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus

sparse_counts = scipy.sparse.csr_matrix(first_dtm)
corpus1 = matutils.Sparse2Corpus(sparse_counts)

In [6]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv1 = pickle.load(open("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/cv1_stop.pkl", "rb"))
id2word1 = dict((v, k) for k, v in cv1.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes.

In [7]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda1 = models.LdaModel(corpus=corpus1, id2word=id2word1, num_topics=2, passes=10)
lda1.print_topics()

[(0,
  '0.001*"people" + 0.001*"way" + 0.001*"million" + 0.001*"president" + 0.001*"make" + 0.001*"fact" + 0.001*"happen" + 0.001*"ballot" + 0.001*"year" + 0.001*"talk"'),
 (1,
  '0.028*"people" + 0.010*"way" + 0.010*"make" + 0.009*"million" + 0.008*"president" + 0.008*"fact" + 0.008*"happen" + 0.008*"ballot" + 0.007*"vote" + 0.007*"talk"')]

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

### 2nd Presidential Debate

In [8]:
second_dtm = pd.read_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/second_dtm_stop.pkl")
second_dtm = second_dtm.loc[["Donald Trump","Joe Biden"]]
second_dtm.head()

Unnamed: 0,abide,ability,able,abraham,abroad,absolutely,abuse,access,accord,account,accountant,accumulate,accurate,accuse,act,action,activity,actually,actuary,addition,address,administration,advance,adversary,advisor,advocate,affect,affordable,afghanistan,africanamerican,agent,ago,agree,ahead,air,alabama,alcohol,allow,ally,amendment,...,whistle,white,wife,willing,wilmington,win,wind,windmill,window,windshield,winter,wiper,witch,withhold,woman,wonder,wonderful,word,work,worker,world,worldwide,worried,worry,worth,wrap,write,wrong,wuhan,xenophobia,xenophobic,xi,yeah,year,yes,yesterday,york,young,zero,zone
Donald Trump,0,0,4,6,0,1,2,0,1,4,1,0,1,0,0,3,0,1,0,0,0,2,1,0,0,0,0,0,0,0,1,14,0,1,4,1,1,2,0,0,...,0,1,1,1,0,3,2,2,4,0,1,0,1,0,1,0,1,2,9,0,10,3,0,2,0,1,2,2,0,0,2,0,1,35,2,0,7,3,0,4
Joe Biden,1,2,16,2,0,1,0,3,2,1,0,1,0,2,5,0,1,2,1,1,0,2,0,0,2,0,1,4,1,1,0,0,2,0,2,0,1,4,1,0,...,1,1,2,0,1,1,3,1,1,1,2,1,0,1,1,1,0,2,4,1,6,0,1,8,1,0,1,3,1,1,1,1,0,16,2,0,2,2,3,2


In [9]:
# One of the required inputs is a term-document matrix, preparing for LDA

second_dtm = second_dtm.transpose()
second_dtm.head() 

Unnamed: 0,Donald Trump,Joe Biden
abide,0,1
ability,0,2
able,4,16
abraham,6,2
abroad,0,0


In [10]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus

sparse_counts = scipy.sparse.csr_matrix(second_dtm)
corpus2 = matutils.Sparse2Corpus(sparse_counts)

In [11]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv2 = pickle.load(open("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/cv2_stop.pkl", "rb"))
id2word2 = dict((v, k) for k, v in cv2.vocabulary_.items())

In [12]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda2 = models.LdaModel(corpus=corpus2, id2word=id2word2, num_topics=2, passes=10)
lda2.print_topics()

[(0,
  '0.017*"make" + 0.016*"people" + 0.010*"fact" + 0.009*"president" + 0.008*"sure" + 0.008*"china" + 0.008*"talk" + 0.007*"pay" + 0.006*"need" + 0.006*"happen"'),
 (1,
  '0.015*"people" + 0.012*"want" + 0.011*"year" + 0.011*"think" + 0.010*"million" + 0.010*"joe" + 0.009*"money" + 0.009*"make" + 0.008*"country" + 0.008*"talk"')]

## Topic Modelling for Nouns Only

### 1st Presidential Debate

In [13]:
# Let's create a function to pull out nouns from a string of text

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [14]:
# Read in the cleaned data, before the CountVectorizer step

first_clean = pd.read_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/first_whole_corpus.pkl")
first_clean = first_clean.loc[["Donald Trump","Joe Biden"]]
first_clean

Unnamed: 0,transcript,speech_time,clean_text
Donald Trump,"How are you doing?, Thank you very much, Chris...",36.0,thank much chris tell simply win election elec...
Joe Biden,"How you doing, man?, I’m well., Well, first of...",28.0,man well well first thank look forward mr pres...


In [15]:
# Apply the nouns function to the transcripts to filter only on nouns
first_nouns = pd.DataFrame(first_clean["clean_text"].apply(nouns))
first_nouns

Unnamed: 0,clean_text
Donald Trump,chris tell election election consequence house...
Joe Biden,man thank look president people court nominee ...


In [16]:
# Create a new document-term matrix using only nouns

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words1 = ['like',  'know', 'just', 'get', 'would', 'well', 'people','thing',
                   'want','way','look','joe','president','chris','biden']
stop_words1 = text.ENGLISH_STOP_WORDS.union(add_stop_words1)

# Recreate a document-term matrix with only nouns
cvn1 = CountVectorizer(stop_words=stop_words1)
first_cvn = cvn1.fit_transform(first_nouns.clean_text)
first_dtmn = pd.DataFrame(first_cvn.toarray(), columns=cvn1.get_feature_names())
first_dtmn.index = first_nouns.index
first_dtmn

Unnamed: 0,ability,absorb,accompany,accord,acknowledge,act,addition,administration,admit,advantage,afford,air,airport,alcohol,america,americans,analysis,andor,announce,answer,antifa,anybody,appeal,apple,approval,area,arm,art,ask,aspect,asset,assistance,automobile,ballot,ban,bank,barisma,basket,bastard,beat,...,ventilator,vet,vice,view,violence,violent,virginia,vote,vulnerable,wade,wage,wait,wall,war,wastepaper,watch,watcher,water,weapon,wear,weather,week,wherewithal,whistle,wife,win,winner,wipe,woman,womens,wonder,word,work,workforce,world,xi,yeah,year,yes,york
Donald Trump,0,0,0,2,0,0,1,4,0,0,1,4,1,1,0,0,0,0,1,1,1,2,1,0,1,1,0,0,1,1,1,0,0,22,1,1,0,2,1,0,...,2,1,5,2,0,0,2,3,1,1,0,1,0,0,1,4,1,3,0,2,0,2,0,0,1,0,0,0,0,0,1,7,2,1,1,0,1,26,4,4
Joe Biden,2,1,1,2,1,5,2,6,1,2,0,0,0,0,4,1,1,1,0,0,1,1,0,1,0,0,1,2,3,0,0,2,2,15,0,1,1,0,0,1,...,0,0,4,1,5,2,0,24,0,0,1,5,1,1,0,0,0,0,1,0,1,0,1,1,0,1,1,1,4,1,0,2,5,0,5,2,1,9,0,0


In [17]:
# Create the gensim corpus
corpusn1 = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(first_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn1 = dict((v, k) for k, v in cvn1.vocabulary_.items())

In [18]:
# Let's create 2 different topic modeling
ldan1 = models.LdaModel(corpus=corpusn1, num_topics=2, id2word=id2wordn1, passes=10)
ldan1.print_topics()

[(0,
  '0.019*"year" + 0.018*"country" + 0.016*"ballot" + 0.014*"dollar" + 0.014*"election" + 0.012*"law" + 0.011*"lot" + 0.010*"job" + 0.010*"place" + 0.010*"car"'),
 (1,
  '0.026*"fact" + 0.017*"vote" + 0.017*"deal" + 0.015*"number" + 0.014*"job" + 0.011*"talk" + 0.011*"ballot" + 0.011*"tax" + 0.009*"man" + 0.008*"election"')]

In [19]:
corpus_transformed = ldan1[corpusn1]

list(zip([a for [(a,b)] in corpus_transformed], first_dtmn.index))

[(0, 'Donald Trump'), (1, 'Joe Biden')]

### 2nd Presidential Debate

In [20]:
# Let's create a function to pull out nouns from a string of text

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [21]:
# Read in the cleaned data, before the CountVectorizer step

second_clean = pd.read_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/second_whole_corpus.pkl")
second_clean = first_clean.loc[["Donald Trump","Joe Biden"]]
second_clean

Unnamed: 0,transcript,speech_time,clean_text
Donald Trump,"How are you doing?, Thank you very much, Chris...",36.0,thank much chris tell simply win election elec...
Joe Biden,"How you doing, man?, I’m well., Well, first of...",28.0,man well well first thank look forward mr pres...


In [22]:
# Apply the nouns function to the transcripts to filter only on nouns
second_nouns = pd.DataFrame(second_clean["clean_text"].apply(nouns))
second_nouns

Unnamed: 0,clean_text
Donald Trump,chris tell election election consequence house...
Joe Biden,man thank look president people court nominee ...


In [23]:
# Create a new document-term matrix using only nouns

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words2 = ['like',  'know', 'just', 'get', 'would', 'well', 'people',
                   'thing', 'want','look','joe','president','chris','biden','way']
stop_words2 = text.ENGLISH_STOP_WORDS.union(add_stop_words2)

# Recreate a document-term matrix with only nouns
cvn2 = CountVectorizer(stop_words=stop_words2)
second_cvn = cvn2.fit_transform(second_nouns.clean_text)
second_dtmn = pd.DataFrame(second_cvn.toarray(), columns=cvn2.get_feature_names())
second_dtmn.index = second_nouns.index
second_dtmn

Unnamed: 0,ability,absorb,accompany,accord,acknowledge,act,addition,administration,admit,advantage,afford,air,airport,alcohol,america,americans,analysis,andor,announce,answer,antifa,anybody,appeal,apple,approval,area,arm,art,ask,aspect,asset,assistance,automobile,ballot,ban,bank,barisma,basket,bastard,beat,...,ventilator,vet,vice,view,violence,violent,virginia,vote,vulnerable,wade,wage,wait,wall,war,wastepaper,watch,watcher,water,weapon,wear,weather,week,wherewithal,whistle,wife,win,winner,wipe,woman,womens,wonder,word,work,workforce,world,xi,yeah,year,yes,york
Donald Trump,0,0,0,2,0,0,1,4,0,0,1,4,1,1,0,0,0,0,1,1,1,2,1,0,1,1,0,0,1,1,1,0,0,22,1,1,0,2,1,0,...,2,1,5,2,0,0,2,3,1,1,0,1,0,0,1,4,1,3,0,2,0,2,0,0,1,0,0,0,0,0,1,7,2,1,1,0,1,26,4,4
Joe Biden,2,1,1,2,1,5,2,6,1,2,0,0,0,0,4,1,1,1,0,0,1,1,0,1,0,0,1,2,3,0,0,2,2,15,0,1,1,0,0,1,...,0,0,4,1,5,2,0,24,0,0,1,5,1,1,0,0,0,0,1,0,1,0,1,1,0,1,1,1,4,1,0,2,5,0,5,2,1,9,0,0


In [24]:
# Create the gensim corpus
corpusn2 = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(second_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn2 = dict((v, k) for k, v in cvn2.vocabulary_.items())

In [25]:
# Let's create 2 different topic modeling
ldan2 = models.LdaModel(corpus=corpusn2, num_topics=2, id2word=id2wordn2, passes=10)
ldan2.print_topics()

[(0,
  '0.027*"fact" + 0.017*"vote" + 0.017*"deal" + 0.016*"number" + 0.014*"job" + 0.011*"talk" + 0.011*"ballot" + 0.011*"tax" + 0.009*"man" + 0.008*"election"'),
 (1,
  '0.019*"year" + 0.018*"country" + 0.016*"ballot" + 0.014*"election" + 0.014*"dollar" + 0.012*"law" + 0.011*"lot" + 0.010*"job" + 0.010*"place" + 0.010*"car"')]

In [31]:
corpus_transformed = ldan2[corpusn2]

list(zip([a for [(a,b)] in corpus_transformed], second_dtmn.index))

[(1, 'Donald Trump'), (0, 'Joe Biden')]

## Topic Modeling for Nouns and Adjectives

### 1st Presidential Debate

In [27]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [28]:
# Apply the nouns function to the transcripts to filter only on nouns
first_nouns_adj = pd.DataFrame(first_clean["clean_text"].apply(nouns_adj))
first_nouns_adj

Unnamed: 0,clean_text
Donald Trump,much chris tell win election election conseque...
Joe Biden,man first thank look mr president american peo...


In [29]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna1 = CountVectorizer(stop_words=stop_words1, max_df=.8)
first_cvna = cvna1.fit_transform(first_nouns_adj["clean_text"])
first_dtmna = pd.DataFrame(first_cvna.toarray(), columns=cvna1.get_feature_names())
first_dtmna.index = first_nouns_adj.index
first_dtmna

Unnamed: 0,ability,absorb,academic,accompany,accomplish,accountable,acknowledge,acre,additional,admit,advantage,afford,affordable,african,agree,air,airport,alcohol,america,american,americans,analysis,andor,announce,antisemitic,appeal,apple,appropriate,approval,approve,area,arm,art,aspect,asset,assistance,automobile,aware,ban,barisma,...,vet,violence,virginia,vulnerable,wade,wage,wall,war,warm,wash,wastepaper,watch,watcher,water,wealthy,weapon,weather,week,west,wherewithal,whistle,wide,wife,willing,winner,wipe,wishful,woman,womens,wonder,workforce,worried,worth,wrap,write,wuhan,xenophobic,xi,yes,york
Donald Trump,0,0,1,0,0,0,0,1,0,0,0,1,0,0,4,4,2,1,0,0,0,0,0,1,0,1,0,0,1,0,1,0,0,1,1,0,0,0,1,0,...,1,0,2,1,1,0,0,0,0,1,1,4,1,3,0,0,0,2,1,0,0,1,1,1,0,0,0,0,0,1,1,0,0,1,0,0,1,0,4,4
Joe Biden,2,1,0,1,1,3,1,0,1,1,2,0,5,2,0,0,0,0,4,14,1,1,1,0,1,0,1,3,0,1,0,1,2,0,0,2,2,1,0,1,...,0,5,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,0,0,1,2,0,0,0,1,3,1,4,1,0,0,2,1,0,1,1,0,2,0,0


In [30]:
# Create the gensim corpus
corpusna1 = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(first_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna1 = dict((v, k) for k, v in cvna1.vocabulary_.items())

In [36]:
# Let's create 2 different topic modeling
ldana1 = models.LdaModel(corpus=corpusna1, num_topics=2, id2word=id2wordna1, passes=10)
ldana1.print_topics()

[(0,
  '0.016*"place" + 0.010*"month" + 0.008*"city" + 0.008*"okay" + 0.008*"judge" + 0.008*"close" + 0.007*"period" + 0.007*"november" + 0.007*"management" + 0.006*"excuse"'),
 (1,
  '0.015*"american" + 0.008*"united" + 0.007*"create" + 0.007*"discredit" + 0.006*"home" + 0.006*"affordable" + 0.006*"violence" + 0.006*"director" + 0.005*"blow" + 0.005*"gas"')]

In [33]:
corpus_transformed = ldana1[corpusna1]
list(zip([a for [(a,b)] in corpus_transformed], first_dtmna.index))

[(0, 'Donald Trump'), (1, 'Joe Biden')]

### 2nd Presidential Debate

In [37]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [38]:
# Apply the nouns function to the transcripts to filter only on nouns
second_nouns_adj = pd.DataFrame(second_clean["clean_text"].apply(nouns_adj))
second_nouns_adj

Unnamed: 0,clean_text
Donald Trump,much chris tell win election election conseque...
Joe Biden,man first thank look mr president american peo...


In [39]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna2 = CountVectorizer(stop_words=stop_words2, max_df=.8)
second_cvna = cvna2.fit_transform(second_nouns_adj["clean_text"])
second_dtmna = pd.DataFrame(second_cvna.toarray(), columns=cvna2.get_feature_names())
second_dtmna.index = second_nouns_adj.index
second_dtmna

Unnamed: 0,ability,absorb,academic,accompany,accomplish,accountable,acknowledge,acre,additional,admit,advantage,afford,affordable,african,agree,air,airport,alcohol,america,american,americans,analysis,andor,announce,antisemitic,appeal,apple,appropriate,approval,approve,area,arm,art,aspect,asset,assistance,automobile,aware,ban,barisma,...,vet,violence,virginia,vulnerable,wade,wage,wall,war,warm,wash,wastepaper,watch,watcher,water,wealthy,weapon,weather,week,west,wherewithal,whistle,wide,wife,willing,winner,wipe,wishful,woman,womens,wonder,workforce,worried,worth,wrap,write,wuhan,xenophobic,xi,yes,york
Donald Trump,0,0,1,0,0,0,0,1,0,0,0,1,0,0,4,4,2,1,0,0,0,0,0,1,0,1,0,0,1,0,1,0,0,1,1,0,0,0,1,0,...,1,0,2,1,1,0,0,0,0,1,1,4,1,3,0,0,0,2,1,0,0,1,1,1,0,0,0,0,0,1,1,0,0,1,0,0,1,0,4,4
Joe Biden,2,1,0,1,1,3,1,0,1,1,2,0,5,2,0,0,0,0,4,14,1,1,1,0,1,0,1,3,0,1,0,1,2,0,0,2,2,1,0,1,...,0,5,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,0,0,1,2,0,0,0,1,3,1,4,1,0,0,2,1,0,1,1,0,2,0,0


In [40]:
# Create the gensim corpus
corpusna2 = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(second_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna2 = dict((v, k) for k, v in cvna2.vocabulary_.items())

In [44]:
# Let's create 2 different topic modeling
ldana2 = models.LdaModel(corpus=corpusna2, num_topics=2, id2word=id2wordna2, passes=10)
ldana2.print_topics()

[(0,
  '0.016*"place" + 0.010*"month" + 0.008*"close" + 0.008*"city" + 0.008*"okay" + 0.008*"judge" + 0.007*"management" + 0.007*"november" + 0.007*"period" + 0.006*"super"'),
 (1,
  '0.015*"american" + 0.008*"united" + 0.007*"create" + 0.007*"discredit" + 0.006*"home" + 0.006*"director" + 0.006*"violence" + 0.006*"affordable" + 0.005*"panic" + 0.005*"proposal"')]

In [42]:
corpus_transformed = ldana2[corpusna2]
list(zip([a for [(a,b)] in corpus_transformed], second_dtmna.index))

[(0, 'Donald Trump'), (1, 'Joe Biden')]

## Identify Topics in Each Document
When we look at the topic modelings, the nouns modeling made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

### 1st Presidential Debate

In [49]:
# Our final LDA model 
ldan1 = models.LdaModel(corpus=corpusn1, num_topics=2, id2word=id2wordn1, passes=10)
ldan1.print_topics()

[(0,
  '0.026*"fact" + 0.017*"deal" + 0.017*"vote" + 0.015*"number" + 0.014*"job" + 0.011*"talk" + 0.011*"ballot" + 0.011*"tax" + 0.009*"man" + 0.008*"care"'),
 (1,
  '0.019*"year" + 0.018*"country" + 0.016*"ballot" + 0.014*"election" + 0.014*"dollar" + 0.012*"law" + 0.011*"lot" + 0.010*"job" + 0.010*"place" + 0.009*"car"')]

These topics look pretty decent. Let's settle on these for now.
* Topic 0: election, vote, job, tax 
* Topic 1: country, job, economic affairs, election


In [48]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldan1[corpusn1]

list(zip([a for [(a,b)] in corpus_transformed], first_dtmn.index))

[(1, 'Donald Trump'), (0, 'Joe Biden')]

For now, our topic modelling makes sense and these are our topics for each politician.

* Topic 0: election, vote, job, tax **[Joe Biden]** 
* Topic 1: country, job, economic affairs, election **[Donald Trump]**


* **Finding**            
There were specific discussion topic in the debate, similar to the topics we found in our model. What we notice here is the focus point of each politician during the debate.

### 2nd Presidential Debate

In [52]:
# Our final LDA model (for now)
ldan2 = models.LdaModel(corpus=corpusn2, num_topics=2, id2word=id2wordn2, passes=10)
ldan2.print_topics()

[(0,
  '0.019*"year" + 0.018*"country" + 0.016*"ballot" + 0.014*"election" + 0.014*"dollar" + 0.012*"law" + 0.011*"lot" + 0.010*"job" + 0.010*"place" + 0.010*"car"'),
 (1,
  '0.026*"fact" + 0.017*"vote" + 0.017*"deal" + 0.015*"number" + 0.014*"job" + 0.011*"talk" + 0.011*"ballot" + 0.011*"tax" + 0.009*"man" + 0.008*"election"')]

These topics look pretty decent. Let's settle on these for now.
* Topic 0: country, economic affairs, law, election
* Topic 1: election, vote, job, tax,

In [53]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldan2[corpusn2]
list(zip([a for [(a,b)] in corpus_transformed], second_dtmn.index))

[(0, 'Donald Trump'), (1, 'Joe Biden')]

For now, our topic modelling makes sense and these are our topics for each politician.
* Topic 0: country, economic affairs, law, election **[Donald Trump]**
* Topic 1: election, vote, job, tax **[Joe Biden]**

* **Finding**              
No surprise, we have similar focused topics in second debate as well.