# Topic Modeling with Latent Dirichlet Allocation Model
In this project extension I will explore applying an LDA model to the data. This model aims to uncover hidden structure in a collection of texts. This type of modeling can be compared to clustering (thus an interesting extension for this project) but with LDA it builds clusters of words rather than clusters of texts.  


> LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.

# Libraries and Data

In [1]:
#custom functions 
from projectfunctions import * 

In [34]:
import pandas as pd  
import numpy as np  
import pickle   

%matplotlib inline
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import seaborn as sns

import gensim.corpora as corpora 

from pprint import pprint  

import os 

from wordcloud import WordCloud, STOPWORDS   

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Prepare Data For LDA Analysis 

In [101]:
#load in question data 
classroom_questions_csv = pd.read_csv(r'PDFfiles/classroom_questions.csv')
cq_list = classroom_questions_csv['question'].values.tolist()

In [102]:
class CleaningText:   
    
    def __init__(self, lst): 
        self.lst = lst 
        
    def lower_words(questionlist): 
        #return list lowered
        return [text.lower() for text in questionlist.lst] 
    
    def remove_stopwords(questionlist):   
        #returns a list with stopwords removed 
        import nltk 
        from nltk.corpus import stopwords
        stopword=stopwords.words('english') 
        #return list with stopwords removed 
        return [text for text in questionlist.lst if word not in stopword]  
    
    def remove_punc(questionlist):  
        #returns a list without punctuation 
        import re
        return [re.sub(r'[^a-zA-Z0-9]', ' ', text) for text in questionlist.lst] 
    
    def cleaned(questionlist):  
        #lowers, unpunctuates, & removes stopwords   
        lowered = lower_words(questionlist.lst)  
        return remove_stopwords(lowered)  

In [108]:
p1 = CleaningText(cq_list)
cleaned = p1.cleaned() 

p2 = CleaningText(cleaned)
alphanumeric = p2.remove_punc()

In [109]:
#sanity check
alphanumeric[:5]

[' how many ounces in a pound  ',
 ' how would you illustrate the water cycle  ',
 ' how would you use your knowledge of latitude and longitude to locate greenland  ',
 ' if you had eight inches of water in your basement and a hose  how would you use the hose to get the water out  ',
 ' what are some of the factors that cause rust  ']

In [110]:
corpi_list = [text.split(",") for text in alphanumeric] 
corpi_list

[[' how many ounces in a pound  '],
 [' how would you illustrate the water cycle  '],
 [' how would you use your knowledge of latitude and longitude to locate greenland  '],
 [' if you had eight inches of water in your basement and a hose  how would you use the hose to get the water out  '],
 [' what are some of the factors that cause rust  '],
 [' why do we call all these animals mammals  '],
 ['how would your life be different if you could breathe under water  '],
 [' construct a tower one foot tall using only four blocks  '],
 [' why do you think benjamin franklin is so famous  '],
 ['does the tilt change as the earth orbits the sun '],
 ['what direction does the shadow point directly at noon '],
 ['what direction in the sky would the observer look to see the noontime sun '],
 ['what direction does the sun set '],
 ['does the sun s path change when you change the date from march 20th to september 20th '],
 ['what direction in the sky would the observer look to see the noontime sun '

# Train a Vanilla LDA Model 

In [111]:
#create a dictionary of words 
id2word = corpora.Dictionary(corpi_list) 

#create corpus 
texts = corpi_list

#TDF 
corpus = [id2word.doc2bow(text) for text in corpi_list]

print(corpus[:1][0][:30]) 

#sanity check 
[[(id2word[i], freq) for i, freq in doc] for doc in corpus[:1]]

[(0, 1)]


[[(' how many ounces in a pound  ', 1)]]

In [112]:
#build model 
lda_model = gensim.models.LdaModel(corpus=corpus, 
                                      id2word=id2word, 
                                      num_topics=10, 
                                      random_state=42, 
                                      chunksize=100, 
                                      alpha='auto', 
                                      per_word_topics=True)

#print keywords in each topic 
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.000*"which fertilization and development method is most typical of humans '
  'before birth occurs " + 0.000*"which statement is an inference " + '
  '0.000*"what is the approximate time interval between the two high tides " + '
  '0.000*"the plant seedlings and containers were identical  identify one '
  'additional factor that should be held constant in this experiment " + '
  '0.000*"the inference that earth s interior has an outer core and an inner '
  'core is based on studies of what " + 0.000*"describe one way that the '
  'student can determine the exact volume of one of the three blocks" + '
  '0.000*"the african savanna is a large grassland region with few trees that '
  'is hot and seasonally dry  a population of lions and a population of wild '
  'dogs living there are most likely to compete with each other for what " + '
  '0.000*"how many grams of the salt were dissolved in the solution at 24 c " '
  '+ 0.000*"identify the two organisms in this food web that bel

# Model Analysis 

## Dominant Topic & Percentage Contribution 

In [113]:
def format_topics_sentences(ldamodel=None, corpus=corpus, texts=texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=texts)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(5)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,4.0,0.1051,which fertilization and development method is ...,[ how many ounces in a pound ]
1,1,4.0,0.1051,which fertilization and development method is ...,[ how would you illustrate the water cycle ]
2,2,4.0,0.1051,which fertilization and development method is ...,[ how would you use your knowledge of latitude...
3,3,4.0,0.1051,which fertilization and development method is ...,[ if you had eight inches of water in your bas...
4,4,4.0,0.1051,which fertilization and development method is ...,[ what are some of the factors that cause rust ]


## The Most Representative Sentence for Each Topic

In [114]:
# Display setting to show more characters in column
pd.options.display.max_colwidth = 100

sent_topics_sorteddf_mallet = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=False).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib",
                                       "Keywords", "Representative Text"]

# Show
sent_topics_sorteddf_mallet.head(10)

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Representative Text
0,4.0,0.1051,"which fertilization and development method is most typical of humans before birth occurs , ident...",[ how many ounces in a pound ]


## PyLDA Visualization 

In [15]:
import pyLDAvis.sklearn 
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

  and should_run_async(code)


# Resources: 
* [Topic Modeling in Python: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0) 
* [Topic Modeling Visualization - How to present the results of LDA models?](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)