# Topic Modeling with Latent Dirichlet Allocation Model
In this project extension I will explore applying an LDA model to the data. This model aims to uncover hidden structure in a collection of texts. This type of modeling can be compared to clustering (thus an interesting extension for this project) but with LDA it builds clusters of words rather than clusters of texts.  


> LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.

# Libraries and Data

In [1]:
#custom functions 
from projectfunctions import * 

In [15]:
import pandas as pd  
import numpy as np   
np.random.seed(42)

import pickle   

%matplotlib inline
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import seaborn as sns

import gensim 
from gensim.utils import simple_preprocess 
from gensim.parsing.preprocessing import STOPWORDS 
import gensim.corpora as corpora  

import nltk 
from nltk.stem import PorterStemmer
from nltk.stem.porter import * 

from pprint import pprint  

import os 

from wordcloud import WordCloud, STOPWORDS   

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Prepare Data For LDA Analysis 

In [6]:
#load in question data 
classroom_questions_csv = pd.read_csv(r'PDFfiles/classroom_questions.csv')
cq_list = classroom_questions_csv['question'].values.tolist()

In [7]:
cq_list

['“How many ounces in a pound?”',
 '“How would you illustrate the water cycle?”',
 '“How would you use your knowledge of latitude and longitude to locate Greenland?”',
 '“If you had eight inches of water in your basement and a hose, how would you use the hose to get the water out?”',
 '“What are some of the factors that cause rust?”',
 '“Why do we call all these animals mammals?”',
 'How would your life be different if you could breathe under water?”',
 '“Construct a tower one foot tall using only four blocks.”',
 '“Why do you think Benjamin Franklin is so famous?”',
 'Does the tilt change as the Earth orbits the Sun?',
 'What direction does the shadow point directly at noon?',
 'What direction in the sky would the observer look to see the noontime Sun?',
 'What direction does the Sun set?',
 'Does the Sun’s path change when you change the date from March 20th to September 20th?',
 'What direction in the sky would the observer look to see the noontime Sun?',
 'Describe what happens to 

In [8]:
def lower_words(text): 
    #return list lowered
    return [t.lower() for t in text]  

def remove_punc(text):  
    #returns a list without punctuation 
    import re 
    return [re.sub(r'[^a-zA-Z0-9]', ' ', t) for t in text]  

"""def lemmatize(text):  
    from nltk.stem.wordnet import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    lemma_list_of_words = []
    for i in range(0, len(text)):
         l1 = text[i]
         l2 = ' '.join([lemmatizer.lemmatize(word) for word in l1])
         lemma_list_of_words.append(l2)
    return lemma_list_of_words"""

def remove_stopwords(text):   
    #returns a list with stopwords removed 
    from gensim.parsing.preprocessing import remove_stopwords 
    return [remove_stopwords(word) for word in text]

def stemm(text): 
    ps = PorterStemmer()
    return [[ps.stem(token) for token in sentence.split(" ")] for sentence in text]

def preprocess(text): 
    lowered = lower_words(text) #lower all words 
    alphanumeric = remove_punc(lowered) #remove punctuation  
    stopped = remove_stopwords(alphanumeric) #remove stop words
    sentence_list = [text.split(",") for text in stopped] #create a list for each sentence
    lemmatized = lemmatize(sentence_list) #lemmatize list
    stemmed = stemm(stopped) # stem words
    return [" ".join(x).split() for x in stemmed] #remove any blanks

In [9]:
#apply processing to document
corpi_list = preprocess(cq_list) 

In [10]:
corpi_list

[['ounc', 'pound'],
 ['illustr', 'water', 'cycl'],
 ['use', 'knowledg', 'latitud', 'longitud', 'locat', 'greenland'],
 ['inch', 'water', 'basement', 'hose', 'use', 'hose', 'water'],
 ['factor', 'caus', 'rust'],
 ['anim', 'mammal'],
 ['life', 'differ', 'breath', 'water'],
 ['construct', 'tower', 'foot', 'tall', 'block'],
 ['think', 'benjamin', 'franklin', 'famou'],
 ['tilt', 'chang', 'earth', 'orbit', 'sun'],
 ['direct', 'shadow', 'point', 'directli', 'noon'],
 ['direct', 'sky', 'observ', 'look', 'noontim', 'sun'],
 ['direct', 'sun', 'set'],
 ['sun',
  's',
  'path',
  'chang',
  'chang',
  'date',
  'march',
  '20th',
  'septemb',
  '20th'],
 ['direct', 'sky', 'observ', 'look', 'noontim', 'sun'],
 ['happen', 'altitud', 'sun', 'januari', 'decemb'],
 ['month', 'sun', 'lowest', 'sky', 'noon', 'highest', 'sky', 'noon'],
 ['23', '5', 'signific', 'number'],
 ['sun', 'directli', 'overhead', 'june', '21st', 'equat'],
 ['briefli', 'explain', 'differ', 'mass', 'weight'],
 ['identifi', '2', 'simi

In [11]:
#filter out just the words that are greater that 3
for sentence in corpi_list: 
    for word in sentence: 
        if len(word) < 5: #the data here was a bit off on lenght so I needed to adjust it to 5 to actually filter for 3
            sentence.remove(word)

In [12]:
corpi_list

[['pound'],
 ['illustr', 'water'],
 ['knowledg', 'latitud', 'longitud', 'locat', 'greenland'],
 ['water', 'basement', 'use', 'water'],
 ['factor', 'rust'],
 ['mammal'],
 ['differ', 'breath', 'water'],
 ['construct', 'tower', 'tall', 'block'],
 ['think', 'benjamin', 'franklin', 'famou'],
 ['chang', 'earth', 'orbit'],
 ['direct', 'shadow', 'point', 'directli'],
 ['direct', 'observ', 'noontim'],
 ['direct', 'set'],
 ['s', 'chang', 'chang', 'march', 'septemb'],
 ['direct', 'observ', 'noontim'],
 ['happen', 'altitud', 'januari', 'decemb'],
 ['month', 'lowest', 'noon', 'highest', 'noon'],
 ['5', 'signific', 'number'],
 ['directli', 'overhead', '21st', 'equat'],
 ['briefli', 'explain', 'differ', 'weight'],
 ['identifi', 'similar', 'differ', 'inner', 'outer', 'planet'],
 ['planet', 'weigh', 'explain'],
 ['object',
  'solar',
  'greatest',
  'graviti',
  'happen',
  'gravit',
  'increas',
  'reduc'],
 ['planet', 'orbit'],
 ['eclips', 'occur', 'new', 'moon'],
 ['affect', 'phase'],
 ['locat', 'ti

# Train a Vanilla LDA Model 

In [13]:
#create a dictionary of words 
id2word = corpora.Dictionary(corpi_list) 

#create corpus 
texts = corpi_list

#TDF 
corpus = [id2word.doc2bow(text) for text in corpi_list]

print(corpus[:1][0][:30]) 

#sanity check 
[[(id2word[i], freq) for i, freq in doc] for doc in corpus[:1]]

[(0, 1)]


[[('pound', 1)]]

In [14]:
#build model 
lda_model = gensim.models.LdaModel(corpus=corpus, 
                                      id2word=id2word, 
                                      num_topics=10,  
                                      random_state=42,  
                                      alpha='auto', 
                                      per_word_topics=True)

#print keywords in each topic 
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.036*"typic" + 0.035*"data" + 0.035*"hurrican" + 0.034*"contain" + '
  '0.023*"consid" + 0.022*"earth" + 0.020*"power" + 0.020*"determin" + '
  '0.020*"bacteria" + 0.019*"energi"'),
 (1,
  '0.066*"increas" + 0.045*"plant" + 0.036*"experi" + 0.035*"function" + '
  '0.034*"approxim" + 0.027*"popul" + 0.021*"temperatur" + 0.020*"identifi" + '
  '0.020*"explain" + 0.020*"extinct"'),
 (2,
  '0.075*"weight" + 0.057*"student" + 0.051*"organ" + 0.050*"cloud" + '
  '0.037*"follow" + 0.036*"certain" + 0.025*"spread" + 0.024*"form" + '
  '0.024*"reproduct" + 0.023*"end"'),
 (3,
  '0.035*"statement" + 0.030*"primari" + 0.030*"system" + 0.029*"base" + '
  '0.029*"summer" + 0.029*"commun" + 0.029*"conclud" + 0.029*"respons" + '
  '0.027*"daili" + 0.026*"fiber"'),
 (4,
  '0.037*"damag" + 0.037*"identifi" + 0.035*"fertil" + 0.032*"occur" + '
  '0.029*"day" + 0.026*"method" + 0.021*"result" + 0.021*"solut" + '
  '0.020*"appear" + 0.019*"air"'),
 (5,
  '0.049*"chang" + 0.044*"chemic" + 0.036*"p

# Model Analysis 

## Dominant Topic & Percentage Contribution 

In [16]:
def format_topics_sentences(ldamodel=None, corpus=corpus, texts=texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=texts)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(5)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,1.0,0.5702,"increas, plant, experi, function, approxim, po...",[pound]
1,1,9.0,0.3753,"water, direct, chang, factor, state, dissolv, ...","[illustr, water]"
2,2,6.0,0.6082,"earth, layer, provid, reason, surfac, explain,...","[knowledg, latitud, longitud, locat, greenland]"
3,3,9.0,0.8336,"water, direct, chang, factor, state, dissolv, ...","[water, basement, use, water]"
4,4,9.0,0.7195,"water, direct, chang, factor, state, dissolv, ...","[factor, rust]"


## The Most Representative Sentence for Each Topic

In [None]:
# Display setting to show more characters in column
pd.options.display.max_colwidth = 100

sent_topics_sorteddf_mallet = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=False).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib",
                                       "Keywords", "Representative Text"]

# Show
sent_topics_sorteddf_mallet.head(5)

## PyLDA Visualization 

In [17]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

# Resources: 
* [Topic Modeling in Python: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0) 
* [Topic Modeling Visualization - How to present the results of LDA models?](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/) 
* [Topic Modeling and Latent Dirichlet Allocation (LDA) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)