## Topic Modeling and Visualization

Packages : <b> Gensim, spacy, nltk </b>  <br>
Specify the number of topics you want generated and the number of passes ie. the number of iterations for looping over the documents (every line of text is considered a document). 
<br>

#### Steps :
1. Text Preprocessing - 
Functions : tokenize(), prepare_text_for_lda()
2. Corpus and dictionary creation from words appearing in the documents
3. Topic modeling
4. Visualization using <b>pyLDAvis</b>

In [None]:
import spacy
import pandas as pd
import re
import itertools
spacy.load('en_core_web_sm')
from spacy.en import English
import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
en_stop = set(nltk.corpus.stopwords.words('english'))
parser = English()

### STEP 1 : Preprocessing

Total - 50 blogposts, 32251 words

In [6]:

# Function to tokenize the text data by row
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens


# Preprocessing of text - removal of stopwords. You can stem/lemmatize if you like
# I found the results dissatisfactory hence skipped it

def prepare_text_for_lda(text):
    
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 2]
    tokens = [token for token in tokens if token not in en_stop]
    
    tokens = [w.replace('nbsp', '') for w in tokens]
    # tokens = [stemmer.stem(token) for token in tokens]
    return tokens

text_data = []

df=pd.read_csv("blog_content and titles.csv")
df.fillna('',inplace=True)
data=df["content"]
data=[w.split('.') for w in data]

data=sum(data, [])
# print data

# Removing punctuations, special characters
for line in data:
        
        line = re.sub(r'@[a-zA-Z0-9]+','',line)
        line = re.sub(r"[^A-Za-z0-9]", " ", line)
        tokens = prepare_text_for_lda(unicode(line,'utf-8'))
        # print(tokens)
        text_data.append(tokens)

      
text_data = [list(filter(None, x)) for x in text_data]
text_data = [x for x in text_data if x]
print text_data 







Okay so the above list of words is pretty exciting if you're implementing this on your personal blog - you can see every word you ever used, often coming across stuff you'd even forgotten. 

I skipped lemmatization and stemming because the results were too poor (as it more often than not is) - <b><i> stemming "coffee" to "coffe"?</i></b> I'll do without! Also because the intention and context was getting lost. You could always include them if you want to - it makes your corpus denser. Being a personal project, the individual representation was more important than reducing the size of the dictionary.

### STEP 2 : Building a corpus, dictionary from the text words

In [None]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')    


### STEP 3 : Topic Modeling using gensim

<i>num_topics</i> : number of topics to be generated <br>
<i>passes</i> : number of repetitions/passes over the documents for the modeling - make sure this number is high enough <br>
<i>num_words</i> : number of words to be returned in relation to each topic

The topics are printed with probability of occurrence associated with every word in the topic. Here I've extracted the <b> top 10 words </b> and skipped printing the probabilities.

In [7]:

import gensim
num_topics = 12
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = num_topics, id2word=dictionary, passes=350)
ldamodel.save('model.gensim')

#Specify number of top words of each topic to be printed. Top 30 words visualized later using pyLDAvis
topics = ldamodel.print_topics(num_words=10)


for i,topic in enumerate(topics):
    topic_words = re.sub(r'(.\....\*)|(\+ .\....\*)', '',topic[1]).replace('"','')
    print('Topic ' + str(i+1) + ': ' + '\n' + re.sub(' +', ' ',str(''.join(topic_words))).strip())
    
# Loading the model
#ldamodel = models.LdaModel.load('model') 

Topic 1: 
know life people could tea one like every even never
Topic 2: 
back day time would home always people tea years ever
Topic 3: 
one also water steel time long like decided last amma
Topic 4: 
know year like new make team though really day much
Topic 5: 
like family always everyone away would small bought stuff counter
Topic 6: 
like would know shit well almost think look college much
Topic 7: 
never could really college think life one two good would
Topic 8: 
time room like long could class think though tiny behind
Topic 9: 
know right first home time could life one new never
Topic 10: 
first kids remember like would know book day back class
Topic 11: 
back get even going like time life post little award
Topic 12: 
like people room every said mean life little probably remember


### STEP 4 : Topic visualization using pyLDAvis

This gives a representation of the top 30 words associated with every topic of the 12 generated. Hover over each topic bubble to display relevant terms.

Do they make a lot of sense? You decide. <br>
(Topic 12 is my favorite - <i>medical college quarters</i> is like my thing, you guys!)

In [2]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
lda10 = gensim.models.ldamodel.LdaModel.load('model.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)


In [8]:
pyLDAvis.save_html(lda_display10, 'lda.html')

Hope you guys have fun doing this as much as I did!