# Topic Modeling

<b>NOTE</b>: This project is taken from "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python" by Susan Li. Check out the blog post in [towardsdatascience.com](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) for better explanations.

Read the training data and separate by heading text and content.

In [None]:
import pandas as pd

data = pd.read_csv('../data/extracted/abcnews-date-text.csv', error_bad_lines=False)
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

Peak into the training data.

In [None]:
print(len(documents))
print(documents[:5])
print(documents[-5:])

Load gensim and NLTK library; download `wordnet` data through NLTK.

In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

import nltk
nltk.download('wordnet')

Prepare and define functions needed to lemmatize and stem words.

In [None]:
def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

Check process word output.

In [None]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n Tokenized and lemmatized document: ')
print(preprocess(doc_sample))

Peak into the output after processing.

In [None]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

Prepare bag of words.

In [None]:
dictionary = gensim.corpora.Dictionary(processed_docs)

Filter out tokens that is not frequent enough.

In [None]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Create bag of words nested-array representation.

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Investigate BOW output.

In [None]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (`{}`) appears {} time.".format(bow_doc_4310[i][0],
                                                   dictionary[bow_doc_4310[i][0]],
                                                   bow_doc_4310[i][1]))

Create TF-IDF model.

In [None]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

Create LDA model using Bag of Words (and without the TF-IDF model).

In [None]:
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10,
                                       id2word=dictionary,
                                       passes=2,
                                       workers=2)

Run interpretations.

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))

Create LDA model using the TF-IDF model.

In [None]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf,
                                             num_topics=10,
                                             id2word=dictionary,
                                             passes=2,
                                             workers=4)

Run interpretations.

In [None]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))

Evaluate performance by classifying sample document using LDA BOW model.

In [None]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

Evaluate performance by classifying sample document using LDA TF-IDF model.

In [None]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]],
                           key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score,
                                             lda_model_tfidf.print_topic(index, 10)))

Test model on unseen document using LDA BOW model.

In [None]:
unseen_document = "How a Pentagon deal became an identity crisis for Google"
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector],
                           key=lambda tup: -1*tup[1]):
    print("Score: {}\nTopic: {}\n".format(score, lda_model.print_topic(index, 5)))