# Generate Topic Models
Generates the topic models of focum posts with LDA (Latent Dirichlet Allocation)

## Data Sources
- corpus (created with 3-Lemmatize_Text.ipynb)
- dictionary (created with 3-Lemmatize_Text.ipynb)

## Changes
- 2020-09-16: Created

## TODO
- Tutorial: https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21

## Imports

In [1]:
from gensim import corpora, models
import pickle
from pathlib import Path
from io import FileIO
import pyLDAvis.gensim

## Functions

In [2]:
# none yet

## File Locations

In [3]:
p = Path.cwd()
path_parent = p.parents[0]
path_corpus_pkl = path_parent / "clean_data" / "corpus.pkl"
path_dictionary_gensim = path_parent / "clean_data" / "dictionary.gensim"
path_model = path_parent / "clean_data"

## Load Data

In [4]:
corpus = pickle.load(open(path_corpus_pkl, 'rb'))

In [5]:
dictionary = corpora.Dictionary.load(str(path_dictionary_gensim))

## Perform LDA
Try to find 3-10 topics:

In [6]:
lda_models = {}
NUM_WORDS = 7
for i in range(3,11):
    n = "model_LDA_" + str(i)
    fn = n + ".gensim"
    ldamodel = models.ldamodel.LdaModel(corpus, num_topics = i, id2word=dictionary, passes=15)
    lda_models[n] = ldamodel
    path_model_i = path_model / fn
    ldamodel.save(str(path_model_i))
    topics = ldamodel.print_topics(num_words=NUM_WORDS)
    print("LDA with {} topics".format(i))
    for topic in topics:
        print(topic)
    print("\n")

LDA with 3 topics
(0, '0.016*"like" + 0.015*"op" + 0.014*"know" + 0.012*"dont" + 0.010*"think" + 0.010*"im" + 0.010*"say"')
(1, '0.013*"ds" + 0.013*"time" + 0.012*"get" + 0.011*"kid" + 0.010*"go" + 0.010*"take" + 0.008*"adhd"')
(2, '0.047*"school" + 0.019*"kid" + 0.012*"need" + 0.011*"get" + 0.010*"dc" + 0.008*"college" + 0.008*"sn"')


LDA with 4 topics
(0, '0.021*"time" + 0.020*"get" + 0.015*"take" + 0.011*"go" + 0.010*"year" + 0.010*"need" + 0.009*"would"')
(1, '0.016*"like" + 0.014*"kid" + 0.013*"adhd" + 0.013*"ds" + 0.011*"help" + 0.010*"med" + 0.009*"really"')
(2, '0.054*"school" + 0.025*"kid" + 0.016*"need" + 0.012*"sn" + 0.009*"teacher" + 0.009*"dc" + 0.008*"get"')
(3, '0.016*"sorry" + 0.014*"im" + 0.013*"op" + 0.011*"say" + 0.010*"post" + 0.010*"doe" + 0.009*"know"')


LDA with 5 topics
(0, '0.019*"get" + 0.014*"year" + 0.013*"take" + 0.012*"yes" + 0.012*"go" + 0.012*"time" + 0.011*"years"')
(1, '0.057*"school" + 0.021*"private" + 0.016*"public" + 0.015*"nyc" + 0.014*"thank" +

Visualize the topics

In [11]:
pyLDAvis.display(pyLDAvis.gensim.prepare(lda_models["model_LDA_3"], corpus, dictionary, sort_topics=False))

In [9]:
pyLDAvis.display(pyLDAvis.gensim.prepare(lda_models["model_LDA_6"], corpus, dictionary, sort_topics=False))

In [10]:
pyLDAvis.display(pyLDAvis.gensim.prepare(lda_models["model_LDA_10"], corpus, dictionary, sort_topics=False))