# LDA Model

>Now that the raw input text has been tokenized/cleaned, we can train our LDA model.
>
>The below will train our model and then save it for later usage.
>
>Some *seed params* have been hardcoded to nudge the model in the right direction.

## Read Text Input

In [1]:
from gensim.corpora.dictionary import Dictionary
from lda_helpers import read_lda_input  # Package with helpers

texts = read_lda_input('lda_input/lda_input.jl')
id2word = Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

## Construct Eta Prior Matrix

>Allowing the model to train randomly will result in not-so-understandable genres.
>
>Thus we provide some seed words to push the model towards some desired output genres.
>
>Construct the corresponding matrix to pass as the *eta* parameter.
>
>This serves as a prior assumption of each topic distribution from which we initialize our LDA model.

In [2]:
import numpy as np
from lda_input.lda_seed import genre_seed_words, r, smooth  # Hardcoded params used to train LDA model

seed_structs = [dict.fromkeys(x[1], smooth/len(x[1])) for x in genre_seed_words]

k = len(genre_seed_words)  # Topic Count
n = len(id2word)           # Vocabulary size

# Convert to prior eta matrix, for LDA starting point
f = lambda i,j: seed_structs[i].get(id2word[j], (1-smooth)/(n-len(seed_structs[i])))
seed_matrix = np.fromfunction(np.vectorize(f, otypes=[np.float64]), (k,n), dtype=int)

## Train LDA Model

In [3]:
from lda_helpers import get_lda_model
import warnings
warnings.filterwarnings('ignore')  # Annoying RunTimeError which doesn't affect anything

lda_model = get_lda_model(corpus, id2word, k, r=r, eta=seed_matrix)

## Visualize LDA Genres

In [4]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
LDAvis_display = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, sort_topics=False)
LDAvis_display

## Save Model to Disk

>Use gensim's *save* functionality to store this model for later usage.

In [5]:
from os import mkdir
try:
    mkdir('lda_model')
except FileExistsError:
    pass
lda_model.save('lda_model/lda_model')

  and should_run_async(code)
