# Building Topic Models

The core of this project is the building of multiple topic models based on the same collection of documents, the *JBL*. There are three primary variations that are being tested:
* The types of words included in the corpus, whether they should be from all parts of speech or just nouns
* How many topics should be assigned to each model
* What value should be assigned to the alpha hyper-parameter

In other notebooks I make multiple comparisons to test these variations. First, I take the general corpus and alpha = 'symmetric' as my constants and I compare how the number of topics effect the model. Then I take the general corpus and alpha = 'symmetric' as constants and I compare how the number of topics effect the model. On the basis of those comparisons, I take number of topics and alpha = 'symmetric' as constants and investigate how limiting the parts of speech of the corpus effect the model. Finally, I take the general corpus and number of topics as a constant and compare how the alpha hyper-parameter affect the model. In order to make all of those comparisons, I build the following models in this notebook:
1. General corpus model, 100 topics, alpha = 'symmetric'
2. General corpus model, 100 topics, alpha = 'asymmetric'
3. General corpus model, 100 topics, alpha = 'auto'
4. General corpus model, 250 topics, alpha = 'symmetric'
5. General corpus model, 500 topics, alpha = 'symmetric'
6. Noun corpus model, 100 topics, alpha = 'symmetric'
7. Noun corpus model, 250 topics, alpha = 'symmetric'
8. Noun corpus model, 500 topics, alpha = 'symmetric

## Imports and Set-up

In [1]:
from gensim import corpora, models
import logging

In [None]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# delete when reproduced in cells below
def build_models(num, corpus, dictionary, base_name):
    lda = models.LdaModel(corpus, id2word=dictionary, num_topics=num, passes=25)
    lda.save(path + base_name + str(num) + '.model')

## Build models from the General Corpus

### Load Corpus and Dictionary for General Corpus

In [None]:
path = '../general_corpus/'
dictionary = corpora.Dictionary.load(path + 'general_corpus.dict')
corpus = corpora.MmCorpus(path + general_corpus.mm)

### 100 Topics, alpha = 'symmetric'

In [None]:
lda_100 = models.LdaModel(corpus, id2word=dictionary, num_topics=100, passes=25)
lda_100.save(path + 'alpha_symmetric/general_100.model')

### 100 Topics, alpha = 'asymmetric'

In [None]:
lda_100_asymmetric = models.LdaModel(corpus, id2word=dictionary, num_topics=100, passes=25, alpha='asymmetric')
lda_100_asymmetric.save(path + 'alpha_asymmetric/general_100_asymmetric.model')

### 100 Topics, alpha = 'auto'

In [None]:
lda_100_auto = models.LdaModel(corpus, id2word=dictionary, num_topics=100, passes=25, alpha='auto')
lda_100_auto.save(path + 'alpha_auto/general_100_auto.model')

### 250 Topics, alpha = 'symmetric'

In [None]:
lda_250 = models.LdaModel(corpus, id2word=dictionary, num_topics=250, passes=25)
lda_250.save(path + 'alpha_symmetric/general_250.model')

### 500 Topics, alpha = 'symmetric'

In [None]:
lda_500 = models.LdaModel(corpus, id2word=dictionary, num_topics=500, passes=25)
lda_500.save(path + 'alpha_symmetric/general_500.model')

## Build models from the Noun Corpus with 100, 250, and 500 topics, alpha = 'symmetric'

### Load Corpus and Dictionary for Noun Corpus

In [3]:
path = '../noun_corpus/'
dictionary = corpora.Dictionary.load(path + 'noun_corpus.dict')
corpus = corpora.MmCorpus(path + 'noun_corpus.mm')

### 100 Topics

In [None]:
lda_100 = models.LdaModel(corpus, id2word=dictionary, num_topics=100, passes=25)
lda_100.save(path + 'noun_100.model')

### 250 Topics

In [None]:
lda_250 = models.LdaModel(corpus, id2word=dictionary, num_topics=250, passes=25)
lda_250.save(path + 'noun_250.model')

### 500 Topics

In [None]:
lda_500 = models.LdaModel(corpus, id2word=dictionary, num_topics=500, passes=25)
lda_500.save(path + 'noun_500.model')