# Guided LDA Modeling

Often an LDA models will return less than optimal results, this happens because the natural occurence of linguistic features may not be conducive to the identification of deeper meaning conveyed across an entire corpus. A guided LDA allows for the pre-determination of elements that may be at the center of clusters. Guided LDA uses a process known as [Gibbs sampling](https://www.pnas.org/content/101/suppl_1/5228.abstract) to identify the observations that most closely relate to a predetermined set of criteria. The implementation of guided LDA was first introduced in [this paper](https://www.aclweb.org/anthology/E12-1021) which was used to produce the python library used in this analysis. 

In the previous notebook, where we examined language models, we identified potential topics to set as seeds. In this notebook, we will use guided LDAs to try and improve on the results from our earlier LDA models. Similarly, we will use coherence to select the number of topics in each instance. 

In [1]:
from gensim.corpora import Dictionary
import os
import time
import glob 
import numpy as np
import modeling_tools as mt
import bear_necessities as bn
import lda_analysis as ld
from importlib import reload
import warnings

warnings.filterwarnings('ignore')

ld = reload(ld)
mt = reload(mt)

# import the ranges (we will need the range indices at the end)
range_indices = bn.loosen(os.getcwd() + '/data/by_rating_range.pickle')
ranges = list(np.sort(list(range_indices.keys())))
ranges = [ranges[0]]

# available configurations 
data_configs = {'full clean':'A1',
                'with nots':'C1'}
print(data_configs) 

dconf = input('Type in the configuration you want to use:')
print(dconf)
print('ranges:',ranges)



Available Cores: 3
Available Cores: 3
{'full clean': 'A1', 'with nots': 'C1'}
A1
ranges: ['[0, 35)']


**Set model parameters for guided LDA:**

In [2]:
mconfig = {} 
mconfig['ntrange'] = [4] + list(range(8, 58, 7)) # list of topics (number of models) to try with each corpus 
mconfig['iters'] = 32 # number of passes through the corpus for each model 
mconfig['seed_confidence'] = 0.8
mconfig['nbelow'] = 30
mconfig['nabove'] = .5
mconfig['name'] = 'G-LDA1'

**Retrieve seeds for guided LDA:**

In [4]:
# open seed words for topics 
files = glob.glob(os.getcwd() + '/data/subject_words/larger/*primary.txt')
# throw them all into one list
seed_topic_list = []
for file in files: 
    with open(file, 'r') as f: 
        seed_topic_list.append(list(filter(None,f.read().split('\n'))))

**Train guided LDA models:**

In [7]:
ld = reload(ld)

model_directory = ld.run_guidedlda(dconf, mconfig, [ranges[0]], seed_topic_list)

Available Cores: 7


KeyboardInterrupt: 

In [None]:
n_top_words = 10
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(list(id2word.values()))[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

In [None]:
from collections import OrderedDict
import warnings 

warnings.filterwarnings('ignore')

# No multiprocessing for guided LDA
model_directory = {} 

total_time = 0 
# for each rating range (corpus) train a series of LDA models and test out coherence. 
for rng in ranges: 
    st = time.time()
    # instantiate the dictionaries 
    docs, stem_map, lemma_map, phrase_freq, dictionary, literal_dictionary, id2word, word2id = mt.setup_text_training_data(rng, 
                                                                                                                           dconf, 
                                                                                                                           mconfig['nbelow'], 
                                                                                                                           mconfig['nabove'])
    # assign topic numbers to seed words 
    seed_topics = {}
    for t_id, st in enumerate(seed_topic_list):
        for word in st:
            try:
                seed_topics[word2id[word]] = t_id
            except: 
                print('Was not able to find %s in %s' % (str(word), str(rng)))    

    # format the data as a document term matrix (required for guided LDA)
    dtm = mt.bow2dtm(docs, dictionary)
    dtm = dtm.astype(int) 
    iters = mconfig['iters'] 
    ntranges = mconfig['ntrange']
    seed_confidence=mconfig['seed_confidence']

    # train the guided lda models
#     trained_models = mt.train_guidedLDAs(dtm) 
#                                    mconfig['iters'], 
#                                    mconfig['ntrange'], 
#                                    seed_topics, 
#                                    seed_confidence=mconfig['seed_confidence'])

    # create an ordered dictionary to store the results from each model
    trained_models = OrderedDict()

    # we will train models for different numbers of topics and evaluate the coherence for each 
    for num_topics in ntrange: 
        
        # train guided LDA model
        model = guidedlda.GuidedLDA(n_topics=num_topics, 
                                    n_iter=iters, 
                                    random_state=7, 
                                    refresh=2)
        model.fit(dtm, 
                  seed_topics=seed_topics, 
                  seed_confidence=seed_confidence)
        
        print("Training Guided LDA(k=%d)" % num_topics)
        
        # add it to the dictionary of trained models 
        trained_models[num_topics] = lda     
        
    # print how long it took to train
    print('Training all the models on the corpus for %s took %s' % (str(rng), str(time.time() - st)))
    total_time += (time.time() - st) # start counting the total time it takes for training. 
    name = 'G-LDA_'+rng+'_'+dconf+'_'+mconfig['name'] # Each model will be named after its data configuration and corpus range
    models_dir = os.getcwd() + '/models/'+name # Set the model directory name (will be created if does not exist)    
    mt.save_models(trained_models, models_dir, name)

    # save the models to a dictionary 
    model_directory[rng] = {}
    model_directory[rng]['models'] = trained_models
    model_directory[rng]['dictionary'] = dictionary 
    model_directory[rng]['docs'] = docs


In [27]:
seed_topics = {}
for t_id, st in enumerate(seed_topic_list):
    for word in st:
        try:
            seed_topics[word2id[word]] = t_id
        except: 
            print('Was not able to find %s in %s' % (str(word), str(rng)))    


**Rank models by coherence:**

In [None]:
ranked_models = ld.rank_by_coherence(model_directory, ranges)

Topic 0: ask_question bok learn properli idea want posibl prepar unfair far_behind
Topic 1: time say efect god sit_around apeal use alow click confer
Topic 2: part manipul world beter teach make lock blame downhil easi
Topic 3: shel reali atitud quiz hate read_bok favorit say hand task
Topic 4: teribl milion club told leson cla restrict easili especiali think
