# Introduction

In the last notebook we identified that most of the review structure we are interested in for clustering based on nose, palate, and finish are concentrated in topics 2 and 5.  There are two separate ways we can get at that information in reviews:

 1. Create a list of terms for topic 2 and 5 from the word topic probabilities
 2. Run inference with LDA on each topic and use the per-token probabilities to extract terms
 
In this notebook we will run both cases and compare outputs. 

## Imports and Functions

In [29]:
import os
import pandas as pd
from lda_funcs import *
from gensim.corpora import Dictionary

## Data Load

### Restricted Dictionaries from LDA Topics

First we will go through the simple method of generating custom dictionaries for gensim based on the LDA output. 

In [5]:
term_topic_matrix = pd.read_pickle(os.getenv('DOMINO_WORKING_DIR') + '/data/processed/k7lemmas_pertopicprobs.pkl')

In [15]:
term_topic_matrix.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,sum,highest_prob_topic
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
$,0.00455404,1.322508e-06,0.005524675,0.003675863,1.8e-05,6.960281e-05,0.00725685,0.0211,7
100,0.01059977,0.01049358,0.005802063,0.001320193,0.006939,0.002476366,0.01348264,0.051114,7
12,0.007131748,0.002287706,0.001863797,0.002552788,0.000242,0.001064249,0.001178023,0.016321,1
220,1.385378e-07,1.545671e-07,9.1132e-08,1.811082e-07,3e-06,1.595077e-05,0.0001229628,0.000143,7
35cl,1.411301e-07,1.551792e-07,1.590261e-05,1.803219e-07,2.8e-05,2.635139e-07,2.67215e-07,4.5e-05,5


In [32]:
topic25_terms = term_topic_matrix[term_topic_matrix.highest_prob_topic.isin([2,5])].index.tolist()

In [26]:
topic2_terms_sorted = term_topic_matrix[term_topic_matrix.highest_prob_topic == 2].\
    reset_index().sort_values(by=1, ascending=False)['token'].tolist()

In [27]:
topic5_terms_sorted = term_topic_matrix[term_topic_matrix.highest_prob_topic == 5].\
    reset_index().sort_values(by=4, ascending=False)['token'].tolist()

In [35]:
profile_term_dict = Dictionary([topic25_terms])

In [36]:
print(profile_term_dict)

Dictionary(2277 unique tokens: ['+', '002', '003', '004', '005']...)


In [37]:
topic2_dict = Dictionary([topic2_terms_sorted])

In [38]:
print(topic2_dict)

Dictionary(995 unique tokens: ['+', '003', '005', '006', '100~~']...)


In [41]:
topic2_dict_top500 = Dictionary([topic2_terms_sorted[0:500]])

In [42]:
print(topic2_dict_top500)

Dictionary(500 unique tokens: ['+', '105', '10yo', '110', '115']...)


### Save off the Dictionaries

In [44]:
profile_term_dict.save(os.getenv('DOMINO_WORKING_DIR') + '/models/tastingnotes_dictionary.gendict')
topic2_dict.save(os.getenv('DOMINO_WORKING_DIR') + '/models/topic2_dictionary.gendict')
topic2_dict_top500.save(os.getenv('DOMINO_WORKING_DIR') + '/models/topic2_top500_dictionary.gendict')

## Per-Token Inferred Topic from LDA 