# Language Models Using Paragraph Vectors

In this notebook we will use Gensim's [word2vec](https://radimrehurek.com/gensim/models/word2vec.html), [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html) and [fasttext](https://radimrehurek.com/gensim/models/fasttext.html) models to use a neural networks to identify relationships between the words in the text. The existing pre-trained fasttext models were insufficient to accurately retrieve information for the corpus so we must resort to using our own models. We will evaluate these using an simple information retrieval task on a number of selected words. 

<br>1) First we will run each language model on several different parameters and return the experimental results. 
<br>2) Next, we will use the best language model to begin putting together topics based on the direct synonyms of words. 

***Make the necessary imports:***

In [1]:
import os 
import time
import numpy as np
import pandas as pd
import visuals as vs
import glob
import modeling_tools as mt
import bear_necessities as bn
import multiprocessing
import lm_analysis_v2 as lma
import warnings 

from importlib import reload
lma = reload(lma)
vs = reload(vs)

warnings.filterwarnings('ignore')

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


Available Cores: 3
Available Cores: 3


In [2]:
experimental_configurations = pd.read_csv('language_model_parameters.txt')

# Pull in the different model type configurations 
d2vexp = experimental_configurations[experimental_configurations['nickname'].str.contains('D2V')]
fstexp = experimental_configurations[experimental_configurations['nickname'].str.contains('FST')]
w2vexp = experimental_configurations[experimental_configurations['nickname'].str.contains('W2V')]

experimental_configurations['review length'] = 80

experimental_configurations

Unnamed: 0,none below,none above,epochs,vector size,window length,alpha,nickname,negative sample,review length
0,30,0.5,20000,300,5,0.05,'D2V1',,80
1,30,0.5,20000,300,7,0.05,'D2V2',,80
2,30,0.5,20000,300,9,0.05,'D2V3',,80
3,30,0.5,20000,400,5,0.05,'D2V4',,80
4,30,0.5,20000,400,7,0.05,'D2V5',,80
5,30,0.5,20000,400,9,0.05,'D2V6',,80
6,30,0.5,20000,400,5,0.05,'D2V7',10.0,80
7,30,0.5,20000,400,7,0.05,'D2V8',10.0,80
8,30,0.5,20000,400,9,0.05,'D2V9',10.0,80
9,30,0.5,1000,300,5,0.05,'FST1',,80


In [3]:
# import the data if need be
data = bn.decompress_pickle(os.getcwd() + '/data/review_stats.pbz2')

# import the ranges (don't need the indices but its how we can get the up to date ranges quickly)
range_indices = bn.loosen(os.getcwd() + '/data/by_rating_range.pickle')
ranges = list(np.sort(list(range_indices.keys())))

**Word2Vec Training**

The original word2vec model is a useful benchmark for evaluating information retrieval tasks. It will also help us decide from the other available models.

In [4]:
data_configurations = ['A1','E1']

for config in data_configurations:
    
    # load the cleaned text data
    text, stem_map, lemma_map, phrase_frequencies = bn.decompress_pickle(os.getcwd()+'/data/cleaned_data/cleaned_docs_'+config+'.pbz2')
    
    # define the directory where you want to save the models
    model_directory = os.getcwd()+'/models/D2V'+config

    # first we will create language models for every range 
    indices = data['Review_Length']
    
    # we will place a single filter on review length of 80 characters 
    docs = [text[idx] for idx in indices[indices>80].index][:1000]
        
    # for doc2vec 
    d2v_params = experimental_configurations.loc[experimental_configurations['nickname'].str.contains('D2V')]

    # for word2vec 
    w2v_params = experimental_configurations.loc[experimental_configurations['nickname'].str.contains('W2V')]
    
    # language models are not as sensitive to input as LDA so we do not vary these parameters 
    none_below = 30 
    none_above = 0.5 

    # run the doc2vec analysis for the full dataset 
    lm.run_d2v_analysis(docs, 
                        d2v_params,
                        none_below,
                        none_above,
                        config+'_full',
                        model_directory)
        
    
    # now for each range
    for rng in ranges: 

        # get the rows in the data that you need 
        indices = data.loc[range_indices[rng],'Review_Length']

        # filter the trainin corpus by review length and save the length 
        docs = [text[idx] for idx in indices[indices>80].index][:1000]
        
        # run the doc2vec analysis for ranges in the dataset 
        lm.run_d2v_analysis(docs, 
                    d2v_params,
                    none_below,
                    none_above,
                    config+'_'+rng,
                    model_directory)


IndexError: list index out of range

In [11]:
for idx in indices[indices>80].index:
    text[idx]


IndexError: list index out of range

In [14]:
indices

0           36
1           40
2           65
3           17
4          107
5          223
6          195
7          116
8          192
9           56
10         169
11         243
12          85
13          77
14         171
15         200
16         125
17          27
18          52
19          82
20          61
21          31
22          57
23          67
24          87
25          78
26          18
27          36
28          86
29         185
          ... 
4884449     29
4884450     44
4884451    119
4884452     30
4884453    149
4884454     44
4884455      9
4884456     97
4884457    197
4884458     41
4884459     20
4884460     58
4884461     59
4884462     52
4884463     58
4884464     11
4884465     37
4884466     24
4884467     92
4884468    160
4884469     11
4884470     97
4884471    108
4884472     60
4884473     14
4884474     12
4884475     11
4884476    153
4884477     69
4884478     37
Name: Review_Length, Length: 4884479, dtype: int64

In [12]:
idx

4863978

**Doc2Vec Training**

We will be using gensim's doc2vec framework to generate the word vectors for the vocabulary in our corpus. Doc2Vec can run on multiple cores but in order to take full advantage of this functionality we need to use file based training as specified in [this notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb). For this we need to modify our input a bit. 

In [None]:
lma.run_d2v_analysis()

**FastText Training**

Alternatively to doc2vec we can use the FastText model. Instead of word vectors, FastText looks at words as character n-grams making it capable of handling words that aren't in the dictionary or are mispelled. 

In [None]:
lma.run_ft_analysis()

# Evaluate Models

Now that we've trained our language models we will evaluate their ability to retrieve information. We want to see if these models are capable of identifying synonyms and antonyms that make sense in the context of the given corpus. We use a standard set of evaluation words to test the models performance and evaluate the output manually to select a model.

**First we select the rating range we want to work with set the results file and pull in the evaluation words:**

In [23]:
corpus_range = '[85, 95)' # remember, ranges have a space after the comma

# set the output file 
results_file = os.getcwd()+'/results/language_model_experimental_results.xlsx'

# open eval words for topics 
files = glob.glob(os.getcwd() + '/eval_words/stemmed_lemmatized/*.txt')

**Import the available language models and save the results:**

In [16]:
lma.library_eval(results_file, files, corpus_range)

## Creating Seeds

With the loaded model we can begin exploring word similarities. We can use this to evaluate model performance (i.e. do the synonyms make sense?). Your goal should be to identify key words that describe similar topics. In the next section we will save all of these words. If you've reviewed the results file and selected a model to use, go ahead and highlight the model and begin building the seeds. 

**Import the gensim model needed to load your model of choice:**

In [4]:
from gensim.models.doc2vec import Doc2Vec

moc = 'D2V_A1_D2V5'
model = Doc2Vec.load(os.getcwd()+'/models/'+moc+'/'+moc.replace('D2V_','D2V_'+corpus_range)+'.model')
materials = bn.decompress_pickle(os.getcwd()+'/models/'+moc+'/'+moc.replace('D2V_','materials_'+corpus_range)+'.pbz2')
stem_map, lemma_map, phrase_freq, dictionary = materials

*Load all of the existing words in the destination folder so you can check if a words has been used before:*

In [12]:
files = glob.glob(os.getcwd()+'/data/subject_words/narrow/*.txt')
topic_dict = {}
registered_topic_words = [] 
for f in files: 
    topic_name = f.split('\\')[-1].split('.')[0]
    topic_dict[topic_name]=[]
    for w in open(f,'r').read().split(): 
        registered_topic_words.append(w)
        topic_dict[topic_name].append(w)

***These words will have been stemmed and lemmatized before training the model, enter is below to find the specific spelling and all the words or phrases it pops up in:***

In [27]:
# Look up specific words stems and phrases 
word = 'think'

print('Words: ',[d for d in [w for w in stem_map if word in w] if d in [dictionary[i] for i in dictionary]])
print('Phrases: ',[w.replace("b'",'').replace(' ','_').replace("'",'') for w in [str(k) for k in list(phrase_freq.keys())] if word in w]) 

Words:  ['think', 'rethink']
Phrases:  ['ned_rethink']


In [None]:
[t for t in topic_dict.keys() if '_primary' not in t]

In [8]:
# Reset the topic and primary lists
subject_words = []
primary = []

In [None]:
subject_name = ''
topic_dict[subject_name]

In [16]:
[w for w in subject_words if w not in primary]

['partner']

In [17]:
# Enter the words you want to list synonyms for. 
word_list = ['partner']

primary += word_list
primary = list(set(primary))

In [18]:
syns = []
sims = [] 
removals = []
for w in word_list: 
    if w in model.wv:
        syns.append(w)
        sims.append(1)
    
        for entry in model.wv.similar_by_word(w):    
            syns.append(entry[0])
            sims.append(entry[1])
    else:
        print('removing',w)
        removals.append(w)

if len(removals)>0:
    word_list = [w for w in word_list if w not in removals]

table = vs.split_tables(syns,word_list,sims=sims)
vs.display_side_by_side(table)  

# format the table so the index you create is capable of selecting from it 
tbl = pd.DataFrame()
for t in table:
    try: 
        tbl = tbl.append(t)
    except: 
        tbl = t
table = tbl 

print('In the following cell block, make a list of the indices above you would like to keep')
print('Full list is %i' % len(list(tbl['Word'])))
print('There are %i unique words' % len(list(set(tbl['Word']))))

Unnamed: 0,Word,Score
0,partner,1.0
1,group,0.465183
2,exact,0.426385
3,tabl,0.415747
4,mate,0.398736
5,seat,0.382509
6,exit,0.376244
7,flash,0.3699
8,comput,0.365927
9,card,0.36051


In the following cell block, make a list of the indices above you would like to keep
Full list is 11
There are 11 unique words


In [19]:
# Write the index numbers of the words you want to include. 
print('Make a list (separated by commas) with the numbers of the words you would like to include')
idx = input('These will be added to the existing list: ')
idx = [int(i) for i in idx.split(',')]
try: subject_words
except: subject_words = [] 
for i in idx: subject_words += [table.loc[i].values[0]]
subject_words = list(set(subject_words))
list.sort(subject_words)

print(subject_words)
for s in subject_words: 
    if s in registered_topic_words: 
        for t in topic_dict: 
            if '_primary' not in t:
                if s in topic_dict[t]:
                    print(s,'is in',t)

Make a list (separated by commas) with the numbers of the words you would like to include
['group', 'group_project', 'partner']


In [21]:
subject_name = 'group_work'

In [22]:
# set a name to save your current list of words
#subject_name = 'fun_or_funny'

print(subject_words)
with open(os.getcwd() + '/data/subject_words/narrow/' + subject_name + '.txt', 'w') as f:
    for w in subject_words: 
        f.write(w)
        f.write('\n')

primary = list(set(primary))
print(primary)
with open(os.getcwd() + '/data/subject_words/narrow/' + subject_name + '_primary.txt', 'w') as f:
    for w in primary: 
        f.write(w)
        f.write('\n')

['group', 'group_project', 'partner']
['group_project', 'partner', 'group']
