# LDA Topic Modeling

In this notebook we will use LDA models optimized for coherence to review the topics present in the rmt corpus.

We will train a [latent dirichlet allocation model (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
) to discover topics in the data. LDA is a type of natural language processing model that gets topics from the corpus. You must specify the number of topics you want to extract, we iterate over different quantities of topics for each corpus and, later, use coherence analysis to select the best number of topic in the range. 
We will use `gensim`'s built in functionality to train the model. LDA models require a given number of topics to be specified. We will train models for different numbers of topics. Later on we will select between these models using topic coherence as our crtieria. 

A lot of the code in this section was pulled from the [topic_coherence_model_selection notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_coherence_model_selection.ipynb) in the gensim repo. 

In [4]:
import os
import numpy as np
import pandas as pd

from student_voices import sv_utils as bn

## Setting Experimental Parameters 

Just as we did in the cleaning steps, there are certain things we will want to vary to retrieve the optimal results. The version of cleaned data is one of them, here are all the experimental parameters we will vary: 

* Data configuration: of our cleaning steps, two (A1, B1, D1 & E1) are suited for LDA analysis.
* Review length: very short reviews are less likely to convey valuable information, in the following cell block we will check review length distribution to set the values for this parameter. 
* Number of topics: LDA requires that we submit a number of topics before hand, we do not know how many topics can be found in each corpus so we will iterate through an array of these and select the best later.  
* Passes: number of passes to make over the corpus (akin to epochs or iterations)
* N-below: exclude words that occur fewer than this number of times (will be unable to extract meaning from too few contexts)
* N-above: exclude words that occure in over this percentage of reviews (will be unable to extract meaning from too many contexts). 

**Selecting a Target**

Besides these parameters there is one other item to decide which is the target of our analysis. We have generated two general sets of indices, reviews in certain ranges, and reviews with certain characteristic lables. Either can be used in LDA (unsupervised) or supervised learning contexts. 

Our goal is to identify the profiles of bad teaching and use good reviews to contrast and contextualize these. We will assume that longer reviews contain more information, thus we will vary the input to our LDA models based on review length. Below we examine the statistics around review length and decide exactly what we want to target. 

In [5]:
%%time 

root_dir = 'C:/Projects/VirtualMachines/Student_Voices/svvm/Student_Voices/student_voices/'

# import the data if need be
data = bn.decompress_pickle(root_dir+'data/review_stats.pbz2')

Wall time: 1min 1s


In [6]:
# import the labels indices 
label_dict = bn.decompress_pickle(root_dir+'/data/labeled_indices.pbz2') 
# import the range indices 
#range_indices = bn.loosen(root_dir + '/data/by_rating_range.pickle')   # original groupings 
range_indices = bn.loosen(root_dir + '/data/by_rating_range_2.pickle')  # second groupings ([0, 60), [0, 65))
# create a list of each range 
ranges = list(np.sort(list(range_indices.keys())))

**Examine "review-length" distribution:**

Display a graph and tables with summary stats for the distribution of `Review_Length` across corpus

In [12]:
from student_voices import visuals as vs 

tables = []
for rng in ranges: 
    # setup the summary stat table, format into thousands and append to the list of descriptive tables 
    t = pd.DataFrame(data.loc[range_indices[rng],'Review_Length'].describe().astype(int)).rename(columns={'Review_Length':rng})
    t[rng] = t[rng].apply(lambda x: "{:,}".format(x))
    tables.append(t)

for k in label_dict: 
    for v in label_dict[k]: 
        if v != 5: 
            t = pd.DataFrame(data.loc[label_dict[k][v],'Review_Length'].describe().astype(int)).rename(columns={'Review_Length':k+': '+str(int(v))})
            t[k+': '+str(int(v))] = t[k+': '+str(int(v))].apply(lambda x: "{:,}".format(x))
            tables.append(t)

vs.chart_review_lengths(tables, save='C:/Projects/VirtualMachines/Student_Voices/svvm/Student_Voices/graphs/review length percentile distributions.png')
vs.display_side_by_side(tables)

No handles with labels found to put in legend.


<Figure size 700x700 with 1 Axes>

Unnamed: 0,"[0, 60)"
count,669237
mean,139
std,110
min,1
25%,63
50%,119
75%,175
max,1977

Unnamed: 0,"[0, 65)"
count,913741
mean,132
std,105
min,1
25%,59
50%,112
75%,166
max,1977

Unnamed: 0,Clarity: 4
count,780688
mean,122
std,97
min,1
25%,54
50%,103
75%,159
max,1749

Unnamed: 0,Clarity: 3
count,465739
mean,124
std,98
min,1
25%,56
50%,105
75%,160
max,893

Unnamed: 0,Clarity: 2
count,272943
mean,136
std,106
min,2
25%,63
50%,117
75%,172
max,1345

Unnamed: 0,Clarity: 1
count,348314
mean,146
std,117
min,1
25%,64
50%,123
75%,181
max,1977

Unnamed: 0,Easiness: 4
count,945497
mean,120
std,95
min,1
25%,53
50%,102
75%,159
max,841

Unnamed: 0,Easiness: 2
count,519437
mean,131
std,102
min,1
25%,60
50%,113
75%,164
max,1749

Unnamed: 0,Easiness: 3
count,1164881
mean,122
std,98
min,1
25%,54
50%,103
75%,159
max,1749

Unnamed: 0,Easiness: 1
count,306242
mean,148
std,119
min,1
25%,64
50%,122
75%,187
max,1670

Unnamed: 0,Exam Difficulty: 4
count,147272
mean,220
std,149
min,4
25%,96
50%,183
75%,317
max,1747

Unnamed: 0,Exam Difficulty: 1
count,108101
mean,203
std,146
min,2
25%,83
50%,160
75%,286
max,1310

Unnamed: 0,Exam Difficulty: 2
count,87353
mean,219
std,147
min,4
25%,95
50%,181
75%,313
max,1527

Unnamed: 0,Exam Difficulty: 3
count,213756
mean,208
std,145
min,2
25%,88
50%,169
75%,293
max,1749

Unnamed: 0,Helpfulness: 2
count,268457
mean,134
std,104
min,2
25%,61
50%,115
75%,169
max,1345

Unnamed: 0,Helpfulness: 1
count,369222
mean,145
std,115
min,1
25%,65
50%,123
75%,180
max,1977

Unnamed: 0,Helpfulness: 3
count,414405
mean,120
std,95
min,1
25%,54
50%,102
75%,158
max,893

Unnamed: 0,Helpfulness: 4
count,654338
mean,117
std,93
min,2
25%,52
50%,99
75%,157
max,1749

Unnamed: 0,Knowledge: 2
count,33487
mean,251
std,150
min,5
25%,127
50%,209
75%,369
max,1977

Unnamed: 0,Knowledge: 4
count,122363
mean,206
std,137
min,1
25%,102
50%,170
75%,277
max,1055

Unnamed: 0,Knowledge: 1
count,58304
mean,243
std,150
min,2
25%,121
50%,200
75%,355
max,748

Unnamed: 0,Knowledge: 3
count,61162
mean,227
std,144
min,4
25%,113
50%,193
75%,320
max,1403

Unnamed: 0,Textbook Use: 4
count,94555
mean,213
std,146
min,4
25%,92
50%,175
75%,304
max,1749

Unnamed: 0,Textbook Use: 1
count,209921
mean,213
std,148
min,2
25%,90
50%,173
75%,306
max,1977

Unnamed: 0,Textbook Use: 3
count,129412
mean,205
std,144
min,3
25%,86
50%,165
75%,289
max,981

Unnamed: 0,Textbook Use: 2
count,85916
mean,214
std,147
min,2
25%,92
50%,176
75%,305
max,1527

Unnamed: 0,Determination: 1
count,8685
mean,245
std,157
min,21
25%,109
50%,204
75%,370
max,527

Unnamed: 0,Effective: 1
count,8249
mean,248
std,158
min,21
25%,110
50%,209
75%,379
max,527

Unnamed: 0,Empathy: 1
count,8569
mean,247
std,157
min,21
25%,110
50%,206
75%,375
max,527

Unnamed: 0,Homework: 1
count,10396
mean,237
std,155
min,21
25%,103
50%,193
75%,353
max,527

Unnamed: 0,Integrity: 1
count,8260
mean,248
std,157
min,21
25%,110
50%,209
75%,378
max,527

Unnamed: 0,Parent Relation: 1
count,8254
mean,249
std,158
min,21
25%,111
50%,210
75%,381
max,527

Unnamed: 0,Respect: 1
count,8346
mean,249
std,158
min,21
25%,110
50%,209
75%,378
max,527


* Reviewing the distribution of review lengths above there is a clear correlation between corpus size and review length which is curious since the corpus are determined based on fairly arbitrary characteristics (rating & corpus size) and not statistical parameters. 

* Most of the corpus with fewer (but longer) reviews are those with low ratings on teacher characteristics (to add perspective, this sentence was 105 characters long). Interestingly, the lowest rated comments were right skewed, meaning most 

* This finding suggest that high ratings are a sort of "default" while lower ratings will tend to be more informative (keep in mind we do not say anything about whether these describe teachers more accurately as there is no way to verify that).  

**Setting review length:** We use the summary statistics above to set different minima for review length in the page. 

**# of Topics to Try:** We need a manageable ammount of topics. Since there are 13 characteristics highlighted by the website, 3 of which have been present since the beginning of data collection, iterate between 3 and 30 topics. 

In [7]:
from student_voices import lda_analysis

lda_parameters = lda_analysis.hardcoded_lda_parameters(ranges, range_indices, 'E')

### Model Selection by Topic Coherence 

We will use topic coherence for model selection. Topic coherence measures summarize the "interpretability" of the topics resulting from a particular training model. These measures are relative, in other words they are used to compare models to one another rather than evaluate the "absolute" coherence of topics. 

A good explanation of how these measures work can be found [here](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/) but basically, these measures look at the similarity of words within topics, the similarity between topis, and then combines these factors into an aggregate measure. The intuition being that by increasing consistency within topics and minimizing redundancy across topics yield more human-interpretable results. 

There are several different coherence measures, we will be using the "c_v" measure which was found to out-perform other measures in the paper ["Exploring the Space of Topic Coherence Measures"](https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf). If time (and computing power) permits, we will also use the "c_w2v" measure which has been [shown](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization) to reduce the noise in the estimation of the "c_v" by using Word2Vec embeddings to implicitly factorize the distance matrices required to calculate the measure. 

**Using AWS to Train Models:**

First we create the scripts to run the analysis on AWS. Then we set the aws specifications and launch the analysis on AWS. 

In [28]:
from student_voices import ec2_scripts 
from spot_connect import bash_scripts

filesystem = 'student_data'  # File system to connect to 
region='us-east-2'           # Region

configs = ['D1']#,'A1']#,'B1','E1']  # When we look at the results summary, 'D1' cleaned corpus outperforms the rest consistently at all levels w.r.t. coherence. 
settings = ['LDA1']#,'LDA3']# 'LDA2','LDA4']

ntop = 'E' # In follow-ups we've added the "number of topics option, ntop" to the scripts so that we can execute custom topic number in each instance
cg = 'B' # corresponding corpus group (See "run_lda.py" for coding)

model_dir = '/home/ec2-user/efs/models/'
config_path = '/home/ec2-user/efs/data/cleaned_data/'

scripts = [] 
uploads = [] 
for config in configs: 
    for setting in settings:    
#         if (setting, config) in exclude: 
#             continue            
        print('Prepping ',config, setting)
        script = ec2_scripts.get_instance_setup_script(filesystem,region,run_as_user='ec2-user')
        log_file_name = 'log_'+str(setting)+'_'+str(config)+'.txt'        # Logfile that will be saved on the instance 
        script = ec2_scripts.get_lda_script(config,setting,ntop,cg, model_dir,config_path,log_file_name,region='us-east-2', cancel_fleet=False,run_as_user='ec2-user',script=script) 
        user_data_script = bash_scripts.script_to_userdata(script)        # Convert the working script to base-64 encoded so the fleet can run it 
        scripts.append(user_data_script)
    
n_jobs = len(scripts)

Prepping  D1 LDA1
...EFS file system already exists
Waiting for availability......Available


In [27]:
print(n_jobs)
print(script)

1
#!/bin/bash
mkdir /home/ec2-user/efs
sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-9d3d4fe5.efs.us-east-2.amazonaws.com:/   /home/ec2-user/efs 
cd /home/ec2-user/efs
sudo chmod go+rw .
echo EFS Mounted

cd efs
sudo runuser -l ec2-user -c 'sudo update-alternatives --set python /usr/bin/python3.6'
sudo runuser -l ec2-user -c 'pip install -e /home/ec2-user/efs/Student-Voices/'
sudo runuser -l ec2-user -c 'python -m nltk.downloader all'
cd /home/ec2-user/efs/models
sudo runuser -l ec2-user -c 'python /home/ec2-user/efs/Student-Voices/student_voices/modeling_tools.py'
sudo runuser -l ec2-user -c 'python /home/ec2-user/efs/Student-Voices/student_voices/run_lda.py -c D1 -cp /home/ec2-user/efs/data/cleaned_data/ -md /home/ec2-user/efs/models/ -s LDA1 -nt E -cg 2> log_LDA1_D1.txt'



In [26]:
instance_type = 'c5.4xlarge' # Instance type 
n_cores = 4                  # Number of physical cores in the instance type 

from spot_connect import instance_manager
aws_link = instance_manager.InstanceManager()

account_number_file = 'C:/Users/Computer/Documents/AWS/account_number.txt'
account_num = open(account_number_file).read()

aws_link.run_distributed_jobs(account_num,
                              'student_data',                     # Instance prefix 
                              n_jobs,                             # Number of jobs 
                              instance_type,                      # Instance type to use
                              availability_zone='us-east-2c',
                              user_data=scripts,                  # List of scripts, 1 for each job 
                              instance_profile='instance_manager')

Default key-pair directory is "C:/Projects/VirtualMachines/Key_Pairs"
Key pair detected, re-using...
Security group detected, re-using...


<br><br>**Using AWS to Estimate Coherence:**

Again, create the scripts, then distribute the analysis on AWS instances. This time we don't need much computing power. 

In [2]:
from student_voices import ec2_scripts 
from spot_connect import bash_scripts

configs = ['D1']#,'A1','D1','B1']
settings = ['LDA1','LDA2','LDA3','LDA4']

model_dir = '/home/ec2-user/efs/models/'
config_path = '/home/ec2-user/efs/data/cleaned_data/'
results_path = '/home/ec2-user/efs/results/'
filesystem = 'student_data'  # File system to connect to 
region='us-east-2'           # Region

ntop = 'E' # In follow-ups we've added the "number of topics option, ntop" to the scripts so that we can execute custom topic number in each instance
cg = 2 # corresponding corpus group (See "run_lda.py" for coding)

scripts = [] 
uploads = [] 
for config in configs: 
    for setting in settings:            
#         if (setting, config) in exclude: 
#             continue
        print('Prepping ',config, setting)        
        script = ec2_scripts.get_instance_setup_script(filesystem,region,run_as_user='ec2-user')
        log_file_name = 'coh_'+str(setting)+'_'+str(config)+'.txt'
        script = ec2_scripts.get_coherence_script(config,setting,ntop, cg, model_dir,config_path,results_path,log_file_name,region='us-east-2', cancel_fleet=True,run_as_user='ec2-user',script=script)         
        # Convert the working script to base-64 encoded so the fleet can run it 
        user_data_script = bash_scripts.script_to_userdata(script)
        scripts.append(user_data_script)
        
n_jobs = len(scripts)

Prepping  E1 LDA1
...EFS file system already exists
Waiting for availability......Available
Prepping  E1 LDA2
...EFS file system already exists
Waiting for availability......Available
Prepping  E1 LDA3
...EFS file system already exists
Waiting for availability......Available
Prepping  E1 LDA4
...EFS file system already exists
Waiting for availability......Available


In [3]:
#print(script)

In [4]:
from spot_connect import instance_manager
aws_link = instance_manager.InstanceManager()

instance_type = 'c5.2xlarge' # Instance type 
n_cores = 2                  # Number of physical cores in the instance type 

account_number_file = 'C:/Users/Computer/Documents/AWS/account_number.txt'
account_num = open(account_number_file).read()
aws_link.run_distributed_jobs(account_num,
                              'student_data',                     # Instance prefix 
                              n_jobs,                             # Number of jobs 
                              instance_type,                      # Instance type to use
                              availability_zone='us-east-2c',
                              user_data=scripts,                  # List of scripts, 1 for each job 
                              instance_profile='instance_manager')

Default key-pair directory is "C:/Projects/VirtualMachines/Key_Pairs"
Key pair detected, re-using...
Security group detected, re-using...
Key pair detected, re-using...
Security group detected, re-using...
Key pair detected, re-using...
Security group detected, re-using...
Key pair detected, re-using...
Security group detected, re-using...


**Download Models and Results**:

First we transfer the data from the instance to our s3 repository and then we can download it locally. 

In [5]:
from spot_connect import instance_manager
aws_link = instance_manager.InstanceManager()

# Transfer data from : <s3 bucket>  to  <folder on instance> using <instance profile access> to connect to <efs>
aws_link.instance_s3_transfer('/home/ec2-user/efs/models', 's3://student-reviews', 'instance_manager', efs='student_data')

Default key-pair directory is "C:/Projects/VirtualMachines/Key_Pairs"
Instance will be mounted on the student_data elastic filesystem

#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#
#~#~#~#~#~#~#~# Spotting downloader_9VN
#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#

Profile:
{'efs_mount': True, 'firewall_ingress': ('tcp', 22, 22, '0.0.0.0/0'), 'image_id': 'ami-0f3c887052a4defe9', 'image_name': 'Deep Learning AMI (Amazon Linux 2) Version 29.0', 'instance_type': 't3.small', 'min_price': '0.0072', 'price': '0.00828', 'region': 'us-east-2', 'scripts': [], 'username': 'ec2-user'}

Key pair KP-downloader_9VN created...
Security Group SG-downloader_9VN Created...Requesting spot instance
Launching.........Retrieving instance by id
Got instance: i-008cc1bc435e49d65[running].
Waiting for instance to boot...................................Online
Requesting EFS mount...
...EFS file system already exists
Waiting for availability......Available
Region us-east-2
F

In [6]:
# Transfer data from : <s3 bucket>  to  <folder on instance> using <instance profile access> to connect to <efs>
aws_link.instance_s3_transfer('/home/ec2-user/efs/results', 's3://student-reviews', 'instance_manager', efs='student_data')

Instance will be mounted on the student_data elastic filesystem

#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#
#~#~#~#~#~#~#~# Spotting downloader_cFt
#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#~#

Profile:
{'efs_mount': True, 'firewall_ingress': ('tcp', 22, 22, '0.0.0.0/0'), 'image_id': 'ami-0f3c887052a4defe9', 'image_name': 'Deep Learning AMI (Amazon Linux 2) Version 29.0', 'instance_type': 't3.small', 'min_price': '0.0072', 'price': '0.00828', 'region': 'us-east-2', 'scripts': [], 'username': 'ec2-user'}

Key pair KP-downloader_cFt created...
Security Group SG-downloader_cFt Created...Requesting spot instance
Launching.........Retrieving instance by id
Got instance: i-01ce080fdc235fe4a[running].
Waiting for instance to boot.....................................................Online
Requesting EFS mount...
...EFS file system already exists
Waiting for availability......Available
Region us-east-2
FSID fs-9d3d4fe5
Connecting instance to link EFS...
E