# Data Cleaning

To build language models on the text corpus we've retrieved from RateMyTeacher.com we need to clean the data. The way we clean data (remove conjunctions, pluralisations, etc...) can have a considerable effect on results. Thus, we want to experiment and clean the data a few different ways to see how the results are affected. 

In [1]:
from importlib import reload
import os, sys, path

**Begin by collecting the scraped data and formatting it into datasets**:

The `clean_data.gen_data()` method creates the files: 
- ***review_stats.pbz2*** : a file with all the summary information including scores and teacher info. 
- ***full_review_text.pbz2*** : a file with all the review texts.

<font color=darkred size=1>**No need to run this again if you already have these two files**</font>

In [2]:
# from student_voices.clean_data import gen_data
# gen_data('D:/Student_Voices_Database/')

**Bin reviews**:

The original study binned the ratings into several ranges from lowest to highest. Use the `create_hardcoded_ratings_bins` method to create the same bins. 

- This will create the file ***by_ratings_range.pbz2*** which is a dictionary of the form `{bin1:[indices],bin2...}`

<font color=darkred size=1>**No need to run this again if you already have this file**</font>

In [3]:
# from student_voices.clean_data import create_hardcoded_ratings_bins
# from student_voices.sv_utils import decompress_pickle

# review_data = decompress_pickle('D:/Student_Voices/review_stats.pbz2')
# create_hardcoded_ratings_bins(review_data)

## Cleaning Data for NLP

### Use AWS to Clean the Data

Cleaning this ammount of text data can be memory intensive. **This notebook** uses the [spot-connect](https://pypi.org/project/spot-connect/) module to launch virtual machines on AWS to clean the data. 

**Cleaning the data** means preparing the text data for review by an NLP model. This involves: 
- Tokenizing 
- Lemmatizing/Stemming 
- Removing Stop Words 
- Removing Numeric Characters 
- Removing Contractions 
- and more... 

A list of preset cleaning parameters can be retrieved with the `clean_data.data_configuration_hardcodes()` command: 

In [4]:
from student_voices import clean_data
data_configurations = clean_data.data_configuration_hardcodes()

**Set the region and instance AMIs**:

In [None]:
from spot_connect import sutils 

# Change the region for the default profiles
sutils.reset_profiles()

Specify the AWS parameters

In [5]:
# Instance type 
instance_type = 'r5.2xlarge'

# Number of instances to run 
n_jobs = 1

# Number of physical cores in the instance type 
n_cores = 4

# File system to connect to 
filesystem = 'student_data'

# Region
region='us-east-2'

**Uploading the data to AWS**: 

If you have followed all the steps to use AWS with the `spot-connect` module then you should be able to use the AWS command line interface (awscli). Open a command prompt and type the following command to upload your data to an S3 drive: 

`aws s3 sync <local_folder> <s3-bucket>`

This will upload every file and folder in `<local_folder>` to the S3 bucket you choose which should have the name formatted as `s3://<bucket_name>`.

In the my case, this command was: 

`aws s3 sync D:Student_Voices_Database s3://student-voices` 

S3 storage is very affordable so once you've uploaded your data feel free to leave it on there.

Once the data is on S3, use the `LinkAWS` class to create an instance and connect it to a new or existing elastic file system (EFS), then download the data from S3 to the EFS via the instance. The instance will terminate automatically once the transfer is complete. 

<font color=blue size=1>Note that the `LinkAWS` class using `awscli` to perform these transfers on the instance which makes it faster than regular FTS transfers<font>

In [6]:
from spot_connect import instance_manager

# Use the LinkAWS to move data and run jobs on AWS 
aws_link = instance_manager.InstanceManager()

Default key-pair directory is "C:/Projects/VirtualMachines/Key_Pairs"


In [None]:
# Transfer data from : <s3 bucket>  to  <folder on instance> using <instance profile access> to connect to <efs>   
# aws_link.instance_s3_transfer('s3://student_reviews', '/home/ec2-user/efs/', 'ec2_s3_access', efs='student-reviews')

**Create a monitor instance to upload the repo to the EFS**:

In [None]:
# Create a very low cost instance to download the github repo for the project onto the EFS 
# aws_link.launch_monitor()
# aws_link.update_repo(aws_link.monitor, 
#                      '/home/ec2-user/efs/', 
#                      branch='master', 
#                      repo_link='https://github.com/losDaniel/Student-Voices.git')
# aws_link.terminate_monitor()

**Create the job scripts for each instance**:

Since we're working with a python module that connects to a linux instance that will then run a python script that needs specific arguments, passing arguments can get complicated. One easy way to get around this is by creating methods that take the arguments you need and generate the bash scripts you need as "\n" separated strings to be run on the instances.

In the example below, create one script for each data cleaning configuration we want to apply because we will be using one instance per configuration. 

In [7]:
from student_voices import ec2_scripts 
from spot_connect import bash_scripts

configs = ['E1']#['A1','B1','D1']

n_jobs = len(configs) # number of instances we'll run 

scripts = [] 
uploads = [] 
for config in configs: 
    script = ec2_scripts.get_instance_setup_script(
        filesystem,
        region,
        run_as_user='ec2-user')

    script = ec2_scripts.get_clean_data_script(
        config,
        'clean_'+config+'_log.txt', 
        region='us-east-2', 
        path='/home/ec2-user/efs/data/', 
        cancel_fleet=True,
        run_as_user='ec2-user',
        script=script) 
    
    # Convert the working script to base-64 encoded so the fleet can run it 
    user_data_script = bash_scripts.script_to_userdata(script)

    scripts.append(user_data_script)

...EFS file system already exists
Waiting for availability......Available


In [8]:
account_number_file = 'C:/Users/Computer/Documents/AWS/account_number.txt'
account_num = open(account_number_file).read()
        
aws_link.run_distributed_jobs(account_num,
                              'data_cleaning_'+config,            # Instance prefix 
                              n_jobs,                             # Number of jobs 
                              instance_type,                      # Instance type to use
                              availability_zone='us-east-2c',
                              user_data=scripts,                  # List of scripts, 1 for each job 
                              instance_profile='instance_manager') 

Key pair detected, re-using...
Security group detected, re-using...


**Download cleaned data (optional)**:

The cleaned data is now on the EFS. Since the language models will run on AWS instances as well there is no need to download the data. However, if you choose to do so simply upload the cleaned data to S3 and then download locally. 

In [None]:
# Transfer data from : <s3 bucket>  to  <folder on instance> using <instance profile access> to connect to <efs>
aws_link.instance_s3_transfer('/home/ec2-user/efs/cleaned_data', 's3://student_reviews', 'ec2_s3_access', efs='student_data')

Once this transfer is complete the data will be on S3 and can be downloaded locally using: 

`aws s3 sync <s3-bucket> <local_folder>`

## Review the Data

Here we can review and split the data into the corpus we want to evaluate. 

In [3]:
from student_voices import sv_utils as bn

root_dir = 'C:/Projects/VirtualMachines/Student_Voices/svvm/Student_Voices/student_voices/data'

# import the data if need be
data = bn.decompress_pickle(root_dir+'/review_stats.pbz2')

**Cut the data into groups based on ratings:**

In [6]:
import pandas as pd 

# Change the ratings to a 0-100 measure
data['Rating'] = ((data['Rating'] - data['Rating'].min())/ (data['Rating'].max()-data['Rating'].min()))*100

# we bin the data by ratings such that we end up with a similar number of rows in each rating bin 
rating_bins = [0, 35, 60, 65, 75, 85, 95, 101]

# split the ratings data into the bins above and place it in the Range field 
data['Range'] = pd.cut(data['Rating'], rating_bins, right = False)
range_dist = data['Range'].value_counts()

bn.full_value_count(data,'Range')

Unnamed: 0,Range_#,Range_%
"[0, 35)",359387,0.073887
"[35, 60)",309850,0.063703
"[60, 65)",244504,0.050268
"[65, 75)",248963,0.051185
"[75, 85)",404741,0.083212
"[85, 95)",593581,0.122036
"[95, 101)",2702952,0.555708


**Export the indices for each group** 

In [None]:
# # find the indices for each rating bin
# range_indices = {} 
# for v in range_dist.keys():
#     range_indices[str(v)] = list(data.loc[data['Range']==v].index)
    
# # and save the dictionary with the indices
# bn.full_pickle(root_dir+'/by_rating_range', range_indices)

# range_dist.sort_index()

**Generate other groups:**

After analyzing the initial groupings, interest grew in reviewing topic models that covered the following groups
<br>
- [0, 60)
- [0, 65)

In [18]:
range_indices = {} 

keys = sorted(list(range_dist.keys()))

range_indices['[0, 60)'] = list(data.loc[(data['Range']==keys[0])|(data['Range']==keys[1])].index)
range_indices['[0, 65)'] = list(data.loc[(data['Range']==keys[0])|(data['Range']==keys[1])|(data['Range']==keys[2])].index)

# and save the dictionary with the indices
bn.full_pickle(root_dir+'/by_rating_range_2', range_indices)

After analyzing the initial groupings, interest grew in reviewing topic models that covered the following groups
<br>
- [35, 85)
- [60, 85)

In [10]:
range_indices = {} 

keys = sorted(list(range_dist.keys()))

range_indices['[35, 85)'] = list(data.loc[(data['Range']==keys[1])|(data['Range']==keys[2])|(data['Range']==keys[3])|(data['Range']==keys[4])].index)
range_indices['[60, 85)'] = list(data.loc[(data['Range']==keys[2])|(data['Range']==keys[3])|(data['Range']==keys[4])].index)

# and save the dictionary with the indices
bn.full_pickle(root_dir+'/by_rating_range_3', range_indices)

'[0, 35)'

In [15]:
sorted(list(range_dist.keys()))

[Interval(0, 35, closed='left'),
 Interval(35, 60, closed='left'),
 Interval(60, 65, closed='left'),
 Interval(65, 75, closed='left'),
 Interval(75, 85, closed='left'),
 Interval(85, 95, closed='left'),
 Interval(95, 101, closed='left')]

# Cleaning Results 

Correcting for some of things ommitted in the cleaning

**Remove verbatim duplicates from the ratings range**:

In [40]:
from student_voices import sv_utils as bn

config_path = 'D:/Student_Voices_Database/s3mirror/data'

full_text = bn.decompress_pickle(config_path+'/full_review_text.pbz2')

In [86]:
indices = bn.decompress_pickle('D:/Student_Voices_Database/s3mirror/data/by_rating_range.pbz2')

In [88]:
import pandas as pd 

data = pd.DataFrame({'text':full_text})
#data['text'] = data['text'].apply(lambda x: ' '.join(x))

newidx = {}
newidx['[0, 35)'] = list(data.iloc[indices['[0, 35)']].drop_duplicates().index)

In [90]:
bn.compressed_pickle('D:/Student_Voices_Database/s3mirror/data/by_rating_range_lnd', newidx)
bn.compressed_pickle('C:/Projects/VirtualMachines/Student_Voices/svvm/Student_Voices/student_voices/data/by_rating_range_lnd', newidx)

**Remove the word teacher from the cleaned text**:

In [43]:
# Load the cleaned data 
# A1 uses both stemming and lemmatizer 
# B1 uses only the lemmatizer 
# D1 uses none, its damaged
# E1 uses stemmer but not the lemmatizer

config = 'A1'
config_path = 'D:/Student_Voices_Database/s3mirror/data/cleaned_data/'

# This is the full 4 million reviews 
text, stem_map, lemma_map, phrase_frequencies = bn.decompress_pickle(config_path+'/cleaned_docs_'+config+'.pbz2')

# # Load the indices 
# indices = bn.decompress_pickle('D:/Student_Voices_Database/s3mirror/data/by_rating_range.pbz2')

In [95]:
%%time 
no_teacher_text = [[ele for ele in doc if ele != 'teacher'] for doc in text] 

Wall time: 2min 31s


In [97]:
# Save the new text without the word teacher as the cleaned data
config_material = (no_teacher_text, stem_map, lemma_map, phrase_frequencies)
bn.compressed_pickle(config_path+'/cleaned_docs_'+config+'_no_teacher', config_material)

In [102]:
newidx = bn.decompress_pickle('D:/Student_Voices_Database/s3mirror/data/by_rating_range_lnd.pbz2')

# Remove all comments that are 0 in length from the indices listed in the by_ratings_range_lnd above
data = pd.DataFrame({'no_teacher_text':no_teacher_text})
data['no_teacher_text'] = data['no_teacher_text'].apply(lambda x: ' '.join(x))

lengths = [len(x) for x in no_teacher_text]
data['lengths'] = lengths
sample = data.loc[newidx['[0, 35)']]
sample = sample[sample['lengths']>0]

# Save the new indices 
doneidx = {}
doneidx['[0, 35)'] = list(sample.index)

bn.compressed_pickle('D:/Student_Voices_Database/s3mirror/data/by_rating_range_noteach', doneidx)
bn.compressed_pickle('C:/Projects/VirtualMachines/Student_Voices/svvm/Student_Voices/student_voices/data/by_rating_range_noteach', doneidx)

In [99]:
review_text, stem_map, lemma_map, phrase_frequencies = bn.decompress_pickle(config_path+'/cleaned_docs_'+config+'_no_teacher.pbz2')

In [101]:
doneidx

[23,
 81,
 94,
 110,
 124,
 129,
 130,
 149,
 184,
 207,
 226,
 271,
 294,
 296,
 301,
 304,
 326,
 351,
 353,
 369,
 431,
 439,
 446,
 474,
 514,
 516,
 518,
 547,
 583,
 585,
 609,
 612,
 629,
 668,
 683,
 701,
 703,
 713,
 731,
 742,
 768,
 773,
 794,
 799,
 803,
 805,
 818,
 820,
 829,
 830,
 843,
 846,
 860,
 861,
 871,
 887,
 898,
 905,
 910,
 940,
 971,
 974,
 975,
 977,
 1003,
 1095,
 1107,
 1142,
 1143,
 1190,
 1209,
 1215,
 1216,
 1237,
 1297,
 1298,
 1310,
 1325,
 1329,
 1422,
 1423,
 1425,
 1441,
 1514,
 1532,
 1587,
 1611,
 1647,
 1663,
 1675,
 1774,
 1784,
 1789,
 1791,
 1792,
 1796,
 1797,
 1809,
 1812,
 1814,
 1858,
 1884,
 1894,
 1895,
 1933,
 1961,
 1962,
 1963,
 1968,
 1969,
 1975,
 1980,
 1981,
 1999,
 2006,
 2011,
 2026,
 2031,
 2041,
 2044,
 2062,
 2069,
 2080,
 2089,
 2090,
 2092,
 2095,
 2119,
 2120,
 2126,
 2127,
 2128,
 2131,
 2174,
 2176,
 2187,
 2282,
 2294,
 2296,
 2347,
 2353,
 2355,
 2367,
 2373,
 2377,
 2418,
 2427,
 2428,
 2429,
 2432,
 2433,
 2434,
 24