# Data Cleaning

To build language models on the text corpus we've retrieved from RateMyTeacher.com we need to clean the data. The way we clean data (remove conjunctions, pluralisations, etc...) can have a considerable effect on results. Thus, we want to experiment and clean the data a few different ways to see how the results are affected. 

In [1]:
from importlib import reload
import os, sys, path

**Begin by collecting the scraped data and formatting it into datasets**:

The `clean_data.gen_data()` method creates the files: 
- ***review_stats.pbz2*** : a file with all the summary information including scores and teacher info. 
- ***full_review_text.pbz2*** : a file with all the review texts.

<font color=darkred size=1>**No need to run this again if you already have these two files**</font>

In [2]:
# from student_voices.clean_data import gen_data
# gen_data('D:/Student_Voices_Database/')

**Bin reviews**:

The original study binned the ratings into several ranges from lowest to highest. Use the `create_hardcoded_ratings_bins` method to create the same bins. 

- This will create the file ***by_ratings_range.pbz2*** which is a dictionary of the form `{bin1:[indices],bin2...}`

<font color=darkred size=1>**No need to run this again if you already have this file**</font>

In [3]:
# from student_voices.clean_data import create_hardcoded_ratings_bins
# from student_voices.sv_utils import decompress_pickle

# review_data = decompress_pickle('D:/Student_Voices/review_stats.pbz2')
# create_hardcoded_ratings_bins(review_data)

## Use AWS to Clean the Data

Cleaning this ammount of text data can be memory intensive. **This notebook** uses the [spot-connect](https://pypi.org/project/spot-connect/) module to launch virtual machines on AWS to clean the data. 

**Cleaning the data** means preparing the text data for review by an NLP model. This involves: 
- Tokenizing 
- Lemmatizing/Stemming 
- Removing Stop Words 
- Removing Numeric Characters 
- Removing Contractions 
- and more... 

A list of preset cleaning parameters can be retrieved with the `clean_data.data_configuration_hardcodes()` command: 

In [4]:
from student_voices import clean_data
data_configurations = clean_data.data_configuration_hardcodes()

**Set the region and instance AMIs**:

In [5]:
from spot_connect import sutils 

# Change the region for the default profiles
sutils.reset_profiles()

c:\users\computer\dropbox\projects\spot-connect\spot_connect\data\profiles.txt


Specify the AWS parameters

In [5]:
# Instance type 
instance_type = 'r5.2xlarge'

# Number of instances to run 
n_jobs = 1

# Number of physical cores in the instance type 
n_cores = 4

# File system to connect to 
filesystem = 'student_data'

# Region
region='us-east-2'

**Uploading the data to AWS**: 

If you have followed all the steps to use AWS with the `spot-connect` module then you should be able to use the AWS command line interface (awscli). Open a command prompt and type the following command to upload your data to an S3 drive: 

`aws s3 sync <local_folder> <s3-bucket>`

This will upload every file and folder in `<local_folder>` to the S3 bucket you choose which should have the name formatted as `s3://<bucket_name>`.

In the my case, this command was: 

`aws s3 sync D:Student_Voices_Database s3://student-voices` 

S3 storage is very affordable so once you've uploaded your data feel free to leave it on there.

Once the data is on S3, use the `LinkAWS` class to create an instance and connect it to a new or existing elastic file system (EFS), then download the data from S3 to the EFS via the instance. The instance will terminate automatically once the transfer is complete. 

<font color=blue size=1>Note that the `LinkAWS` class using `awscli` to perform these transfers on the instance which makes it faster than regular FTS transfers<font>

In [6]:
from spot_connect import instance_manager

# Use the LinkAWS to move data and run jobs on AWS 
aws_link = instance_manager.InstanceManager()

Default key-pair directory is "C:/Projects/VirtualMachines/Key_Pairs"


In [None]:
# Transfer data from : <s3 bucket>  to  <folder on instance> using <instance profile access> to connect to <efs>   
# aws_link.instance_s3_transfer('s3://student_reviews', '/home/ec2-user/efs/', 'ec2_s3_access', efs='student-reviews')

**Create a monitor instance to upload the repo to the EFS**:

In [None]:
# Create a very low cost instance to download the github repo for the project onto the EFS 
# aws_link.launch_monitor()
# aws_link.update_repo(aws_link.monitor, 
#                      '/home/ec2-user/efs/', 
#                      branch='master', 
#                      repo_link='https://github.com/losDaniel/Student-Voices.git')
# aws_link.terminate_monitor()

**Create the job scripts for each instance**:

Since we're working with a python module that connects to a linux instance that will then run a python script that needs specific arguments, passing arguments can get complicated. One easy way to get around this is by creating methods that take the arguments you need and generate the bash scripts you need as "\n" separated strings to be run on the instances.

In the example below, create one script for each data cleaning configuration we want to apply because we will be using one instance per configuration. 

In [7]:
from student_voices import ec2_scripts 
from spot_connect import bash_scripts

configs = ['B1','D1','E1']#['A1']#,'B1','C1','D1']

n_jobs = len(configs) # number of instances we'll run 

scripts = [] 
uploads = [] 
for config in configs: 
    script = ec2_scripts.get_instance_setup_script(
        filesystem,
        region,
        run_as_user='ec2-user')

    script = ec2_scripts.get_clean_data_script(
        config,
        'clean_'+config+'_log.txt', 
        region='us-east-2', 
        path='/home/ec2-user/efs/data/', 
        cancel_fleet=True,
        run_as_user='ec2-user',
        script=script) 
    
    # Convert the working script to base-64 encoded so the fleet can run it 
    user_data_script = bash_scripts.script_to_userdata(script)

    scripts.append(user_data_script)

...EFS file system already exists
Waiting for availability......Available
...EFS file system already exists
Waiting for availability......Available
...EFS file system already exists
Waiting for availability......Available


In [8]:
account_number_file = 'C:/Users/Computer/Documents/AWS/account_number.txt'
account_num = open(account_number_file).read()
        
aws_link.run_distributed_jobs(account_num,
                              'data_cleaning_'+config,            # Instance prefix 
                              n_jobs,                             # Number of jobs 
                              instance_type,                      # Instance type to use
                              availability_zone='us-east-2c',
                              user_data=scripts,                  # List of scripts, 1 for each job 
                              instance_profile='instance_manager') 

Key pair KP-data_cleaning_E1 created...
Security Group SG-data_cleaning_E1 Created...Key pair detected, re-using...
Security group detected, re-using...
Key pair detected, re-using...
Security group detected, re-using...


**Download cleaned data (optional)**:

The cleaned data is now on the EFS. Since the language models will run on AWS instances as well there is no need to download the data. However, if you choose to do so simply upload the cleaned data to S3 and then download locally. 

In [None]:
# Transfer data from : <s3 bucket>  to  <folder on instance> using <instance profile access> to connect to <efs>   
aws_link.instance_s3_transfer('/home/ec2-user/efs/cleaned_data', 's3://student_reviews', 'ec2_s3_access', efs='reviews_efs')

Once this transfer is complete the data will be on S3 and can be downloaded locally using: 

`aws s3 sync <s3-bucket> <local_folder>`