# CORE #4 Text Processing

* Ingestion of data in S3 from the CORE API stored data as JSONs with up to 100 search results stored in each file. 
Per [BlazingText Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html), the algorithm requires each line of the input file should contain a single sentence of space separated tokens. Raw data will need to be processed to accomodate the training format. 
* In # 3, text was extracted from the JSON results and stored in S3. This code picks up from there, prepares the text for modeling, and stores to S3. 

## Initial Prep

Imports

In [None]:
import time
from datetime import datetime
import boto3
import pandas as pd
import pickle
import sagemaker
from sagemaker import get_execution_role

Declarations

In [None]:
core_bucket_name = 'core0823'
stg_bucket = 'core0823-stg'
fnl_bucket = 'core0823-fnl'
psent_key='BT_STG/prepd_sentences.txt'

train_data_path = 's3://{}/{}'.format(stg_bucket,psent_key)
model_path = 's3://{}/{}'.format(fnl_bucket,'blztxt')

Prep

In [None]:
s3_client = boto3.client('s3')

Generic Functions

In [None]:
def s3_file_location(f_bucket, f_file):
    """
    Simply returns a formatted string with the S3 file location
    """
    data_location = 's3://{}/{}'.format(f_bucket,f_file)
    return data_location

def serialobj_file_to_list(f_bucket, f_obj_key):
    """
    Intakes bucket and the key for a serialized object. 
    In this case it is a serialized list object from CORE #3.
    Returns list.
    """
    try:
        s3_obj = s3_client.get_object( Bucket= f_bucket, Key = f_obj_key )['Body'].read()
        return pickle.loads(s3_obj)
    except:
        pass 
        print('Fail getting and deserialization object.')

In [None]:
response = s3_client.put_object(Body=prepd_sentences_serialized, Bucket='core0823-stg', Key='BT_STG/prepd_sentences.txt')

## BlazingText

In [None]:
sess = sagemaker.Session()
role = get_execution_role()

In [None]:
region_name = boto3.Session().region_name
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, 'blazingtext','latest')

In [None]:
bt_model = sagemaker.estimator.Estimator(container,
                                        role,
                                        train_instance_count=2,
                                        train_instance_type='m.c4.2xlarge',
                                        train_volume_size=5,
                                        train_max_run=360000,
                                        input_mode='File',
                                        output_path=model_path,
                                        sagemaker_session_sess)

bt_model.set_hyperparameters(mode='batch_skipgram',
                            epochs=5,
                            min_count=5,
                            sampling_threshold=0.0001,
                            learning_rate=0.05,
                            window_size=5,
                            vector_dim=100,
                            negative_samples=5,
                            batch_size=11,
                            evaluation=True,
                            subwords=False)

In [None]:
# need to check that content_type='text/plain' is correct because pickle dumps was used to write list to file
bt_train_date = sagemaker.session.s3_input(train_data_path, distribution-'FullyReplicated',
                                          content_type='text/plain',s3_data_type='S3Prefix')

bt_data_channels = {'train' : train_data }

In [None]:
bt_model.fit(inputs=bt_data_channels, log=True)