# CORE #4 Text Processing

* Ingestion of data in S3 from the CORE API stored data as JSONs with up to 100 search results stored in each file. 
Per [BlazingText Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html), the algorithm requires each line of the input file should contain a single sentence of space separated tokens. Raw data will need to be processed to accomodate the training format. 
* In # 3, text was extracted from the JSON results and stored in S3. This code picks up from there, prepares the text for modeling, and stores to S3. 

## Initial Prep

Imports

In [1]:
import time
from datetime import datetime
import boto3
import pandas as pd
import pickle
import sagemaker
from sagemaker import get_execution_role

Declarations

In [2]:
core_bucket_name = 'core0823'
stg_bucket = 'core0823-stg'
fnl_bucket = 'core0823-fnl'
psent_key='BT_STG/prepd_sentences.txt'

train_data_path = 's3://{}/{}'.format(stg_bucket,psent_key)
model_path = 's3://{}/{}'.format(fnl_bucket,'blztxt')

## BlazingText

In [4]:
sess = sagemaker.Session()
role = get_execution_role()

In [5]:
region_name = boto3.Session().region_name
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, 'blazingtext','latest')

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


In [7]:
bt_model = sagemaker.estimator.Estimator(container,
                                        role,
                                        train_instance_count=2,
                                        train_instance_type='ml.m4.xlarge',
                                        train_volume_size=5,
                                        train_max_run=360000,
                                        input_mode='File',
                                        output_path=model_path,
                                        sagemaker_session = sess)

bt_model.set_hyperparameters(mode='batch_skipgram',
                            epochs=5,
                            min_count=5,
                            sampling_threshold=0.0001,
                            learning_rate=0.05,
                            window_size=5,
                            vector_dim=100,
                            negative_samples=5,
                            batch_size=11,
                            evaluation=True,

                             subwords=False)

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [8]:
# need to check that content_type='text/plain' is correct because pickle dumps was used to write list to file
bt_train_data = sagemaker.session.s3_input(train_data_path, distribution='FullyReplicated',
                                          content_type='text/plain',s3_data_type='S3Prefix')

bt_data_channels = {'train' : bt_train_data }

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [9]:
bt_model.fit(inputs=bt_data_channels, logs=True)

2020-09-21 23:02:48 Starting - Starting the training job...
2020-09-21 23:02:51 Starting - Launching requested ML instances......
2020-09-21 23:04:07 Starting - Preparing the instances for training.........
2020-09-21 23:05:44 Downloading - Downloading input data
2020-09-21 23:05:44 Training - Downloading the training image...
2020-09-21 23:06:04 Training - Training image download completed. Training in progress.[35mArguments: train[0m
[35mFound 10.2.83.8 for host algo-1[0m
[35mFound 10.2.103.143 for host algo-2[0m
[34mArguments: train[0m
[34mFound 10.2.83.8 for host algo-1[0m
[34mFound 10.2.103.143 for host algo-2[0m
[35m[09/21/2020 23:06:16 INFO 139719394350912] nvidia-smi took: 0.0251860618591 secs to identify 0 gpus[0m
[35m[09/21/2020 23:06:16 INFO 139719394350912] Running distributed CPU BlazingText training using batch_skipgram on 2 hosts.[0m
[35m[09/21/2020 23:06:16 INFO 139719394350912] Number of hosts: 2, master IP address: 10.2.83.8, host IP address: 10.2.103