# Sagemaker Skicit-learn Example
Let's use the final project dataset

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

Read our data from the s3 bucket. 

In [3]:
import pandas as pd

bucket='mlu-data-example-test'
data_key = 'training.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

df = pd.read_csv(data_location, header=0)
print(df.head())

        doc_id                                               text  \
0  16452961621  Smells.like burnt coffee and taste disgusting....   
1  17116740071  It was ok, but definitely not worth paying $31...   
2  16550647171  It stings and burns under my tongue.  I have u...   
3  17119506041  Great idea but paper does not burn uniformly n...   
4  16969366511           Burned almost anything I tried to toast.   

            date  star_rating                     title  \
0  9/15/18 15:18            1  Terrible taste and smell   
1  9/17/18 15:50            1      not worth $31/bottle   
2  9/16/18 18:26            1                It stings!   
3  9/20/18 21:18            3                Ok product   
4  9/11/18 17:17            1                         .   

                  human_tag  
0  Not Product Safety Issue  
1  Not Product Safety Issue  
2  Not Product Safety Issue  
3  Not Product Safety Issue  
4  Not Product Safety Issue  


# 1-Pre-processing:
We will do pre-processing of our data. In pre-processing, we will:
* Handle missing values
* Normalize star_rating field

In [4]:
# Let's remove rows with NaN value
df.dropna(inplace=True)

In [5]:
df["text"] = df["text"].apply(lambda x: x.lower())
df["title"] = df["title"].apply(lambda x: x.lower())

In [6]:
print("Normalizing star rating field")
df["star_rating"] = (df["star_rating"]-df["star_rating"].min())/(df["star_rating"].max()-df["star_rating"].min())

Normalizing star rating field


# 2-Training-Validation Split and Vectorizing:

Let's split our training data into training and validation subsets. We will use title, text and star_rating fields as predictor variables and human_tag is our label or taget variable. 

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

X_train, X_val, y_train_text, y_val_text = train_test_split(df[["title", "text", "star_rating"]], df["human_tag"].values, test_size=0.3, shuffle=True)

le = LabelEncoder()
le.fit(y_train_text)
y_train = le.transform(y_train_text)
y_val = le.transform(y_val_text)

Extracting TF-IDF features

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_title_vectorizer = TfidfVectorizer(max_features=350)
tfidf_text_vectorizer = TfidfVectorizer(max_features=850) 

tfidf_title_vectorizer.fit(X_train["title"].values)
tfidf_text_vectorizer.fit(X_train["text"].values)

X_train_title_vectors = tfidf_title_vectorizer.transform(X_train["title"].values)
X_train_text_vectors = tfidf_text_vectorizer.transform(X_train["text"].values)

X_val_title_vectors = tfidf_title_vectorizer.transform(X_val["title"].values)
X_val_text_vectors = tfidf_text_vectorizer.transform(X_val["text"].values)

Let's put everything together.

In [9]:
import numpy as np

train_data = np.column_stack([X_train["star_rating"].values, 
                              X_train_title_vectors.toarray(), 
                              X_train_text_vectors.toarray(),
                              y_train
                             ])

validation_data = np.column_stack([X_val["star_rating"].values, 
                              X_val_title_vectors.toarray(), 
                              X_val_text_vectors.toarray(),
                              y_val
                             ])

# 3-Saving the features and vectorizers into S3 bucket
We successfully extracted features. It is now time to the data it to our S3 bucket so that our training algorithm can use it. It will be uploaded to "processed_data" folder. We will also save our vectorizers to our s3 bucket.

In [10]:
import os 
import pickle
import numpy as np

# Let's save our vectorizers locally
with open("tfidf_title_vectorizer.pickle", "wb") as f:
    pickle.dump(tfidf_title_vectorizer, f)
with open("tfidf_text_vectorizer.pickle", "wb") as f:
    pickle.dump(tfidf_text_vectorizer, f)

# Save training and validation data locally
np.save('train_data', train_data)
np.save('validation_data', validation_data)

# Upload the data to our S3 bucket
prefix = 'processed_data'
bucket='mlu-data-example-test'

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train_data.npy')).upload_file('train_data.npy')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation_data.npy')).upload_file('validation_data.npy')
boto3.Session().resource('s3').Bucket(bucket).Object('tfidf_title_vectorizer.pickle').upload_file('tfidf_title_vectorizer.pickle')
boto3.Session().resource('s3').Bucket(bucket).Object('tfidf_text_vectorizer.pickle').upload_file('tfidf_text_vectorizer.pickle')

# 4-Training and Deployment Script

Let's get the training data path.

In [11]:
bucket = 'mlu-data-example-test'
prefix = 'processed_data/train'

train_input = 's3://{}/{}'.format(bucket, prefix)
print(train_input)

s3://mlu-data-example-test/processed_data/train


We will use a separate instance for training. "script_path" variable holds the file name/path for our training and inference code.

In [12]:
from sagemaker.sklearn.estimator import SKLearn
import numpy as np

script_path = 'sklearn_training_burn_project.py'

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session
    )

The following command will take some time, it will print information about the instance and training process. 

In [13]:
sklearn.fit({'train': train_input}, logs=False)


2019-11-14 22:39:09 Starting - Starting the training job.
2019-11-14 22:39:18 Starting - Launching requested ML instances............
2019-11-14 22:40:19 Starting - Preparing the instances for training.......
2019-11-14 22:41:03 Downloading - Downloading input data.....
2019-11-14 22:41:30 Training - Downloading the training image..
2019-11-14 22:41:48 Training - Training image download completed. Training in progress..
2019-11-14 22:41:59 Uploading - Uploading generated training model
2019-11-14 22:42:04 Completed - Training job completed


We will deploy our model to a "ml.t2.medium" instance with the endpoint name: "endpoint-sklearn-new". This will also take some to finish.

In [14]:
predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.t2.medium", endpoint_name="endpoint-sklearn-test")

--------------------------------------------------------------------------------------------------------------!

# 5-Test the endpoint:

In [38]:
payload = '{"data": [{"text":"the laptop gets hot even when idling", "title":"it burnt!", "star_rating":1}]}'
sm=boto3.client("runtime.sagemaker")
response = sm.invoke_endpoint(
      EndpointName='endpoint-sklearn-test',
      Body=payload,
      ContentType="application/json",
      Accept="application/json")

a = int(response['Body'].read().decode()[6])
    
if(a == 0):
    answer = "product safety issue"
else:
    answer = "not product safety issue"

In [39]:
print(answer)

product safety issue
