### To deploy the model in sagemaker

**Importing Important Libraries**<br>
**Steps To Be Followed**

    Importing necessary Libraries
    Creating S3 bucket
    Mapping train And Test Data in S3
    Mapping The path of the models in S3

## Importing the Libraries

In [1]:
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri 
from sagemaker.session import s3_input, Session
import os
#! pip install textblob

**Initialize Your Resources**

SageMaker need unique training jobs instance to run, and as a user you need to be able to see your job! So here we will provide our name, and use that to track our resources throughout the lab.

Let's start by specifying:

        Here we use our S3 bucket and prefix for training and modelling data. This will be within the same region as the   Notebook Instance, training, and hosting. If we don't give bucket, SageMaker SDK will create a default bucket with pre defined naming convention.
    The IAM role ARN will give SageMaker access to data. It can be fetched using the get_execution_role method from sagemaker python SDK.

**Let we give our bucket name**

In [59]:
bucket_name = 'youtubecommentanalysis' # <--- CHANGE THIS VARIABLE TO A UNIQUE NAME FOR YOUR BUCKET
my_region = boto3.session.Session().region_name # set the region of the instance
print(my_region)

ap-southeast-1


**Initialize the bucket with given prefix name**

In [60]:
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    else: 
        s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region })
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


#### You can go to the S3 option in AWS and can see we created the bucket successfully

In [61]:
# set an output path where the trained model will be saved
prefix = 'xgboost-as-a-built-in-algo'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

s3://youtubecommentanalysis/xgboost-as-a-built-in-algo/output


## Preprocessing the data

In [73]:
#As we continue from the Nootebook result of Cleaning Data, using comment_clean_dataset results, let we get the data in form
# numerical data, and for sagemaker we should have output column in first and rest features in other columns
# This is the requirement for sagemaker model training and deployment

In [27]:
import pandas as pd
import numpy as np
from textblob import TextBlob

In [6]:
comments_clean = pd.read_csv("comment_clean_dataset.csv")
comments_clean.head()

Unnamed: 0,commentId,videoId,commentLikesCount,comments
0,UgxsNXiQRd3_KsLgdPV4AaABAg,BRMS3T11Cdw,0,I am agricultural engineering in machine ...
1,Ugxyg6P0ohA7sUMdyTN4AaABAg,BRMS3T11Cdw,1,i am from the mechanical engineering bac...
2,UgynJARwxVpywHiSMPh4AaABAg,BRMS3T11Cdw,0,What are the prerequisite for this cours...
3,UgxurtxhcNEWR_1sAW54AaABAg,BRMS3T11Cdw,1,there are many videos which one to st...
4,Ugy9hN39JI5EjSPpx694AaABAg,BRMS3T11Cdw,0,www mltut com For machine Learning blogs


In [74]:
#Getting sentiment polarity using textBlob

In [7]:
comments_clean['comment_polarity'] = comments_clean['comments'].apply(lambda x: TextBlob(x).sentiment.polarity)

In [8]:
comments_clean = comments_clean.sample(frac=1).reset_index(drop=True)

In [75]:
# Converting for 0-> Negative sentiments and 1 -> positive sentiment

In [9]:
comments_clean['comment_pol_cat']  = 0
comments_clean['comment_pol_cat'][comments_clean.comment_polarity > 0] = 1
comments_clean['comment_pol_cat'][comments_clean.comment_polarity <= 0] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [10]:
from nltk.corpus import stopwords
from nltk import word_tokenize
import string
import re
import nltk

In [15]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
stop_words = set(stopwords.words('english'))
def remove_stopwords(line):
    word_tokens = word_tokenize(line)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    return " ".join(filtered_sentence)

In [76]:
# This step will remove all the stopwords

In [16]:
comments_clean['stop_comments'] = comments_clean['comments'].apply(lambda x : remove_stopwords(x))

In [17]:
tokenized_comment = comments_clean['stop_comments'].apply(lambda x: x.split())

In [18]:
from nltk.stem.porter import *
stemmer = PorterStemmer()

In [19]:
tokenized_comment = tokenized_comment.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming

In [20]:
for i in range(len(tokenized_comment)):
    tokenized_comment[i] = ' '.join(tokenized_comment[i])

In [21]:
comments_clean['token_comments'] = tokenized_comment

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(stop_words='english')
# bag-of-words feature matrix
bow_dataset = bow_vectorizer.fit_transform(comments_clean['token_comments'])

In [78]:
# This steps weill give us vectorised form of data that we can use to train

In [24]:
bow_dataset.toarray().shape

(274, 1019)

In [44]:
comment_label = comments_clean[['comment_pol_cat']].values

In [79]:
# Concatenating the comment labels and our comment data in vectorised form

In [49]:
result_dataset = pd.DataFrame(np.concatenate((comment_label,bow_dataset.toarray()),axis = 1))

In [51]:
# Splitting in training and testing set
train_data, test_data = np.split(result_dataset.sample(frac=1, random_state=1729), [int(0.8 * len(result_dataset))])
print(train_data.shape, test_data.shape)

(219, 1020) (55, 1020)


In [80]:
# Now we got the data we want let we just see first few rows
train_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019
26,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
222,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
165,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [81]:
test_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019
115,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
260,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
263,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
167,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Saving the data train and test in S3 bucket

In [62]:
# Now let we save the data in local and then Train data into bucket
train_data.to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [63]:
# Test Data Into Buckets
test_data.to_csv('test.csv',index = False,header = False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


## Using the predefined inbuild xgboost model to train in container

In [64]:
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
container = get_image_uri(boto3.Session().region_name,
                          'xgboost', 
                          repo_version='1.0-1')

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


In [68]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":50
        }

In [69]:
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.2xlarge', 
                                          train_volume_size=5, # 5 GB 
                                          output_path=output_path,
                                          )

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


## Now once this step is done, we are ready to train with our estimator and input and validation dataset

In [70]:
estimator.fit({'train': s3_input_train,'validation': s3_input_test})

2020-10-24 14:10:55 Starting - Starting the training job...
2020-10-24 14:10:58 Starting - Launching requested ML instances......
2020-10-24 14:12:26 Starting - Preparing the instances for training...
2020-10-24 14:12:54 Downloading - Downloading input data...
2020-10-24 14:13:00 Training - Downloading the training image.[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[14:13:29] 219x1019 matrix with 223161 entries loaded from /opt/ml/input/data/tra

In [None]:
# Now depending upon the estimators hyperparameters, we can tune it and observe the validation error.
# Now once it is done, we can deploy it in sagemaker using this command.

## Model Deployment

In [71]:
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-------------!

## Now our Model have been deployed in sagemaker successfully