# Sentiment Analysis

## Using XGBoost in SageMaker


I constructed a random tree model to predict the sentiment of a movie review. 

## Step 1: Downloading the data

The dataset is very popular among researchers in Natural Language Processing, usually referred to as the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/). It consists of movie reviews from the website [imdb.com](http://www.imdb.com/), each labeled as either '**pos**itive', if the reviewer enjoyed the film, or '**neg**ative' otherwise.

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

I began by using some Jupyter Notebook magic to download and extract the dataset.

In [None]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

## Step 2: Preparing the data

The data downloaded is split into various files, each of which contains a single review. It will be much easier going forward if we combine these individual files into two large files, one for training and one for testing.

In [None]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [None]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

In [None]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [None]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

In [None]:
train_X[100]

## Step 3: Processing the data

Now that the training and testing datasets merged and ready to use, I started processing the raw data into something that would be useable by our machine learning algorithm. I removed any html formatting that may appear in the reviews and performed some standard natural language processing in order to homogenize the data.

In [None]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import *
stemmer = PorterStemmer()

In [None]:
import re
from bs4 import BeautifulSoup

def review_to_words(review):
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [None]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [None]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

### Extract Bag-of-Words features

For the model I will be implementing, rather than using the reviews directly, I transformed each review into a Bag-of-Words feature representation. 

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib
# joblib is an enhanced version of pickle that is more efficient for storing NumPy arrays

def extract_BoW_features(words_train, words_test, vocabulary_size=5000,
                         cache_dir=cache_dir, cache_file="bow_features.pkl"):
    """Extract Bag-of-Words for a given set of documents, already preprocessed into words."""
    
    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = joblib.load(f)
            print("Read features from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Fit a vectorizer to training documents and use it to transform them
        # NOTE: Training documents have already been preprocessed and tokenized into words;
        #       pass in dummy functions to skip those steps, e.g. preprocessor=lambda x: x
        vectorizer = CountVectorizer(max_features=vocabulary_size,
                preprocessor=lambda x: x, tokenizer=lambda x: x)  # already preprocessed
        features_train = vectorizer.fit_transform(words_train).toarray()

        # Apply the same vectorizer to transform the test documents (ignore unknown words)
        features_test = vectorizer.transform(words_test).toarray()
        
        # NOTE: Remember to convert the features using .toarray() for a compact representation
        
        # Write to cache file for future runs (store vocabulary as well)
        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train=features_train, features_test=features_test,
                             vocabulary=vocabulary)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                joblib.dump(cache_data, f)
            print("Wrote features to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        features_train, features_test, vocabulary = (cache_data['features_train'],
                cache_data['features_test'], cache_data['vocabulary'])
    
    # Return both the extracted features as well as the vocabulary
    return features_train, features_test, vocabulary

In [None]:
# Extract Bag of Words features for both training and test datasets
train_X, test_X, vocabulary = extract_BoW_features(train_X, test_X)

## Step 4: Classification using XGBoost

After creating the feature representation of our training (and testing) data, it is time to start setting up and using the XGBoost classifier provided by SageMaker.

### Writing the dataset

The XGBoost classifier requires the dataset to be written to a file and stored using Amazon S3. To do this, I started by splitting the training dataset into two parts, the data we will train the model with and a validation set. Then, I wrote those datasets to a file and uploaded the files to S3. In addition, I wrote the test set input to a file and uploaded the file to S3. This is so that I can use SageMakers Batch Transform functionality to test our model once I've fit it.

In [None]:
import pandas as pd

val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])

The documentation for the XGBoost algorithm in SageMaker requires that the saved datasets should contain no headers or index and that for the training and validation data, the label should occur first for each sample.

For more information about this and other algorithms, the SageMaker developer documentation can be found on __[Amazon's website.](https://docs.aws.amazon.com/sagemaker/latest/dg/)__

In [None]:
# First we make sure that the local directory in which we'd like to store the training and validation csv files exists.
data_dir = '../data/xgboost'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [None]:
# First, save the test data to test.csv in the data_dir directory. Note that we do not save the associated ground truth
# labels, instead we will use them later to compare with our model output.

pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

# Save the training and validation data to train.csv and validation.csv in the data_dir directory.
# Make sure that the files you create are in the correct format.

pd.concat([val_y, val_X], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([train_y, train_X], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [None]:
# To save a bit of memory we can set text_X, train_X, val_X, train_y and val_y to None.

test_X = train_X = val_X = train_y = val_y = None

### Uploading Training / Validation files to S3

Amazon's S3 service allows us to store files that can be access by both the built-in training models such as the XGBoost model. 

For tasks using SageMaker, there are two methods we could use. The first is to use the low level functionality of SageMaker which requires knowing each of the objects involved in the SageMaker environment. The second is to use the high level functionality in which certain choices have been made on the user's behalf. The low level approach benefits from allowing the user a great deal of flexibility while the high level approach makes development much quicker. For my purposes I will opt to use the high level approach although using the low-level approach is certainly an option.

The method `upload_data()` is a member of object representing the current SageMaker session. What this method does is upload the data to the default bucket (which is created if it does not exist) into the path described by the key_prefix variable. 

For additional resources, see the __[SageMaker API documentation](http://sagemaker.readthedocs.io/en/latest/)__ and in addition the __[SageMaker Developer Guide.](https://docs.aws.amazon.com/sagemaker/latest/dg/)__

In [None]:
import sagemaker

session = sagemaker.Session() # Store the current SageMaker session

# S3 prefix (which folder will we use)
prefix = 'sentiment-xgboost'

# Upload the test.csv, train.csv and validation.csv files which are contained in data_dir to S3 using sess.upload_data().
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

### Creating the XGBoost model

Now that the data has been uploaded it is time to create the XGBoost model. A model is comprised of three different objects in the SageMaker ecosystem, which interact with one another.

- Model Artifacts
- Training Code (Container)
- Inference Code (Container)

The Model Artifacts are what you might think of as the actual model itself. For example, if you were building a neural network, the model artifacts would be the weights of the various layers. In this case, for an XGBoost model, the artifacts are the actual trees that are created during training.

The other two objects, the training code and the inference code are then used the manipulate the training artifacts. More precisely, the training code uses the training data that is provided and creates the model artifacts, while the inference code uses the model artifacts to make predictions on new data.

The way that SageMaker runs the training and inference code is by making use of Docker containers. A container is a way of packaging code up so that dependencies aren't an issue.

In [None]:
from sagemaker import get_execution_role

# Our current execution role is require when creating the model as the training
# and inference code will need to access the model artifacts.
role = get_execution_role()

In [None]:
# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.
# As a matter of convenience, the training and inference code both use the same container.
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(session.boto_region_name, 'xgboost')

In [None]:
#       Create a SageMaker estimator using the container location determined in the previous cell.
#       It is recommended that you use a single training instance of type ml.m4.xlarge. It is also
#       recommended that you use 's3://{}/{}/output'.format(session.default_bucket(), prefix) as the
#       output path.
xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

#       Set the XGBoost hyperparameters in the xgb object. Don't forget that in this case we have a binary
#       label so we should be using the 'binary:logistic' objective.
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

### Fit the XGBoost model

Now that the model has been set up I simply need to attach the training and validation datasets and then ask SageMaker to set up the computation.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

In [None]:
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

### Testing the model

Now that I've fit our XGBoost model, it's time to see how well it performs. To do this I will use SageMakers Batch Transform functionality. Batch Transform is a convenient way to perform inference on a large dataset in a way that is not realtime. That is, I don't necessarily need to use the model's results immediately and instead I can peform inference on a large number of samples. An example of this in industry might be peforming an end of month report. This method of inference can also be useful to us as it means to can perform inference on our entire test set. 

To perform a Batch Transformation I need to first create a transformer objects from our trained estimator object.

In [None]:
#       Create a transformer object from the trained model. Using an instance count of 1 and an instance type of ml.m4.xlarge
#       should be more than enough.
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Next I will perform the transform job. When doing so I need to make sure to specify the type of data sent so that it is serialized correctly in the background. In this case I am providing our model with csv data so I specify `text/csv`. Also, if the test data that I have provided is too large to process all at once then I need to specify how the data file should be split up. Since each line is a single entry in the data set I tell SageMaker that it can split the input on each line.

In [None]:
# Start the transform job. Make sure to specify the content type and the split type of the test data.
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

Currently the transform job is running but it is doing so in the background. Since we wish to wait until the transform job is done and I would like a bit of feedback I can run the `wait()` method.

In [None]:
xgb_transformer.wait()

Now the transform job has executed and the result, the estimated sentiment of each review, has been saved on S3. Since we would rather work on this file locally we can perform a bit of notebook magic to copy the file to the `data_dir`.

In [None]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

The last step is now to read in the output from our model, convert the output to something a little more usable, in this case I want the sentiment to be either `1` (positive) or `0` (negative), and then compare to the ground truth labels.

In [None]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

## Step 5: Looking at New Data

So now I have an XGBoost sentiment analysis model that I believe is working pretty well. As a result, I deployed it and I'm using it in some sort of app.

However, as users use the app I periodically record submitted movie reviews to perform some quality control on the deployed model. Once I've accumulated enough reviews I go through them by hand and evaluate whether they are positive or negative. The reason for doing this is so that I can check to see how well the model is doing.

In [None]:
import new_data

new_X, new_Y = new_data.get_new_data()

**NOTE:** The `new_data` module assumes that the cache created earlier in Step 3 is still stored in `../cache/sentiment_analysis`.

### Testing the current model

Now that I've loaded the new data, I'll check how the current XGBoost model performs on it.

The data loaded has already been pre-processed so that each entry in `new_X` is a list of words that have been processed using `nltk`. However, I have not yet constructed the bag of words encoding, which I will do now.

I will use the vocabulary constructed earlier using the original training data to construct a `CountVectorizer` which I will use to transform the new data into its bag of words encoding.

In [None]:
# Create the CountVectorizer using the previously constructed vocabulary
vectorizer = CountVectorizer(vocabulary=vocabulary,
                preprocessor=lambda x: x, tokenizer=lambda x: x)

# Transform our new data set and store the transformed data in the variable new_XV
new_XV = vectorizer.transform(new_X).toarray()

As a quick sanity check, we make sure that the length of each of our bag of words encoded reviews is correct. In particular, it must be the same size as the vocabulary which in our case is `5000`.

In [None]:
len(new_XV[100])

Now that I've performed the data processing that is required by the model I can save it locally and then upload it to S3 so that I can construct a batch transform job in order to see how well the model is working.

First, I save the data locally.

In [None]:
# Save the data contained in new_XV locally in the data_dir with the file name new_data.csv

pd.DataFrame(new_XV).to_csv(os.path.join(data_dir, 'new_data.csv'), header=False, index=False)

Next, I upload the data to S3.

In [None]:
#       Upload the new_data.csv file contained in the data_dir folder to S3 and save the resulting
#       URI as new_data_location
new_data_location = session.upload_data(os.path.join(data_dir, 'new_data.csv'), key_prefix=prefix)

Then, once the new data has been uploaded to S3, I create and run the batch transform job to get the  model's predictions about the sentiment of the new movie reviews.

In [None]:
#       Using xgb_transformer, transform the new_data_location data. You may wish to **wait** until
#       the batch transform job has finished.

xgb_transformer.transform(new_data_location, content_type='text/csv', split_type='Line')
xgb_transformer.wait()

As usual, we copy the results of the batch transform job to our local instance.

In [None]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

Read in the results of the batch transform job.

In [None]:
predictions = pd.read_csv(os.path.join(data_dir, 'new_data.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

And check the accuracy of our current model.

In [None]:
accuracy_score(new_Y, predictions)

So it would appear that *something* has changed since the model is no longer (as) effective at determining the sentiment of a user provided review.

In a real life scenario you would check a number of different things to see what exactly is going on. In this case, I am only going to check one and that is whether some aspect of the underlying distribution has changed. In other words, I want to see if the words that appear in our new collection of reviews matches the words that appear in the original training set. Of course, I want to narrow our scope a little bit so I will only look at the `5000` most frequently appearing words in each data set, or in other words, the vocabulary generated by each data set.

Before doing that, however, let's take a look at some of the incorrectly classified reviews in the new data set.

To start, I will deploy the original XGBoost model. I will then use the deployed model to infer the sentiment of some of the new reviews. This will also serve as a nice excuse to deploy our model so that I can mimic a real life scenario where I have a model that has been deployed and is being used in production.

In [None]:
# Deploy the model that was created earlier. Recall that the object name is 'xgb'.

xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

### Diagnose the problem

Now that I have our deployed "production" model, I can send some of the new data to it and filter out some of the incorrectly classified reviews.

In [None]:
from sagemaker.predictor import csv_serializer

# I need to tell the endpoint what format the data sent is in so that SageMaker can perform the serialization.
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

It will be useful to look at a few different examples of incorrectly classified reviews so I will start by creating a *generator* which I will use to iterate through some of the new reviews and find ones that are incorrect.

In [None]:
def get_sample(in_X, in_XV, in_Y):
    for idx, smp in enumerate(in_X):
        res = round(float(xgb_predictor.predict(in_XV[idx])))
        if res != in_Y[idx]:
            yield smp, in_Y[idx]

In [None]:
gn = get_sample(new_X, new_XV, new_Y)

At this point, `gn` is the *generator* which generates samples from the new data set which are not classified correctly. To get the *next* sample I simply call the `next` method on our generator.

In [None]:
print(next(gn))

After looking at a few examples, I decide to look at the most frequently appearing `5000` words in each data set, the original training data set and the new data set. The reason for looking at this might be that I expect the frequency of use of different words to have changed, maybe there is some new slang that has been introduced or some other artifact of popular culture that has changed the way that people write movie reviews.

To do this, I start by fitting a `CountVectorizer` to the new data.

In [None]:
new_vectorizer = CountVectorizer(max_features=5000,
                preprocessor=lambda x: x, tokenizer=lambda x: x)
new_vectorizer.fit(new_X)

Now that I have this new `CountVectorizor` object, I can check to see if the corresponding vocabulary has changed between the two data sets.

In [None]:
original_vocabulary = set(vocabulary.keys())
new_vocabulary = set(new_vectorizer.vocabulary_.keys())

I can look at the words that were in the original vocabulary but not in the new vocabulary.

In [None]:
print(original_vocabulary - new_vocabulary)

And similarly, I can look at the words that are in the new vocabulary but which were not in the original vocabulary.

In [None]:
print(new_vocabulary - original_vocabulary)

These words themselves don't tell us much, however if one of these words occured with a large frequency, that might tell us something. In particular, I wouldn't really expect any of the words above to appear with too much frequency.

### Build a new model

Supposing that something has changed about the underlying distribution of the words that the reviews are made up of, I need to create a new model. This way the new model will take into account whatever it is that has changed.

To begin with, I will use the new vocabulary to create a bag of words encoding of the new data. I will then use this data to train a new XGBoost model.

**NOTE:** Because I believe that the underlying distribution of words has changed it should follow that the original vocabulary that I used to construct a bag of words encoding of the reviews is no longer valid. This means that I need to be careful with the data. If I send an bag of words encoded review using the *original* vocabulary I should not expect any sort of meaningful results.

In particular, this means that if I had deployed our XGBoost model then I would need to implement this vocabulary change in the Lambda function as well.

In [None]:
new_XV = new_vectorizer.transform(new_X).toarray()

And a quick check to make sure that the newly encoded reviews have the correct length, which should be the size of the new vocabulary created.

In [None]:
len(new_XV[0])

Now that I have our newly encoded, newly collected data, I can split it up into a training and validation set so that we can train a new XGBoost model. As usual, I first split up the data, then save it locally and then upload it to S3.

In [None]:
import pandas as pd

# Earlier we shuffled the training dataset so to make things simple we can just assign
# the first 10 000 reviews to the validation set and use the remaining reviews for training.
new_val_X = pd.DataFrame(new_XV[:10000])
new_train_X = pd.DataFrame(new_XV[10000:])

new_val_y = pd.DataFrame(new_Y[:10000])
new_train_y = pd.DataFrame(new_Y[10000:])

In order to save some memory I will effectively delete the `new_X` variable. 

In [None]:
new_X = None

Next I save the new training and validation sets locally. Note that I overwrite the training and validation sets used earlier. This is mostly because the amount of space that we have available on our notebook instance is limited. Of course, you can increase this if you'd like but to do so may increase the cost of running the notebook instance.

In [None]:
pd.DataFrame(new_XV).to_csv(os.path.join(data_dir, 'new_data.csv'), header=False, index=False)

pd.concat([new_val_y, new_val_X], axis=1).to_csv(os.path.join(data_dir, 'new_validation.csv'), header=False, index=False)
pd.concat([new_train_y, new_train_X], axis=1).to_csv(os.path.join(data_dir, 'new_train.csv'), header=False, index=False)

Now that I've saved our data to the local instance, I can safely delete the variables to save on memory.

In [None]:
new_val_y = new_val_X = new_train_y = new_train_X = new_XV = None

Lastly, I upload the new training and validation sets to S3.

In [None]:
#  Upload the new data and the new validation.csv and train.csv files in the data_dir directory to S3.
new_data_location = session.upload_data(os.path.join(data_dir, 'new_data.csv'), key_prefix=prefix)
new_val_location = session.upload_data(os.path.join(data_dir, 'new_validation.csv'), key_prefix=prefix)
new_train_location = session.upload_data(os.path.join(data_dir, 'new_train.csv'), key_prefix=prefix)

Once the new training data has been uploaded to S3, I can create a new XGBoost model that will take into account the changes that have occured in our data set.

In [None]:
# First, create a SageMaker estimator object for our model.
new_xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

#       Then set the algorithm specific parameters. You may wish to use the same parameters that were
#       used when training the original model.
new_xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

Once the model has been created, I can train it with our new data.

In [None]:
#       First, make sure that you create s3 input objects so that SageMaker knows where to
#       find the training and validation data.

s3_new_input_train = sagemaker.s3_input(s3_data=new_train_location, content_type='csv')
s3_new_input_validation = sagemaker.s3_input(s3_data=new_val_location, content_type='csv')

In [None]:
# Using the new validation and training data, 'fit' your new model.
new_xgb.fit({'train': s3_new_input_train, 'validation': s3_new_input_validation})

### Check the new model

So now I have a new XGBoost model that I believe more accurately represents the state of the world at this time, at least in how it relates to the sentiment analysis problem that I'm working on. The next step is to double check that the model is performing reasonably.

To do this, I will first test our model on the new data.

**Note:** In practice this is a pretty bad idea. I already trained our model on the new data, so testing it shouldn't really tell us much. In fact, this is sort of a textbook example of leakage. I'm only doing it here so that I have a numerical baseline.

First, I create a new transformer based on our new XGBoost model.

In [None]:
# Create a transformer object from the new_xgb model
new_xgb_transformer = new_xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Next I test our model on the new data.

In [None]:
# Using new_xgb_transformer, transform the new_data_location data. 

new_xgb_transformer.transform(new_data_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()

Copy the results to our local instance.

In [None]:
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir

And see how well the model did.

In [None]:
predictions = pd.read_csv(os.path.join(data_dir, 'new_data.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [None]:
accuracy_score(new_Y, predictions)

As expected, since I trained the model on this data, our model performs pretty well. So, we have reason to believe that our new XGBoost model is a "better" model.

However, before I start changing the deployed model, I should first make sure that the new model isn't too different. In other words, if the new model performed really poorly on the original test data then this might be an indication that something else has gone wrong.

To start with, since I got rid of the variable that stored the original test reviews, I will read them in again from the cache that we created in Step 3. Note that I need to make sure that we read in the original test data after it has been pre-processed with `nltk` but before it has been bag of words encoded. This is because we need to use the new vocabulary instead of the original one.

In [None]:
cache_data = None
with open(os.path.join(cache_dir, "preprocessed_data.pkl"), "rb") as f:
            cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", "preprocessed_data.pkl")
            
test_X = cache_data['words_test']
test_Y = cache_data['labels_test']

# Here we set cache_data to None so that it doesn't occupy memory
cache_data = None

Once I've loaded the original test reviews, I need to create a bag of words encoding of them using the new vocabulary that we created, based on the new data.

In [None]:
# Use the new_vectorizer object that you created earlier to transform the test_X data.
test_X = new_vectorizer.transform(test_X).toarray()

Now that I have correctly encoded the original test data, I can write it to the local instance, upload it to S3 and test it.

In [None]:
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

In [None]:
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)

In [None]:
new_xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()

In [None]:
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir

In [None]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [None]:
accuracy_score(test_Y, predictions)

It would appear that the new XGBoost model is performing quite well on the old test data. This gives us some indication that our new model should be put into production and replace our original model.

## Step 6: Updating the Model

So I have a new model that I'd like to use instead of one that is already deployed. Furthermore, I'm assuming that the model that is already deployed is being used in some sort of application. As a result, what I want to do is update the existing endpoint so that it uses our new model.

Of course, to do this we need to create an endpoint configuration for the newly created model.

First, note that I can access the name of the model that I created above using the `model_name` property of the transformer. The reason for this is that in order for the transformer to create a batch transform job it needs to first create the model object inside of SageMaker. Since I've sort of already done this I should take advantage of it.

In [None]:
new_xgb_transformer.model_name

Next, I create an endpoint configuration using the low level approach of creating the dictionary object which describes the endpoint configuration we want.

In [None]:
from time import gmtime, strftime


# Give our endpoint configuration a name. Remember, it needs to be unique.
new_xgb_endpoint_config_name = "sentiment-update-xgboost-endpoint-config-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Using the SageMaker Client, construct the endpoint configuration.
new_xgb_endpoint_config_info = session.sagemaker_client.create_endpoint_config(
                            EndpointConfigName = new_xgb_endpoint_config_name,
                            ProductionVariants = [{
                                "InstanceType": "ml.m4.xlarge",
                                "InitialVariantWeight": 1,
                                "InitialInstanceCount": 1,
                                "ModelName": new_xgb_transformer.model_name,
                                "VariantName": "XGB-Model"
                            }])

Once the endpoint configuration has been constructed, it is a straightforward matter to ask SageMaker to update the existing endpoint so that it uses the new endpoint configuration.

Of note here is that SageMaker does this in such a way that there is no downtime. Essentially, SageMaker deploys the new model and then updates the original endpoint so that it points to the newly deployed model. After that, the original model is shut down. This way, whatever app is using our endpoint won't notice that I've changed the model that is being used.

In [None]:
# Update the xgb_predictor.endpoint so that it uses new_xgb_endpoint_config_name.
session.sagemaker_client.update_endpoint(EndpointName=xgb_predictor.endpoint, EndpointConfigName=new_xgb_endpoint_config_name)

And, as is generally the case with SageMaker requests, this is being done in the background so if I want to wait for it to complete I need to call the appropriate method.

In [None]:
session.wait_for_endpoint(xgb_predictor.endpoint)

## Step 7: Delete the Endpoint

Of course, since I have done with the deployed endpoint I need to make sure to shut it down, otherwise I will continue to be charged for it.

In [None]:
xgb_predictor.delete_endpoint()

## Clean up

The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook.

In [None]:
# First we will remove all of the files contained in the data_dir directory
!rm $data_dir/*

# And then we delete the directory itself
!rmdir $data_dir

# Similarly we will remove the files in the cache_dir directory and the directory itself
!rm $cache_dir/*
!rmdir $cache_dir