# Sentiment Analysis with Apache MXNet and Gluon

This tutorial shows how to train and test a Sentiment Analysis (Text Classification) model on Amazon SageMaker using Apache MXNet and the Gluon API.

## Download training and test data

In this notebook, we train a Sentiment Analysis model on the [SST-2 (Stanford Sentiment Treebank 2) dataset](https://nlp.stanford.edu/sentiment/index.html). This dataset consists of movie reviews with one sentence per review. The task is to classify the review as either positive or negative.

We download the preprocessed version of this dataset from the links below. Each line in the dataset has space separated tokens, with the first token being the label: 1 for positive and 0 for negative.

In [1]:
%%bash
mkdir data

curl https://raw.githubusercontent.com/saurabh3949/Text-Classification-Datasets/master/stsa.binary.phrases.train > data/train
curl https://raw.githubusercontent.com/saurabh3949/Text-Classification-Datasets/master/stsa.binary.test > data/test 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 4147k  100 4147k    0     0  6209k      0 --:--:-- --:--:-- --:--:-- 6200k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  189k  100  189k    0     0   727k      0 --:--:-- --:--:-- --:--:--  727k


## Upload the data

We use the `sagemaker.s3.S3Uploader` to upload our datasets to an Amazon S3 location. The return value `inputs` identifies the location -- we use this later when we start the training job.

In [2]:
from sagemaker import s3, session

bucket = session.Session().default_bucket()
inputs = s3.S3Uploader.upload('data', 's3://{}/mxnet-gluon-sentiment-example/data'.format(bucket))

## Implement the training function

We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, but you can also access useful properties about the training environment through various environment variables. In addition, hyperparameters are passed to the script as arguments. For more about writing an MXNet training script for use with SageMaker, see [the SageMaker documentation](https://sagemaker.readthedocs.io/en/stable/using_mxnet.html#prepare-an-mxnet-training-script).

The script here is a simplified implementation of ["Bag of Tricks for Efficient Text Classification"](https://arxiv.org/abs/1607.01759), as implemented by Facebook's [FastText](https://github.com/facebookresearch/fastText/) for text classification. The model maps each word to a vector and averages vectors of all the words in a sentence to form a hidden representation of the sentence, which is inputted to a softmax classification layer. For more details, please refer to [the paper](https://arxiv.org/abs/1607.01759).

At the end of every epoch, our script also checks the validation accuracy, and checkpoints the best model so far, along with the optimizer state, in the folder `/opt/ml/checkpoints`. (If the folder `/opt/ml/checkpoints` does not exist, this checkpointing step is skipped.)

In [3]:
!pygmentize 'sentiment.py'

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mbisect[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mfrom[39;49;00m [04m[36mcollections[39;49;00m [34mimport[39;49;00m Counter
[34mfrom[39;49;00m [04m[36mitertools[39;49;00m [34mimport[39;49;00m chain, islice

[34mimport[39;49;00m [04m[36mmxnet[39;49;00m [34mas[39;49;00m [04m[36mmx[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mfrom[39;49;00m [04m[36mmxnet[39;49;00m [34mimport[39;49;00m gluon, autograd, nd
[34mfrom[39;49;00m [04m[36mmxnet.io[39;49;00m [34mimport[39;49;00m DataIter, DataBatch, DataDesc


## Run a SageMaker training job

The `MXNet` class allows us to run our training function on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. In this case we run our training job on a single `c4.2xlarge` instance. 

In [9]:
from sagemaker import get_execution_role
from sagemaker.mxnet import MXNet

m2 = MXNet('sentiment.py',
          role=get_execution_role(),
          train_instance_count=1,
          train_instance_type='ml.c4.xlarge',
          framework_version='1.6.0',
          py_version='py3',
          distributions={'parameter_server': {'enabled': True}},
          hyperparameters={'batch-size': 8,
                           'epochs': 2,
                           'learning-rate': 0.01,
                           'embedding-size': 50, 
                           'log-interval': 1000})

After we've constructed our `MXNet` estimator, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [10]:
m2.fit(inputs)

2020-05-25 23:12:42 Starting - Starting the training job...
2020-05-25 23:12:44 Starting - Launching requested ML instances......
2020-05-25 23:13:44 Starting - Preparing the instances for training...
2020-05-25 23:14:29 Downloading - Downloading input data...
2020-05-25 23:15:06 Training - Training image download completed. Training in progress..[34m2020-05-25 23:15:08,173 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-05-25 23:15:08,176 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-05-25 23:15:08,190 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"batch-size":8,"embedding-size":50,"epochs":2,"learning-rate":0.01,"log-interval":1000}', 'SM_USER_ENTRY_POINT': 'sentiment.py', 'SM_FRAMEWORK_PARAMS': '{"sagemaker_parameter_server_enabled":true}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1",

[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.07480485737323761,Timestamp=1590448524.556141,IterationNumber=1500)[0m
[34m[Epoch 0 Batch 2000] Training: accuracy=0.775300, 271.801445 samples/s[0m
[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.34132832288742065,Timestamp=1590448530.9327826,IterationNumber=2000)[0m
[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.1680513322353363,Timestamp=1590448538.7392519,IterationNumber=2500)[0m
[34m[Epoch 0 Batch 3000] Training: accuracy=0.794818, 203.684855 samples/s[0m
[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.19035762548446655,Timestamp=1590448547.2512016,IterationNumber=3000)[0m
[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.50604


2020-05-25 23:21:28 Uploading - Uploading generated training model
2020-05-25 23:21:28 Completed - Training job completed
Training seconds: 419
Billable seconds: 419


As can be seen from the logs, our model gets over 80% accuracy on the test set using the above hyperparameters.

After training, we use our `MXNet` object to build and deploy an `MXNetPredictor` object. This creates a SageMaker Endpoint that we can use to perform inference. 

In [18]:
predictor = m2.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Using already existing model: mxnet-training-2020-05-25-23-12-41-684


-----------------!

With our predictor, we can perform inference on a JSON-encoded string array. 

The predictor runs inference on our input data and returns the predicted sentiment (1 for positive and 0 for negative).

In [12]:
data = ["this movie was extremely good .",
        "the plot was very boring .",
        "this film is so slick , superficial and trend-hoppy .",
        "i just could not watch it till the end .",
        "the movie was so enthralling !"]

response = predictor.predict(data)
print(response)

[1, 0, 0, 0, 1]


In [15]:
data = ["the store was not very good i didnt have a pleasant time"]

response = predictor.predict(data)
print(response)

[1]


In [17]:
data = ["the store was not very good i didnt have a pleasant time", 
         "one two three"
        ]

response = predictor.predict(data)
print(response)

[1, 1]


## Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
predictor.delete_endpoint()