# Sentiment Analysis with an RNN

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis. 

In [36]:
import numpy as np
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sess = sagemaker.Session()
bucket = sess.default_bucket()                    # Set a default S3 bucket
prefix = 'sentiment_rnn'

# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

## Data pre-processing
First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [37]:
from string import punctuation

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

### Encoding the words

In [38]:
# feel free to use this import 
from collections import Counter

## Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

**Test your processing code**

In [39]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  74072

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]


### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.


In [40]:
# 1=positive, 0=negative label conversion
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

### Removing Outliers

1. Getting rid of extremely long or short reviews; the outliers
2. Padding/truncating the remaining data so that we have reviews of the same length.

First, remove *any* reviews with zero length from the `reviews_ints` list and their corresponding label in `encoded_labels`.

In [41]:
print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.

# get indices of any reviews with length 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000


## Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some `seq_length`, we'll pad with 0s. For reviews longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

In [42]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    
    # getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0

## Split Training, Validation, Test

In [43]:
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


In [44]:
train = np.concatenate((train_x, 
                        train_y.reshape(train_y.shape[0], 1)),
                       axis=1)
valid = np.concatenate((val_x, 
                        val_y.reshape(val_y.shape[0], 1)),
                       axis=1)
test = np.concatenate((test_x, 
                        test_y.reshape(test_y.shape[0], 1)),
                       axis=1)

In [45]:
!rm -r data/processed
!mkdir data/processed
train_file = 'data/processed/train.txt'
valid_file = 'data/processed/valid.txt'
test_file = 'data/processed/test.txt'


np.savetxt(train_file, train, delimiter=',')   # X is an array
np.savetxt(valid_file, valid, delimiter=',')   # X is an arraynp.savetxt('train.out', train, delimiter=',')   # X is an array
np.savetxt(test_file, test, delimiter=',')   # X is an array

## test loading the data back

In [46]:
# train = np.loadtxt('data/processed/train.txt', delimiter=',')
# train
import io
import tempfile
import numpy as np


f = open('data/processed/valid.txt', "rb")
tfile = tempfile.NamedTemporaryFile(delete=False)
tfile.write(f.read())

print(tfile.name)
test = np.loadtxt(tfile.name, delimiter=',')
test

/tmp/tmp32hiy5hs


array([[0.000e+00, 0.000e+00, 0.000e+00, ..., 4.500e+01, 4.000e+00,
        1.000e+00],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 8.870e+02, 4.332e+03,
        0.000e+00],
       [9.273e+03, 4.300e+01, 4.650e+02, ..., 2.000e+00, 5.227e+03,
        1.000e+00],
       ...,
       [4.600e+01, 1.100e+01, 6.000e+00, ..., 1.000e+00, 2.140e+02,
        0.000e+00],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 9.980e+02, 2.700e+02,
        1.000e+00],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 2.418e+03, 2.100e+01,
        0.000e+00]])

In [47]:
train_x = train[:, :-1]
train_y = train[:, -1:].reshape(train.shape[0])
train_x

array([[    0,     0,     0, ...,     8,   215,    23],
       [    0,     0,     0, ...,    29,   108,  3324],
       [22382,    42, 46418, ...,   483,    17,     3],
       ...,
       [    0,     0,     0, ...,    28,    77,   384],
       [    0,     0,     0, ...,     1,  1893,  3610],
       [    0,     0,     0, ...,     2,  2428,     8]])

In [48]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2
vocab_size

74073

### Upload file to s3

In [49]:
s3 = boto3.client('s3')

s3_data_path = dict()

for f in [train_file, valid_file, test_file]:
    filename = f.split('/')[-1]
    object_name = f"{prefix}/{filename}"
    s3.upload_file(f, bucket, object_name)
    
    s3_data_path[filename.split('.')[0]] = f"s3://{bucket}/{object_name}"

### Train

In [50]:
s3_data_path['train']

's3://sagemaker-us-west-2-987720697751/sentiment_rnn/train.txt'

In [55]:
from sagemaker.pytorch import PyTorch

hyperparameters = {"epochs": 4, "batch_size": 50} 

metric_definitions = [{'Name': 'Loss',      'Regex': 'Loss: ([0-9\\.]+)'},
                      {'Name': 'Val_Loss',  'Regex': 'Val_Loss: ([0-9\\.]+)'},
                      {'Name': 'val_loss',  'Regex': 'val_loss: ([0-9\\.]+)'},
                      {'Name': 'val_acc',   'Regex': 'val_accuracy: ([0-9\\.]+)'}]


estimator = PyTorch(
    base_job_name="sentiment-rnn-pytorch",
    entry_point="sentiment_rnn.py", 
    role=role,
    framework_version="1.8.0",
    py_version="py3",
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    enable_sagemaker_metrics=True,
)

In [56]:
inputs = {"train": s3_data_path['train'], 
          "valid": s3_data_path['valid']}

estimator.fit(inputs, wait=True, logs=True)


2022-07-09 18:28:48 Starting - Starting the training job...
2022-07-09 18:29:15 Starting - Preparing the instances for trainingProfilerReport-1657391328: InProgress
.........
2022-07-09 18:30:37 Downloading - Downloading input data...
2022-07-09 18:31:14 Training - Downloading the training image...........................
2022-07-09 18:35:44 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-07-09 18:35:47,275 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-07-09 18:35:47,299 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-07-09 18:35:47,306 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-07-09 18:35:47,824 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTrainin

In [59]:
from sagemaker.pytorch import PyTorchModel

inference_prefix = "batch_transform"

model_artifact_s3_location = estimator.model_data  # "s3://<BUCKET>/<PREFIX>/model.tar.gz"
s3_output_path = f"s3://{bucket}/{prefix}/{inference_prefix}"
# !aws s3 cp $model_artifact_s3_location model.tar.gz
# !tar -xf model.tar.gz

# Create PyTorchModel from saved model artifact
pytorch_model = PyTorchModel(
    model_data=model_artifact_s3_location,
    role=role,
    framework_version="1.8.0",
    py_version="py3",
    entry_point="sentiment_rnn.py",
)

transformer = pytorch_model.transformer(instance_count=1, 
                                        instance_type="ml.c5.xlarge",
                                        output_path = s3_output_path,
                                        assemble_with = 'Line')

In [60]:
transformer.transform(
    data=s3_data_path['test'],
    data_type="S3Prefix",
    content_type="text/plain",
    split_type='Line',
    wait=True,
)

...........................[34m2022-07-09 19:31:37,788 [INFO ] main org.pytorch.serve.ModelServer - [0m
[34mTorchserve version: 0.3.0[0m
[34mTS Home: /opt/conda/lib/python3.6/site-packages[0m
[34mCurrent directory: /[0m
[34mTemp directory: /home/model-server/tmp[0m
[34mNumber of GPUs: 0[0m
[34mNumber of CPUs: 4[0m
[34mMax heap size: 938 M[0m
[34mPython executable: /opt/conda/bin/python3.6[0m
[34mConfig file: /etc/sagemaker-ts.properties[0m
[34mInference address: http://0.0.0.0:8080[0m
[34mManagement address: http://0.0.0.0:8080[0m
[34mMetrics address: http://127.0.0.1:8082[0m
[34mModel Store: /.sagemaker/ts/models[0m
[34mInitial Models: model.mar[0m
[34mLog dir: /logs[0m
[34mMetrics dir: /logs[0m
[34mNetty threads: 0[0m
[34mNetty client threads: 0[0m
[34mDefault workers per model: 4[0m
[34mBlacklist Regex: N/A[0m
[34mMaximum Response Size: 6553500[0m
[34mMaximum Request Size: 6553500[0m
[34mPrefer direct buffer: false[0m
[35m2022-07-09 1

In [None]:
import re


def get_bucket_and_prefix(s3_output_path):
    trim = re.sub("s3://", "", s3_output_path)
    bucket, prefix = trim.split("/")
    return bucket, prefix


local_path = "output"  # Where to save the output locally

bucket, output_prefix = get_bucket_and_prefix(s3_output_path)
print(bucket, output_prefix)

sagemaker_session.download_data(path=local_path, bucket=bucket, key_prefix=output_prefix)

In [None]:
import json

for f in os.listdir(local_path):
    path = os.path.join(local_path, f)
    with open(path, "r") as f:
        pred = json.load(f)
        print(pred)