In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2019-01-02 09:03:55--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2019-01-02 09:04:07 (6.84 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [2]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [3]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [4]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    

    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    return data_train, data_test, labels_train, labels_test

In [5]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


In [6]:
print(train_X[100])
print(train_y[100])

This film is famous for several qualities: a literate script, for once in partly-religious film-making, by Philip Dunne, some very good performances, a first-rate production in every department and its intelligent direction by veteran Henry King. If one were making a film, then getting such talents as Leon Shamroy as cinematographer, Lyle Wheeler as art director and Alfred Newman as composer of original music would guarantee a quality production. Add the cast of this film, including Gregory Peck and Susan Hayward as the title characters, James Robertson Justice, Raymond Massey, Kieron Moore, Jayne Meadows and John Sutton plus a dance by Gwen Verdon and expectations might be raised that the resulting film could be made into something special. But in a biblical subject script, usually a sub-genre prone to illogical motivations and miraculous interventions, everything would ultimately depend on the author's skills. Philip Dunne here has supplied human beings, a rare achievement in biblica

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text()
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    words = text.split() 
    words = [w for w in words if w not in stopwords.words("english")] 
    words = [PorterStemmer().stem(w) for w in words]
    
    return words

In [8]:
# TODO: Apply review_to_words to a review (train_X[100] or any other review)
review_to_words(train_X[100])

['film',
 'famou',
 'sever',
 'qualiti',
 'liter',
 'script',
 'partli',
 'religi',
 'film',
 'make',
 'philip',
 'dunn',
 'good',
 'perform',
 'first',
 'rate',
 'product',
 'everi',
 'depart',
 'intellig',
 'direct',
 'veteran',
 'henri',
 'king',
 'one',
 'make',
 'film',
 'get',
 'talent',
 'leon',
 'shamroy',
 'cinematograph',
 'lyle',
 'wheeler',
 'art',
 'director',
 'alfr',
 'newman',
 'compos',
 'origin',
 'music',
 'would',
 'guarante',
 'qualiti',
 'product',
 'add',
 'cast',
 'film',
 'includ',
 'gregori',
 'peck',
 'susan',
 'hayward',
 'titl',
 'charact',
 'jame',
 'robertson',
 'justic',
 'raymond',
 'massey',
 'kieron',
 'moor',
 'jayn',
 'meadow',
 'john',
 'sutton',
 'plu',
 'danc',
 'gwen',
 'verdon',
 'expect',
 'might',
 'rais',
 'result',
 'film',
 'could',
 'made',
 'someth',
 'special',
 'biblic',
 'subject',
 'script',
 'usual',
 'sub',
 'genr',
 'prone',
 'illog',
 'motiv',
 'miracul',
 'intervent',
 'everyth',
 'would',
 'ultim',
 'depend',
 'author',
 'skill

**Question:** Above we mentioned that `review_to_words` method removes html formatting and allows us to tokenize the words found in a review, for example, converting *entertained* and *entertaining* into *entertain* so that they are treated as though they are the same word. What else, if anything, does this method do to the input?

**Answer:** The method also converts all words to lower case and removes all punctuations.

The method below applies the `review_to_words` method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

In [9]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis") 
os.makedirs(cache_dir, exist_ok=True)  

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
  
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  
    
   
    if cache_data is None:
       
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
      
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
       
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [10]:

train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


In [11]:
import numpy as np

def build_dict(data, vocab_size = 5000):
   
    word_count = {}
    for sentence in data:
        for word in sentence:
            if word in word_count:
                word_count[word] += 1
            else:
                word_count[word] = 1
  
    sorted_words = None
    sorted_words = sorted(word_count, key=word_count.get, reverse=True)
    
    word_dict = {}
    for idx, word in enumerate(sorted_words[:vocab_size - 2]):
        word_dict[word] = idx + 2                              
        
    return word_dict

In [12]:
word_dict = build_dict(train_X)

**Question:** What are the five most frequently appearing (tokenized) words in the training set? Does it makes sense that these words appear frequently in the training set?

**Answer:** There five most frequent words are movi, film, one, like and time. Most words are commonly used in movie review, however the word one being in the list is quite interesting

In [13]:

count = 0
for word in word_dict:
    print(word)
    if count > 4:
        break
    count += 1

movi
film
one
like
time
good


In [14]:
data_dir = '../data/pytorch' 
if not os.path.exists(data_dir): 
    os.makedirs(data_dir)

In [15]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

In [16]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 
    INFREQ = 1 
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [17]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

In [18]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.
train_X[0]

array([3871,    1,   59,  807,    1,   79,    2,   12, 3028,  168,   10,
        297, 1377,  137,   61,   22,   49,  197, 1377,   20,   28, 1059,
          1,  841,  197, 2412,  228,  719,  335,    3, 2186,    3,  125,
        113,  128,  387,   54, 1235,   23, 1812, 3049,   59,  807,  835,
         36,  109, 4248, 1812,  546,    4,   84,  740,  488,    1,  626,
        105,    5,  627,  295, 2349,    2,   92,  561,  465,    5, 4584,
       2462,    9, 3166, 2851,  557,   71,   13,  283,   10,  854,   13,
        283,   10,  314,  734,   53,   43,  284, 1571,    1, 1465,    1,
       1308, 3533, 1696, 2132,  700,    1,  700,    1,  140,    4,   53,
         60,   56,   47,    3,    7,  719,  965, 1740,    1, 1272,  343,
        282,    1,  879,  486, 4909, 1696, 2132,   33,    1,   55, 2013,
       3067,  700,    1,  286,    1, 1793, 1425, 1275,    1,   16,  594,
         37, 1696, 2132,  487,  700,    1,  594,   29,   68,  180,   26,
        271,  384,   17, 3180,   93,   18,   45, 15

**Question:** In the cells above we use the `preprocess_data` and `convert_and_pad_data` methods to process both the training and testing set. Why or why not might this be a problem?

**Answer:** Despite having different word lengths for each review, the training data all have the same length. This might be memory intensive.

In [19]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [20]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [21]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

In [22]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)
        [36mself[39;49;00m.sig = nn.Sigm

In [23]:
import torch
import torch.utils.data


train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)


train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()


train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)

train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

In [24]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            
            optimizer.zero_grad()
            out = model.forward(batch_X)
            loss = loss_fn(out, batch_y)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

In [25]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6899807333946228
Epoch: 2, BCELoss: 0.680549705028534
Epoch: 3, BCELoss: 0.6721559405326843
Epoch: 4, BCELoss: 0.662817370891571
Epoch: 5, BCELoss: 0.6515201926231384


In [26]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.m4.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [27]:
estimator.fit({'training': input_data})

INFO:sagemaker:Creating training-job with name: sagemaker-pytorch-2019-01-02-09-04-41-294


2019-01-02 09:04:44 Starting - Starting the training job...
2019-01-02 09:04:46 Starting - Launching requested ML instances......
2019-01-02 09:05:52 Starting - Preparing the instances for training......
2019-01-02 09:07:01 Downloading - Downloading input data...
2019-01-02 09:07:27 Training - Training image download completed. Training in progress.
[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-01-02 09:07:27,788 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-01-02 09:07:27,791 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-01-02 09:07:27,804 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-01-02 09:07:30,814 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-01-02 09:07:31,070 sagemaker-containers INFO    

In [28]:
# TODO: Deploy the trained model
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-pytorch-2019-01-02-09-04-41-294
INFO:sagemaker:Creating endpoint with name sagemaker-pytorch-2019-01-02-09-04-41-294


---------------------------------------------------------------------------!

In [29]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [30]:


def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [31]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [32]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.85116

**Question:** How does this model compare to the XGBoost model you created earlier? Why might these two models perform differently on this dataset? Which do *you* think is better for sentiment analysis?

**Answer:** The XGBoost model created earlier had an accuracy of 0.85696.
Both model produces approximately the same results.

In [33]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

In [34]:

test_data = None
test_data_review_to_words = review_to_words(test_review)
test_data = [np.array(convert_and_pad(word_dict, test_data_review_to_words)[0])]

Now that we have processed the review, we can send the resulting array to our model to predict the sentiment of the review.

In [35]:
predictor.predict(test_data)

array(0.6346543, dtype=float32)

Since the return value of our model is close to `1`, we can be certain that the review we submitted is positive.

In [36]:
estimator.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2019-01-02-09-04-41-294


In [37]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m LSTMClassifier

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m review_to_words, 

In [39]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-pytorch-2019-01-02-11-07-56-568
INFO:sagemaker:Creating endpoint with name sagemaker-pytorch-2019-01-02-11-07-56-568


---------------------------------------------------------------!

In [40]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
       
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
       
        for f in files:
            with open(f) as review:
                
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
               
                review_input = review.read().encode('utf-8')
              
                results.append(float(predictor.predict(review_input)))
                
          
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [41]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [42]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.886

In [43]:
predictor.predict(test_review)

b'1.0'

In [44]:
predictor.endpoint

'sagemaker-pytorch-2019-01-02-11-07-56-568'

Now that your web app is working, trying playing around with it and see how well it works.

**Question**: Give an example of a review that you entered into your web app. What was the predicted sentiment of your example review?

**Answer:** Review obtained from Rotten Tomatoes, CRITIC REVIEWS FOR SPIDER-MAN: INTO THE SPIDER-VERSE "The spectacularly colorful, varied, and busy animation is impressive but bombastic, leaving little room for wonder and suggesting exertion rather than inspiration." The prediceted sentiment was positive.

In [45]:
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2019-01-02-11-07-56-568
