<a href="https://colab.research.google.com/github/himani2207/Cloud_enabled_technology-project/blob/main/Sentiment_analysis_cloud_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2018-12-24 11:55:52--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2018-12-24 11:55:56 (20.3 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [None]:
#preparing and preprocessing the data
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [None]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [None]:
# combine the positive and negative reviews and shuffle the resulting records.
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [None]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


In [None]:
print(train_X[100])
print(train_y[100])

Before I'd seen this, I had seen some pretty bad Christmas films. But once I saw this, "Jingle All the Way" looked better than "The Godfather". "Santa Claus" is a jolly film about Santa helping out some kids, but it almost feels demonic watching it. Santa's jolly ho-ho-ho is replaces by an evil, devilish laugh that I'm sure has turned many kids off of Christmas. The plot of this massacre is very strange, which fits along with all of the performances and dialog. Santa lives high above Earth in the North Pole where he, and kids from all around the world get ready for Christmas. But Santa has an enemy named Pitch, or Satan. Pitch tries to ruin Santa's Christmas by making three boys naughty, and by creating diversions, like moving the chimney and making the doorknob hot. When Pitch causes Santa to be attacked by a dog, it's up to Santa's helper Pedro and Merlin the wizard to get Santa out of this pickle. <br /><br />Everything about this film, along with being downright bad, is so bizarre.

In [None]:
#any html tags that appear should be removed
#tokenize the input
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

#review_to_words method defined above uses BeautifulSoup to remove any html tags that appear
def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [None]:
# TODO: Apply review_to_words to a review (train_X[100] or any other review)
review_to_words(train_X[100])

['seen',
 'seen',
 'pretti',
 'bad',
 'christma',
 'film',
 'saw',
 'jingl',
 'way',
 'look',
 'better',
 'godfath',
 'santa',
 'clau',
 'jolli',
 'film',
 'santa',
 'help',
 'kid',
 'almost',
 'feel',
 'demon',
 'watch',
 'santa',
 'jolli',
 'ho',
 'ho',
 'ho',
 'replac',
 'evil',
 'devilish',
 'laugh',
 'sure',
 'turn',
 'mani',
 'kid',
 'christma',
 'plot',
 'massacr',
 'strang',
 'fit',
 'along',
 'perform',
 'dialog',
 'santa',
 'live',
 'high',
 'earth',
 'north',
 'pole',
 'kid',
 'around',
 'world',
 'get',
 'readi',
 'christma',
 'santa',
 'enemi',
 'name',
 'pitch',
 'satan',
 'pitch',
 'tri',
 'ruin',
 'santa',
 'christma',
 'make',
 'three',
 'boy',
 'naughti',
 'creat',
 'divers',
 'like',
 'move',
 'chimney',
 'make',
 'doorknob',
 'hot',
 'pitch',
 'caus',
 'santa',
 'attack',
 'dog',
 'santa',
 'helper',
 'pedro',
 'merlin',
 'wizard',
 'get',
 'santa',
 'pickl',
 'everyth',
 'film',
 'along',
 'downright',
 'bad',
 'bizarr',
 'satan',
 'danc',
 'lot',
 'actual',
 'seem

In [None]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [None]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Wrote preprocessed data to cache file: preprocessed_data.pkl


In [None]:
import numpy as np
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [None]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

In [None]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.
train_X[20]

array([ 135,    2,  424,  838,  168, 1345,   37,  298,    4,  729,    1,
          1, 3508, 2982,  346, 2032,  178,   94,  181,   15,  165,  423,
          1,   52,    3,  433,  156,  353,  383, 4910, 1380, 3508, 2982,
         76,   73,  161,  770,  321,  403,    1,    1,  895, 3417,  235,
       2025, 1826,  586,    1, 3508,   33, 4095, 2014,    1, 1108,  241,
        723,    8,  725,  329, 3508,  122,   16,    1, 1826,    1,  312,
          4,  283, 1871,   52,  247,  237,  725,  400, 3376, 2062,  486,
          3,    1,   65,    1,  338,   33,  642,   81,    1,   99,  465,
         31,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

**Answer:**

In [None]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-805470203735


In [None]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

In [None]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)
        [36mself[39;49;00m.sig = nn.Sigm

In [None]:
import torch
import torch.utils.data

# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

In [None]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            model.zero_grad()
            
            out=model.forward(batch_X)
            
            loss=loss_fn(out,batch_y)
            loss.backward()
            optimizer.step()
            
            # TODO: Complete this train method to train the model provided.
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

In [None]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6946825623512268
Epoch: 2, BCELoss: 0.6842933416366577
Epoch: 3, BCELoss: 0.675960099697113
Epoch: 4, BCELoss: 0.667292320728302
Epoch: 5, BCELoss: 0.6573679685592652


In [None]:
# TODO: Deploy the trained model
predictor = estimator.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-pytorch-2018-12-25-11-38-48-314
INFO:sagemaker:Creating endpoint with name sagemaker-pytorch-2018-12-25-11-38-48-314


---------------------------------------------------------------------------!

In [None]:
import pickle
cache_dir = os.path.join("../cache", "sentiment_analysis")
cache_file="preprocessed_data.pkl"
with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
train_X, test_X, train_y, test_y = cache_data['words_train'],cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test']

In [None]:
data_dir = '../data/pytorch' # The folder we will use for storing data

with open(os.path.join(data_dir, 'word_dict.pkl'), "rb") as f1:
    word_dict=pickle.load(f1)

In [None]:
import pandas as pd
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [None]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [None]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.84932

**Answer:**

In [None]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

In [None]:
# TODO: Convert test_review into a form usable by the model and save the results in test_data
test_data_review_to_words = review_to_words(test_review)

test_data = [np.array(convert_and_pad(word_dict, test_data_review_to_words)[0])]

In [None]:
test_data

[array([   1, 1376,   49,   53,    3,    4,  878,  173,  392,  682,   29,
         724,    2, 4420,  275, 2082, 1061,  760,    1,  582,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

In [None]:
predictor.predict(test_data)

array(0.5848401, dtype=float32)

Since the return value of our model is close to `1`, we can be certain that the review we submitted is positive.

In [None]:
estimator.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2018-12-25-11-38-48-314


In [None]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m LSTMClassifier

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m review_to_words, 

In [None]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(float(predictor.predict(review_input)))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [None]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.854

As an additional test, we can try sending the `test_review` that we looked at earlier.

In [None]:
predictor.predict(test_review)

b'1.0'

In [None]:
predictor.endpoint

'sagemaker-pytorch-2018-12-25-12-16-23-374'

In [None]:
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2018-12-25-12-16-23-374
