# Movie Reviews Sentiment Analysis Web App


##  1: Download the data


In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2024-04-19 22:27:42--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2024-04-19 22:27:47 (14.6 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## 2: Data Preparation


In [2]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [3]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [4]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labels
    return data_train, data_test, labels_train, labels_test

In [5]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


In [6]:
print(train_X[10])
print(train_y[10])

This movie was for a while in my collection, but it wasn't before a friend of mine reminded me about it  until I decided that I should watch it. I did not know much about Close to Leo  just that it was supposed to be excellent coming out of age movie and it deals with a very serious topic  Aids. <br /><br />Although the person who has aids  is Leo  the scenario wraps around the way in which Marcel (the youngest brother of Leo) coupes with the sickness of his relative. At first everyone is trying to hide the truth from Marcel  he is believed to be too young to understand the sickness of his brother  the fact that Leo is also a homosexual contributes to the unwillingness of the parents to discus the matter with the young Marcel. I know from experience that on many occasions most older people do not want to accept the fact that sometimes even when someone is young this does not automatically means that he will not be able to accept the reality and act in more adequate manner then e

## 3. Data Cleaning

In [9]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [10]:
review_to_words(train_X[10])

['movi',
 'collect',
 'friend',
 'mine',
 'remind',
 'decid',
 'watch',
 'know',
 'much',
 'close',
 'leo',
 'suppos',
 'excel',
 'come',
 'age',
 'movi',
 'deal',
 'seriou',
 'topic',
 'aid',
 'although',
 'person',
 'aid',
 'leo',
 'scenario',
 'wrap',
 'around',
 'way',
 'marcel',
 'youngest',
 'brother',
 'leo',
 'coup',
 'sick',
 'rel',
 'first',
 'everyon',
 'tri',
 'hide',
 'truth',
 'marcel',
 'believ',
 'young',
 'understand',
 'sick',
 'brother',
 'fact',
 'leo',
 'also',
 'homosexu',
 'contribut',
 'unwilling',
 'parent',
 'discu',
 'matter',
 'young',
 'marcel',
 'know',
 'experi',
 'mani',
 'occas',
 'older',
 'peopl',
 'want',
 'accept',
 'fact',
 'sometim',
 'even',
 'someon',
 'young',
 'automat',
 'mean',
 'abl',
 'accept',
 'realiti',
 'act',
 'adequ',
 'manner',
 'even',
 'except',
 'fact',
 'famili',
 'tri',
 'conceal',
 'truth',
 'marcel',
 'left',
 'quit',
 'impress',
 'way',
 'support',
 'son',
 'even',
 'discov',
 'truth',
 'sexual',
 'sick',
 'fact',
 'allow',


In [11]:
import pickle

cache_dir = os.path.join("../cache", "movie_review_sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [None]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

  text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags


In [17]:
import numpy as np
from collections import Counter

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""

    words = []
    for sentence in data:
        word = set(sentence)
        words.extend(word)
    word_count = Counter(words) # A dict storing the words that appear in the reviews along with how often they occur
    
    
    sorted_words = sorted(word_count, key=word_count.get, reverse=True)
    
    word_dict = {} # Word dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [18]:
word_dict = build_dict(train_X)

In [19]:
data_dir = '../data/pytorch' # The folder to store the data
if not os.path.exists(data_dir): # Check that the folder exists
    os.makedirs(data_dir)

In [20]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

## 4. Data Transformation

In [21]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # 0 represents the 'no word' category
    INFREQ = 1 # 1 represents the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [22]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

In [23]:
# example one of the processed reviews.
train_X[0], train_X_len[0]

(array([   1,  350, 1473,  612, 1340, 1942, 1578,  228,  595,    1, 1279,
        3681,    1,    1, 4302,  403, 2720,  142,  101,  382,  292,    4,
        1285,    3,   10,   30,  571, 1221,   53,    1,  108, 1158, 3109,
        1978,    1,    1,   78, 2010, 1500, 1505,    3, 4675,    1,  552,
         382, 1479,  163,   69, 2411,   20,  445,   34,    3,  552,  196,
           1,   17,    4,  114,   60, 1001, 4483,    3,    2, 2410, 1593,
          30,  716, 3509, 2451, 1929,  212,    6,  142,  111,  159,  108,
         111,  721,  368,  563,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

## 5. Uploading Data to S3

In [24]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [25]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'movie_review/sagemaker/sentiment_data'

role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [26]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

## 6. Build Model

In [27]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):[37m[39;49;00m
[37m    [39;49;00m[33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m[37m[39;49;00m
[37m[39;49;00m
    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):[37m[39;49;00m
[37m        [39;49;00m[33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m[37m[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()[37m[39;49;00m
[37m[39;49;00m
        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)[

In [29]:
!pip install torch

Collecting torch
  Downloading torch-2.2.2-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidi

In [30]:
import torch
import torch.utils.data

# Read in the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

In [31]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    """
    This is the training method that is called by the PyTorch training script. The parameters
    passed are as follows:
    model        - The PyTorch model that we wish to train.
    train_loader - The PyTorch DataLoader that should be used during training.
    epochs       - The total number of epochs to train for.
    optimizer    - The optimizer to use during training.
    loss_fn      - The loss function used for training.
    device       - Where the model and data should be loaded (gpu or cpu).
    """
        
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # clear the gradients 
            optimizer.zero_grad()
            # forward pass
            outputs = model(batch_X)
            # prediction
#             _, preds = torch.max(outputs, 1)
            # calculate loss
            loss = loss_fn(outputs, batch_y)
            # bacward pass
            loss.backward()
            # optimization
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

In [32]:
# check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [33]:
import torch.optim as optim
from train.model import LSTMClassifier

model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6946756720542908
Epoch: 2, BCELoss: 0.6845051765441894
Epoch: 3, BCELoss: 0.675368320941925
Epoch: 4, BCELoss: 0.6648377060890198
Epoch: 5, BCELoss: 0.6511309385299683


In [37]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='1.10.0',
                    py_version='py38',
                    train_instance_count=1,
                    train_instance_type='ml.m4.xlarge',  # Updated to a valid instance type
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })


train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [38]:
estimator.fit({'training': input_data})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2024-04-20-00-01-07-679


2024-04-20 00:01:08 Starting - Starting the training job...
2024-04-20 00:01:22 Starting - Preparing the instances for training......
2024-04-20 00:02:19 Downloading - Downloading input data...
2024-04-20 00:02:59 Downloading - Downloading the training image......
2024-04-20 00:03:49 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-04-20 00:03:58,811 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-04-20 00:03:58,813 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-04-20 00:03:58,825 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-04-20 00:03:58,830 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-04-20 00:03:59,020 sagemaker-trainin

[34mEpoch: 1, BCELoss: 0.6682921623697087[0m
[34mEpoch: 2, BCELoss: 0.6077206718678377[0m
[34mEpoch: 3, BCELoss: 0.5563158514548321[0m
[34mEpoch: 4, BCELoss: 0.4570438880093244[0m
[34mEpoch: 5, BCELoss: 0.4085615812515726[0m
[34mEpoch: 6, BCELoss: 0.3655042715218602[0m
[34mEpoch: 7, BCELoss: 0.34548429627807775[0m
[34mEpoch: 8, BCELoss: 0.32394267649066694[0m
[34mEpoch: 9, BCELoss: 0.29707914955761966[0m

2024-04-20 00:46:43 Uploading - Uploading generated training model[34mEpoch: 10, BCELoss: 0.3147812634706497[0m
[34m2024-04-20 00:46:36,974 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

2024-04-20 00:46:54 Completed - Training job completed
Training seconds: 2675
Billable seconds: 2675


## 8. Deploy the Model

In [39]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-2-637423629228/pytorch-training-2024-04-20-00-01-07-679/output/model.tar.gz), script artifact (s3://sagemaker-us-east-2-637423629228/pytorch-training-2024-04-20-00-01-07-679/source/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-2-637423629228/pytorch-training-2024-04-20-00-52-36-559/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-training-2024-04-20-00-52-36-559
INFO:sagemaker:Creating endpoint-config with name pytorch-training-2024-04-20-00-52-36-559
INFO:sagemaker:Creating endpoint with name pytorch-training-2024-04-20-00-52-36-559


------!

## 9. Model Testing

In [40]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [41]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [42]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [43]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.8556

In [44]:
test_review = 'This an Amazing Movie. Very good story and plot'

In [45]:
# Convert test_review into a form usable by the model and save the results in test_data
test_review_words = review_to_words(test_review)
test_review_words, test_review_len = convert_and_pad(word_dict, test_review_words)
test_data = np.hstack((test_review_len, test_review_words))
test_data = test_data.reshape(1, -1)
test_data.shape, test_data[0, :8]

((1, 501), array([  5, 355,   2,   7,  15,  41,   0,   0]))

In [46]:
predictor.predict(test_data)

array(0.66853648)

## 10. Inference Code for Model

In [99]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[3

## 11. Deploy the Model

In [101]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

In [103]:
model = PyTorchModel(model_data=estimator.model_data,
                     role=role,
                     framework_version='1.10.0',
                     py_version='py38',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-2-637423629228/pytorch-training-2024-04-20-00-01-07-679/output/model.tar.gz), script artifact (serve), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-2-637423629228/pytorch-inference-2024-04-20-02-41-37-624/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-inference-2024-04-20-02-41-38-385
INFO:sagemaker:Creating endpoint-config with name pytorch-inference-2024-04-20-02-41-38-968
INFO:sagemaker:Creating endpoint with name pytorch-inference-2024-04-20-02-41-38-968


------!

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


## 12. Testing the Model

In [109]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(int(predictor.predict(review_input, initial_args={'ContentType':'text/plain'})))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [110]:
ground, results = test_reviews()


Starting  pos  files
Starting  neg  files


In [111]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.872

In [115]:
test_review = 'The Movie was very bad! poor!'

In [116]:
predictor.predict(test_review, initial_args={'ContentType':'text/plain'})

b'0'

In [114]:
predictor.endpoint

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


'pytorch-inference-2024-04-20-02-41-38-968'

In [117]:
# https://vtfg80f45l.execute-api.us-east-2.amazonaws.com/Prod

In [120]:
# zip -r -X deepak_kumar_nuid_002631397_final_project.zip INFO6105-Project

## 13. Model Use for Web App

### Setting up a Lambda function

 1. Create an IAM Role for the Lambda function
 2. Create a Lambda function
 3. Setting up API Gateway
 4. Deploying the web app

In [123]:
!pygmentize lambda_function.py

[34mimport[39;49;00m [04m[36mboto3[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32mlambda_handler[39;49;00m(event, context):[37m[39;49;00m
[37m[39;49;00m
    [37m# The SageMaker runtime is what allows us to invoke the endpoint that we've created.[39;49;00m[37m[39;49;00m
    runtime = boto3.Session().client([33m'[39;49;00m[33msagemaker-runtime[39;49;00m[33m'[39;49;00m)[37m[39;49;00m
[37m[39;49;00m
    [37m# Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given[39;49;00m[37m[39;49;00m
    response = runtime.invoke_endpoint(EndpointName = [33m'[39;49;00m[33mpytorch-inference-2024-04-20-02-41-38-968[39;49;00m[33m'[39;49;00m,    [37m# The name of the endpoint we created[39;49;00m[37m[39;49;00m
                                       ContentType = [33m'[39;49;00m[33mtext/plain[39;49;00m[33m'[39;49;00m,                 [37m# The data format that is expected[39;49;00m[37m[39;49;00m
