# Overview

In this example we use AWS's Comprehend service with boto3 to perform sentiment analysis on Kaggle's [Sentiment Analysis of Movie Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview) Dataset. We have download the dataset to this repo.

The sentiment labels are:

* 0 - negative
* 1 - somewhat negative
* 2 - neutral
* 3 - somewhat positive
* 4 - positive

# Import Libs

In [42]:
import boto3
import dill
import sys, os, shutil
import pandas as pd
import warnings
import joblib
import sklearn.metrics

# Define Paths

In [11]:
path_data = './sentiment-analysis-on-movie-reviews_data/'

path_home_dir = os.path.expanduser(os.path.join('~','Desktop'))
path_report_dir = os.path.join(path_home_dir, 'AWS_Comprehend')

# Load Data

In [13]:
os.listdir(path_data)

['train.tsv', 'sampleSubmission.csv', 'test.tsv']

In [102]:
df = pd.read_table(os.path.join(path_data,'train.tsv'))

#bin the sentiments
warnings.filterwarnings('ignore')
df['Sentiment'][df['Sentiment']==1] = 0
df['Sentiment'][df['Sentiment']==3] = 4

df['Sentiment'][df['Sentiment']==0] = 'negative'
df['Sentiment'][df['Sentiment']==2] = 'neutral'
df['Sentiment'][df['Sentiment']==4] = 'positive'
warnings.filterwarnings('default')

#slice out a subset for this example
# Note, AWS only allows you to process 50K units of text for free
df = df.sample(50000)

df

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,negative
1,2,1,A series of escapades demonstrating the adage ...,neutral
2,3,1,A series,neutral
3,4,1,A,neutral
4,5,1,series,neutral
...,...,...,...,...
156055,156056,8544,Hearst 's,neutral
156056,156057,8544,forced avuncular chortles,negative
156057,156058,8544,avuncular chortles,positive
156058,156059,8544,avuncular,neutral


In [103]:
true_sentiments = list(df['Sentiment'])
phrases = list(df['Phrase'])
len(phrases)

156060

# Load AWS Access Keys

In [104]:
path_access_keys_file = '../accessKeys.csv'

access_keys = pd.read_csv(path_access_keys_file)

personal_access_key = access_keys['Access key ID'].iloc[0]
secret_access_key = access_keys["Secret access key"].iloc[0]

In [105]:
personal_access_key

'AKIA6KBRM63JS2OARHM7'

# Detect Sentiment

## Start Client

In [106]:
region = 'us-east-1'

compr_client = boto3.client(service_name ='comprehend',
                                      region_name = region, 
                                      aws_access_key_id = personal_access_key,
                                      aws_secret_access_key = secret_access_key)

## Detect Sentiment

Below, we use multi-processing to send multiple requests to the client at once to speed things up

In [107]:
def detect_sentiment(phrases_batch):
    responses = compr_client.batch_detect_sentiment(TextList = phrases_batch,
                                                    LanguageCode='en')
    
    ResultList = responses['ResultList']
    
    pred_sentiments = [response['Sentiment'].lower() for response in ResultList]
    
    return pred_sentiments

In [108]:
pred_sentiments = []
batches = len(phrases)//25
for batch_idx in range(batches):
    
    phrases_batch = phrases[batch_idx*25:(batch_idx+1)*25]

    pred_sentiments = pred_sentiments + detect_sentiment(phrases_batch)
    
    print('Progress:',round((batch_idx+1)/batches*100,2),end='\r')

Progress: 100.0

In [112]:
print(sklearn.metrics.classification_report(true_sentiments[:len(pred_sentiments)], 
                                            pred_sentiments))

  'recall', 'true', average, warn_for)


              precision    recall  f1-score   support

       mixed       0.00      0.00      0.00         0
    negative       0.55      0.64      0.59     34342
     neutral       0.75      0.68      0.71     79576
    positive       0.68      0.64      0.66     42132

    accuracy                           0.66    156050
   macro avg       0.50      0.49      0.49    156050
weighted avg       0.69      0.66      0.67    156050

