# Sentiment Analysis Assessment - Solution

## Task #1: Perform vector arithmetic on your own words
Write code that evaluates vector arithmetic on your own set of related words. The goal is to come as close to an expected word as possible. Please feel free to share success stories in the Q&A Forum for this section!

In [1]:
# Import spaCy and load the language library. Remember to use a larger model!
import spacy
nlp = spacy.load('en_core_web_md')

In [2]:
# Choose the words you wish to compare, and obtain their vectors
word1 = nlp.vocab['wolf'].vector
word2 = nlp.vocab['dog'].vector
word3 = nlp.vocab['cat'].vector

In [3]:
# Import spatial and define a cosine_similarity function
from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

In [7]:
# Write an expression for vector arithmetic
# For example: new_vector = word1 - word2 + word3
new_vector = word1 - word2 + word3
new_vector.shape

(300,)

In [8]:
# List the top ten closest vectors in the vocabulary to the result of the expression above
computed_similarities = []

for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([w[0].text for w in computed_similarities[:10]])

['wolf', 'cat', 'i', 'cuz', 'lovin', 'dare', 'u', 'dog', 'she', 'ai']


#### CHALLENGE: Write a function that takes in 3 strings, performs a-b+c arithmetic, and returns a top-ten result

In [9]:
def vector_math(a,b,c):
    new_vector = nlp.vocab[a].vector - nlp.vocab[b].vector + nlp.vocab[c].vector
    computed_similarities = []

    for word in nlp.vocab:
        if word.has_vector:
            if word.is_lower:
                if word.is_alpha:
                    similarity = cosine_similarity(new_vector, word.vector)
                    computed_similarities.append((word, similarity))

    computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

    return [w[0].text for w in computed_similarities[:10]]

In [10]:
# Test the function on known words:
vector_math('king','man','woman')

['king', 'woman', 'she', 'who', 'wolf', 'when', 'dare', 'cat', 'was', 'not']

## Task #2: Perform VADER Sentiment Analysis on your own review
Write code that returns a set of SentimentIntensityAnalyzer polarity scores based on your own written review.

In [11]:
# Import SentimentIntensityAnalyzer and create an sid object
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

In [23]:
# Write a review as one continuous string (multiple sentences are ok)
review_neutral = 'This movie portrayed real people, and was based on actual events.'
review_negative = 'This movie was awful, the worst movie ever done !'

In [24]:
# Obtain the sid scores for your review
sid.polarity_scores(review_neutral)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [25]:
# Obtain the sid scores for your review
sid.polarity_scores(review_negative)

{'neg': 0.514, 'neu': 0.486, 'pos': 0.0, 'compound': -0.8122}

### CHALLENGE: Write a function that takes in a review and returns a score of "Positive", "Negative" or "Neutral"

In [27]:
def review_rating(string):
    scores = sid.polarity_scores(string)
    if scores['compound'] == 0:
        return 'Neutral_{:2.4}'.format(scores['compound'])
    elif scores['compound'] > 0:
        return 'Positive_{:2.4}'.format(scores['compound'])
    else:
        return 'Negative_{:2.4}'.format(scores['compound'])

In [28]:
# Test the function on your review above:
review_rating(review_neutral)

'Neutral_0.0'

In [29]:
review_rating(review_negative)

'Negative_-0.8122'

In [30]:
my_text = 'we love you'
review_rating(my_text)

'Positive_0.6369'

### LEt's compare this with [transformers](https://huggingface.co/transformers/task_summary.html#sequence-classification) classification !

In [33]:
from transformers import pipeline
nlp = pipeline("sentiment-analysis")

def print_transformer_sentiment_scores(nlp_pipe, phrase):
    result = nlp_pipe(phrase)[0]
    print(f"{phrase:<{20}}\nlabel: {result['label']}, with score: {round(result['score'], 4)}")

In [34]:
print_transformer_sentiment_scores(nlp, my_text)
print_transformer_sentiment_scores(nlp, review_neutral)
print_transformer_sentiment_scores(nlp, review_negative)
print_transformer_sentiment_scores(nlp, 'I hate you')
print_transformer_sentiment_scores(nlp, 'I love you')

This movie portrayed real people, and was based on actual events.label: POSITIVE, with score: 0.9893
This movie was awful, the worst movie ever done !label: NEGATIVE, with score: 0.9998
I hate you          label: NEGATIVE, with score: 0.9991
I love you          label: POSITIVE, with score: 0.9999


## LEt's use the Transformers to check the movie dataset !

In [25]:
import numpy as np
import pandas as pd

CONST_DATA_FILE = 'data/moviereviews.zip'

df = pd.read_csv(CONST_DATA_FILE, sep='\t', compression='zip', )
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [26]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)
df.describe()

Unnamed: 0,label,review
count,1965,1965.0
unique,2,1939.0
top,neg,
freq,983,27.0


In [27]:
print(f"Number of empty reviews : {len(df[df['review'].str.strip()==u''])}")
df.drop(df[df['review'].str.strip()==u''].index, inplace=True)
df.describe()

Number of empty reviews : 27


Unnamed: 0,label,review
count,1938,1938
unique,2,1938
top,pos,"let's face it : since waterworld floated by , ..."
freq,969,1


In [28]:
from transformers import pipeline
nlp = pipeline("sentiment-analysis")

def print_transformer_sentiment_scores(nlp_pipe, phrase):
    result = nlp_pipe(phrase)[0]
    print(f"{phrase:<{20}}\nlabel: {result['label']}, with score: {round(result['score'], 4)}")

In [29]:
def get_transformer_sentiment_scores(nlp_pipe, phrase):
    return nlp_pipe(phrase)[0]

In [30]:
#df.iloc[0]['review']
df.iloc[0]

label                                                   neg
review    how do films like mouse hunt get into theatres...
Name: 0, dtype: object

In [31]:
get_transformer_sentiment_scores(nlp, df.iloc[0]['review'])

{'label': 'NEGATIVE', 'score': 0.997870683670044}

In [32]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [33]:
df['scores'] = None
df['sentiment'] =None
df.at[0,'scores'] = 0
df.head()


Unnamed: 0,label,review,scores,sentiment
0,neg,how do films like mouse hunt get into theatres...,0.0,
1,neg,some talented actresses are blessed with a dem...,,
2,pos,this has been an extraordinary year for austra...,,
3,pos,according to hollywood movies made in last few...,,
4,neg,my first press screening of 1998 and already i...,,


In [37]:
s='123456'
s[:3]

'123'

In [38]:
for i,lb,review,score,sentiment in df.itertuples():  # iterate over the DataFrame
    result = get_transformer_sentiment_scores(nlp,review[:500]) # truncate review to first 500
    df.at[i,'scores'] = round(result['score'], 4)
    df.at[i,'sentiment'] = result['label']
    
df.head()
    

Unnamed: 0,label,review,scores,sentiment
0,neg,how do films like mouse hunt get into theatres...,0.9997,NEGATIVE
1,neg,some talented actresses are blessed with a dem...,0.9514,NEGATIVE
2,pos,this has been an extraordinary year for austra...,0.9997,POSITIVE
3,pos,according to hollywood movies made in last few...,0.9982,NEGATIVE
4,neg,my first press screening of 1998 and already i...,0.9767,NEGATIVE


In [39]:
df['comp_score'] = df['sentiment'].apply(lambda c: 'pos' if c =='POSITIVE' else 'neg')
df.head()

Unnamed: 0,label,review,scores,sentiment,comp_score
0,neg,how do films like mouse hunt get into theatres...,0.9997,NEGATIVE,neg
1,neg,some talented actresses are blessed with a dem...,0.9514,NEGATIVE,neg
2,pos,this has been an extraordinary year for austra...,0.9997,POSITIVE,pos
3,pos,according to hollywood movies made in last few...,0.9982,NEGATIVE,neg
4,neg,my first press screening of 1998 and already i...,0.9767,NEGATIVE,neg


###  Perform a comparison analysis between the original label and comp_score¶

In [40]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [41]:
accuracy_score(df['label'],df['comp_score'])

0.7069143446852425

##### we got an accuracy_score of 0.6357 with nltk VADER so the huggingface transformers is better here

In [42]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.70      0.72      0.71       969
         pos       0.71      0.69      0.70       969

    accuracy                           0.71      1938
   macro avg       0.71      0.71      0.71      1938
weighted avg       0.71      0.71      0.71      1938



In [43]:
print(confusion_matrix(df['label'],df['comp_score']))

[[701 268]
 [300 669]]
