# Semantics & Sentiment Analysis

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

# 6.1.0 - Introduction to Semantics & Sentiment Analysis   

Goals and coverage:
- Understanding semantic word vectors
- Understanding sentiment analysis
- Leverage sentiment analysis for the purpose of text classification

# 6.2.0 - Semantics & word vectors

## Required library downloads
Ensure the following lib additions have been installed into your virtual/conda env:
- python -m spacy download en_core_web_md
- python -m spacy download en_core_web_lg 

**note:** both are large files and may take some time to complete the download.
    
## Word2vec

#### Q: What is Word2vec?
A: [Word2vec](https://en.wikipedia.org/wiki/Word2vec) is a two-layer neural net processes text. The input is a text corpus and the output is a set of vectors: feature vectors for words in that corpus. 

#### Q: What does it do?
A: The usefulness, or purpose of vectors is the vectors of similar words together in vectorspace. This is a process of mathematically detected similarities. The vectors created by `Word2vec` are distributed numerical representations of word features, features such as the context of individual words. This is done without human intervention. Therefore, given enough data, usage and contexts `Word2vec` can make highly accurate guesses about the meaning of a word based on past appearances. The guesses can be used to establish a word association with other words.

#### Q: How does it work?
A: Word2vec trains words against other words that neighbour them in the input corpus. This is done in two possible ways: 
1. Using context to predict a target word (continuous bag of words)
2. Skip-gram, which uses words to predict a context. 

Basically, inverses of each other. At the end of the process each word is represented by a vector and that in the spacy library each of these vectors has 300 dimensions. 

## Cosine similarity

The vectorisation process means we can use cosine similarity to measure how similar words are to one another. To get a result this means taken a number of 300-dimensional vectors and making comparisons to derive a similarity. This enables us to perform vector arithmetic with these word vectors. 
eg: 
- new_vec = (king - man) + woman
- new_vec = giraffe - (tiger + zebra)

This creates new vectors not directly associated with a word that allows us to attempt to find the most similar vectors. In the example of `king` this may be queen because of the gender contexts. In the case of `giraffes` this may be something related to skin markings and patterns given all animals have distinguishing markings.    

# 6.3.0 - Semantics & word vectors with Spacy

In [2]:
import spacy

In [120]:
nlp = spacy.load('en_core_web_lg')

In [122]:
len(nlp.vocab)

490

In [121]:
nlp(u"lion").vector

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

Note that docs and span objects have vectors too and are composed of the averages of the tokens (words) vectors. This facilitates not only word2vec but document2vec too. 

In [5]:
nlp(u'This is a doc that we can check the shape of').vector.shape

(300,)

In [50]:
def compare_tokens(tokens):
    for token1 in tokens:
        print(f"{'Token (base)':<{15}}{'Token (Comparison)':<{25}}\t{'Score':<{30}}")
        print("-" * 100)
        for token2 in tokens:
            if token1.text != token2.text:
                print(f"{token1.text:<{15}}{token2.text:<{25}}\t{token1.similarity(token2):<{30}}")
        print('')

In [51]:
tokens = nlp(u"lion cat pet")

compare_tokens(tokens)

Token (base)   Token (Comparison)       	Score                         
----------------------------------------------------------------------------------------------------
lion           cat                      	0.5265436768531799            
lion           pet                      	0.399237722158432             

Token (base)   Token (Comparison)       	Score                         
----------------------------------------------------------------------------------------------------
cat            lion                     	0.5265436768531799            
cat            pet                      	0.7505456209182739            

Token (base)   Token (Comparison)       	Score                         
----------------------------------------------------------------------------------------------------
pet            lion                     	0.399237722158432             
pet            cat                      	0.7505456209182739            



In [52]:
# take some words with a context sharing but which are not
# instantly related or converging. 
 
tokens = nlp(u"like love hate lust greed envy")
compare_tokens(tokens)

Token (base)   Token (Comparison)       	Score                         
----------------------------------------------------------------------------------------------------
like           love                     	0.6579039692878723            
like           hate                     	0.6574652194976807            
like           lust                     	0.28143566846847534           
like           greed                    	0.2609533965587616            
like           envy                     	0.334461510181427             

Token (base)   Token (Comparison)       	Score                         
----------------------------------------------------------------------------------------------------
love           like                     	0.6579039692878723            
love           hate                     	0.6393098831176758            
love           lust                     	0.47823452949523926           
love           greed                    	0.28427520394325256           
love 

In [54]:
# take some random words to see the looser, or lesser 
# connected to show a typical/atypical scoring. 

tokens = nlp(u"cat peanut burgler jelly urine monkey")
compare_tokens(tokens)

Token (base)   Token (Comparison)       	Score                         
----------------------------------------------------------------------------------------------------
cat            peanut                   	0.3071243464946747            
cat            burgler                  	0.19492697715759277           
cat            jelly                    	0.24843300879001617           
cat            urine                    	0.31303825974464417           
cat            monkey                   	0.5351812839508057            

Token (base)   Token (Comparison)       	Score                         
----------------------------------------------------------------------------------------------------
peanut         cat                      	0.3071243464946747            
peanut         burgler                  	0.02168104238808155           
peanut         jelly                    	0.591347336769104             
peanut         urine                    	0.1654939353466034            
peanu

In [98]:
# run example looking dor:
# token vector
# token is out of vocabulary 
tokens = nlp(u"Rapper guitarist drummer bassist Bezmondo")

print(f"{'token':{15}}{'has_vector':{8}}{'vector_norm':>{25}}{'oov':>{12}}")
print("-" * 100)
for token in tokens:
    print(f"{token.text:{15}}{str(token.has_vector):{8}}{token.vector_norm:>{27}}{str(token.is_oov):>{12}}")

token          has_vector              vector_norm         oov
----------------------------------------------------------------------------------------------------
Rapper         True              7.430685043334961       False
guitarist      True                  7.60302734375       False
drummer        True              7.250106334686279       False
bassist        True              7.421383857727051       False
Bezmondo       False                           0.0        True


In [106]:
from scipy import spatial

cosine_similarity = lambda vec1, vec2: 1 - spatial.distance.cosine(vec1,vec2)

In [119]:
print(len(nlp.vocab))

king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

517


In [108]:
# king - man + woman --> new vector: queen princess highness
new_vec = king-man+woman

In [140]:
import time

computed_similarities = []

# -----------------
# Step 1 
start = time.time()

# updates the vocab with the entire list as 2.30 does not include it 
# and without it even the large lib has +/-500 entries
for s in nlp.vocab.vectors:
    _ = nlp.vocab[s]

end = time.time()
print(f"Step1: {end - start}")
    
# -----------------
# Step 2 
start = time.time()
# search all of the vocab
for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vec, word.vector)
                computed_similarities.append((word, similarity))

end = time.time()
print(f"Step2: {end - start}")

Step1: 0.16414403915405273
Step2: 13.595584154129028


In [138]:
# sorted in descending order to get most similar words at the top
computed_similarities = sorted(computed_similarities, key=lambda item:-item[1])

In [139]:
print([t[0].text for t in computed_similarities[:10]])

['king', 'queen', 'prince', 'kings', 'princess', 'royal', 'throne', 'queens', 'monarch', 'kingdom']


# 6.4.0 - Sentiment Analysis overview

#### VADER and unlabeled sentiment reasoning

We have seen the practice of text classification for predicting sentiment where we have pre-labeled data. Where we don't have historical, labeled data discerning sentiment may be more challenging. This is where we can use `VADER` or **V**alence **A**ware **D**ictionary for s**E**ntiment **R**easoning. This model is used for sentiment analysis that is sensitive to both polarity (pos/neg) and intensity of strength of emotion.

We'll use the NLTK package to explore this concept. 

primarily, VADER sentiment analysis relies on a dictionary which maps lexical features to emotion intensities or `sentiment scores`. The sentiment score can be obtained by summing the intensity for each word in the text under analysis. 

#### Q: So what can it do? 
VADER is smart enough to take words such as `joy, happy, love` as positive and yet is smart enough to see the negators in the same text can have an entirely different sentiment therefore `did not love` would be correctly determined to be a negative. Areas that can still present some challenge are:
- Positive and negative sentiment in the same text segment. 
- Sarcasm using positive words in a negative way. 


# 6.5.0 Sentiment analysis with VADER & NLTK

In [141]:
import nltk

In [142]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ed/nltk_data...


True

In [143]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [144]:
sid = SentimentIntensityAnalyzer()

In [153]:
data = "That was not the best test score I have had but I was DELIGHTED"

In [154]:
sid.polarity_scores(data)

{'neg': 0.123, 'neu': 0.564, 'pos': 0.313, 'compound': 0.6559}

#### Exercise using amazon reviews

In [155]:
import pandas as pd

In [157]:
df = pd.read_csv('./resources/amazonreviews.tsv', sep='\t')

df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [161]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [162]:
df.dropna(inplace=True)

In [163]:
blanks = []

for i, il, rev in df.itertuples():
    # (index, label, review_text)
    if type(rev) == str:
        if rev.isspace():
             blanks.append(i)

In [165]:
blanks

# df.drop(blanks)

[]

In [168]:
df.iloc[0]['review']

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [169]:
sid.polarity_scores(df.iloc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

#### Add scores to the dataframe

In [170]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

In [171]:
df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [172]:
df['compound'] = df['scores'].apply(lambda d:d['compound'])

In [173]:
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [174]:
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score >=0 else 'neg')

In [175]:
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


#### Compare vader scoring to the manual labels

In [176]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [178]:
accuracy_score(df['label'], df['comp_score'])

0.7092

In [179]:
print(classification_report(df['label'], df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [181]:
print(confusion_matrix(df['label'], df['comp_score']))

[[2623 2474]
 [ 434 4469]]


Next step is to run a full e2e example which we'll do in the next notebook