# **Part Three of the Course Project**
In this part of the course project, you will automatically expand the VADER's lexicon using TextBlob's sentiment analyzer on words missing from the VADER's vocabulary. Then you'll measure the f1 score with and without this expansion.<hr style="border-top: 2px solid #606366; background: transparent;">

# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries and corpora needed for this project. 

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = 'all'
import numpy as np, pandas as pd, nltk, unittest, numpy.testing as npt
from textblob import TextBlob  # version 0.17.1
from sklearn.metrics import classification_report as rpt
from nltk.sentiment.vader import SentimentIntensityAnalyzer
_ = nltk.download(['vader_lexicon', 'movie_reviews', 'punkt', 'sentence_polarity', 'omw-1.4'], quiet=True)
from nltk.corpus import movie_reviews, sentence_polarity, wordnet as wn
from colorunittest import run_unittest
eq, aeq = npt.assert_equal, npt.assert_almost_equal
sia = SentimentIntensityAnalyzer()

pd.set_option('max_colwidth', 100, 'max_rows', 5)

In [None]:
print('Starting size of VADER''s lexicon:', len(sia.lexicon))

# Task 1

Complete `ApplySIA()`, which takes lists of positive and negative reviews and returns a dataframe with these reviews in original order and relevant statistics from VADER's sentiment intensity analyzer. This includes negative (`neg`), neutral (`neu`), positive (`pos`), `compound` scores, and original polarity `vY`. The field `pY` is 1 for positive compound score, -1 for negative compound score and zero otherwise.

For example, `ApplySIA(['good'], ['bad'])` returns

|neg|neu|pos|compound|vY|pY|review|
|-|-|-|-|-|-|-|
|0|0.0|1.0|0.4404|1|1|good|
|1|0.0|0.0|-0.5423|-1|-1|bad|

In [None]:
def ApplySIA(LsPos=[''], LsNeg=[''], sia=sia) -> pd.DataFrame:
    ''' Add positive reviews LsPos and negative reviews LsNeg to a dataframe
        with their polarity_scores from SentimentIntensityAnalyzer().
        Then add a test polarity vY as 1 and a predicted polarity pY as thresholded compound score.
        Then add negative reviews with vY=-1 and other statistics from SentimentIntensityAnalyzer()
    Input:
        LsPos, LsNeg: lists with positive and negative review texts, respectively
    Returns: dataframe     '''
    df = pd.DataFrame([], columns=['neg neu pos compound vY pY review'.split()]) # desired output structure
    # YOUR CODE HERE
    raise NotImplementedError()
    return df

df1 = ApplySIA([' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.pos')], 
               [' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.neg')])
df1

In [None]:
# RUN CELL TO TEST YOUR CODE
df0 = ApplySIA(['good'], ['bad'])

@run_unittest
class Test_ApplySIA(unittest.TestCase):
    def test00(self): eq(type(df0), pd.DataFrame)
    def test01(self): eq(df0.shape, (2,7))
    def test02(self): eq(list(df0.columns), ['neg', 'neu', 'pos', 'compound', 'vY', 'pY', 'review'])
    def test03(self): eq(df0.values.tolist(), [[0,0,1,.4404,1,1,'good'], [1,0,0,-.5423,-1,-1,'bad']])
    def test04(self): eq(df1.shape, (10662, 7))
    def test05(self): eq(df1.iloc[0,:6].tolist(), [0.0, 0.918, 0.082, 0.3612, 1, 1])
    def test06(self): eq(df1.iloc[-1].tolist(), [0,1,0,0,-1,1,"enigma is well-made , but it's just too dry and too placid ."])
    def test07(self): aeq(df1.compound[:5].tolist(), [0.3612, 0.8069, 0.2617, 0.8271, 0.6592], 4) # checking row order
    def test08(self): aeq(df1.mean().tolist(), [0.0913, 0.7449, 0.1639, 0.1633, 0.0, 0.4219], 4)  # checking all columns


# Task 2

Complete `OOV()` UDF, which removes all words from `Docs` iterable which are either one character long or contain non-alpha characters or are present in `Vocab` iterable (when lower-cased). 

This function collects the words out of VADER's lexicon, so that we could score these words and add them to the lexicon.

For example, `OOV(['Good Goods . ', 'a lucky7'])` returns `{'goods'}` because the other words in `['good', 'goods', '.', 'a', 'lucky7']` are either too short or non-alpha or already have a valency score in `sia.lexicon`.

In [None]:
def OOV(Docs=df1.review, Vocab=sia.lexicon) -> set(''):
    ''' Out of Vocab (OOV) UDF returns a subset of words in Docs which are not in Vocab.
        Docs are joined by space, lower-cased, and split on whitespace using split() of a string.
    Inputs:
        Docs: an iterable of string elements (list, Series, tuple, set, dict.keys, etc.)
        Vocab: iterable with string keys.
    Returns: set of strings of alpha words (over 1 char), which are not in Lexicon     '''
    SsNewWords = set('')  # desired output format
    # YOUR CODE HERE
    raise NotImplementedError()
    return SsNewWords

SsOOV = OOV()  # unscored words
OOV(['Good Goods . ', 'a lucky7'])

In [None]:
# RUN CELL TO TEST YOUR CODE
s0 = OOV(['Good Goods . ', 'a lucky7'])

@run_unittest
class Test_OOV(unittest.TestCase):
    def test00(self): eq(type(s0), set)
    def test01(self): eq(len(s0), 1)
    def test02(self): eq(s0, {'goods'})
    def test03(self): eq(sorted(SsOOV)[:5], ['aaa', 'aaliyah', 'aan', 'abandone', 'abandono'])
    def test04(self): eq(OOV([df1.review[6]]), {'and', 'combination', 'education', 'of', 'offers', 'rare', 'that'}) 
    def test05(self): eq(len(SsOOV), 14868)


# Task 3

Complete the `Expand()` UDF, which extends the given set of words with their other morphological forms using WordNet's `morphy()` method. All original and morphed words will later be assigned valency (by TextBlob model) and added to VADER's lexicon with their new scores. Since given a word, one doesn't know which POS it uses in a sentence, you assume that either POS can be used with an equal chance and add all possible simplified POS forms of this word back to the set.

The example below uses `wn.morphy()` to standardize different morphologies of `run` to their base form assuming different POS.

In [None]:
wn.morphy('runs', wn.NOUN), wn.morphy('runs', wn.ADJ), wn.morphy('runs', wn.VERB), \
wn.morphy('running', wn.NOUN), wn.morphy('running', wn.ADJ), wn.morphy('running', wn.VERB), \
wn.morphy('ran', wn.NOUN), wn.morphy('ran', wn.ADJ), wn.morphy('ran', wn.VERB)

For example, `Expand({'run','runs','ran','running'})` should return (ordered)

    ['ran', 'run', 'running', 'runs']

In [None]:
def Expand(SsWords={''}) -> list(''):
    ''' Expand() takes a set of words and adds morphed nouns, adjectives and verbs of these words
        using wn.morphy() function, which takes a word and a POS (wn.NOUN, wn.ADJ, or wn.VERB)
    Inputs:   SsWords: a set of word strings
    Returns:  an alphabetically ordered list of original and added words     '''
    SsWordsNew = set('')
    # YOUR CODE HERE
    raise NotImplementedError()
    return SsWordsNew

LsOOV2 = Expand(SsOOV)  # expanded list of words
print(LsOOV2[:10])

In [None]:
# RUN CELL TO TEST YOUR CODE
e0 = Expand({'run','runs','ran','running'})

@run_unittest
class Test_Expand(unittest.TestCase):
    def test00(self): eq(type(e0), list)
    def test01(self): eq(len(e0), 4)
    def test02(self): eq(e0, ['ran', 'run', 'running', 'runs'])
    def test03(self): eq(len(LsOOV2), 16224)
    def test04(self): eq(LsOOV2[:5], ['aaa', 'aaliyah', 'aan', 'abandone', 'abandono']) 
    def test05(self): eq(LsOOV2[-5:], ['ótimo', 'último', 'últimos', 'única', 'único'])


# Task 4

Complete `GetScores()` UDF so that it takes a list of words and keeps those for which it can retrieve a valence score using TextBlob's `polarity` attribute. 

For example `GetScores(['smart','cry','liked','sweet','sour'])` returns 

    [('smart', 0.8571428571428571), ('liked', 2.4), ('sweet', 1.4)]

In [None]:
def GetScores(LsWords=LsOOV2) -> [(str,int)]:
    ''' For every word in LsWords, find its polarity score using TextBlob(..).polarity. 
        Then add the word and 4*polarity tuple to the list LsWS.
        Polarity scaling ensures that these are in the range [-4,4], 
        which coincides with VADER's valence score.
        Keep the word ordering the same and drop words with zero polarity.
    Input:    list of string words
    Return:   a list of tuples in the form (word, valence score)      ''' 
    LTsWS = [('',0)]    # desired output format
    # YOUR CODE HERE
    raise NotImplementedError()
    return LTsWS

%time LTsWS = GetScores()
print(LTsWS[:5])

In [None]:
# RUN CELL TO TEST YOUR CODE
g0 = GetScores(['smart','cry','liked','sweet','sour'])

@run_unittest
class Test_Expand(unittest.TestCase):
    def test00(self): eq(type(g0), list)
    def test01(self): eq(len(g0), 4)
    def test02(self): eq(g0, [('smart', 0.8571428571428571), ('liked', 2.4), ('sweet', 1.4),  ('sour', -0.6000000000000001)])
    def test03(self): eq(len(LTsWS), 780)
    def test04(self): eq(LTsWS[:4], [('able', 2.0), ('ably', 2.0), ('abridged', 0.4), ('abrupt', -0.5)]) 
    def test05(self): eq(LTsWS[-4:], [('witty', 2.0), ('workmanlike', 2.0), ('yarn', -0.4), ('young', 0.4)])

# Reapply Expanded VADER

Next, the VADER's lexicon is expanded with the new words in `LTsWS`.

In [None]:
_ = [sia.lexicon.update({w: s}) for w,s in LTsWS]
print('New size of VADER''s lexicon:', len(sia.lexicon))  # now VADER should have 7877 words in its lexicon

Then `df2` dataframe is computed using the improved VADER's sentiment analyzer.

In [None]:
%time df2 = pd.DataFrame([sia.polarity_scores(s) for s in df1.review])
df2['review'] = df1.review
df2['vY'] = df1.vY
df2['pY'] = df2.compound.apply(lambda c: -1 if c < 0 else 1 if c >= 0 else 0)   # predicted polarity (based on threshold)
df2

# Compare the Performance

Next performance metrics are recomputed for the SIA model using the original and expanded vocabulary. While Precision and Recall improved more than they deteriorated in the new model, the f1 score has dropped by more than it improved. I.e. it dropped by 0.03 from 0.51 to 0.48 and improved by 0.01 from 0.68 to 0.69. 

An important takeaway is that no hypothesis of improvement guarantees the improvement in the model performance. Each idea needs to be tested on the given data. In general, brainstorming many ideas is great as it allows one to choose those that appear most profitable and least costly. 

In [None]:
print(rpt(y_true=df1.vY, y_pred=df1.pY))

In [None]:
print(rpt(y_true=df2.vY, y_pred=df2.pY))