# **Part Three of the Course Project**
In this part of the course project, you will automatically expand the VADER's lexicon using TextBlob's sentiment analyzer on words missing from the VADER's vocabulary. Then you'll measure the f1 score with and without this expansion.<hr style="border-top: 2px solid #606366; background: transparent;">

# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries and corpora needed for this project. 

In [1]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = 'all'
import numpy as np, pandas as pd, nltk, unittest, numpy.testing as npt
from textblob import TextBlob  # version 0.17.1
from sklearn.metrics import classification_report as rpt
from nltk.sentiment.vader import SentimentIntensityAnalyzer
_ = nltk.download(['vader_lexicon', 'movie_reviews', 'punkt', 'sentence_polarity', 'omw-1.4'], quiet=True)
from nltk.corpus import movie_reviews, sentence_polarity, wordnet as wn
from colorunittest import run_unittest
eq, aeq = npt.assert_equal, npt.assert_almost_equal
sia = SentimentIntensityAnalyzer()

pd.set_option('max_colwidth', 100, 'max_rows', 5)

In [2]:
print('Starting size of VADER''s lexicon:', len(sia.lexicon))

Starting size of VADERs lexicon: 7502


# Task 1

Complete `ApplySIA()`, which takes lists of positive and negative reviews and returns a dataframe with these reviews in original order and relevant statistics from VADER's sentiment intensity analyzer. This includes negative (`neg`), neutral (`neu`), positive (`pos`), `compound` scores, and original polarity `vY`. The field `pY` is 1 for positive compound score, -1 for negative compound score and zero otherwise.

For example, `ApplySIA(['good'], ['bad'])` returns

|neg|neu|pos|compound|vY|pY|review|
|-|-|-|-|-|-|-|
|0|0.0|1.0|0.4404|1|1|good|
|1|0.0|0.0|-0.5423|-1|-1|bad|

In [3]:
"""def ApplySIA(LsPos=[''], LsNeg=[''], sia=sia) -> pd.DataFrame:
    ''' Add positive reviews LsPos and negative reviews LsNeg to a dataframe
        with their polarity_scores from SentimentIntensityAnalyzer().
        Then add a test polarity vY as 1 and a predicted polarity pY as thresholded compound score.
        Then add negative reviews with vY=-1 and other statistics from SentimentIntensityAnalyzer()
    Input:
        LsPos, LsNeg: lists with positive and negative review texts, respectively
    Returns: dataframe     '''
    def get_sentiment_score(review):
        sentiment_score = sia.polarity_scores(review)
        return sentiment_score['neg'], sentiment_score['neu'], sentiment_score['pos'], sentiment_score['compound']

    positive_data = []
    for review in LsPos:
        neg, neu, pos, compound = get_sentiment_score(review)
        #compound = round(compound, 4)  # Round compound score to avoid precision issues
        pY = 1 if compound > 0 else (0 if compound == 0 else -1)
        positive_data.append([neg, neu, pos, compound, 1, pY, review])
    
    negative_data = []
    for review in LsNeg:
        neg, neu, pos, compound = get_sentiment_score(review)
        compound = round(compound, 4)  # Round compound score to avoid precision issues
        pY = -1 if compound < 0 else (0 if compound == 0 else 1)
        negative_data.append([neg, neu, pos, compound, -1, pY, review])
    
    df1 = pd.DataFrame(positive_data, columns=['neg', 'neu', 'pos', 'compound', 'vY', 'pY', 'review'])
    df2 = pd.DataFrame(negative_data, columns=['neg', 'neu', 'pos', 'compound', 'vY', 'pY', 'review'])
    
    df = pd.concat([df1, df2], ignore_index=True)
    df = df.drop_duplicates(subset=['review']).reset_index(drop=True)
  
    
    
    return df


    


df1 = ApplySIA([' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.pos')], 
               [' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.neg')])
df1"""

"def ApplySIA(LsPos=[''], LsNeg=[''], sia=sia) -> pd.DataFrame:\n    ''' Add positive reviews LsPos and negative reviews LsNeg to a dataframe\n        with their polarity_scores from SentimentIntensityAnalyzer().\n        Then add a test polarity vY as 1 and a predicted polarity pY as thresholded compound score.\n        Then add negative reviews with vY=-1 and other statistics from SentimentIntensityAnalyzer()\n    Input:\n        LsPos, LsNeg: lists with positive and negative review texts, respectively\n    Returns: dataframe     '''\n    def get_sentiment_score(review):\n        sentiment_score = sia.polarity_scores(review)\n        return sentiment_score['neg'], sentiment_score['neu'], sentiment_score['pos'], sentiment_score['compound']\n\n    positive_data = []\n    for review in LsPos:\n        neg, neu, pos, compound = get_sentiment_score(review)\n        #compound = round(compound, 4)  # Round compound score to avoid precision issues\n        pY = 1 if compound > 0 else (0 if c

In [4]:
"""LsPos = [' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.pos')]
LsNeg = [' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.neg')]
LsReviews = LsPos + LsNeg
LnPolarity = [1] * len(LsPos) + [-1] * len(LsNeg)   # original polarity lacks neutral values

%time dfSP = pd.DataFrame([sia.polarity_scores(s) for s in LsReviews])
dfSP['nChar'] = [len(s) for s in LsReviews]
dfSP['vY'] = LnPolarity
dfSP['pY'] = dfSP.compound.apply(lambda c: -1 if c < 0 else 1 if c >= 0 else 0)   # predicted polarity (based on threshold)
dfSP['review'] = LsReviews
dfSP"""

"LsPos = [' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.pos')]\nLsNeg = [' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.neg')]\nLsReviews = LsPos + LsNeg\nLnPolarity = [1] * len(LsPos) + [-1] * len(LsNeg)   # original polarity lacks neutral values\n\n%time dfSP = pd.DataFrame([sia.polarity_scores(s) for s in LsReviews])\ndfSP['nChar'] = [len(s) for s in LsReviews]\ndfSP['vY'] = LnPolarity\ndfSP['pY'] = dfSP.compound.apply(lambda c: -1 if c < 0 else 1 if c >= 0 else 0)   # predicted polarity (based on threshold)\ndfSP['review'] = LsReviews\ndfSP"

In [5]:
#this one

def ApplySIA(LsPos=[''], LsNeg=[''], sia=None) -> pd.DataFrame:
    ''' Add positive reviews LsPos and negative reviews LsNeg to a dataframe
        with their polarity_scores from SentimentIntensityAnalyzer().
        Then add a test polarity vY as 1 and a predicted polarity pY as thresholded compound score.
        Then add negative reviews with vY=-1 and other statistics from SentimentIntensityAnalyzer().
    Input:
        LsPos, LsNeg: lists with positive and negative review texts, respectively
        sia: an instance of SentimentIntensityAnalyzer (should be passed)
    Returns:
        pd.DataFrame: dataframe with sentiment analysis results and review information
    '''
    # cut this??
    if sia is None:
        sia = SentimentIntensityAnalyzer()

    LsReviews = LsPos + LsNeg

    #  polarity scores using VADER
    dfSP = pd.DataFrame([sia.polarity_scores(s) for s in LsReviews])

    # calc original polarity vY
    LnPolarity = [1] * len(LsPos) + [-1] * len(LsNeg)

    # Add the 'vY' col
    dfSP['vY'] = LnPolarity

    # calc the predicted polarity pY based on the compound score threshold
    dfSP['pY'] = dfSP['compound'].apply(lambda c: -1 if c < 0 else 1 if c >= 0 else 0)

    # Add the original review text
    dfSP['review'] = LsReviews

    return dfSP

df1 = ApplySIA([' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.pos')], 
               [' '.join(sSent) for sSent in sentence_polarity.sents('rt-polarity.neg')])
df1


Unnamed: 0,neg,neu,pos,compound,vY,pY,review
0,0.0,0.918,0.082,0.3612,1,1,"the rock is destined to be the 21st century's new "" conan "" and that he's going to make a splash..."
1,0.0,0.758,0.242,0.8069,1,1,"the gorgeously elaborate continuation of "" the lord of the rings "" trilogy is so huge that a col..."
...,...,...,...,...,...,...,...
10660,0.0,0.667,0.333,0.3612,-1,1,the thing looks like a made-for-home-video quickie .
10661,0.0,1.000,0.000,0.0000,-1,1,"enigma is well-made , but it's just too dry and too placid ."


In [6]:
ApplySIA(['good'], ['bad'])

Unnamed: 0,neg,neu,pos,compound,vY,pY,review
0,0.0,0.0,1.0,0.4404,1,1,good
1,1.0,0.0,0.0,-0.5423,-1,-1,bad


In [7]:
df1.iloc[-1].tolist()

[0.0,
 1.0,
 0.0,
 0.0,
 -1,
 1,
 "enigma is well-made , but it's just too dry and too placid ."]

In [8]:
# RUN CELL TO TEST YOUR CODE
df0 = ApplySIA(['good'], ['bad'])

@run_unittest
class Test_ApplySIA(unittest.TestCase):
    def test00(self): eq(type(df0), pd.DataFrame)
    def test01(self): eq(df0.shape, (2,7))
    def test02(self): eq(list(df0.columns), ['neg', 'neu', 'pos', 'compound', 'vY', 'pY', 'review'])
    def test03(self): eq(df0.values.tolist(), [[0,0,1,.4404,1,1,'good'], [1,0,0,-.5423,-1,-1,'bad']])
    def test04(self): eq(df1.shape, (10662, 7))
    def test05(self): eq(df1.iloc[0,:6].tolist(), [0.0, 0.918, 0.082, 0.3612, 1, 1])
    def test06(self): eq(df1.iloc[-1].tolist(), [0,1,0,0,-1,1,"enigma is well-made , but it's just too dry and too placid ."])
    def test07(self): aeq(df1.compound[:5].tolist(), [0.3612, 0.8069, 0.2617, 0.8271, 0.6592], 4) # checking row order
    def test08(self): aeq(df1.mean().tolist(), [0.0913, 0.7449, 0.1639, 0.1633, 0.0, 0.4219], 4)  # checking all columns


Ran 9 tests in 0.398s

[1m[34mOK[0m
test00 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test01 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test02 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test03 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test04 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test05 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test06 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test07 (__main__.Test_ApplySIA) ... [1m[34mok[0m
test08 (__main__.Test_ApplySIA) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 2

Complete `OOV()` UDF, which removes all words from `Docs` iterable which are either one character long or contain non-alpha characters or are present in `Vocab` iterable (when lower-cased). 

This function collects the words out of VADER's lexicon, so that we could score these words and add them to the lexicon.

For example, `OOV(['Good Goods . ', 'a lucky7'])` returns `{'goods'}` because the other words in `['good', 'goods', '.', 'a', 'lucky7']` are either too short or non-alpha or already have a valency score in `sia.lexicon`.

In [9]:
def OOV(Docs=df1.review, Vocab=sia.lexicon) -> set(''):
    ''' Out of Vocab (OOV) UDF returns a subset of words in Docs which are not in Vocab.
        Docs are joined by space, lower-cased, and split on whitespace using split() of a string.
    Inputs:
        Docs: an iterable of string elements (list, Series, tuple, set, dict.keys, etc.)
        Vocab: iterable with string keys.
    Returns: set of strings of alpha words (over 1 char), which are not in Lexicon     '''
    SsNewWords = set('')  # desired output format
 
    
    all_words = ' '.join(Docs).lower().split()
    
    for word in all_words:
        if word.isalpha() and len(word) > 1:
            if word not in Vocab:
                SsNewWords.add(word)
    
    return SsNewWords

SsOOV = OOV()  # unscored words
OOV(['Good Goods . ', 'a lucky7'])

{'goods'}

In [10]:
SsOOV

{'electrocute',
 'ducts',
 'dates',
 'pues',
 'valedictorian',
 'babysitter',
 'license',
 'reaffirms',
 'pauly',
 'gradually',
 'overstylized',
 'unmentioned',
 'athletic',
 'proportions',
 'drabness',
 'substitutes',
 'venus',
 'warlord',
 'fluids',
 'pacino',
 'hits',
 'stumblings',
 'goodfellas',
 'installment',
 'planned',
 'pander',
 'communications',
 'minutiae',
 'naught',
 'key',
 'tug',
 'textural',
 'bunch',
 'dogtown',
 'barbarian',
 'reginald',
 'proctologist',
 'peninsula',
 'psychotic',
 'cumulative',
 'unmolested',
 'sara',
 'bender',
 'leniency',
 'generations',
 'handicapped',
 'flashbulbs',
 'wildcard',
 'breathless',
 'padre',
 'wasp',
 'feelgood',
 'flagging',
 'folks',
 'broder',
 'clinical',
 'brett',
 'mode',
 'smelly',
 'varsity',
 'guessing',
 'liberal',
 'collection',
 'states',
 'chicken',
 'formed',
 'greatness',
 'offset',
 'beverage',
 'married',
 'shrek',
 'reminds',
 'unimaginable',
 'stein',
 'occurs',
 'normal',
 'sordid',
 'freshening',
 'cohesion',


In [11]:
# RUN CELL TO TEST YOUR CODE
s0 = OOV(['Good Goods . ', 'a lucky7'])

@run_unittest
class Test_OOV(unittest.TestCase):
    def test00(self): eq(type(s0), set)
    def test01(self): eq(len(s0), 1)
    def test02(self): eq(s0, {'goods'})
    def test03(self): eq(sorted(SsOOV)[:5], ['aaa', 'aaliyah', 'aan', 'abandone', 'abandono'])
    def test04(self): eq(OOV([df1.review[6]]), {'and', 'combination', 'education', 'of', 'offers', 'rare', 'that'}) 
    def test05(self): eq(len(SsOOV), 14868)


Ran 6 tests in 0.020s

[1m[34mOK[0m
test00 (__main__.Test_OOV) ... [1m[34mok[0m
test01 (__main__.Test_OOV) ... [1m[34mok[0m
test02 (__main__.Test_OOV) ... [1m[34mok[0m
test03 (__main__.Test_OOV) ... [1m[34mok[0m
test04 (__main__.Test_OOV) ... [1m[34mok[0m
test05 (__main__.Test_OOV) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 3

Complete the `Expand()` UDF, which extends the given set of words with their other morphological forms using WordNet's `morphy()` method. All original and morphed words will later be assigned valency (by TextBlob model) and added to VADER's lexicon with their new scores. Since given a word, one doesn't know which POS it uses in a sentence, you assume that either POS can be used with an equal chance and add all possible simplified POS forms of this word back to the set.

The example below uses `wn.morphy()` to standardize different morphologies of `run` to their base form assuming different POS.

In [12]:
wn.morphy('runs', wn.NOUN), wn.morphy('runs', wn.ADJ), wn.morphy('runs', wn.VERB), \
wn.morphy('running', wn.NOUN), wn.morphy('running', wn.ADJ), wn.morphy('running', wn.VERB), \
wn.morphy('ran', wn.NOUN), wn.morphy('ran', wn.ADJ), wn.morphy('ran', wn.VERB)

('run', None, 'run', 'running', 'running', 'run', None, None, 'run')

For example, `Expand({'run','runs','ran','running'})` should return (ordered)

    ['ran', 'run', 'running', 'runs']

In [13]:
"""def Expand(SsWords={''}) -> list(''):
    ''' Expand() takes a set of words and adds morphed nouns, adjectives and verbs of these words
        using wn.morphy() function, which takes a word and a POS (wn.NOUN, wn.ADJ, or wn.VERB)
    Inputs:   SsWords: a set of word strings
    Returns:  an alphabetically ordered list of original and added words     '''
    SsWordsNew = set()  # Initialize as an empty set
    
    for word in SsWords:
        SsWordsNew.add(word)
        for pos in [wn.NOUN, wn.ADJ, wn.VERB]:
            lemma = wn.morphy(word, pos)
            if lemma:
                SsWordsNew.add(lemma)
    
    return SsWordsNew

LsOOV2 = Expand(SsOOV)  # expanded list of words
print(LsOOV2[:10])"""

"def Expand(SsWords={''}) -> list(''):\n    ''' Expand() takes a set of words and adds morphed nouns, adjectives and verbs of these words\n        using wn.morphy() function, which takes a word and a POS (wn.NOUN, wn.ADJ, or wn.VERB)\n    Inputs:   SsWords: a set of word strings\n    Returns:  an alphabetically ordered list of original and added words     '''\n    SsWordsNew = set()  # Initialize as an empty set\n    \n    for word in SsWords:\n        SsWordsNew.add(word)\n        for pos in [wn.NOUN, wn.ADJ, wn.VERB]:\n            lemma = wn.morphy(word, pos)\n            if lemma:\n                SsWordsNew.add(lemma)\n    \n    return SsWordsNew\n\nLsOOV2 = Expand(SsOOV)  # expanded list of words\nprint(LsOOV2[:10])"

In [14]:


def Expand(SsWords=set()) -> list:
    SsWordsNew = set()
    
    for word in SsWords:
        #print(f"Processing word: {word}")  # Debugging: Check the word being processed
        SsWordsNew.add(word)
        for pos in [wn.NOUN, wn.ADJ, wn.VERB]:
            lemma = wn.morphy(word, pos)
            #print(lemma)
            if lemma:
                print(f"Found lemma for {word} ({pos}): {lemma}")  # Debugging: Print lemmas found
                SsWordsNew.add(lemma)
    
    return sorted(SsWordsNew)

LsOOV2 = Expand(SsOOV)
print(LsOOV2[:10])  # Print the first 10 words of the expanded list


Found lemma for electrocute (v): electrocute
Found lemma for ducts (n): duct
Found lemma for dates (n): date
Found lemma for dates (v): date
Found lemma for valedictorian (n): valedictorian
Found lemma for babysitter (n): babysitter
Found lemma for license (n): license
Found lemma for license (v): license
Found lemma for reaffirms (v): reaffirm
Found lemma for athletic (a): athletic
Found lemma for proportions (n): proportion
Found lemma for proportions (v): proportion
Found lemma for drabness (n): drabness
Found lemma for substitutes (n): substitute
Found lemma for substitutes (v): substitute
Found lemma for venus (n): venus
Found lemma for warlord (n): warlord
Found lemma for fluids (n): fluid
Found lemma for hits (n): hit
Found lemma for hits (v): hit
Found lemma for stumblings (v): stumble
Found lemma for installment (n): installment
Found lemma for planned (a): planned
Found lemma for planned (v): plan
Found lemma for pander (n): pander
Found lemma for pander (v): pander
Found lem

In [15]:
# RUN CELL TO TEST YOUR CODE
e0 = Expand({'run','runs','ran','running'})

@run_unittest
class Test_Expand(unittest.TestCase):
    def test00(self): eq(type(e0), list)
    def test01(self): eq(len(e0), 4)
    def test02(self): eq(e0, ['ran', 'run', 'running', 'runs'])
    def test03(self): eq(len(LsOOV2), 16224)
    def test04(self): eq(LsOOV2[:5], ['aaa', 'aaliyah', 'aan', 'abandone', 'abandono']) 
    def test05(self): eq(LsOOV2[-5:], ['ótimo', 'último', 'últimos', 'única', 'único'])


Found lemma for runs (n): run
Found lemma for runs (v): run
Found lemma for ran (v): run
Found lemma for running (n): running
Found lemma for running (a): running
Found lemma for running (v): run
Found lemma for run (n): run
Found lemma for run (v): run
Ran 6 tests in 0.001s

[1m[34mOK[0m
test00 (__main__.Test_Expand) ... [1m[34mok[0m
test01 (__main__.Test_Expand) ... [1m[34mok[0m
test02 (__main__.Test_Expand) ... [1m[34mok[0m
test03 (__main__.Test_Expand) ... [1m[34mok[0m
test04 (__main__.Test_Expand) ... [1m[34mok[0m
test05 (__main__.Test_Expand) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 4

Complete `GetScores()` UDF so that it takes a list of words and keeps those for which it can retrieve a valence score using TextBlob's `polarity` attribute. 

For example `GetScores(['smart','cry','liked','sweet','sour'])` returns 

    [('smart', 0.8571428571428571), ('liked', 2.4), ('sweet', 1.4)]

In [16]:
LsOOV2

['aaa',
 'aaliyah',
 'aan',
 'abandone',
 'abandono',
 'abbas',
 'abbass',
 'abbott',
 'abbreviate',
 'abbreviated',
 'abc',
 'abderrahmane',
 'abdul',
 'abel',
 'aberration',
 'abide',
 'abiding',
 'abject',
 'able',
 'ably',
 'abomination',
 'aborbing',
 'aboriginal',
 'abort',
 'aborted',
 'aboul',
 'abound',
 'about',
 'above',
 'abrams',
 'abrasive',
 'abridge',
 'abridged',
 'abroad',
 'abrupt',
 'abruptly',
 'absence',
 'absent',
 'absolutamente',
 'absolute',
 'absolutely',
 'absorb',
 'absorbed',
 'absorbing',
 'absorbs',
 'abstract',
 'absurd',
 'absurdist',
 'absurdities',
 'absurdity',
 'absurdly',
 'abundance',
 'abundant',
 'abundantly',
 'aburrido',
 'aby',
 'abysmal',
 'abysmally',
 'abyss',
 'acaba',
 'acabamos',
 'academic',
 'academy',
 'accelerate',
 'accelerated',
 'accent',
 'accents',
 'accentuate',
 'accentuating',
 'access',
 'accessibility',
 'accessible',
 'acclaim',
 'acclaimed',
 'accommodate',
 'accomodates',
 'accompanied',
 'accompanies',
 'accompany',
 

In [17]:
def GetScores(LsWords=LsOOV2) -> [(str,int)]:
    ''' For every word in LsWords, find its polarity score using TextBlob(..).polarity. 
        Then add the word and 4*polarity tuple to the list LsWS.
        Polarity scaling ensures that these are in the range [-4,4], 
        which coincides with VADER's valence score.
        Keep the word ordering the same and drop words with zero polarity.
    Input:    list of string words
    Return:   a list of tuples in the form (word, valence score)      ''' 
    #LTsWS = [('',0)]    # desired output format

    LTsWS = []
    for word in LsWords:
        polarity = TextBlob(word).polarity
        if polarity != 0:  
            scaled_polarity = 4 * polarity  
            LTsWS.append((word, scaled_polarity))
    
    return LTsWS

%time LTsWS = GetScores()
print(LTsWS[:5])

CPU times: user 8 s, sys: 4.31 ms, total: 8 s
Wall time: 8.04 s
[('able', 2.0), ('ably', 2.0), ('abridged', 0.4), ('abrupt', -0.5), ('abruptly', -0.5)]


In [18]:
GetScores(['smart','cry','liked','sweet','sour']) 

[('smart', 0.8571428571428571),
 ('liked', 2.4),
 ('sweet', 1.4),
 ('sour', -0.6000000000000001)]

In [19]:
"""LsDocs = \
  ['Yes', 'No', 'Yes :-(', "good idea", "GOOD idea", "good idea!", "good idea!!!",
   "idea's good!!!!!!!!", "idea's good !!!!!!!!", "good idea!!!!!!!!",      # too many exlamations may fail
   "not a good idea", "it isn't a good idea", "good and risky idea",   # negation and multi-attitude towards the movie
   "idea is good, but risky"]           # conjunction "but" sigmals change in polarity towards dominanty phrase

def PolSub(sDoc='great idea!'):
    tb = TextBlob(sDoc)
    return (sDoc, tb.polarity, tb.subjectivity)

df = pd.DataFrame([PolSub(s) for s in LsDocs], columns=['doc','pol','subj']).set_index('doc')
df.T.style.background_gradient(cmap='coolwarm', vmin=-1, vmax=1).set_precision(2)"""

'LsDocs =   [\'Yes\', \'No\', \'Yes :-(\', "good idea", "GOOD idea", "good idea!", "good idea!!!",\n   "idea\'s good!!!!!!!!", "idea\'s good !!!!!!!!", "good idea!!!!!!!!",      # too many exlamations may fail\n   "not a good idea", "it isn\'t a good idea", "good and risky idea",   # negation and multi-attitude towards the movie\n   "idea is good, but risky"]           # conjunction "but" sigmals change in polarity towards dominanty phrase\n\ndef PolSub(sDoc=\'great idea!\'):\n    tb = TextBlob(sDoc)\n    return (sDoc, tb.polarity, tb.subjectivity)\n\ndf = pd.DataFrame([PolSub(s) for s in LsDocs], columns=[\'doc\',\'pol\',\'subj\']).set_index(\'doc\')\ndf.T.style.background_gradient(cmap=\'coolwarm\', vmin=-1, vmax=1).set_precision(2)'

In [20]:
"""

def GetScores(LsWords) -> [(str, float)]:
    ''' For every word in LsWords, find its polarity score using TextBlob(..).polarity. 
        Then scale the polarity score by multiplying by 4 and return a tuple (word, valence score).
        Keep the word ordering the same and drop words with zero polarity.
    Input:    list of string words
    Return:   a list of tuples in the form (word, valence score) '''
    

    
    LTsWS = []
    for word in LsWords:
        blob = TextBlob(word).polarity  # Polarity score from TextBlob
    if blob != 0:
        scaled_polarity = blob * 4  # Scale to the range [-4, 4]
        LTsWS.append((word, scaled_polarity))
    
    
    return LTsWS
#%time LTsWS = GetScores()
print(LTsWS[:5])"""


"\n\ndef GetScores(LsWords) -> [(str, float)]:\n    ''' For every word in LsWords, find its polarity score using TextBlob(..).polarity. \n        Then scale the polarity score by multiplying by 4 and return a tuple (word, valence score).\n        Keep the word ordering the same and drop words with zero polarity.\n    Input:    list of string words\n    Return:   a list of tuples in the form (word, valence score) '''\n    \n\n    \n    LTsWS = []\n    for word in LsWords:\n        blob = TextBlob(word).polarity  # Polarity score from TextBlob\n    if blob != 0:\n        scaled_polarity = blob * 4  # Scale to the range [-4, 4]\n        LTsWS.append((word, scaled_polarity))\n    \n    \n    return LTsWS\n#%time LTsWS = GetScores()\nprint(LTsWS[:5])"

In [21]:
#GetScores(LsOOV2)

In [22]:
#GetScores(['smart','cry','liked','sweet','sour']) 

In [23]:
#GetScores('abridged')

In [24]:
#For example GetScores(['smart','cry','liked','sweet','sour']) returns
#[('smart', 0.8571428571428571), ('liked', 2.4), ('sweet', 1.4)]"

In [25]:
LTsWS[:4]

[('able', 2.0), ('ably', 2.0), ('abridged', 0.4), ('abrupt', -0.5)]

In [26]:
# RUN CELL TO TEST YOUR CODE
g0 = GetScores(['smart','cry','liked','sweet','sour'])

@run_unittest
class Test_Expand(unittest.TestCase):
    def test00(self): eq(type(g0), list)
    def test01(self): eq(len(g0), 4)
    def test02(self): eq(g0, [('smart', 0.8571428571428571), ('liked', 2.4), ('sweet', 1.4),  ('sour', -0.6000000000000001)])
    def test03(self): eq(len(LTsWS), 780)
    def test04(self): eq(LTsWS[:4], [('able', 2.0), ('ably', 2.0), ('abridged', 0.4), ('abrupt', -0.5)]) 
    def test05(self): eq(LTsWS[-4:], [('witty', 2.0), ('workmanlike', 2.0), ('yarn', -0.4), ('young', 0.4)])

Ran 6 tests in 0.002s

[1m[34mOK[0m
test00 (__main__.Test_Expand) ... [1m[34mok[0m
test01 (__main__.Test_Expand) ... [1m[34mok[0m
test02 (__main__.Test_Expand) ... [1m[34mok[0m
test03 (__main__.Test_Expand) ... [1m[34mok[0m
test04 (__main__.Test_Expand) ... [1m[34mok[0m
test05 (__main__.Test_Expand) ... [1m[34mok[0m

----------------------------------------------------------------------



# Reapply Expanded VADER

Next, the VADER's lexicon is expanded with the new words in `LTsWS`.

In [27]:
_ = [sia.lexicon.update({w: s}) for w,s in LTsWS]
print('New size of VADER''s lexicon:', len(sia.lexicon))  # now VADER should have 7877 words in its lexicon

New size of VADERs lexicon: 8256


Then `df2` dataframe is computed using the improved VADER's sentiment analyzer.

In [28]:
%time df2 = pd.DataFrame([sia.polarity_scores(s) for s in df1.review])
df2['review'] = df1.review
df2['vY'] = df1.vY
df2['pY'] = df2.compound.apply(lambda c: -1 if c < 0 else 1 if c >= 0 else 0)   # predicted polarity (based on threshold)
df2

CPU times: user 3.52 s, sys: 5.89 ms, total: 3.52 s
Wall time: 3.52 s


Unnamed: 0,neg,neu,pos,compound,review,vY,pY
0,0.000,0.870,0.130,0.4670,"the rock is destined to be the 21st century's new "" conan "" and that he's going to make a splash...",1,1
1,0.053,0.644,0.303,0.8521,"the gorgeously elaborate continuation of "" the lord of the rings "" trilogy is so huge that a col...",1,1
...,...,...,...,...,...,...,...
10660,0.000,0.667,0.333,0.3612,the thing looks like a made-for-home-video quickie .,-1,1
10661,0.318,0.682,0.000,-0.4939,"enigma is well-made , but it's just too dry and too placid .",-1,-1


# Compare the Performance

Next performance metrics are recomputed for the SIA model using the original and expanded vocabulary. While Precision and Recall improved more than they deteriorated in the new model, the f1 score has dropped by more than it improved. I.e. it dropped by 0.03 from 0.51 to 0.48 and improved by 0.01 from 0.68 to 0.69. 

An important takeaway is that no hypothesis of improvement guarantees the improvement in the model performance. Each idea needs to be tested on the given data. In general, brainstorming many ideas is great as it allows one to choose those that appear most profitable and least costly. 

In [29]:
print(rpt(y_true=df1.vY, y_pred=df1.pY))

              precision    recall  f1-score   support

          -1       0.69      0.40      0.51      5331
           1       0.58      0.82      0.68      5331

    accuracy                           0.61     10662
   macro avg       0.64      0.61      0.59     10662
weighted avg       0.64      0.61      0.59     10662



In [30]:
print(rpt(y_true=df2.vY, y_pred=df2.pY))

              precision    recall  f1-score   support

          -1       0.72      0.46      0.56      5331
           1       0.60      0.82      0.69      5331

    accuracy                           0.64     10662
   macro avg       0.66      0.64      0.63     10662
weighted avg       0.66      0.64      0.63     10662

