# COLX 565 Lab Assignment 1: Polarity Lexicons (Cheat sheet)

## Assignment Objectives

In this assignment you will
- Compare sentiment lexicons for sentiment analysis
- Apply contextual valence shifters to improve polarity detection
- Identify challenges in lexicon-based sentiment analysis by carrying out error analysis

- Ex1: Human-built lexicons
    - NLTK's `opinion_lexicon`
    - `VADER` lexicon
    - `SO-CAL lexicon` *
    - `Subjectivity lexicon`
    - `evaluate_lexicon` on "Sentence Polarity" and "Pros and Cons" in NLTK
- Ex2: Induced lexicons
    - `SentiWordNet`
    - Semantic Axis using `word2vec` (See 25.4.1 in https://web.stanford.edu/~jurafsky/slp3/25.pdf)
    - `random_walk` (optional)
    - `intrinsic_evaluate` with `SO-CAL lexicon`
- Ex3: Valence shifters
    - intensification from `SO-CAL`
    - negation (*no* and *not*)
    - `evaluate_lexicon` with intensification and/or negation
- Ex4: Error analysis
    - print out sentences which are not labeled correctly (eg. negative)
    - identify three categories of errors (manually)

## Getting Started

You need to have downloaded the following NLTK resources:

In [43]:
#provided code
import nltk
nltk.download("sentence_polarity")
nltk.download("pros_cons")
nltk.download("opinion_lexicon")
nltk.download("sentiwordnet")
nltk.download("word2vec_sample")

[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package sentence_polarity is already up-to-date!
[nltk_data] Downloading package pros_cons to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package pros_cons is already up-to-date!
[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


True

Run the code below to access relevant modules (you can add to this as needed)

In [44]:
#provided code
import gensim
from nltk.data import find
from nltk.corpus import sentence_polarity,pros_cons,opinion_lexicon
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
import numpy as np
from scipy.spatial.distance import cosine
from scipy.sparse import csr_matrix, lil_matrix
import urllib.request

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

### Exercise 1: Human-built lexicons

#### Exercise 1.1:
rubric={accuracy:1}

Access the `opinion_lexicon` included with NLTK, and convert the lists of words into a dictionary where the positive words have the value "1" and negative words have the value "-1"

https://www.nltk.org/_modules/nltk/corpus/reader/opinion_lexicon.html

```
from nltk.corpus import opinion_lexicon

def positive(self):
    """
    Return all positive words in alphabetical order.
    """
    a+
    abundance
    acclaim
    ...

def negative(self):
    """
    Return all negative words in alphabetical order.
    """
    abnormal
    abolish
    abominable
    ...
```

For example:

```
{"good": 1,
 "bad": -1,
 "terrible": -1,
 ...
 }
 ```

In [47]:
NLTK_lexicon = {}

#Your code here
#Your code here

In [48]:
print("a+", NLTK_lexicon["a+"])
print("abnormal", NLTK_lexicon["abnormal"])

a+ 1
abnormal -1


#### Exercise 1.2:
rubric={accuracy:1}

Next, get the crowdsourced [`VADER`](https://github.com/cjhutto/vaderSentiment) lexicon by directly accessing the github page [here](https://raw.githubusercontent.com/cjhutto/vaderSentiment/master/vaderSentiment/vader_lexicon.txt). In the tab-delimited format, the lexical entry is the first item in each row, and the second item is the semantic orientation, those are the only two bits of information you need to store, again you should store it in a dictionary.

VADER (Valence Aware Dictionary and Entiment Reasoner) lexicon:  https://github.com/cjhutto/vaderSentiment

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. *Proceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM-14)*. Ann Arbor, MI, June 2014.

```
OK      OK      KO      KO   
--------------------------------------------------------------------------------
:(      -1.9    1.13578 [-2, -3, -2, 0, -1, -1, -2, -3, -1, -4]
:)      2.0     1.18322 [2, 2, 1, 1, 1, 1, 4, 3, 4, 1]
happy   2.7     0.9     [2, 2, 2, 4, 2, 4, 3, 4, 2, 2]
unhappy -1.8    0.6     [-2, -2, -1, -3, -2, -2, -2, -1, -2, -1]
```

In [None]:
VADER_lexicon = {}

location = "https://raw.githubusercontent.com/cjhutto/vaderSentiment/master/vaderSentiment/vader_lexicon.txt"
f = urllib.request.urlopen(location)
#Your code here
#Your code here

In [None]:
print(":(", VADER_lexicon[":("])
print(":)", VADER_lexicon[":)"])

:( -1.9
:) 2.0


The `compound` score is computed by summing the valence scores of each word in the lexicon:


* positive sentiment: compound score $\geq$ 0.05
* neutral sentiment:  -0.05 $<$ compound score  $<$ 0.05
* negative sentiment: compound score $\leq$ -0.05

```
> VADER is smart, handsome, and funny.
{'pos': 0.746, 'compound': 0.8316, 'neu': 0.254, 'neg': 0.0}
> VADER is not smart, handsome, nor funny.
{'pos': 0.0, 'compound': -0.7424, 'neu': 0.354, 'neg': 0.646}
```

#### Exercise 1.3:
rubric={accuracy:1}

Now access the [SO-CAL lexicon](https://github.com/sfu-discourse-lab/SO-CAL/tree/master/Resources/dictionaries/English), which, like VADER, you should pull directly from the internet rather than saving any files. Your code will be very similar to the VADER, but note that there are actually 4 lexicons, one for each open-class part of speech, that you should collapse into a single Python dictionary (it's okay if you just overwrite any duplicates, i.e. the same word appearing in different parts of speech). 

SO-CAL is the "Semantic Orientation (SO)" CALculator https://github.com/sfu-discourse-lab/SO-CAL since 2004. 

Taboada, M., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (2011) Lexicon-Based Methods for Sentiment Analysis. *Computational Linguistics* 37(2): 267-307.


**4  open-class parts of speech**: https://github.com/sfu-discourse-lab/SO-CAL/tree/master/Resources/dictionaries/English

```
adj_dictionary1.11.txt
adv_dictionary1.11.txt
google_dict.txt
int_dictionary1.11.txt
noun_dictionary1.11.txt
verb_dictionary1.11.txt
```

In [None]:
SO_CAL_lexicon = {}

location = "https://raw.githubusercontent.com/sfu-discourse-lab/SO-CAL/master/Resources/dictionaries/English/"
#Your code here
#Your code here

processing... adj_dictionary1.11.txt
processing... noun_dictionary1.11.txt
processing... verb_dictionary1.11.txt
processing... adv_dictionary1.11.txt


In [None]:
assert SO_CAL_lexicon["unrealistic"] == -1.0

#### Exercise 1.4:
rubric={accuracy:2}

The last lexicon you have to include in this analysis is the [Subjectivity lexicon](http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/). Since downloading it is not entirely trivial we have included it in your repo. It has very different format than the other lexicons, with key/value pairs which have the information you need. You are interested here in "word1" which provides the word, "priorpolarity" which gives the polarity, and "type" which gives information about Strength. It includes neutral words that can be ignored. Otherwise, you can treat it somewhat similarly to the NLTK lexicon, except words with strong subjectivity should have values of 2/-2, instead of the 1/-1.

Subjectivity lexicon http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/ (Note that `subjectivity.txt` is provided)


T. Wilson, J. Wiebe, and P. Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. *Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing*. https://www.aclweb.org/anthology/H05-1044/

```
strength        length word          pos      stemmed    polarity
--------------------------------------------------------------------------------
type=strongsubj len=1  word1=happy   pos1=adj stemmed1=n priorpolarity=positive
type=strongsubj len=1  word1=unhappy pos1=adj stemmed1=n priorpolarity=negative




```
- `subjectivity.txt` is delimited with " "
- `positive` or `negative` =  2/-2 if `type=strongsubj`, otherwise = 1/-1 
- neutral words can be ignored

```

% cut -f6 -d" " subjectivity.txt | sort | uniq -c  
   1 ppriorpolarity=negative
  21 priorpolarity=both
4911 priorpolarity=negative
 570 priorpolarity=neutral
2718 priorpolarity=positive

% cut -f1 -d" " subjectivity.txt | sort | uniq -c
5569 type=strongsubj
2652 type=weaksubj
```

In [None]:
subjectivity_lexicon = {}

f = open("subjectivity.txt")
#Your code here
#Your code here

In [None]:
print("happy", subjectivity_lexicon["happy"])
print("unhappy", subjectivity_lexicon["unhappy"])

happy 2
unhappy -2


#### 1.5
rubric={accuracy:1,quality:1}

Present some basic stats for these lexicons. You should include the total number of words, the number of positive and negative words, and the range of semantic orientation (SO) values.

In [None]:
def get_stats(lexicon):
    '''
    A function that, when given a lexicon, will print out the total number of words in the lexicon,
    the number of positive words, the number of negative words, and the range of the SO values.
    '''
    #Your code here
    #Your code here 
    
print("NLTK lexicon")
get_stats(NLTK_lexicon)

print("VADER lexicon")
get_stats(VADER_lexicon)

print("SO-CAL lexicon")
get_stats(SO_CAL_lexicon)

print("Subjectivity lexicon")
get_stats(subjectivity_lexicon)

NLTK lexicon
total words: 6786
positive words: 2003
negative words: 4783
range -1 to 1
VADER lexicon
total words: 7506
positive words: 3337
negative words: 4169
range -3.9 to 3.4
SO-CAL lexicon
total words: 6091
positive words: 2478
negative words: 3613
range -5.0 to 5.0
Subjectivity lexicon
total words: 6451
positive words: 2301
negative words: 4150
range -2 to 2


#### 1.6
rubric={accuracy:2,quality:2}

Evaluate these four lexicons against two corpora included in NLTK, the `sentence_polarity` corpus (snippets from Rotten Tomatoes) and `pros_cons` (Pros and cons from camera reviews). You can just sum the SO of words in each document and consider it positive if the SO is greater than zero, and negative if it is less than zero. For each corpus and each lexicon, present an overall accuracy as well as an accuracy for each type of text (positive and negative). You should modularize your code and avoid repetition. (HINT: to get full quality points, you'll want at least two functions: one that evaluates a given corpus with a given lexicon and another that calculates the SO for a particular text. Within the former function, it might be easier to iterate over positive and negative sentences in separate loops--remember the *categories* keyword for NLTK corpora. It is possible to be modular here by taking advantage of the fact that although the two corpora have different labels for positive and negative, the negative labels for both corpora are \*corpus\*.categories()\[0\] and the positive labels for both are \*corpus\*.categories()\[1\])  

In [None]:
def calculate_SO(text,lexicon,ints=None,negs=None):
    '''calculate a semantic orientation for a text as the sum of the semantic orientation
    of the words of the text as provided by lexicon'''
    
    #Your code here
    #Your code here

def evaluate_lexicon(corpus,lexicon,ints=None,negs=None):
    '''given a lexicon and a sentiment analysis corpus with labels, calculate and print the
    percentage accuracy in polarity classification for each class, and overall'''
    
    pos_total = 0
    neg_total = 0
    neg_correct = 0
    pos_correct = 0

    #Your code here
    #Your code here

    print("-----")    
    print("positive correct")
    print(pos_correct/pos_total)    
    print("negative correct")
    print(neg_correct/neg_total)
    print("total")
    print((pos_correct + neg_correct)/(neg_total + pos_total))

    print("-------------------------")

```
from nltk.corpus import sentence_polarity, pros_cons

corpus.sents(categories=corpus.categories()[0]) #-> negative sentences
corpus.sents(categories=corpus.categories()[1]) #-> positive sentences
```

In [None]:
print("Positive sentence from the 'sentence_polarity' corpus:")

text = sentence_polarity.sents(categories=sentence_polarity.categories()[1])[0]
print(text)

print(sum(NLTK_lexicon.get(word.lower(),0) for word in text))
print(sum(VADER_lexicon.get(word.lower(),0) for word in text))
print(sum(SO_CAL_lexicon.get(word.lower(),0) for word in text))
print(sum(subjectivity_lexicon.get(word.lower(),0) for word in text))

Positive sentence from the 'sentence_polarity' corpus:
['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', "century's", 'new', '"', 'conan', '"', 'and', 'that', "he's", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean-claud', 'van', 'damme', 'or', 'steven', 'segal', '.']
0
1.5
6.0
2


In [None]:
# nltk.download("sentence_polarity")
# nltk.download("pros_cons")

#Provided code
for ev_name,corpus in [("Sentence Polarity",sentence_polarity),
                        ("Pros and Cons",pros_cons)]:
    print("******")
    print(ev_name)
    for lex_name,lexicon in [("NLTK",NLTK_lexicon),("VADER",VADER_lexicon),
                             ("SO-CAL",SO_CAL_lexicon),("Subjectivity",subjectivity_lexicon)]:
        print(lex_name)
        evaluate_lexicon(corpus,lexicon)
#Provided code

******
Sentence Polarity
NLTK
-----
positive correct
0.5670605890076909
negative correct
0.4483211404989683
total
0.5076908647533296
VADER
-----
positive correct
0.708309885574939
negative correct
0.3676608516225849
total
0.537985368598762
SO-CAL
-----
positive correct
0.7454511348715063
negative correct
0.4558244231851435
total
0.6006377790283249
Subjectivity
-----
positive correct
0.6479084599512287
negative correct
0.44982179703620334
total
0.548865128493716
******
Pros and Cons
NLTK
-----
positive correct
0.7653443766346992
negative correct
0.43562241116197953
total
0.6005013623978201
VADER
-----
positive correct
0.704925893635571
negative correct
0.3659908436886854
total
0.5354768392370572
SO-CAL
-----
positive correct
0.8216652136006974
negative correct
0.4293873991715718
total
0.6255476839237057
Subjectivity
-----
positive correct
0.7469485614646905
negative correct
0.4391105297580118
total
0.5930463215258855


#### 1.7
rubric={reasoning:1}

Discuss your results from 1.5 and 1.6. What stands out?

Answer:



#### 1.8 optional
rubric={accuracy:1}

Investigate these dictionaries further. Here are some questions you can try to answer: To what extent do these dictionaries contain the same words? Is there any clear patterns in the words they don't share? For the words they do share, do they always agree on the polarity? If not, why not? For the dictionaries with strength/intensity scores, how well are the intensity scores correlated? Is there any benefit to be gained from combining dictionaries, in terms of performance on the sentiment analysis task?

Provide both code and discussion.

In [None]:
# Open-ended, no official solution!

### Exercise 2: Induced lexicons

#### 2.1
rubric={accuracy:2}

SentiWordNet provides a positive, negative, and neutral score for each synset in WordNet, based on a random walk algorithm. Create a lexicon where the SO value for each word is the average of the positive score minus the negative score across all possible senses of each word. Do this using the words in SO-CAL, skipping words that aren't in WordNet. The [WordNet howto](https://www.nltk.org/howto/wordnet.html) and [SentiWordNet howto](http://www.nltk.org/howto/sentiwordnet.html) might be useful.

*good*:

```
<good.n.01: PosScore=0.5 NegScore=0.0>          : 0.5 - 0.0
<good.n.02: PosScore=0.875 NegScore=0.0>        : 0.875 - 0.0
<good.n.03: PosScore=0.625 NegScore=0.0>        : ...
<commodity.n.01: PosScore=0.0 NegScore=0.0>
<good.a.01: PosScore=0.75 NegScore=0.0>
<full.s.06: PosScore=0.0 NegScore=0.0>
<good.a.03: PosScore=1.0 NegScore=0.0>
<estimable.s.02: PosScore=1.0 NegScore=0.0>
<beneficial.s.01: PosScore=0.625 NegScore=0.0>
...
<good.s.17: PosScore=0.75 NegScore=0.0>
<good.s.18: PosScore=0.875 NegScore=0.0>
<good.s.19: PosScore=0.5 NegScore=0.0>
<good.s.20: PosScore=0.375 NegScore=0.125>
<good.s.21: PosScore=0.75 NegScore=0.0>
<well.r.01: PosScore=0.375 NegScore=0.0>
<thoroughly.r.02: PosScore=0.0 NegScore=0.0>

sum of (positive score - negative score) / len(synsets)
```

In [None]:
SWN_lexicon = {}

def get_synset_SO(synset_name):
    '''get the SO for a synset in sentiwordnet by subtracting the neg_score from the pos_score'''
    #Your code here
    #Your code here
    
for word in SO_CAL_lexicon:
#     #Your code here
#     #Your code here


In [None]:
assert 0.56 <SWN_lexicon["good"] < 0.57
assert -0.60 <SWN_lexicon["bad"] < -0.59
print("Success!")

Success!


#### 2.2
rubric={accuracy:2,efficiency:1}

Given seeds provided below, build a sentiment lexicon from the word2vec vectors included in NLTK using the Semantic Axis method discussed in [25.4.1 of the J&M textbook](https://web.stanford.edu/~jurafsky/slp3/25.pdf). That is, you should calculate the centroid of the embeddings of the positive and negative seeds, create an axis by taking their difference, and then do cosine similarity of each word with that axis.

In [None]:
#provided code
positive_seeds = ["good", "excellent", "nice", "happy","exciting"]
negative_seeds = ["bad","poor","terrible","dull","unhappy"]

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False) # 300D

```
# use `np.sum`:
pos_vec = word2vec_model["good"] + word2vec_model["excellent"] +  ...
neg_vec = word2vec_model["bad"] + word2vec_model["poor"] +  ...

axis_vec = pos_vec - neg_vec

similarities = 1 - word2vec_model.distances(axis_vec)
...

```

In [1]:
axis_lexicon = {}

#Your code here
#Your code here

# after getting pos_vec and neg_vec, obtain axis_vec
# then, obtain smilarities 

# `awesome` where i = 22758 in word2vec_model.index_to_key[i] 
# print(similarities[22758]) --> 0.3929876
# print(word2vec_model.index_to_key[22758]) --> `awesome`
# axis_lexicon["awesome"] = 0.3929876

In [None]:
axis_lexicon["awesome"]

0.3929876

In [None]:
assert 0.39 <axis_lexicon["awesome"] < 0.40
assert -0.24 > axis_lexicon["bitter"] > -0.25
print("Success!")

Success!


#### 2.3 Optional
rubric={accuracy:1}

Now we will build a lexicon using a variation of the random walk/label propogration method described in [J&M 25.4.2, p.8-9](https://web.stanford.edu/~jurafsky/slp3/25.pdf). The algorithm works as follows:

1. Create a transition matrix (between nodes in the graph) using by normalizing the cosine similarities of the word2vec vectors into a probability distribution. Only the closest 50 neighbors are preserved, and so we make use of a sparse matrix. Code that does this is provided to you, it will take a little while to run
2. For each of the polarities, we do a random walk by iteratively multiplying the transition matrix by a vector corresponding to the probability of visiting each node from the seeds. That vector begins with its probability mass evenly distributed across the seeds, everything else is zero.
3. After each step of the "walk", some of the probability mass that has been distributed needs to be given back to the seeds. This is done by setting the probability of each seed to its original value (1/number of seeds) and then renormalizing to create a probability distribution
4. This should be done 200 times for each of the positive and negative seeds.
5. A final score for each word is derived by subtracting its positive score from its negative score

In [None]:
#provided code
#this code builds the transition matrix. Do not modify! In addition to the transitions, you will
#want to use the word2index and index2word dictionaries

# NOTE: it takes ~3min to run;;; 

k= 50

word2index = {}

for i, word in enumerate(word2vec_model.index_to_key):
    word2index[word] = i # obtain indices of each word in the model

transitions = lil_matrix((len(word2index),len(word2index))) # create a transition matrix

for i,word in enumerate(word2vec_model.index_to_key):
    if i % 10000 ==0:
        print(i)
    nearest_neighbors = word2vec_model.most_similar(word,topn=k) # grab the k nearest neighbours
    for neighbor,similarity in nearest_neighbors:
        transitions[i,word2index[neighbor]] = similarity # create an edge weighted by the similarity

transitions = csr_matrix(transitions.multiply(1/np.sum(transitions,axis=1))) 
# normalize similarities to create probabilities

0
10000
20000
30000
40000


In [None]:
# your `transitions` is <43981x43981 sparse matrix of type '<class 'numpy.float64'>'
#   43981 = len(word2vec_model.index_to_key)


In [None]:
def random_walk(transitions, seeds,iterations=200):
    
    '''carry out a random walk graph defined by transitions, starting with probability distribution
    distributed to words in seeds, and redistributed back to seeds in each iteration'''
    #Your code here
    # 1. create `output` with  `np.zeros`, and output.size = len(word2vec_model.index_to_key)
    # 2. initialize output for each seed word: output['seed_word'] = 1/len(seeds)
    # 3. iteration 1..200:
    #   output = transitions*output
    #   iterate for all seeds
    #       again, output['seed_word'] = 1/len(seeds)
    #   normalize output
    # 4. return output. 

    #Your code here

pos_seeds = [word2index[word] for word in positive_seeds]
neg_seeds = [word2index[word] for word in negative_seeds]

pos_values = random_walk(transitions,pos_seeds)
neg_values = random_walk(transitions,neg_seeds)

random_walk_lexicon = {}
diff = pos_values - neg_values
for i in range(len(pos_values)):
    random_walk_lexicon[word2vec_model.index_to_key[i]] = diff[i]

#### 2.4 
rubric={accuracy:1,quality:1}

Finally, you will carry out an intrinsic evaluation of these lexicons, by comparing them to **a gold standard lexicon, the SO-CAL**. You first need to **create a test set which is the intersection of the keys of the SO-CAL, SWN, and Axis lexicons**. Then, for each of the induced lexicons, calculate the percentage of the test words that have the same polarity in the SO-CAL (you should ignore intensity for this). 

In [None]:
def intrinsic_evaluate(test_set,proposed_lexicon,gold_lexicon):
    '''calculate and return percentage of words in test_set which have same polarity in 
    proposed_lexicon and gold_lexicon'''
    #Your code here
    
    #Your code here
test_set = set(SO_CAL_lexicon) & set(SWN_lexicon) & set(axis_lexicon)
print("SWN")
print(intrinsic_evaluate(test_set,SWN_lexicon,SO_CAL_lexicon)) 
print("Axis lexicon")
print(intrinsic_evaluate(test_set,axis_lexicon,SO_CAL_lexicon))
# print("Random Walk")
# print(intrinsic_evaluate(test_set,random_walk_lexicon,SO_CAL_lexicon)) # 

SWN
0.6712057439960386
Axis lexicon
0.8450111413716267
Random Walk
0.8625897499381034


### Exercise 3: Valence shifters

SO-CAL has also has a dictionary of intensifiers (`int_dictionary`). The code below will load them into a dictionary where (true) intensifiers like "very" will have values greater than one and downplayers like "sorta" will have values less than one; you can multiply these modifiers by the SO scores to get the intended valence shifting effect. 

You need to add 1 to satisfy the condition for values where they are greater and less than one:
```
very	0.2     > 1           -->  0.2+1 > 1
...
sorta	-0.3   < 1           --> -0.3+1 < 1
```


In [49]:
#provided code
SO_CAL_int = {}

location = "https://raw.githubusercontent.com/sfu-discourse-lab/SO-CAL/master/Resources/dictionaries/English/"
f = urllib.request.urlopen(location + "int_dictionary1.11.txt")
for line in f:
    line = line.decode("latin-1")
    if line.strip() \
        and "_" not in line: # to ignore the line with more than one words: eg. the_least	-3
        word, intensity = line.strip().split()
        SO_CAL_int[word] = float(intensity) + 1
        
print(SO_CAL_int)

{'less': -0.5, 'barely': -0.5, 'hardly': -0.5, 'almost': -0.5, 'only': 0.5, 'slightly': 0.5, 'marginally': 0.5, 'relatively': 0.7, 'mildly': 0.7, 'moderately': 0.7, 'somewhat': 0.7, 'partially': 0.7, 'arguably': 0.8, 'mostly': 0.8, 'mainly': 0.8, 'sorta': 0.7, 'kinda': 0.7, 'fairly': 0.8, 'pretty': 0.9, 'rather': 0.9, 'immediately': 1.1, 'quite': 1.1, 'perfectly': 1.1, 'consistently': 1.1, 'really': 1.2, 'clearly': 1.2, 'obviously': 1.2, 'certainly': 1.2, 'completely': 1.2, 'definitely': 1.2, 'absolutely': 1.2, 'constantly': 1.2, 'highly': 1.2, 'very': 1.2, 'significantly': 1.2, 'noticeably': 1.2, 'distinctively': 1.2, 'frequently': 1.2, 'awfully': 1.2, 'totally': 1.2, 'largely': 1.2, 'fully': 1.2, 'extra': 1.3, 'truly': 1.3, 'especially': 1.3, 'particularly': 1.3, 'damn': 1.3, 'intensively': 1.3, 'downright': 1.3, 'entirely': 1.3, 'strongly': 1.3, 'remarkably': 1.3, 'majorly': 1.3, 'amazingly': 1.3, 'strikingly': 1.3, 'stunningly': 1.3, 'quintessentially': 1.3, 'unusually': 1.3, 'dram

In [50]:
print("very", SO_CAL_int["very"])
print("sorta", SO_CAL_int["sorta"])

very 1.2
sorta 0.7


#### 3.1
rubric={accuracy:3,quality:1}

Add simple valence shifting to your SO calcuation code from 1.6. For each word, look at the previous word. If it is an intensifier/downplayer, scale up the SO value of the word according to the value in the intensifier dictionary, by multiplying. If it is the word "not" or "no", shift the SO by subtracting 4 from positive words, and adding 4 to negative words. You should implement this is in a way such that it is possible to independently test the effects of intensification and negation, as well as do both together. You can make changes to your code in 1.6 for the purposes of avoiding major duplication (though make sure you don't break anything!), but the code that carries out the valence shifting  should be in the code block below, even if some duplication of code is required.

-  For each word, look at the previous word. If it is an intensifier/downplayer, scale up the SO value of the word according to the value in the intensifier dictionary:  `SO *= SO_CAL_int[prev_word]`
- If it is the word "not" or "no", shift the SO by subtracting 4 from positive words, and adding 4 to negative words: `SO += 4` or ` -= 4`

In [None]:
negatives = {"no","not"}

def calculate_SO(text,lexicon,ints=None, negs=None,verbose=False):
    '''calculate a semantic orientation for a text as the sum of the semantic orientation
    of the words of the text as provided by lexicon, modified by the effect of intensifiers
    if ints is True, and negators if negs is true. Returns the SO unless verbose is True,
    at which point it returns both the SO and a list of tuples correspond to each SO-bearing
    word and the SO calculated for that word'''
    #Your code here
    #Your code here


#### 3.2

rubric={accuracy:1}

Using the SO-CAL lexicon, evaluate the effect of intensification, negation, and the combination of the two valence shifters together, using the two corpora we used in Exercise 1. (You should see a consistent, though small, improvement)

```
    print("no valence shifters")
    evaluate(...)
    print("intensification")
    evaluate(...)
    print("negation")
    evaluate(...)
    print("both intensification and negation")
    evaluate(...)
```

In [None]:
lexicon = SO_CAL_lexicon

for ev_name,corpus in [("Sentence Polarity",sentence_polarity),
                        ("Pros and Cons",pros_cons)]:
    print("******")
    #Your code here
    #Your code here


******
Sentence Polarity
no valence shifters
-----
positive correct
0.7454511348715063
negative correct
0.4558244231851435
total
0.6006377790283249
-------------------------
intensification
-----
positive correct
0.7492027762145939
negative correct
0.46782967548302384
total
0.6085162258488088
-------------------------
negation
-----
positive correct
0.7447008066028887
negative correct
0.45920090039392236
total
0.6019508534984056
-------------------------
both intensification and negation
-----
positive correct
0.7480772838116676
negative correct
0.4710185706246483
total
0.609547927218158
-------------------------
******
Pros and Cons
no valence shifters
-----
positive correct
0.8216652136006974
negative correct
0.4293873991715718
total
0.6255476839237057
-------------------------
intensification
-----
positive correct
0.8256756756756757
negative correct
0.43514279485502505
total
0.6304305177111716
-------------------------
negation
-----
positive correct
0.8231909328683522
negative cor

### Exercise 4: Error Analysis

#### 4.1

rubric={accuracy:2}

Adapt your earlier code such that so that it optionally keeps track of information that will help you to do an error analysis. You can add this to the earlier cell, no need to duplicate your code here. In particular, you will want to remember words which were found in the lexicon for that sentence, and their corresponding SO *after* valence shifting. 

Then, with this feature turned on, (1) **iterate over all the *negative* sentences** in one of the two corpora we have been using in this lab (you can choose which), and (2) **print out sentences which are not labeled as negative** by your system, along with the words in the lexicon (and their SO), as mentioned above. That is, for one incorrect sentence, your output should look something like this.

- iterate `sentence_polarity.sents(categories="neg")`
- print out not labeled as negative `calculate_SO(sent,lexicon,ints=ints,negs=negs,verbose=True) >= 0` ...

In [None]:
def sentence_polarity_error_analysis(lexicon,ints=None,negs=None):
    '''iterate over negative sentences in sentence polarity corpus and print out those which
    are incorrect according to SO calculated using lexicon, along with the SO calculation details'''

    #Your code here
    #Your code here       
            
sentence_polarity_error_analysis(SO_CAL_lexicon,ints=SO_CAL_int,negs=negatives)

["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.']
[('juvenile', -2.0), ('funny', 3.0)]
1.0
['a', 'visually', 'flashy', 'but', 'narratively', 'opaque', 'and', 'emotionally', 'vapid', 'exercise', 'in', 'style', 'and', 'mystification', '.']
[('flashy', 1.0), ('emotionally', -1.0)]
0.0
['the', 'story', 'is', 'also', 'as', 'unoriginal', 'as', 'they', 'come', ',', 'already', 'having', 'been', 'recycled', 'more', 'times', 'than', "i'd", 'care', 'to', 'count', '.']
[('unoriginal', -2.0), ('care', 2.0)]
0.0
['a', 'sentimental', 'mess', 'that', 'never', 'rings', 'true', '.']
[('sentimental', 2.0), ('mess', -2.0)]
0.0
['while', 'the', 'performances', 'are', 'often', 'engaging', ',', 'this', 'loose', 'collection', 'of', 'largely', 'improvised', 'numbers', 'would', 'probably', 'have', 'worked', 'better', 'as', 'a', 'one-hour', 'tv', 'documentary', '.']
[('engaging', 4.0), ('loose', -1.0), ('better', 2.0)]
5.0
['interesting'

[('wacky', 1.0), ('disloyal', -1.0)]
0.0
['the', 'whole', 'affair', ',', 'true', 'story', 'or', 'not', ',', 'feels', 'incredibly', 'hokey', '.', '.', '.', '[it]', 'comes', 'off', 'like', 'a', 'hallmark', 'commercial', '.']
[('incredibly', 5.0), ('like', 1.0), ('commercial', -1.0)]
5.0
["what's", 'missing', 'is', 'what', 'we', 'call', 'the', "'wow'", 'factor', '.']
[]
0
['.', '.', '.', 'tara', 'reid', 'plays', 'a', 'college', 'journalist', ',', 'but', 'she', 'looks', 'like', 'the', 'six-time', 'winner', 'of', 'the', 'miss', 'hawaiian', 'tropic', 'pageant', ',', 'so', 'i', "don't", 'know', 'what', "she's", 'doing', 'in', 'here', '.', '.', '.']
[('like', 1.0)]
1.0
['normally', ',', "rohmer's", 'talky', 'films', 'fascinate', 'me', ',', 'but', 'when', 'he', 'moves', 'his', 'setting', 'to', 'the', 'past', ',', 'and', 'relies', 'on', 'a', 'historical', 'text', ',', 'he', 'loses', 'the', 'richness', 'of', 'characterization', 'that', 'makes', 'his', 'films', 'so', 'memorable', '.']
[('normally'

2.0
['the', 'latest', 'installment', 'in', 'the', 'pokemon', 'canon', ',', 'pokemon', '4ever', 'is', 'surprising', 'less', 'moldy', 'and', 'trite', 'than', 'the', 'last', 'two', ',', 'likely', 'because', 'much', 'of', 'the', 'japanese', 'anime', 'is', 'set', 'in', 'a', 'scenic', 'forest', 'where', 'pokemon', 'graze', 'in', 'peace', '.']
[('latest', 1.0), ('moldy', 1.5), ('trite', -3.0), ('last', 1.0), ('scenic', 4.0), ('peace', 1.0)]
5.5
['the', 'backyard', 'battles', 'you', 'staged', 'with', 'your', 'green', 'plastic', 'army', 'men', 'were', 'more', 'exciting', 'and', 'almost', 'certainly', 'made', 'more', 'sense', '.']
[('green', 1.0), ('exciting', 2.5)]
3.5
['what', 'is', 'captured', 'during', 'the', 'conceptual', 'process', "doesn't", 'add', 'up', 'to', 'a', 'sufficient', 'explanation', 'of', 'what', 'the', 'final', 'dance', 'work', ',', 'the', 'selection', ',', 'became', 'in', 'its', 'final', 'form', '.']
[('sufficient', 1.0)]
1.0
['had', 'the', 'film', 'boasted', 'a', 'clearer', 

[('enjoy', 3.0), ('waste', -2.0)]
1.0
['the', 'film', 'is', 'so', 'packed', 'with', 'subplots', 'involving', 'the', 'various', 'silbersteins', 'that', 'it', 'feels', 'more', 'like', 'the', 'pilot', 'episode', 'of', 'a', 'tv', 'series', 'than', 'a', 'feature', 'film', '.']
[('like', 0.5), ('feature', 2.0)]
2.5
['despite', 'all', 'the', 'closed-door', 'hanky-panky', ',', 'the', 'film', 'is', 'essentially', 'juiceless', '.']
[]
0
['it', 'is', 'parochial', ',', 'accessible', 'to', 'a', 'chosen', 'few', ',', 'standoffish', 'to', 'everyone', 'else', ',', 'and', 'smugly', 'suggests', 'a', 'superior', 'moral', 'tone', 'is', 'more', 'important', 'than', 'filmmaking', 'skill']
[('accessible', 3.0), ('smugly', -2.0), ('superior', 4.0), ('moral', 2.0), ('important', 1.5), ('skill', 2.0)]
10.5
["it's", 'lost', 'the', 'politics', 'and', 'the', 'social', 'observation', 'and', 'become', 'just', 'another', 'situation', 'romance', 'about', 'a', 'couple', 'of', 'saps', 'stuck', 'in', 'an', 'inarticulate'

2.0
['it', 'sounds', 'like', 'another', 'clever', 'if', 'pointless', 'excursion', 'into', 'the', 'abyss', ',', 'and', "that's", 'more', 'or', 'less', 'how', 'it', 'plays', 'out', '.']
[('like', 1.0), ('clever', 3.0), ('pointless', -4.0)]
0.0
['report', 'card', ':', "doesn't", 'live', 'up', 'to', 'the', 'exalted', 'tagline', '-', "there's", 'definite', 'room', 'for', 'improvement', '.', "doesn't", 'deserve', 'a', 'passing', 'grade', '(', 'even', 'on', 'a', 'curve', ')', '.']
[('improvement', 2.0), ('deserve', 2.0)]
4.0
['.', '.', '.', 'if', 'it', 'had', 'been', 'only', 'half-an-hour', 'long', 'or', 'a', 'tv', 'special', ',', 'the', 'humor', 'would', 'have', 'been', 'fast', 'and', 'furious--', 'at', 'ninety', 'minutes', ',', 'it', 'drags', '.']
[('long', -1.0), ('special', 3.0), ('humor', 2.0)]
4.0
['bean', 'drops', 'the', 'ball', 'too', 'many', 'times', '.', '.', '.', 'hoping', 'the', 'nifty', 'premise', 'will', 'create', 'enough', 'interest', 'to', 'make', 'up', 'for', 'an', 'unfocused

#### 4.2
rubric={reasoning:3}

Identify 3 categories of errors that you see. They should NOT be issues with specific dictionary entries, but something more general. For each of the three types of errors, include 3 examples of the error (the sentence, with some indicator of the relevant word(s)) 

Answer:

