In [210]:
import csv as csv
import nltk as nltk
import codecs
import numpy as np
import matplotlib.pyplot as plt

## Project - Sentiment Analysis

The goal of the project is to associate text with a **sentiment class**; here, we will use three of them: *negative*, *neutral*, and *positive*. While sentiment analysis is often seen as an 'easy' task, this project is challenging, since it implies working with noisy data, in very little quantity: samples are *tweets*, and there is only 500 of them.

neg 0, neu 2, pos 4


In [211]:
tweetDataFile = './fp_data/testdata.manual.2009.06.14.csv'

In [212]:
# Get tweets data from the file
dataFile = open(tweetDataFile, 'rt')  
dataReader = csv.reader(dataFile, delimiter=',')
data = []
# Each tweet is a list (sentiment + text) 
for row in dataReader:
    data.append([row[0], row[5]])

In [213]:
print(data[21
          ])

['4', 'lebron and zydrunas are such an awesome duo']


### First approach: Rule-based 

- You will use SentiWordNet, containing sentiment scores (positive/negative) associated to each synset.
- You also have access to a dictionary of slang and abreviations. 
- Goal: obtain a sentiment values for each tweet in the data without any learning - only rules !

In [214]:
dicoSlangFile = './fp_data/SlangLookupTable.txt'
dicoSentiWordnetFile = './fp_data/SentiWordNet_3.0.0_20130122.txt'

In [215]:
# Get the slang dictionnary from the file 
file = codecs.open(dicoSlangFile, 'r', encoding="utf-8", errors='ignore')
slang_dict = {}
for line in file:
    line=line.strip()
    [key,value] = line.split("\t")
    slang_dict[key] = value

In [216]:
print(slang_dict['u'])

you


In [217]:
print(slang_dict.keys())

dict_keys(['121', 'a/s/l', 'adn', 'afaik', 'afk', 'aight', 'alol', 'b4', 'b4n', 'bak', 'bf', 'bff', 'bfn', 'bg', 'bta', 'btw', 'cid', 'cnp', 'cp', 'cu', 'cul', 'cul8r', 'cya', 'cyo', 'dbau', 'fud', 'fwiw', 'fyi', 'g', 'g2g', 'ga', 'gal', 'gf', 'gfn', 'gmbo', 'gmta', 'h8', 'hagn', 'hdop', 'hhis', 'iac', 'ianal', 'ic', 'idk', 'imao', 'imnsho', 'imo', 'iow', 'ipn', 'irl', 'jk', 'l8r', 'ld', 'ldr', 'llta', 'lmao', 'lmirl', 'lol', 'ltr', 'lulab', 'lulas', 'luv', 'm/f', 'm8', 'milf', 'oll', 'omg', 'otoh', 'pir', 'ppl', 'r', 'rofl', 'rpg', 'ru', 'shid', 'somy', 'sot', 'thanx', 'thx', 'ttyl', 'u', 'ur', 'uw', 'wb', 'wfm', 'wibni', 'wtf', 'wtg', 'wtgp', 'ym'])


In [218]:
from nltk.corpus import wordnet as wn
from nltk.corpus.reader import sentiwordnet as swn
nltk.download('wordnet')
nltk.download('omw-1.4')


# Get the dictionnary from the file
swn_dict = swn.SentiWordNetCorpusReader('', [dicoSentiWordnetFile])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\juand\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\juand\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


- Some advice for this first approach:
    - Be careful to specific symbols (@, #) and what they mean in this context.
    - Your first step will be to look for scores for each word in a tweet. Look into the ```nltk``` documentation to find details on how it works - info given by SentiWordNet looks like this:`

In [219]:
print(list(swn_dict.senti_synsets('you')))

[]


In [220]:
breakdown_scores = swn_dict.senti_synset('evil.a.01')

In [221]:
breakdown_scores.pos_score()

0.0

   - You get a list of synsets with their associated sentiment. Which one to choose ? Let us not care about sense and take the first appropriate. You still need to match for the Part of Speech tag - is it a noun, a verb ? Use the right function ```nltk``` to tag the words in the tweet.
   - Afterward, work towards more elaborated rules. How should a negation affect the sentiment of the next word ? You have access to other dictionaries in the files - use them for better rules ! 
   - The value associate to each tweet indicates its sentiment (0 = negative, 2 = neutral, 4 = positive). You can compute appropriate metrics (or use tools from scikit-learn to do that) to check how well your rules work.
   - Next step: using a learning method. 

### The actual work
Rule #1:
- we use simple tokenization
- we add up sentiment scores of each word, filtering synsets by POS tagging (we take the first appropriate as in the instructions)
- we try to learn a threshold

Rule #2 ideas:
- elongated words
- symbols
- sentence tokenizers, negations
- !, ?
- choose between synsets
- boosting words



#### Utility functions:

In [222]:
tag_to_category = {
    'NN' : 'n',
    'NNP' : 'n',
    'NNPS' : 'n',
    'NNS' : 'n',
    'VB' : 'n',
    'VBG' : 'v',
    'VBN' : 'v',
    'VBP': 'v',
    'VBZ': 'v',
    'JJ': 'a',
    'JJR': 'a',
    'JJS': 'a',
    'RB': 'r',
    'RBR': 'r',
    'RBS': 'r'
}

In [223]:
labels, tweets = zip(*data)
from nltk import word_tokenize, pos_tag
tweets = [t.lower() for t in tweets]

In [224]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\juand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\juand\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [225]:
tweets

['@stellargirl i loooooooovvvvvveee my kindle2. not that the dx is cool, but the 2 is fantastic in its own right.',
 'reading my kindle2...  love it... lee childs is good read.',
 'ok, first assesment of the #kindle2 ...it fucking rocks!!!',
 "@kenburbary you'll love your kindle2. i've had mine for a few months and never looked back. the new big one is huge! no need for remorse! :)",
 "@mikefish  fair enough. but i have the kindle2 and i think it's perfect  :)",
 "@richardebaker no. it is too big. i'm quite happy with the kindle2.",
 'fuck this economy. i hate aig and their non loan given asses.',
 'jquery is my new best friend.',
 'loves twitter',
 'how can you not love obama? he makes jokes about himself.',
 "check this video out -- president obama at the white house correspondents' dinner http://bit.ly/imxum",
 "@karoli i firmly believe that obama/pelosi have zero desire to be civil.  it's a charade and a slogan, but they want to destroy conservatism",
 'house correspondents dinner 

We use the tag from the POS tagger and map it to one of the four categories `(a, n, v, r)` recognised by `swn`.

In [226]:
tagged_tokenized_tweets = [pos_tag(word_tokenize(tweet)) for tweet in tweets]
tags = []
test_sc = []
for tweet in tagged_tokenized_tweets:
    tweet_tags = []
    for word in tweet:
        try:
            tweet_tags.append(tag_to_category[word[1]])
        except:
            tweet_tags.append('NULL')
    tags.append(tweet_tags)
    
    

We test with the first tweet our scoring approach. We take the first available synset that matches the word and the tag selected. 

In [227]:
accepted_tags = 'anvr'
test_sc = []
for i in range(len(tagged_tokenized_tweets[0])):
    if tags[0][i] in 'anvr':
        try:
            breakdown_scores = swn_dict.senti_synset(f'{tagged_tokenized_tweets[0][i][0]}.{tags[0][i]}.01')
            print(f'{tagged_tokenized_tweets[0][i][0]}.{tags[0][i]}.01')
            test_sc.append(breakdown_scores.pos_score() - breakdown_scores.neg_score())
        except:
            pass

i.n.01
not.r.01
cool.a.01
fantastic.a.01
own.a.01
right.n.01


That 'not' is going to be a problem later.

In [228]:
test_sc

[0.0, -0.625, 0.125, 0.375, 0.0, 0.0]

Based on the POS outputs, we select the (most) appropriate synset, and add up:

In [229]:
word_scores = []
scores = []
for ii in range(len(tagged_tokenized_tweets)):
    sc = []
    for jj in range(len(tagged_tokenized_tweets[ii])):
        if tags[ii][jj] in accepted_tags:
            try:
                breakdown_scores = swn_dict.senti_synset(f'{tagged_tokenized_tweets[ii][jj][0]}.{tags[ii][jj]}.01')
                sc.append(breakdown_scores.pos_score() - breakdown_scores.neg_score())
            except:
                pass
    word_scores.append(sc)
    scores.append(np.mean(sc))
scores = np.array(scores) 

In [230]:
# handle nans to 0
nans = np.isnan(scores)
scores[nans] = 0

In [231]:
labels = [int(x) for x in labels]
labels = np.array(labels)

We then use `hyperopt` to try to adjust a threshold that gives us the best scores. Given that we have three categories (positive, neutral, negative), we need to search for two thresholds $threshold_+ \gt threshold_-$, given that all our scores are in the range (-1, 1), and try to maximise f1 score.

In [232]:
from hyperopt import hp, tpe, fmin, STATUS_OK
import copy
from sklearn.metrics import f1_score
space = [
    hp.uniform('pos_threshold', -0.1, 0.1), # a first trial was done with -1,1 and gave values in this window.
                                            #We try to close up to the optimum.
    hp.uniform('neg_threshold', -0.1, 0.1),
    scores,
    labels
]


In [233]:
def score_loss(args):
    pos_threshold, neg_threshold, pred_labels, labels = args
    lab = pred_labels.copy()
    lab[lab >= pos_threshold] = 4
    lab[(lab < pos_threshold) & (lab > neg_threshold)] = 2
    lab[lab <= neg_threshold] = 0
    score = f1_score(labels, lab, average='weighted')
    return {
        'loss' : -score,
        'status' : STATUS_OK
    }

best = fmin(score_loss,
    space=space, 
    algo=tpe.suggest,
    max_evals=1000,
           )
    
    
    


100%|██████████████████████████████████████████| 1000/1000 [00:09<00:00, 101.06trial/s, best loss: -0.5505420993580069]


In [234]:
print(best)


{'neg_threshold': -0.004165212618324908, 'pos_threshold': 0.06088917725467967}


Our simple model returned 0.55 f1-score after searching for the best thresholds. We try to refine the rule.

### Rule #2 - Incorporating more complex rules

- Sentence tokenising and negation handling
 - We use a sentence tokeniser to separate the sentences of a tweet. The presence of a negation word inverts the score output of the sentence.
 - The final output of a tweet is the mean of its subsentences.
- Emojis
 - We tagged the emoji dataset as positive (1), neutral (0), and negative (-1). They will be treated as usual words in the rest of the processing, following these rules.
- Slang
 - In preprocessing, slang and abbreviations will be replaced by their formal language variants.


### Preprocessing

In [235]:
from nltk.tokenize import sent_tokenize
#import other datasets
file = codecs.open("./fp_data/EmoticonLookupTable.txt", 'r', encoding="utf-8", errors='ignore')
emoji_dict = {}
for line in file:
    line=line.strip()
    [key,value] = line.split("\t")
    emoji_dict[key] = value

file = open('./fp_data/NegatingWordList.txt')
negating_words = [line.strip() for line in file]

In [236]:
def replace_slang(sentence):
    for abbr in slang_dict.keys():
        if abbr in sentence:
            sentence.remove(abbr)
            sentence += word_tokenize(slang_dict[abbr]) # count-based metrics don't care about the order
    return sentence

def emoji_processor(sentence):
    '''
    returns string, but emojis are replaced by -1, 1, 0. Do before word tokenization.
    '''
    for emoji in emoji_dict.keys():
        if emoji in sentence:
            sentence = sentence.replace(emoji, emoji_dict[emoji])
    return sentence
    
def negate(sentence):
    '''
    returns original sentence with negation words replaced with NEGATE, which when parsed will multiply
    sentence score by -1
    '''
    for n in negating_words:
        if n in sentence:
            sentence[sentence.index(n)] =  'NEGATE'
    return sentence


def preprocess(tweet):
    tweet = sent_tokenize(tweet)
    tweet = [emoji_processor(sentence) for sentence in tweet]
    tweet = [word_tokenize(sentence) for sentence in tweet]
    tweet = [replace_slang(sentence) for sentence in tweet]
    tweet = [negate(sentence) for sentence in tweet]
    return tweet

def score_v2(tweet):
    tweet = preprocess(tweet)
    sentence_scores = []
    for sentence in tweet:
        # we first collect all our (-1,1,0) and 'NEGATE's
        wordwise_score = []
        invert = False
        modifiers = ['-1','1', '0']
        for modifier in modifiers:
            while modifier in sentence:
                sentence.remove(modifier)
                wordwise_score.append(int(modifier))
        while 'NEGATE' in sentence:
            sentence.remove('NEGATE')
            invert = not invert
            
        # we proceed with scoring as usual for the rest of the words: pos tagging
        sentence = pos_tag(sentence)
        tweet_tags = []
        for word in sentence:
            try:
                tweet_tags.append(tag_to_category[word[1]])
            except:
                tweet_tags.append('NULL')
        
        # scoring and tag checking        
        for ii in range(len(sentence)):
            if tweet_tags[ii] in accepted_tags:
                try:
                    s = swn_dict.senti_synset(f'{sentence[ii][0]}.{tweet_tags[ii]}.01')
                    wordwise_score.append(s.pos_score() - s.neg_score())
                except:
                    pass
        sentence_scores.append(np.mean(wordwise_score))   
    
    return np.mean(sentence_scores)


In [237]:
all_scores_v2 = [score_v2(tweet) for tweet in tweets]
all_scores_v2 = np.array(all_scores_v2)
nans = np.isnan(all_scores_v2)
all_scores_v2[nans] = 0

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [238]:
all_scores_v2

array([ 0.0625    ,  0.40625   ,  0.        ,  0.29375   ,  0.17708333,
        0.        , -0.16666667,  0.41666667,  0.        ,  0.        ,
       -0.03571429,  0.09375   ,  0.        ,  0.        ,  0.        ,
        0.05555556, -0.08333333,  0.        , -0.16666667,  0.        ,
        0.175     ,  0.20833333, -0.0625    ,  0.        ,  0.        ,
        0.1875    ,  0.025     ,  0.        ,  0.        , -0.18452381,
       -0.2375    ,  0.        ,  0.075     , -0.03472222,  0.        ,
        0.46527778, -0.5       ,  0.        ,  0.        ,  0.08035714,
        0.18229167,  0.        ,  0.        ,  0.02857143,  0.01785714,
        0.        ,  0.        ,  0.        ,  0.1875    , -0.02916667,
        0.        ,  0.28125   ,  0.025     ,  0.        ,  0.09375   ,
        0.        ,  0.625     ,  0.        ,  0.09375   ,  0.        ,
        0.17410714,  0.        ,  0.        ,  0.        ,  0.15      ,
       -0.15625   ,  0.        ,  0.35      ,  0.        ,  0.  

### Optimization and final results

We optimise using the same loss function we used for rule #1. We only modify the search space to incorporate our new `all_scores_v2`.

In [242]:
space_v2 = [
    hp.uniform('pos_threshold', 0, 0.1), # AGAIN, a first trial was done with -1,1 and gave values in this window.
                                            #We try to close up to the optimum.
    hp.uniform('neg_threshold', -0.1, 0),
    all_scores_v2,
    labels
]

In [243]:
best = fmin(score_loss,
    space=space_v2, 
    algo=tpe.suggest,
    max_evals=1000,
           )

100%|███████████████████████████████████████████| 1000/1000 [00:10<00:00, 98.91trial/s, best loss: -0.4800580250450823]


In [244]:
best

{'neg_threshold': -0.005490016736387422, 'pos_threshold': 0.06694629558485246}

In [245]:
tweets[0]

'@stellargirl i loooooooovvvvvveee my kindle2. not that the dx is cool, but the 2 is fantastic in its own right.'

### Observations

Our second rule did not perform better than its simple counterpart, obtaining an f1-score of 0.48 with its most optimised thresholds. It is possible that expanding slang obscured the general meaning of the sentences, or that emojis were given too much weight, or that our simple approach to negation (which in some cases such as `tweets[0]`: "not that the dx is cool" is part of a syntactical formula, not a negation per se) might have affected some scores. As well, it is possible that sentence-wise, and not word-wise averaging might have given extra weight to nuances that could mislead our rules.