## Project 10: Emotion Analysis in Dataset
This project aims to investigate the emotion and sentiment from a set of publicly open dataset and test various commonalities for identifying of the emotion. First, collect the emotion dataset from Kaggle available at https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp. Note that there is also a provided for machine learning based approach for classification. 

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('train.txt', header=None, names=['text','label'], sep=';')
test = pd.read_csv('test.txt', header=None, names=['text','label'], sep=';')
val = pd.read_csv('val.txt', header=None, names=['text','label'], sep=';')
harv_inquirer = pd.read_excel('http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls')
harv_inquirer['Entry'] = harv_inquirer.Entry.astype(str)

1.	Use the Harvard General Inquirer available in http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls and try to identify wording associated to each of the five categories “sadness”, “anger”, “love”, “surprise”, “joy”. Record the obtained wording in a separate database D that will be part of deliverables. 

Note: Description of the harvard columns is here: http://www.wjh.harvard.edu/~inquirer/homecat.htm

In [10]:
from text_processing import *

The Harvard General Inquirer does not have a nice way to match the categories.
These can be toyed around with to filter with different categories to include / exclude for each label. 

In [3]:
category_dict = {
    'surprise':{
        'include': ['Arousal'],
        'exclude': []
        },
    'joy':{
        'include': ['Positiv'],
        'exclude': ['Affil']
        },
    'love':{
        'include': ['Affil'],
        'exclude': ['Negativ']
        },
    'anger':{
        'include': ['Hostile'],
        'exclude': []  
        },
    'sadness':{
        'include': ['Negativ'],
        'exclude': ['Hostile']
        },
    'fear':{
        'include': ['Weak'],
        'exclude': []
    }
}

In [4]:
match_targets = get_all_cat(category_dict, harv_inquirer)

In [5]:
match_targets

Unnamed: 0,entry,label
0,abhor,surprise
1,acrimonious,surprise
2,acrimony,surprise
3,adamant,surprise
4,affection,surprise
...,...,...
4735,worrier,fear
4736,worry,fear
4737,worsen,fear
4738,wound,fear


2.	Use a simple string matching procedure to evaluate the matching of every utterance to each of the category. The category that yields the highest matching will be assumed to assigned to the underlined Calculate the accuracy of this prediction using the ground truth knowledge. 

In [6]:
def evaluate_matches(labels, targets, verbose=True):
    matches = pd.get_dummies(labels) * 0
    
    match_target = pd.get_dummies(targets, columns=['label'], prefix='', prefix_sep='')
    match_target = match_target.groupby('entry').sum()
    i = 0
    for word, vals in match_target.iterrows():
        t = train.text.str.contains(word).astype(int)
        for col in match_target.columns:
            matches[col] += t * vals[col]
        if verbose:
            i += 1
            print("{:<5}%".format(round(i * 100 / len(match_target.index), 2)), end='\r')
    print('')
    return matches

In [7]:
matches = evaluate_matches(train.label, match_targets, verbose = True)
print('Accuracy of string matching: ', (matches.idxmax(1) == train.label).mean().round(2))

100.0%
Accuracy of string matching:  0.34


3.	Consider categories generated by Empath Client https://github.com/Ejhfast/empath-client. Apply Empath Client to each utterance and record categories who held non-zero weights in the database D. Elaborate how you can match these categories to each of the five categories above using appropriate linguistic constructs (entailment, synonymy, hyponymy, hypernymy, etc..). Calculate the accuracy of this prediction approach.

Note: pip install empath

In [12]:
data = process_lexicon(train.text)

99.99%

In [13]:
pd.Series(data)

0        [cold, nervousness, body, violence, love, sham...
1        [hate, nervousness, swearing_terms, suffering,...
2        [cold, nervousness, wealthy, social_media, int...
3        [hate, nervousness, suffering, furniture, opti...
4        [hate, nervousness, suffering, optimism, fear,...
                               ...                        
15995    [cold, nervousness, body, violence, love, sham...
15996    [cold, hate, nervousness, weakness, school, co...
15997    [cold, aggression, masculine, nervousness, bod...
15998    [cold, hate, nervousness, swearing_terms, soci...
15999    [cold, nervousness, ridicule, body, violence, ...
Length: 16000, dtype: object

4.	We consider the semantic similarity between each of the five categories with every utterance. Use a semantic similarity so that the overall semantic similarity between category C and Utterance S is equal to the arithmetic average of the sum of the Wu and Palmer semantic similarity of C with each noun contained in S (should use part of speech tagger to identify noun category). Report this information in database D. Therefore, for each, utterance, the category that yields the smallest semantic similarity will be assigned to it. Calculate the overall accuracy accordingly.

5.	Use the SentiStrength from http://sentistrength.wlv.ac.uk/ to determine the positive, negative and overall (sum of positive and negative) sentiment score for each utterance. Provide this information in database D. Comment on whether the sentiment score can be used an indicator to discriminate the various emotion states.

6.	Now we want to develop a machine learning approach for learning to predict the emotion state. For this purpose, tokenize the original data and split the original data into 70% training and 30% testing, and suggest various filtering strategies (e-g-, no filtering, standard stopword removal, selected set of stopwords, …). 

7.	Use various feature engineering, which includes CountVectorizer, tf-Idf for size of vocabulary (all vocabulary,3000, 2000, 1000, 500, 100 of most frequent words). Compare a set of state-of art machine learning classifiers (Naives Bayes, Linear regression, SVM, Decision, Tree and Random Forest). Draw a plot showing the accuracy of the different classifiers and various features. For the classifier that yields the best accuracy, record the confusion matrix, Precision and recall. Compare this to Naives Bayes and Linear regression. Repeat the above reasoning for various filtering strategies to ensure the select the strategy that maximizes the overall accuracy. Provide the result in a table.

8.	We want to test the performance of deep learning classifier. For this purpose, we shall imitate the  paper available at "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 by Yon. (a Python implementation of the above paper is also available online). Imitate the above reasoning and represent the embedding of each word in sentence using word2vec representation. The features are now represented by the embedding vectors handled in the same way as Yon’s paper above. You should attempt to fine-tune the parameters of the CNN architecture to yield maximum accuracy. Represent the accuracy, precision, recall and confusion matrix of the CNN classifier.

9.	Repeat the process of 8) when using FastText embedding instead of word2vec

10.	Design a simple GUI of your choice that show the execution of each of the above tasks in a way to ease the task of the assessor or external end-user