## Goal:
### Training a model for text classification based on dominant emotion.
<br>

## Data:
### 47288 tweets scrapped from twitter API with corresponding labels. Lebels are representing 5 classes of emotions including: Neutral 😐,  Happines 😂,  Fear 😱,  Hate 😒,  Anger 😠
Numerical value corresponding to neutral, happines, fear, hate and anger are 0, 1, 2, 3 and 4, respectively.
<br>

## Description:
### Multi-class classification will be preformed using random forest classifier based on n-grams (n=4) and grid search library from Scikit-learn will be used for hyperparamter tuning.  
*Note the data for this project has been scarapped from twitter and pre-processing was performed to clean the data and make it ready for the analysis. I am plannig on releasing the code for scraping and preprocessing shortly.
<br>

## Result:
### F1-score is used as a measure of classification accuracy. Weighted F1 for the 5 classes is 0.48.


In [3]:
# import libraries
import re
import nltk
import pkg_resources
import numpy as np
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from sklearn.metrics import f1_score
from symspellpy import SymSpell, Verbosity
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier

In [4]:
# load data
data = pd.read_csv('data.csv')

### preprocessing
- Spell correction
- Lower case letters
- Punctuation removal
- Correct letter repetition

Spell correction is based on SymSpell Python library. This tool checks for possible spelling errors within a maximum 
edit distance of n (N-3 in this work) using Fuzzy logic.<br>
Use NLTK library is used for preprocessing and preparing sentences for analysis.

In [5]:
# setup path and parameters for spell correction by symspell libray
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename("symspellpy", "frequency_bigramdictionary_en_243_342.txt")

# setup max edit distance
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

# path for dictionary
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")

# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

In [6]:
nltk.download('wordnet')

# correct letter repetitions
def de_repeat(text):
    ulist = []
    [ulist.append(x) for x in text if x not in ulist]
    return ulist
    
# perform preprocessing    
def pre_process(text):
    text.lower()
    text.split()   
    text = ' '.join(de_repeat(text.split()))
    text_spell_corr = sym_spell.word_segmentation(text)
    return text_spell_corr.corrected_string
    
# toy example to test functionality of the pre-processing step    
# pre_process("whau you do do is coool")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mk\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Tokenize sentences and extract features

In [7]:
# select tweets and convert them to list
sentences = data['sentence'].values.tolist()
emotions = data['emotion'].values.tolist()

In [8]:
def ngram(token, n): 
    output = []
    for i in range(n-1, len(token)): 
        ngram = ' '.join(token[i-n+1:i+1])
        output.append(ngram) 
    return output


def create_feature(text, nrange=(1, 1)):
    text_features = [] 
    text = str(text).lower() 

    # 1. treat alphanumeric characters as word tokens
    # Since tweets contain #, we keep it as a feature
    # Then, extract all ngram lengths
    text_alphanum = re.sub('[^a-z0-9#]', ' ', text)
    for n in range(nrange[0], nrange[1]+1): 
        text_features += ngram(text_alphanum.split(), n)
    
    # 2. treat punctuations as word token
    text_punc = re.sub('[a-z0-9]', ' ', text)
    text_features += ngram(text_punc.split(), 1)
    
    # 3. Return a dictinaory whose keys are the list of elements 
    # and their values are the number of times appearede in the list.
    return Counter(text_features)

In [9]:
# calculate n-gram for the data
sentences_ngram = []
for i in range(len(sentences)):
    text = sentences[i]
    sentences_ngram.append(create_feature(text, nrange=(1, 4)))

### Prepare for training
- Peform train/test split 80/20
- Vectorize sentences
- Define accuracy measure

In [10]:
# split data into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(sentences_ngram, emotions, test_size=0.2, random_state=101)

In [11]:
# convert feature list (dict) to feature value
vectorizer = DictVectorizer(sparse=True)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [12]:
# function to calu accuracy measure calculation
def train_test(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    # weighted f1-score is used to find average f1 for each class by number 
    # of true instance per group to count for label imbalance
    train_acc = f1_score(y_train, clf.predict(X_train), average='weighted')
    test_acc = f1_score(y_test, clf.predict(X_test), average='weighted')   
    print("Train acc: {}".format(np.round(train_acc, 2)))
    print("Test acc : {}".format(np.round(test_acc, 2)))
    return train_acc, test_acc

### Train & fine tune model
- Train a vanilla random forest classifier
- Perform grid search for hyperparameter tuning and avoiding overfitting

In [13]:
# define and fit a random forest classifier with 450 trees
forest_clf = RandomForestClassifier(max_depth=200, n_estimators=450,
                                    max_leaf_nodes=200, n_jobs=-1, random_state=101)

train_acc, test_acc = train_test(forest_clf, X_train_vec, X_test_vec, y_train, y_test)

Train acc: 0.51
Test acc : 0.48


Training a random forest with defult parameter setting results in overfitting. Possible solution is to limit depth of the tree or increase minimum number of samples per split. The following code will perform grid search to find the best setting of hyperparameters within the range of pre-defined parameters.Dominant hyperparameters for this experiment are found to be: 
- Depth of the tree
- Number of trees 
- Max number of leaf per node

In [134]:
# grid search for hyperparameter tuning.

# define grid search parameters
param_grid = {'max_depth': [100, 200, 300],
              'n_estimators': [300, 500, 700],
              'max_leaf_nodes': [100, 200],
              'min_samples_split': [2, 4, 6]}

# perform grid search by 3-fold cross-validation while using 
# 9 CPU workers (use -1 to utilize all available resourses)
grid_search = GridSearchCV(forest_clf, param_grid, cv=3, verbose=1, n_jobs=9)
grid_search.fit(X_train, y_train)

In [None]:
print('Best setting of hyperparameters:')
# best parameter found by the grid search 
print(grid_search.best_params_)

# best accuracy score obtained by the grid search (corresponding to the above best parameters)
print('Best accuracy score is: {}'.format(np.round(grid_search.best_score_, 2)))

In [14]:
# train random forest with final setting of hyperparameters
forest_clf.fit(X_train_vec, y_train)

RandomForestClassifier(max_depth=200, max_leaf_nodes=200, n_estimators=450,
                       n_jobs=-1, random_state=101)

### Test the model

In [15]:
# define emoji dictionary
emoji_dict = {0:"😐", 1:"😂", 2:"😱", 3:"😒", 4:"😠"}

In [16]:
def suggest_emoji(text):
    text_preprocess = pre_process(text)
    sentence_ngram = create_feature(text_preprocess, nrange=(1, 4))
    sentence_ngram_vec = vectorizer.transform(sentence_ngram)
    pred = forest_clf.predict(sentence_ngram_vec)
    return emoji_dict[pred[0]]

In [17]:
text = 'He is a happy man'
suggest_emoji(text)

'😂'

In [18]:
text = 'you re missing the devil wears prada sad'
suggest_emoji(text)

'😱'

In [19]:
text = "the dance video that haunts my dreams india is confused angry wtf more cheese than a bollywood d movie"
suggest_emoji(text)

'😠'