<a href="https://colab.research.google.com/github/oaarnikoivu/dissertation/blob/master/Sem_Eval_2018_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import collections

nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Load data

In [0]:
def load_data(filename):
  data = pd.read_csv(filename, sep='\t')
  return data

In [0]:
file_path = '/content/drive/My Drive/datasets/'

train = load_data(file_path + '2018-E-c-En-train.txt')
val = load_data(file_path + '2018-E-c-En-dev.txt')
test = load_data(file_path + '2018-E-c-En-test-gold.txt')

In [5]:
train.head()

Unnamed: 0,ID,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2017-En-21441,“Worry is a down payment on a problem you may ...,0,1,0,0,0,0,1,0,0,0,1
1,2017-En-31535,Whatever you decide to do make sure it makes y...,0,0,0,0,1,1,1,0,0,0,0
2,2017-En-21068,@Max_Kellerman it also helps that the majorit...,1,0,1,0,1,0,1,0,0,0,0
3,2017-En-31436,Accept the challenges so that you can literall...,0,0,0,0,1,0,1,0,0,0,0
4,2017-En-22195,My roommate: it's okay that we can't spell bec...,1,0,1,0,0,0,0,0,0,0,0


Lets create a list of all labels to predicts and also a 'none' label to see how many tweets have no labels.

In [6]:
class_names = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 
              'optimism', 'pessimism', 'sadness', 'surprise', 'trust']

train['none'] = 1-train[class_names].max(axis=1)
train.describe()

Unnamed: 0,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust,none
count,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0
mean,0.372039,0.143024,0.380521,0.181632,0.36224,0.102369,0.290143,0.116262,0.293653,0.052793,0.052208,0.029833
std,0.483384,0.350123,0.48555,0.385569,0.480683,0.303155,0.453862,0.320562,0.455468,0.223637,0.222463,0.17014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
len(train), len(test), len(val)

(6838, 3259, 886)

Since there are a few empty tweets, we will replace these with: 'unknown'. 

In [0]:
train['Tweet'].fillna("unknown", inplace=True)
test['Tweet'].fillna("unknown", inplace=True)
val['Tweet'].fillna("unknown", inplace=True)

In [0]:
train_text = train['Tweet']
test_text = test['Tweet']
val_text = val['Tweet']
all_text = pd.concat([train_text, test_text, val_text])

## Text preprocessing

In [10]:
train_text.head()

0    “Worry is a down payment on a problem you may ...
1    Whatever you decide to do make sure it makes y...
2    @Max_Kellerman  it also helps that the majorit...
3    Accept the challenges so that you can literall...
4    My roommate: it's okay that we can't spell bec...
Name: Tweet, dtype: object

In [0]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def preprocessor(text):
    text = text.lower()
    text = re.sub(' +', ' ', text)
    #text = re.sub('#', '', text) # remove hashtags
    text = REPLACE_BY_SPACE_RE.sub('', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = ' '.join([word for word in text.split() if word not in STOPWORDS])
    text = text.strip()
  
    return text

In [12]:
preprocessor(train_text[250])

'dont #afraid space #dreams #reality #dream #make'

In [0]:
train_text = train_text.apply(preprocessor)
test_text = test_text.apply(preprocessor)
val_text = val_text.apply(preprocessor)

In [14]:
train_text.head()

0    worry payment problem may never joyce meyer #m...
1               whatever decide make sure makes #happy
2    max_kellerman also helps majority nfl coaching...
3    accept challenges literally even feel exhilara...
4    roommate okay cant spell autocorrect #terrible...
Name: Tweet, dtype: object

## Transforming text to a vector

### TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
vectorizer = TfidfVectorizer(
    ngram_range=(1,1),
    max_df=0.9,
    min_df=5,
    token_pattern='(\S+)'
)

vectorizer.fit(all_text)
train_features = vectorizer.transform(train_text)
test_features = vectorizer.transform(test_text)
val_features = vectorizer.transform(val_text)

In [67]:
train_features.shape, test_features.shape, val_features.shape

((6838, 3345), (3259, 3345), (886, 3345))

## Train classifier

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [0]:
def train_classifier(X_train, X_val, X_test):
  clf = LogisticRegression()
  for class_name in class_names:
    print('\n... Processing {}'.format(class_name))

    train_target = train[class_name]
    test_target = test[class_name]
    val_target = val[class_name]

    clf.fit(X_train, train_target)

    # compute training accuracy
    y_pred_X_train = clf.predict(X_train)
    print('Training accuracy is {}'.format(accuracy_score(train_target, y_pred_X_train)))
    
    # compute validation accuracy
    y_pred_X_val = clf.predict(X_val)
    print('Validation accuracy is {}'.format(accuracy_score(val_target, y_pred_X_val)))

    # compute testing accuracy
    y_pred_X_test = clf.predict(X_test)
    print('Testing accuracy is {}'.format(accuracy_score(test_target, y_pred_X_test)))
    
  return clf

In [74]:
clf = train_classifier(train_features, val_features, test_features)


... Processing anger
Training accuracy is 0.8628253875402164
Validation accuracy is 0.7663656884875847
Testing accuracy is 0.7707885854556612

... Processing anticipation
Training accuracy is 0.8606317636735887
Validation accuracy is 0.8577878103837472
Testing accuracy is 0.869898741945382

... Processing disgust
Training accuracy is 0.8379643170517695
Validation accuracy is 0.7178329571106095
Testing accuracy is 0.7477753912243019

... Processing fear
Training accuracy is 0.8928049137174613
Validation accuracy is 0.891647855530474
Testing accuracy is 0.8864682417919607

... Processing joy
Training accuracy is 0.862971629131325
Validation accuracy is 0.7257336343115124
Testing accuracy is 0.754525928198834

... Processing love
Training accuracy is 0.9140099444281954
Validation accuracy is 0.873589164785553
Testing accuracy is 0.8646824179196072

... Processing optimism
Training accuracy is 0.8188066686165546
Validation accuracy is 0.7291196388261851
Testing accuracy is 0.7229211414544