<a href="https://colab.research.google.com/github/oaarnikoivu/dissertation/blob/master/Sem_Eval_2018_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [175]:
import pandas as pd
import numpy as np
import nltk
import re
import collections

nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [176]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load data

In [0]:
def load_data(filename):
  data = pd.read_csv(filename, sep='\t')
  return data

In [0]:
file_path = '/content/drive/My Drive/datasets/'

train = load_data(file_path + '2018-E-c-En-train.txt')
val = load_data(file_path + '2018-E-c-En-dev.txt')
test = load_data(file_path + '2018-E-c-En-test-gold.txt')

In [179]:
train.head()

Unnamed: 0,ID,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2017-En-21441,“Worry is a down payment on a problem you may ...,0,1,0,0,0,0,1,0,0,0,1
1,2017-En-31535,Whatever you decide to do make sure it makes y...,0,0,0,0,1,1,1,0,0,0,0
2,2017-En-21068,@Max_Kellerman it also helps that the majorit...,1,0,1,0,1,0,1,0,0,0,0
3,2017-En-31436,Accept the challenges so that you can literall...,0,0,0,0,1,0,1,0,0,0,0
4,2017-En-22195,My roommate: it's okay that we can't spell bec...,1,0,1,0,0,0,0,0,0,0,0


Lets create a list of all labels to predicts and also a 'none' label to see how many tweets have no labels.

In [180]:
class_names = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 
              'optimism', 'pessimism', 'sadness', 'surprise', 'trust']

train['none'] = 1-train[class_names].max(axis=1)
train.describe()

Unnamed: 0,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust,none
count,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0,6838.0
mean,0.372039,0.143024,0.380521,0.181632,0.36224,0.102369,0.290143,0.116262,0.293653,0.052793,0.052208,0.029833
std,0.483384,0.350123,0.48555,0.385569,0.480683,0.303155,0.453862,0.320562,0.455468,0.223637,0.222463,0.17014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [181]:
len(train), len(test), len(val)

(6838, 3259, 886)

Since there are a few empty tweets, we will replace these with: 'unknown'. 

In [0]:
train['Tweet'].fillna("unknown", inplace=True)
test['Tweet'].fillna("unknown", inplace=True)
val['Tweet'].fillna("unknown", inplace=True)

In [0]:
train_text = train['Tweet']
test_text = test['Tweet']
val_text = val['Tweet']
all_text = pd.concat([train_text, test_text, val_text])

## Text preprocessing

In [184]:
train_text.head()

0    “Worry is a down payment on a problem you may ...
1    Whatever you decide to do make sure it makes y...
2    @Max_Kellerman  it also helps that the majorit...
3    Accept the challenges so that you can literall...
4    My roommate: it's okay that we can't spell bec...
Name: Tweet, dtype: object

In [0]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def preprocessor(text):
    text = text.lower()
    text = re.sub(' +', ' ', text)
    #text = re.sub('#', '', text) # remove hashtags
    text = REPLACE_BY_SPACE_RE.sub('', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = ' '.join([word for word in text.split() if word not in STOPWORDS])
    text = text.strip()
  
    return text

In [186]:
preprocessor(train_text[250])

'dont #afraid space #dreams #reality #dream #make'

In [0]:
train_text = train_text.apply(preprocessor)
test_text = test_text.apply(preprocessor)
val_text = val_text.apply(preprocessor)

In [188]:
train_text.head()

0    worry payment problem may never joyce meyer #m...
1               whatever decide make sure makes #happy
2    max_kellerman also helps majority nfl coaching...
3    accept challenges literally even feel exhilara...
4    roommate okay cant spell autocorrect #terrible...
Name: Tweet, dtype: object

## Transforming text to a vector

### TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
def tfidf_features(train, val, test):
  vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=2,
                               token_pattern='(\S+)')
  
  vectorizer.fit(all_text)

  train_features = vectorizer.transform(train)
  test_features = vectorizer.transform(test)
  val_features = vectorizer.transform(val)

  return vectorizer, train_features, val_features, test_features, vectorizer.vocabulary_

In [0]:
vect, train_features, val_features, test_features, vocab = tfidf_features(train_text, val_text, test_text)

In [0]:
#vocab['alcohol']

## Train classifier

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

In [0]:
def train_classifier(X_train, X_val, X_test, pred_text):

  clf = OneVsRestClassifier(LogisticRegression())

  predictions = []
  scores = []

  for class_name in class_names:
    print('\n... Processing {}'.format(class_name))

    train_target = train[class_name]
    test_target = test[class_name]
    val_target = val[class_name]

    cv_score = np.mean(cross_val_score(clf, train_features, train_target, cv=3, scoring='roc_auc'))
    scores.append(cv_score)
    print('CV score for class {} is {}'.format(class_name, cv_score))

    clf.fit(X_train, train_target)

    # compute training accuracy
    y_pred_X_train = clf.predict(X_train)
    print('Training accuracy is {}'.format(accuracy_score(train_target, y_pred_X_train)))
    
    # compute validation accuracy
    y_pred_X_val = clf.predict(X_val)
    print('Validation accuracy is {}'.format(accuracy_score(val_target, y_pred_X_val)))

    # compute testing accuracy
    y_pred_X_test = clf.predict(X_test)
    print('Testing accuracy is {}'.format(accuracy_score(test_target, y_pred_X_test)))

    for t in pred_text:
      pred = clf.predict(vect.transform([t]))
      if pred:
        predictions.append("\n" + t + "\nPrediction: " + class_name)

  return (clf, predictions, scores)

In [255]:
depressive_text = ["I kind of understand what you mean. I guess we are depressed so much it becomes familiar, almost comfortable in a way. I get it.",
                   "I thought I was the only one like this, I hate being this sad and all like one hurt myself and I wanna commit suicide but then again I wanna stay like this",
                   "I can’t stand it anymore. I’m committing suicide. This is my story. Being good and positive won’t help u stop suffering.",
                   "I'm sick of getting told life is always worth it and suicide isn't the answer.",
                   "Do you ever feel like listening to sad music and crying for no apparent reason?"]

happy_text = ["Just wanted to share my 6 month progress pic and how happy I am about it! I can't stop smiling everytime I see myself in the mirror. That's 60lbs of fat gone! For anyone out there struggling with self-esteem, you're beautiful. Don't ever let anyone tell you differently.",
              "After years of being fat and unhealthy i’m so happy to be closer to my goal."]

pred_text = depressive_text + happy_text

(clf, preds, scores) = train_classifier(train_features, val_features, test_features, pred_text)

print('\nTotal CV score is {}'.format(np.mean(scores)))


... Processing anger
CV score for class anger is 0.8552042250067343
Training accuracy is 0.8601930389002632
Validation accuracy is 0.7652370203160271
Testing accuracy is 0.7714022706351642

... Processing anticipation
CV score for class anticipation is 0.675491604157343
Training accuracy is 0.8613629716291313
Validation accuracy is 0.8566591422121896
Testing accuracy is 0.869898741945382

... Processing disgust
CV score for class disgust is 0.8110700291480288
Training accuracy is 0.8331383445451886
Validation accuracy is 0.7133182844243793
Testing accuracy is 0.7480822338140534

... Processing fear
CV score for class fear is 0.8699509994718566
Training accuracy is 0.8929511553085697
Validation accuracy is 0.8927765237020316
Testing accuracy is 0.8895366676894753

... Processing joy
CV score for class joy is 0.8649699547791666
Training accuracy is 0.8606317636735887
Validation accuracy is 0.7268623024830699
Testing accuracy is 0.7526848726603252

... Processing love
CV score for class 

In [256]:
for p in preds:
  print(p)


I'm sick of getting told life is always worth it and suicide isn't the answer.
Prediction: anger

After years of being fat and unhealthy i’m so happy to be closer to my goal.
Prediction: joy

Do you ever feel like listening to sad music and crying for no apparent reason?
Prediction: sadness


# Word embeddings (GloVe) 