### 10 Emoji Prediction

The task of this project is to make a system that would automatically fill the text with the appropriate emoticons. This can be done in two steps. First, for each position within the text a prediction is made whether an emoticon should be placed there. Second, an appropriate emoticon is chosen from a list of available emoticons. Both these tasks can be set up as supervised classification problems.

Competition website:
https://competitions.codalab.org/competitions/17344

Dataset:
https://competitions.codalab.org/competitions/17344

Entry point:
https://arxiv.org/pdf/1702.07285.pdf (Barbieri, Francesco, Miguel Ballesteros, and Horacio Saggion. Are Emojis Predictable?)

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


Define paths to folders containing the data and the results.

In [2]:
NUMBER_OF_TWEETS = "ALL"
# NUMBER_OF_TWEETS = 100000

MAX_WORDS_PER_TWEET = 30
# DATA_LOCATION = "./train/data/noise/"
DATA_LOCATION = "./train/data/clean/"
# DATA_LOCATION = "./train/data/words/"
RESULT_LOCATION = "./result/"
TWEET_FILE_NAME = "tweet_by_ID_28_4_2018__03_20_05" + "_"

k = 3

if NUMBER_OF_TWEETS is not None:
    TWEET_FILE_NAME += str(NUMBER_OF_TWEETS)
else:
    TWEET_FILE_NAME += "ALL"

### Load the data
Tweets are loaded in two ways: list of strings (for the TF-IDF vectorizer) and a list of lists of words (for feature extraction). Labels are read as a numpy array of N * MAX_WORDS_PER_TWEET dimensions.

In [3]:
base_file_name = DATA_LOCATION + TWEET_FILE_NAME

text_lines = []
text_lines_split = []

with open(base_file_name + ".text", 'r', encoding="utf-8") as out_text:
    for line in out_text:
        text_lines.append(line[:-1])
        text_lines_split.append(line[:-1].split())
        
loc_lines = []
with open(base_file_name + ".loclabels", 'r') as loc_labels:
    for line in loc_labels:
        loc_line = []
        for c in line[:-1]:
            loc_line.append(int(c))
        loc_lines.append(loc_line)

loc_lines = np.asarray(loc_lines)

emo_lines = []
with open(base_file_name + ".emolabels", 'r') as emo_labels:
    for e_line, loc in zip(emo_labels, loc_lines):
        emo_line = [0]*31
        e_line2 = e_line.split()
        
        br = 0
        for idx, val in enumerate(loc[:-1]):
            if val==1:
                emo_line[idx]=int(e_line2[br])+1
                br += 1
        emo_lines.append(emo_line)
        
emo_lines = np.asarray(emo_lines)

print(f"number of tweets {len(text_lines)}")
print(f"example of tweet texts:")
for i in range(10):
    print(f"{text_lines[i]}")
print(f"\nexample of labels (emoji locations):\n{loc_lines[:10]}")
print(f"\nexample of labels (emoji type):\n{emo_lines[:10]}")

number of tweets 473459
example of tweet texts:
lol @ west covina , california
things got a little festive at the office @ redrock
step out and explore . @ ellis island cafe
@user @ cathedral preparatory school
my baby bear @ bubby's
rupaul's drag race bingo fun . drag queens be sexy ! @user abwyman
black history like a mufffffaaaaaka done thru her yugioh trap card like hell
just light makeup
@ bj's restaurant and brewhouse
so lovely catching up with my soul sister @user @ university of victoria

example of labels (emoji locations):
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 

### TF-IDF Feature Extraction
TF-IDF is computed on the collection of tweets. Then for every position between words a new example is generated: a $2*k$ array containing the k left and k right tf-idf values of words. Labels are taken as 1 or 0 wether an emoji was there in the original tweet.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import random

X_tfidf = None
word_to_tfidf_index_dict = {}

def tfidf_features(tweets, loc_labels, emo_labels, k, func):
    global word_to_tfidf_index_dict, X_tfidf
    
    tfidf_model = TfidfVectorizer(input="content", analyzer="word", stop_words="english")
    X_tfidf = tfidf_model.fit_transform(text_lines)

    #word_to_tfidf_index_dict = {}
    for i, word in enumerate(tfidf_model.get_feature_names()):
        word_to_tfidf_index_dict[word] = i

    print(f"X shape {X_tfidf.shape}")
    print(f"y_l shape {loc_lines.shape}")
    print(f"y_e shape {emo_lines.shape}")
    print(f"some tf-idf values\n{X_tfidf[0]}\n")

    N = len(tweets)
    X = []
    y_l = []
    y_e = []
    
    for tweet_index, (tweet, l_label, e_label) in enumerate(zip(tweets, loc_labels, emo_labels)):
        for pos in range(len(tweet) + 1):
                
            x = []
            for i in range(pos - k, pos + k):
                if i < 0 or i >= len(tweet):
                    x.append(0.0)
                else:
                    x.append(func(tweet_index, tweet[i]))
            X.append(x)
            y_l.append(l_label[pos])
            y_e.append(e_label[pos])
            
    return np.asarray(X), np.asarray(y_l), np.asarray(y_e)

def word_to_tfidf(tweet_index, word):
    global word_to_tfidf_index_dict, X_tfidf
    
    if word in word_to_tfidf_index_dict:
        return X_tfidf[tweet_index, word_to_tfidf_index_dict[word]]
    else:
        return 0.0


def print_dataset(X, y_l, y_e):
    emoji_num = np.count_nonzero(y_l)
    class_freq_ratio = emoji_num / (X.shape[0] * X.shape[1])
    
    print("after feature extraction:")
    print(f"X shape {X.shape}")
    print(f"y_l shape {y_l.shape}")
    print(f"y_e shape {y_e.shape}")
    print(f"some X values\n{X[:5]}")
    print(f"some y_l values\n{y_l[:5]}")
    print(f"non zero elements (1 in label) in y_l {emoji_num}")
    print(f"class frequency ratio {class_freq_ratio}\n")
    print(f"some y_e values\n{y_e[:5]}")

### Word2Vector Feature Extraction

A similar dataset is obtained using w2vec. By setting the number of features to $2*k=6$, the dimensionality stays the same as the previous tf-idf feature extraction. For each word a 6 dimensional vector is obtained and then the average of the 6 neighboring words is taken. Labels stay the same.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec

num_of_features = 2 * k
w2v_model_file_name = "w2v_model"

In [None]:
w2v_model = Word2Vec(text_lines_split, min_count=1, size=num_of_features)
w2v_model.save(w2v_model_file_name)

In [None]:
w2v_model = Word2Vec.load(w2v_model_file_name)

example = w2v_model.wv["california"]
print(example.shape)
print(example)

In [None]:
def w2v_features(tweets, labels, k):
    N = len(tweets)
    X = []
    y = []
    
    for tweet_index, (tweet, label) in enumerate(zip(tweets, labels)):
        for pos in range(len(tweet) + 1):
            x = []
            for i in range(pos - k, pos + k):
                if i < 0 or i >= len(tweet):
                    x.append(np.zeros(num_of_features))
                else:
                    x.append(w2v_model.wv[tweet[i]])
            x = np.average(x, axis=0)
            X.append(x)
            y.append(label[pos])
            
    return np.asarray(X), np.asarray(y)

In [None]:
from sklearn.model_selection import train_test_split

# should_use_tfidf = True
# should_use_hybrid = False

# X_tfidf, y_tfidf = tfidf_features(text_lines_split, loc_lines, k, word_to_tfidf)
# X_w2v, y_w2v = w2v_features(text_lines_split, loc_lines, k)
# X_hybrid = np.hstack((X_tfidf, X_w2v))

# if should_use_tfidf or should_use_hybrid:    
#     X, y = X_tfidf, y_tfidf

# if (not should_use_tfidf) or should_use_hybrid:
#     X, y = X_w2v, y_w2v

# if should_use_hybrid:
#     X = X_hybrid

# print_dataset(X, y)

X_tfidf, y_loc_tfidf, y_emo_tfidf = tfidf_features(text_lines_split, loc_lines, emo_lines, k, word_to_tfidf)

X_train, X_test, y_loc_train, y_loc_test = train_test_split(X_tfidf, y_loc_tfidf, shuffle=False, test_size=0.1, random_state=42)

### Baselines

Make sure to check should_train flags when training/testing.

In [None]:
should_use_kfold = True

should_train_global = True
should_train_linear_svm = True
should_train_bagging_svm = False
should_train_random_forest = False
should_train_adaboost = True

linear_svm_model_file_name = "linear_svm.pkl"
bagging_model_file_name = "bagging_svm.pkl"
random_forest_file_name = "random_forest.pkl"
adaboost_file_name = "adaboost.pkl"

In [22]:
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, KFold

np.set_printoptions(threshold=np.inf)
EVALUATION_RESULTS_KEYS = ['Accuracy: ', 'Precision: ', 'Recall: ', 'F1: ', 'ratio of positive/negative predictions: ']

kf = KFold(n_splits=5, shuffle=True)

def vectorize_func(a, b):
    if a == b:
        return 1
    else:
        return 0

def calc_scores(y_pred, y_test, multi_class):
    result = []
    
    if multi_class:
        result_temp = []
    
        for i in range(0, 21):
            vfunc = np.vectorize(vectorize_func)
            y_test_new = vfunc(y_test, i)
            y_pred_new = vfunc(y_pred, i)
            
            result_temp2 = []
            alfa=1
            if i == 0:
                alfa=1/5
            
            result_temp2.append(alfa*accuracy_score(y_pred_new, y_test_new))
            result_temp2.append(alfa*precision_score(y_pred_new, y_test_new))
            result_temp2.append(alfa*recall_score(y_pred_new, y_test_new))
            result_temp2.append(alfa*f1_score(y_pred_new, y_test_new))
            result_temp.append(result_temp2)
    
        result_avg = np.average(np.asarray(result_temp), axis=0)
        result = result_avg[:]
    else:
        result.append(accuracy_score(y_pred, y_test))
        result.append(precision_score(y_pred, y_test))
        result.append(recall_score(y_pred, y_test))
        result.append(f1_score(y_pred, y_test))
        result.append(np.count_nonzero(y_pred) / y_pred.shape[0])
    return result

def print_scores(scores):
    print(f"Accuracy: {scores[0]}")
    print(f"Precision: {scores[1]}")
    print(f"Recall: {scores[2]}")
    print(f"F1: {scores[3]}")
    if len(scores) > 4:
        print(f"ratio of positive/negative predictions {scores[4]}")

def kfold_score(clf, X, y):
    fold_scores = []
    for train_index, test_index in kf.split(X):
        X_train_kf, X_test_kf = X[train_index], X[test_index]
        y_train_kf, y_test_kf = y[train_index], y[test_index]
        clf.fit(X_train_kf, y_train_kf)
        y_pred = clf.predict(X_test_kf)
        fold_scores.append(calc_scores(y_pred, y_test_kf))
    
    fold_scores = np.average(np.asarray(fold_scores), axis=0)
    print_scores(fold_scores)

def train_and_save_model(model, X_train, y_train, model_file_name):
    model.fit(X_train, y_train)
    joblib.dump(model, model_file_name)
    
def load_and_test_model(model_file_name, X_test, y_test):
    clf_loaded = joblib.load(model_file_name)
    y_pred = clf_loaded.predict(X_test)
    print_scores(calc_scores(y_pred, y_test))
    
def do_test(clf, X, y, clf_file_name, flag):
    if should_use_kfold:
        kfold_score(clf, X, y)
    else:
        if should_train_global and flag:
            train_and_save_model(clf, X_train, y_train, clf_file_name)
        load_and_test_model(clf_file_name, X_test, y_test)
        
def write_to_file(best_params, results, baseline_name, loc_or_emo):
    timestr = time.strftime("%d_%m_%Y__%H_%M_%S_")
    FOLDER_LOCATION = RESULT_LOCATION + "baselines/" + baseline_name + "/"
    FILE_NAME = baseline_name + "_" + loc_or_emo + "_" + timestr
    if NUMBER_OF_TWEETS is not None:
        FILE_NAME += str(NUMBER_OF_TWEETS)
    else:
        FILE_NAME += "ALL"
    FILE_NAME += '.text'
    RESULTS_FILE_NAME = FOLDER_LOCATION + FILE_NAME

    file = open(RESULTS_FILE_NAME, "w+")
    file.write(baseline_name + ", " + loc_or_emo + ":\n\n")
    for key, val in best_params.items():
        file.write(key+ ": "+str(val)+"\n")
    file.write("\n")
    for i, res in enumerate(results):
        file.write(EVALUATION_RESULTS_KEYS[i]+str(res)+"\n")
    file.close()

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report

dummy_clf = DummyClassifier(strategy="stratified")
dummy_clf.fit(X_train, y_loc_train)
dummy_pred = dummy_clf.predict(X_test)
print(calc_scores(y_loc_test, dummy_pred, multi_class=False))

### Linear SVM Baseline

Linear SVM is based on the liblinear library and is faster on large datasets. Not using dual optimization problem makes the training extremely fast (pseudoinverse). It is the fastest so both feature extraction methods (TF-IDF and word2vec) are used for comparison. TF-IDF works better, but adding word2vec to it (hybrid approach) adds a little bit of improvement.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, f1_score

#svm_clf = LinearSVC(class_weight="balanced", dual=False)
X_train, X_test, y_loc_train, y_loc_test = train_test_split(X_tfidf, y_loc_tfidf, shuffle=False, test_size=0.1, random_state=42)


# Parameter tunning

"""

parameters = {'C':[0.05, 0.5, 1, 2]}
scoring = 'f1'
clf = GridSearchCV(svm_clf, parameters, cv=5, scoring=scoring)
clf.fit(X_tfidf, y_loc_tfidf)

best_estimator = clf.best_estimator_
best_params = {}
best_params['C'] = clf.best_params_['C']

y_pred = clf.predict(X_test)

"""

X_train2, X_val, y_loc_train2, y_loc_val = train_test_split(X_train, y_loc_train, test_size=0.2, random_state=42)
c_range = (-5, 15)
param_C = [2**i for i in range(c_range[0], c_range[1]+1)]

c_size = c_range[1]-c_range[0]+1
results_test = np.zeros(c_size, dtype = np.float64)
for i in range(c_size):
    C1 = param_C[i]
            
    svm_clf = LinearSVC(class_weight="balanced", dual=False, C=C1)
    results = kfold_score(svm_clf, X_train, y_loc_train, multi_class=False, print1=False)
    results_test[i] = results[3]
    
    #svm_clf.fit(X_train2, y_loc_train2)

    #results_test[i] = zero_one_loss(y_loc_val, svm_clf.predict(X_val))
    #results_test[i] = f1_score(y_loc_val, svm_clf.predict(X_val), average='binary') 
print(results_test)
Ci = results_test.argmax()
C_best = param_C[Ci]

best_params = {}
best_params['C'] = C_best

best_estimator = LinearSVC(class_weight="balanced", dual=False, C=C_best)
best_estimator.fit(X_train, y_loc_train)
y_pred = best_estimator.predict(X_test)



for key, val in best_params.items():
    print(key+ ": "+str(val))
print(classification_report(y_loc_test, y_pred))


# K-fold 

print("Emoji locations:\n")
print("TF-IDF results:")
results = do_test(best_estimator, X_tfidf, y_loc_tfidf, linear_svm_model_file_name, should_train_linear_svm)

# Writing results into a file
write_to_file(best_params, results, "linearSVM", "locations")


"""


# Training and testing the best estimator

svm_clf = LinearSVC(class_weight="balanced", dual=False, C=0.05)
svm_clf.fit(X_train, y_loc_train)
y_loc_pred_svm = svm_clf.predict(X_test)
fold_scores = calc_scores(y_loc_pred_svm, y_loc_test, False)
print_scores(fold_scores, False)

"""

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

svm_clf = LinearSVC(class_weight="balanced", dual=False)
X_train, X_test, y_emo_train, y_emo_test = train_test_split(X_tfidf, y_emo_tfidf, shuffle=False, test_size=0.1, random_state=42)


"""

# Parameter tunning

parameters = {'C':[0.05, 0.5, 1, 2]}
scoring = 'f1_micro'
clf = GridSearchCV(svm_clf, parameters, cv=5, scoring=scoring)
clf.fit(X_tfidf, y_emo_tfidf)

best_estimator = clf.best_estimator_
best_params = {}
best_params['C'] = clf.best_params_['C']

y_pred = clf.predict(X_test)

for key, val in best_params.items():
    print(key+ ": "+str(val))
print(classification_report(y_emo_test, y_pred))


# K-fold 

print("Emoji types:\n")
print("TF-IDF results:")
results = do_test(best_estimator, X_tfidf, y_emo_tfidf, linear_svm_model_file_name, should_train_linear_svm, True)

# Writing results into a file
write_to_file(best_params, results, "linearSVM", "emojis")

"""


# Training and testing the best estimator

svm_clf = LinearSVC(class_weight="balanced", dual=False, C=0.05)
svm_clf.fit(X_train, y_emo_train)
y_emo_pred_svm = svm_clf.predict(X_test)
fold_scores = calc_scores(y_emo_pred_svm, y_emo_test, True)
print_scores(fold_scores, True)

#print(y_emo_pred_svm)


### Bagging SVM

Warning! SVM is based on the libsvm library and it scales poorly with large datasets. That is why an ensemble (bagging) is used. Each classifier is trained on a portion of the data which greatly reduces training times and gives similar (if not better) results. Using 10 estimators it still takes a few hours to train. Results are just a bit better than a single linear SVM.

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

n_estimators = 10
bagging_svm_clf = BaggingClassifier(SVC(kernel='linear', class_weight='balanced'), max_samples=1.0 / n_estimators,
                                        n_estimators=n_estimators, bootstrap=False)

do_test(bagging_svm_clf, X, y, bagging_model_file_name, should_train_bagging_svm)  

### Random Forest Baseline

Random forests are pretty fast and are generally better than the SVM. Higher accuracy and recall, but worse precision.

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest_clf = RandomForestClassifier(min_samples_leaf=2, class_weight="balanced")

do_test(random_forest_clf, X, y, random_forest_file_name, should_train_random_forest)

### AdaBoost Baseline

AdaBoost gives the best results and is relatively fast to train. It is slower than random forests, but still faster than bagging svms.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adaboost_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, class_weight="balanced"), n_estimators=100)

do_test(adaboost_clf, X, y, adaboost_file_name, should_train_adaboost)

### BLSTM for emoji location prediction

Bidirectional long short-term memory recurrent network implementation using the Keras framework. Emoji location prediction is treated as a binary classification problem. TWEET_NUM determines how much of the tweets are used for training. The model uses an embedding layer and its size is a hyperparameter. Using pretrained Glove embeddings proved to be better. A single bidirectional layer is used which actually consists of two LSTM layers.

In [10]:
from pandas import DataFrame

GLOVE_DIR = "./embeddings/"
GLOVE_FILE_NAME = "glove.twitter.27B."
GLOVE_FILE_NAME_EXT = "d.txt"

MODEL_DIR = "./blstm_models/"
BLSTM_BASE_FILE_NAME = "blstm_model_"
BLSTM_FILE_NAME_EXT = ".h5"

N_TIMESTEPS = MAX_WORDS_PER_TWEET + 1
NUM_EMOJI_TYPES = 20
TEST_SPLIT_SIZE = 0.2
VALIDATION_SPLIT = 0.1
EARLY_STOPPING_PATIENCE = 2
MAX_EPOCH = 30

TWEET_NUM = len(text_lines)
# TWEET_NUM = 100000
INPUT_SIZE = int(TWEET_NUM * (1 - TEST_SPLIT_SIZE))
NUM_OF_VOCAB = None
EMBEDDING_SIZE = 200
HIDDEN_SIZE = 1000

### Data preparation

Words are converted to integer ids using a tokenizer. Each unique word has its own unique integer value. Tweets are padded with zeros to a set length. Padding is required by Keras. Since most of the labels are zero (emojis are present after every 15th word), class weights are calculated. Without class weights the model behaves like a majority class classifier which is of no use here.

In [6]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split

X_input = text_lines[:TWEET_NUM]
y_input = loc_lines[:TWEET_NUM]
y_emo_input = emo_lines[:TWEET_NUM]

abc = 100000

# X_blstm_train, y_blstm_train, y_blstm_emo_train = X_input[:abc], y_input[:abc], y_emo_input[:abc]
X_blstm_train, y_blstm_train, y_blstm_emo_train = X_input[:INPUT_SIZE], y_input[:INPUT_SIZE], y_emo_input[:INPUT_SIZE]
X_blstm_test, y_blstm_test, y_blstm_emo_test = X_input[INPUT_SIZE:], y_input[INPUT_SIZE:], y_emo_input[INPUT_SIZE:]

# X_blstm_train, X_blstm_test, y_blstm_train, y_blstm_test = train_test_split(X_input, y_input, test_size=TEST_SPLIT_SIZE)

if NUM_OF_VOCAB is not None:
    tokenizer = Tokenizer(num_words=NUM_OF_VOCAB)
else:
    tokenizer = Tokenizer()
    
tokenizer.fit_on_texts(X_input)
word_index = tokenizer.word_index
txt_to_seq = tokenizer.texts_to_sequences(X_blstm_train)
# print(f"encoded:\n{txt_to_seq[0:5]}\n")

if NUM_OF_VOCAB is not None:
    vocab_size = NUM_OF_VOCAB + 1
else:
    vocab_size = len(word_index) + 1

print(f"Vocabulary Size: {vocab_size}")

X_blstm = pad_sequences(txt_to_seq, maxlen=N_TIMESTEPS - 1, padding='post')
start_padding = np.zeros((X_blstm.shape[0], 1))
X_blstm = np.append(start_padding, X_blstm, axis=1).astype(int)
print(f"input shape: {X_blstm.shape}")
print(f"BLSTM input example:\n{X_blstm[:5]}\n")

y_loc = y_blstm_train
y_blstm = to_categorical(y_loc, num_classes=2)
print(f"location labels shape: {y_blstm.shape}\n")
print(f"BLSTM loc labels:\n{y_loc[:5]}\n")

y_emo = y_blstm_emo_train
y_emo_blstm = to_categorical(y_emo, num_classes=NUM_EMOJI_TYPES + 1)
print(f"emo labels shape: {y_emo_blstm.shape}\n")
print(f"BLSTM emo labels:\n{y_emo[:5]}\n")

def calc_sample_weights(y):
#     print(np.unique(y))
    weights = class_weight.compute_class_weight('balanced', np.unique(y), y.flatten())
    class_weight_dict = dict(enumerate(weights))
#     print(f"class weight dict:\n{class_weight_dict}\n")
    vfunc = np.vectorize(lambda x: class_weight_dict[x])
    return vfunc(y)

sample_weights_loc = calc_sample_weights(y_loc)
print(f"sample_weights_loc shape: {sample_weights_loc.shape}")
print(f"sample_weights_loc examples:\n{sample_weights_loc[:5]}\n")

sample_weights_emo = calc_sample_weights(y_emo)
print(f"sample_weights_emo shape: {sample_weights_emo.shape}")
print(f"sample_weights_emo examples:\n{sample_weights_emo[:5]}\n")

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Vocabulary Size: 137097
input shape: (378767, 31)
BLSTM input example:
[[    0   148   206  5683    66     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0]
 [    0   289    58     6    62  2312    15     1   642 21191     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0]
 [    0  1463    36     7  3494  5219   205   353     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0]
 [    0     2  2278  7381    79     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0]
 [    0     3   129   944 15100     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0  

In [12]:
import os

def load_embedding_matrix(glove_size):
    embeddings_index = {}
    f = open(os.path.join(GLOVE_DIR, GLOVE_FILE_NAME + str(glove_size) + GLOVE_FILE_NAME_EXT), encoding="utf8")
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))

    embedding_matrix = np.zeros((vocab_size, glove_size))
    for word, i in word_index.items():
        if i >= vocab_size - 1:
            break
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i + 1] = embedding_vector

    print(f"embedding matrix shape {embedding_matrix.shape}")
    
    return embedding_matrix

gloves_dict = {}

def get_embedding_matrix(glove_size):
    global gloves_dict
    
    if gloves_dict[glove_size] is None:
        gloves_dict[glove_size] = load_embedding_matrix(glove_size)
        
    return gloves_dict[glove_size]

 ### Train model
 
 The following code is used to create and train a new model.

In [8]:
from keras.models import load_model
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint

def get_param_str(embedding_size, hidden_size, n_classes, input_size):
    param_str = "CLA-" + str(n_classes) + "_INP-" + str(input_size) + "_EMB-" + str(embedding_size) + "_HID-" + str(hidden_size)
    return param_str

def get_blstm_file_name(embedding_size, hidden_size, n_classes, input_size):
    param_str = get_param_str(embedding_size, hidden_size, n_classes, input_size)
    return MODEL_DIR + BLSTM_BASE_FILE_NAME + param_str + BLSTM_FILE_NAME_EXT

def save_blstm(blstm, embedding_size, hidden_size, n_classes=2, input_size=TWEET_NUM):
    blstm.save(get_blstm_file_name(embedding_size, hidden_size, n_classes, input_size))

def load_blstm(embedding_size, hidden_size, n_classes=2, input_size=TWEET_NUM):
    return load_model(get_blstm_file_name(embedding_size, hidden_size, n_classes, input_size))

def get_callbacks():
    callbacks = []
    
    callbacks.append(EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=EARLY_STOPPING_PATIENCE,
                              verbose=0, mode='auto'))
    checkpoint_path = MODEL_DIR + BLSTM_BASE_FILE_NAME + "check" + BLSTM_FILE_NAME_EXT
    callbacks.append(ModelCheckpoint(filepath=checkpoint_path, save_best_only=True))
    
    return callbacks

In [25]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import Bidirectional
from keras.layers import Dropout

def get_bi_lstm_model(embedding_size=EMBEDDING_SIZE, hidden_size=HIDDEN_SIZE, n_classes=2, use_glove=True,
                      n_timesteps=N_TIMESTEPS, mode="concat"):
    model = Sequential()
    
    if use_glove:
        embedding_matrix = get_embedding_matrix(embedding_size)
        model.add(Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=N_TIMESTEPS,
                            trainable=True))
    else:
        model.add(Embedding(vocab_size_size,
                            embedding_size,
                            input_length=N_TIMESTEPS))
    
    model.add(Dropout(0.5))
    
    model.add(Bidirectional(LSTM(hidden_size, return_sequences=True), merge_mode=mode))
    model.add(TimeDistributed(Dense(n_classes, activation='softmax')))
#     binary_crossentropy
#     categorical_crossentropy
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"], sample_weight_mode="temporal")
    return model

def get_predictions(blstm, texts):
    text_to_integer_sequences = tokenizer.texts_to_sequences(texts)
    blstm_input = pad_sequences(text_to_integer_sequences, maxlen=N_TIMESTEPS, padding='post')
    ypred = blstm.predict_classes(blstm_input)
    return ypred

def evaluate_blstm(blstm, texts=X_blstm_test, labels=y_blstm_test):
    predictions = get_predictions(blstm, texts)
#     print(predictions.shape)
#     print(labels.shape)
    print()
    for i in range(10):
        print(predictions[i])
        print(labels[i])
        print()
    print_scores(calc_scores(predictions.flatten(), labels.flatten(), np.max(labels)>1))
    print()

def do_blstm_test(embedding_size, hidden_size, is_emo=False, use_glove=True, input_size=TWEET_NUM):
    global blstm
    
    if is_emo:
        y_train = y_emo_blstm
        y_test = y_blstm_emo_test
        sample_weights = sample_weights_emo
    else:
        y_train = y_blstm
        y_test = y_blstm_test
        sample_weights = sample_weights_loc
        
    n_classes = y_train.shape[2]
    blstm = get_bi_lstm_model(embedding_size, hidden_size, n_classes, use_glove=use_glove)
    print(f"\nTraining with embedding {embedding_size} hidden {hidden_size} classes {n_classes} input {input_size} glove {use_glove}")    
    blstm.fit(X_blstm, y_train, epochs=MAX_EPOCH,
              validation_split=VALIDATION_SPLIT, verbose=2,
              sample_weight=sample_weights, callbacks=get_callbacks())
    save_blstm(blstm, embedding_size, hidden_size, n_classes, input_size)
    try:
        evaluate_blstm(blstm, labels=y_test)
    except Exception:
        print("couldn't evaluate")

In [None]:
do_blstm_test(200, 500, is_emo=True)

In [11]:
embedding_sizes = [50, 25, 100]
hidden_sizes = [20, 50, 100]

blstm = None
for embedding_size in embedding_sizes:
    for hidden_size in hidden_sizes:
        do_blstm_test(embedding_size, hidden_size, is_emo=False)
        do_blstm_test(embedding_size, hidden_size, is_emo=True)
        

Found 1193514 word vectors.
embedding matrix shape (137097, 50)

Training with embedding 50 hidden 20 classes 2 input 473459 glove True
Train on 340890 samples, validate on 37877 samples
Epoch 1/30
 - 911s - loss: 0.3501 - acc: 0.7824 - val_loss: 0.3193 - val_acc: 0.8054
Epoch 2/30
 - 898s - loss: 0.3148 - acc: 0.8146 - val_loss: 0.3061 - val_acc: 0.8259
Epoch 3/30
 - 898s - loss: 0.3023 - acc: 0.8253 - val_loss: 0.2975 - val_acc: 0.8394
Epoch 4/30
 - 906s - loss: 0.2942 - acc: 0.8322 - val_loss: 0.2946 - val_acc: 0.8386
Epoch 5/30
 - 906s - loss: 0.2878 - acc: 0.8376 - val_loss: 0.2940 - val_acc: 0.8428
Epoch 6/30
 - 905s - loss: 0.2828 - acc: 0.8416 - val_loss: 0.2952 - val_acc: 0.8402
Epoch 7/30
 - 910s - loss: 0.2781 - acc: 0.8456 - val_loss: 0.2940 - val_acc: 0.8475
(94692, 31)
(94692, 31)
[[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Training with embedding 50 hidden 100 classes 2 input 473459 glove True
Train on 340890 samples, validate on 37877 samples
Epoch 1/30
 - 913s - loss: 0.3377 - acc: 0.7908 - val_loss: 0.3067 - val_acc: 0.8233
Epoch 2/30
 - 912s - loss: 0.3041 - acc: 0.8234 - val_loss: 0.2929 - val_acc: 0.8389
Epoch 3/30
 - 910s - loss: 0.2917 - acc: 0.8341 - val_loss: 0.2886 - val_acc: 0.8453
Epoch 4/30
 - 917s - loss: 0.2831 - acc: 0.8411 - val_loss: 0.2889 - val_acc: 0.8561
Epoch 5/30
 - 913s - loss: 0.2766 - acc: 0.8460 - val_loss: 0.2862 - val_acc: 0.8537
Epoch 6/30
 - 910s - loss: 0.2707 - acc: 0.8508 - val_loss: 0.2859 - val_acc: 0.8597
Epoch 7/30
 - 917s - loss: 0.2657 - acc: 0.8548 - val_loss: 0.2871 - val_acc: 0.8594
Epoch 8/30
 - 915s - loss: 0.2612 - acc: 0.8580 - val_loss: 0.2909 - val_acc: 0.8663
(94692, 31)
(94692, 31)
[[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 


Training with embedding 25 hidden 50 classes 2 input 473459 glove True
Train on 340890 samples, validate on 37877 samples
Epoch 1/30
 - 882s - loss: 0.3518 - acc: 0.7772 - val_loss: 0.3187 - val_acc: 0.8079
Epoch 2/30
 - 882s - loss: 0.3184 - acc: 0.8107 - val_loss: 0.3079 - val_acc: 0.8195
Epoch 3/30
 - 883s - loss: 0.3075 - acc: 0.8212 - val_loss: 0.3001 - val_acc: 0.8225
Epoch 4/30
 - 879s - loss: 0.2996 - acc: 0.8282 - val_loss: 0.2965 - val_acc: 0.8241
Epoch 5/30
 - 882s - loss: 0.2939 - acc: 0.8328 - val_loss: 0.2942 - val_acc: 0.8461
Epoch 6/30
 - 885s - loss: 0.2890 - acc: 0.8369 - val_loss: 0.2925 - val_acc: 0.8370
Epoch 7/30
 - 885s - loss: 0.2851 - acc: 0.8401 - val_loss: 0.2918 - val_acc: 0.8357
Epoch 8/30
 - 876s - loss: 0.2816 - acc: 0.8429 - val_loss: 0.2906 - val_acc: 0.8453
Epoch 9/30
 - 1000s - loss: 0.2787 - acc: 0.8452 - val_loss: 0.2936 - val_acc: 0.8461
Epoch 10/30
 - 967s - loss: 0.2760 - acc: 0.8473 - val_loss: 0.2926 - val_acc: 0.8384
(94692, 31)
(94692, 31)
[

KeyboardInterrupt: 

In [None]:
blstm = load_blstm(200, 1000, 2)
blstm_emo = load_blstm(200, 500, 21)
imtired = 20
inputs = X_blstm_test[:imtired]
for i in range(imtired):
    print(inputs[i])
labels = y_blstm_test[:imtired]
labels_emo = y_blstm_emo_test[:imtired]

evaluate_blstm(blstm, inputs, labels)
evaluate_blstm(blstm_emo, inputs, labels_emo)

In [None]:
do_blstm_test(200, 500, is_emo=True)

In [None]:
evaluate_blstm(blstm, labels=y_blstm_test)

In [None]:
save_blstm(blstm, EMBEDDING_SIZE, HIDDEN_SIZE)

In [None]:
for embedding_size in embedding_sizes:
    for hidden_size in hidden_sizes:
        blstm = load_blstm(embedding_size, hidden_size, 2, 100000)
        evaluate_blstm(blstm, labels=y_blstm_test)

In [None]:
for embedding_size in embedding_sizes:
    for hidden_size in hidden_sizes:
        blstm = load_blstm(embedding_size, hidden_size)
        print(f"\nEmbedding {embedding_size} hidden {hidden_size}")
        evaluate_blstm(blstm, X_blstm_test, y_blstm_test)
        print(f"test:")
        evaluate_blstm(blstm, X_blstm_test, y_blstm_test)
        print(f"\ntrain:")
        evaluate_blstm(blstm, X_blstm_train, y_blstm_train)

In [26]:
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(MODEL_DIR) if isfile(join(MODEL_DIR, f))]

for file_path in onlyfiles:
    blstm = load_model(MODEL_DIR + file_path)
    if file_path[17] == "1":
        y_test = y_blstm_emo_test
    else:
        y_test = y_blstm_test
    print(file_path)
    evaluate_blstm(blstm, X_blstm_test, y_test)

blstm_model_CLA-21_INP-473459_EMB-200_HID-500.h5

[ 5  7  7  2 13 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 5  5 20  5  7 16 16  7  7 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 8  8  8  8  8  8  8  8 11  8  8  8  8  8  8  8  8  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[19 19 14 10 10 10 10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 1  1  1  1  4  1  4  4  4 14 14 14  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[9 9 9 9 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[16 16  1 16 11 

Accuracy: 0.9400932235762441
Precision: 0.24999451388985403
Recall: 0.026570847229349326
F1: 0.038197040433274335

blstm_model_CLA-21_INP-473459_EMB-50_HID-50.h5

[13  2  2 13 13 13 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 0 13 13  7 13 13 12 12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 8  8  8  8  8  8  8  8 11  8  8  8  8  8  8  8  8  8  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[19 10  1  2  2  2  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[1 1 1 4 4 1 4 4 4 4 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Accuracy: 0.8058455733563349
Precision: 0.6808459878848229
Recall: 0.10852991314498488
F1: 0.18721665010004152
ratio of positive/negative predictions 0.2060333468235897

blstm_model_CLA-2_INP-473459_EMB-100_HID-1000.h5

[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 1

Accuracy: 0.8669686985172982
Precision: 0.7892291096174592
Recall: 0.1704957705450675
F1: 0.2804141644385396
ratio of positive/negative predictions 0.15202939785763828

blstm_model_CLA-2_INP-473459_EMB-25_HID-20.h5

[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 1 1 1