### 10 Emoji Prediction

The task of this project is to make a system that would automatically fill the text with the appropriate emoticons. This can be done in two steps. First, for each position within the text a prediction is made whether an emoticon should be placed there. Second, an appropriate emoticon is chosen from a list of available emoticons. Both these tasks can be set up as supervised classification problems.

Competition website:
https://competitions.codalab.org/competitions/17344

Dataset:
https://competitions.codalab.org/competitions/17344

Entry point:
https://arxiv.org/pdf/1702.07285.pdf (Barbieri, Francesco, Miguel Ballesteros, and Horacio Saggion. Are Emojis Predictable?)

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


Define paths to folders containing the data and the results.

In [44]:
# NUMBER_OF_TWEETS = "ALL"
NUMBER_OF_TWEETS = 100000

MAX_WORDS_PER_TWEET = 30
DATA_LOCATION = "./train/data/"
RESULT_LOCATION = "./result/"
TWEET_FILE_NAME = "tweet_by_ID_28_4_2018__03_20_05" + "_"

k = 3

if NUMBER_OF_TWEETS is not None:
    TWEET_FILE_NAME += str(NUMBER_OF_TWEETS)
else:
    TWEET_FILE_NAME += "ALL"

### Load the data
Tweets are loaded in two ways: list of strings (for the TF-IDF vectorizer) and a list of lists of words (for feature extraction). Labels are read as a numpy array of N * MAX_WORDS_PER_TWEET dimensions.

In [45]:
base_file_name = DATA_LOCATION + TWEET_FILE_NAME

text_lines = []
text_lines_split = []

with open(base_file_name + ".text", 'r', encoding="utf-8") as out_text:
    for line in out_text:
        text_lines.append(line[:-1])
        text_lines_split.append(line[:-1].split())
        
loc_lines = []
with open(base_file_name + ".loclabels", 'r') as loc_labels:
    for line in loc_labels:
        loc_line = []
        for c in line[:-1]:
            loc_line.append(int(c))
        loc_lines.append(loc_line)

loc_lines = np.asarray(loc_lines)

# full_text = open(base_file_name + ".full", 'r')
# emoji_labels = open(base_file_name + ".emolabels", 'r')
# emoji_ids = open(base_file_name + ".ids", 'r')

print(f"number of tweets {len(text_lines)}")
print(f"example of tweet texts:\n{text_lines[:5]}\n")
print(f"example of labels (emoji locations):\n{loc_lines[:5]}")

number of tweets 100000
example of tweet texts:
['lol west covina california', 'things got a little festive at the office christmas2016 redrock', 'step out and explore ellis island cafe', 'rupauls drag race bingo fun drag queens be sexy rupaulsdragrace user abwyman la', 'just light makeup blueeyes lupusgirl photography modelingagency modeling smiling']

example of labels (emoji locations):
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


### TF-IDF Feature Extraction
TF-IDF is computed on the collection of tweets. Then for every position between words a new example is generated: a $2*k$ array containing the k left and k right tf-idf values of words. Labels are taken as 1 or 0 wether an emoji was there in the original tweet.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
import random

word_to_tfidf_index_dict = {}
tfidf_model = TfidfVectorizer(input="content", analyzer="word", stop_words="english")
X_tfidf_features = tfidf_model.fit_transform(text_lines)

def tfidf_features(tweets, labels, k, func):
    for i, word in enumerate(tfidf_model.get_feature_names()):
        word_to_tfidf_index_dict[word] = i

    print(f"X shape {X_tfidf.shape}")
    print(f"Y shape {loc_lines.shape}")
    print(f"some tf-idf values\n{X_tfidf[0]}\n")

    N = len(tweets)
    X = []
    y = []
    
    for tweet_index, (tweet, label) in enumerate(zip(tweets, labels)):
        for pos in range(len(tweet) + 1):
                
            x = []
            for i in range(pos - k, pos + k):
                if i < 0 or i >= len(tweet):
                    x.append(0.0)
                else:
                    x.append(func(tweet_index, tweet[i]))
            X.append(x)
            y.append(label[pos])
            
    return np.asarray(X), np.asarray(y)

def word_to_tfidf(tweet_index, word):
    if word in word_to_tfidf_index_dict:
        return X_tfidf_features[tweet_index, word_to_tfidf_index_dict[word]]
    else:
        return 0.0

def print_dataset(X, y):
    emoji_num = np.count_nonzero(y)
    class_freq_ratio = emoji_num / (X.shape[0] * X.shape[1])
    
    print("after feature extraction:")
    print(f"X shape {X.shape}")
    print(f"y shape {y.shape}")
    print(f"some X values\n{X[:5]}")
    print(f"some y values\n{y[:5]}")
    print(f"non zero elements (1 in label) in y {emoji_num}")
    print(f"class frequency ratio {class_freq_ratio}\n")

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


### Word2Vector Feature Extraction

A similar dataset is obtained using w2vec. By setting the number of features to $2*k=6$, the dimensionality stays the same as the previous tf-idf feature extraction. For each word a 6 dimensional vector is obtained and then the average of the 6 neighboring words is taken. Labels stay the same.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec

num_of_features = 2 * k
w2v_model_file_name = "w2v_model"



In [6]:
w2v_model = Word2Vec(text_lines_split, min_count=1, size=num_of_features)
w2v_model.save(w2v_model_file_name)

In [7]:
w2v_model = Word2Vec.load(w2v_model_file_name)

example = w2v_model.wv["california"]
print(example.shape)
print(example)

(6,)
[ 4.807096   5.3436766 -3.5083148  5.000165  -1.2895648  4.4985213]


In [10]:
def w2v_features(tweets, labels, k):
    N = len(tweets)
    X = []
    y = []
    
    for tweet_index, (tweet, label) in enumerate(zip(tweets, labels)):
        for pos in range(len(tweet) + 1):
            x = []
            for i in range(pos - k, pos + k):
                if i < 0 or i >= len(tweet):
                    x.append(np.zeros(num_of_features))
                else:
                    x.append(w2v_model.wv[tweet[i]])
            x = np.average(x, axis=0)
            X.append(x)
            y.append(label[pos])
            
    return np.asarray(X), np.asarray(y)

In [17]:
should_use_tfidf = True
should_use_hybrid = False

X_tfidf, y_tfidf = tfidf_features(text_lines_split, loc_lines, k, word_to_tfidf)
X_w2v, y_w2v = w2v_features(text_lines_split, loc_lines, k)
X_hybrid = np.hstack((X_tfidf, X_w2v))

if should_use_tfidf or should_use_hybrid:    
    X, y = X_tfidf, y_tfidf

if (not should_use_tfidf) or should_use_hybrid:
    X, y = X_w2v, y_w2v

if should_use_hybrid:
    X = X_hybrid

print_dataset(X, y)

X shape (1187039, 6)
Y shape (100000, 31)
some tf-idf values
[0.         0.         0.         0.42782282 0.43669317 0.7050588 ]

after feature extraction:
X shape (1187039, 6)
y shape (1187039,)
some X values
[[0.         0.         0.         0.42782282 0.43669317 0.7050588 ]
 [0.         0.         0.42782282 0.43669317 0.7050588  0.3593867 ]
 [0.         0.42782282 0.43669317 0.7050588  0.3593867  0.        ]
 [0.42782282 0.43669317 0.7050588  0.3593867  0.         0.        ]
 [0.43669317 0.7050588  0.3593867  0.         0.         0.        ]]
some y values
[0 1 0 0 0]
non zero elements (1 in label) in y 100779
class frequency ratio 0.014149914198269813



### Build data sets
Dataset is randomly split into train and test subsets. Ignored while using KFold.

In [None]:
from sklearn.model_selection import train_test_split

if not should_use_kfold:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    print(X_train.shape)
    print(y_train.shape)
    print(X_train[30:50])
    print(y_train[30:50])

### Baselines

Make sure to check should_train flags when training/testing.

In [18]:
should_use_kfold = True

should_train_global = True
should_train_linear_svm = True
should_train_bagging_svm = False
should_train_random_forest = True
should_train_adaboost = False

linear_svm_model_file_name = "linear_svm.pkl"
bagging_model_file_name = "bagging_svm.pkl"
random_forest_file_name = "random_forest.pkl"
adaboost_file_name = "adaboost.pkl"

In [5]:
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=5, shuffle=True)

def calc_scores(y_pred, y_test):
    result = []
    result.append(accuracy_score(y_pred, y_test))
    result.append(precision_score(y_pred, y_test))
    result.append(recall_score(y_pred, y_test))
    result.append(f1_score(y_pred, y_test))
    result.append(np.count_nonzero(y_pred) / y_pred.shape[0])
    return result

def print_scores(scores):
    print(f"Accuracy: {scores[0]}")
    print(f"Precision: {scores[1]}")
    print(f"Recall: {scores[2]}")
    print(f"F1: {scores[3]}")
    print(f"ratio of positive/negative predictions {scores[4]}")

def kfold_score(clf, X, y):
    fold_scores = []
    for train_index, test_index in kf.split(X):
        X_train_kf, X_test_kf = X[train_index], X[test_index]
        y_train_kf, y_test_kf = y[train_index], y[test_index]
        clf.fit(X_train_kf, y_train_kf)
        y_pred = clf.predict(X_test_kf)
        fold_scores.append(calc_scores(y_pred, y_test_kf))
    
    fold_scores = np.average(np.asarray(fold_scores), axis=0)
    print_scores(fold_scores)

def train_and_save_model(model, X_train, y_train, model_file_name):
    model.fit(X_train, y_train)
    joblib.dump(model, model_file_name)
    
def load_and_test_model(model_file_name, X_test, y_test):
    clf_loaded = joblib.load(model_file_name)
    y_pred = clf_loaded.predict(X_test)
    print_scores(calc_scores(y_pred, y_test))
    
def do_test(clf, X, y, clf_file_name, flag):
    if should_use_kfold:
        kfold_score(clf, X, y)
    else:
        if should_train_global and flag:
            train_and_save_model(clf, X_train, y_train, clf_file_name)
        load_and_test_model(clf_file_name, X_test, y_test)  

### Linear SVM Baseline

Linear SVM is based on the liblinear library and is faster on large datasets. Not using dual optimization problem makes the training extremely fast (pseudoinverse). It is the fastest so both feature extraction methods (TF-IDF and word2vec) are used for comparison. TF-IDF works better, but adding word2vec to it (hybrid approach) adds a little bit of improvement.

In [6]:
from sklearn.svm import LinearSVC

svm_clf = LinearSVC(class_weight="balanced", dual=False)

print("TF-IDF results:")
do_test(svm_clf, X_tfidf, y_tfidf, linear_svm_model_file_name, should_train_linear_svm)
print("\nword2vec results:")
do_test(svm_clf, X_w2v, y_w2v, linear_svm_model_file_name, should_train_linear_svm)
print("\nhybrid results:")
do_test(svm_clf, X_hybrid, y_w2v, linear_svm_model_file_name, should_train_linear_svm)

TF-IDF results:


NameError: name 'X_tfidf' is not defined

### Bagging SVM

Warning! SVM is based on the libsvm library and it scales poorly with large datasets. That is why an ensemble (bagging) is used. Each classifier is trained on a portion of the data which greatly reduces training times and gives similar (if not better) results. Using 10 estimators it still takes a few hours to train. Results are just a bit better than a single linear SVM.

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

n_estimators = 10
bagging_svm_clf = BaggingClassifier(SVC(kernel='linear', class_weight='balanced'), max_samples=1.0 / n_estimators,
                                        n_estimators=n_estimators, bootstrap=False)

do_test(bagging_svm_clf, X, y, bagging_model_file_name, should_train_bagging_svm)  

### Random Forest Baseline

Random forests are pretty fast and are generally better than the SVM. Higher accuracy and recall, but worse precision.

In [21]:
from sklearn.ensemble import RandomForestClassifier

random_forest_clf = RandomForestClassifier(min_samples_leaf=2, class_weight="balanced")

do_test(random_forest_clf, X, y, random_forest_file_name, should_train_random_forest)

Accuracy: 0.9002408523237806
Precision: 0.21676942055870857
Recall: 0.3562221258643744
F1: 0.26951362532394985
ratio of positive/negative predictions 0.05166721459371121


### AdaBoost Baseline

AdaBoost gives the best results and is relatively fast to train. It is slower than random forests, but still faster than bagging svms.

In [22]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adaboost_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, class_weight="balanced"), n_estimators=100)

do_test(adaboost_clf, X, y, adaboost_file_name, should_train_adaboost)

Accuracy: 0.7394929726436992
Precision: 0.6803645745677411
Recall: 0.1984073000807087
F1: 0.30721909881604825
ratio of positive/negative predictions 0.2911336536033801


### BLSTM for emoji location prediction

Bidirectional long short-term memory recurrent network implementation using the Keras framework. Emoji location prediction is treated as a binary classification problem. INPUT_SIZE determines how much of the tweets are used for training. The model uses an embedding layer and its size is a hyperparameter. A single bidirectional layer is used which actually consists of two LSTM layers.

In [46]:
from pandas import DataFrame

TWEET_NUM = len(text_lines)
# TWEET_NUM = 125000
NUM_OF_VOCAB = 10000
N_TIMESTEPS = MAX_WORDS_PER_TWEET
N_CLASSES = 2
TEST_SPLIT_SIZE = 0.25
INPUT_SIZE = len(text_lines) * (1 - TEST_SPLIT_SIZE)
EMBEDDING_SIZE = 101
HIDDEN_SIZE = 52
PARAMETER_STR = str(N_CLASSES) + "_" + str(INPUT_SIZE) + "_" + str(EMBEDDING_SIZE) + "_" + str(HIDDEN_SIZE)

dummy_text = [""" Jack and Jill went up the hill\n
        To fetch a pail of water\n
        Jack fell down and broke his crown\n
        And Jill came tumbling after\n """]

BLSTM_MODEL_FILE_NAME = "blstm_model_" + PARAMETER_STR + ".h5"

### Data preparation

Words are converted to integer ids using a tokenizer. Each unique word has its own unique integer value. Tweets are padded with zeros to a set length. Padding is required by Keras. Since most of the labels are zero (emojis are present after every 15th word), class weights are calculated. Without class weights the model behaves like a majority class classifier which is of no use here.

In [47]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split

X_input = text_lines[:TWEET_NUM]
y_input = loc_lines[:TWEET_NUM]

X_blstm_train, X_blstm_test, y_blstm_train, y_blstm_test = train_test_split(X_input, y_input, test_size=TEST_SPLIT_SIZE)

# integer encode text
tokenizer = Tokenizer(num_words=NUM_OF_VOCAB)
tokenizer.fit_on_texts(X_input)
txt_to_seq = tokenizer.texts_to_sequences(X_blstm_train)
print(f"encoded:\n{txt_to_seq[0:5]}\n")
# determine the vocabulary size

vocab_size = NUM_OF_VOCAB
# vocab_size = len(tokenizer.word_index) + 1
print(f"Vocabulary Size: {vocab_size}")

# pad input sequences
X_blstm = pad_sequences(txt_to_seq, maxlen=N_TIMESTEPS, padding='post')
print(f"input shape: {X_blstm.shape}")
print(f"BLSTM input example:\n{X_blstm[:5]}\n")

y_loc = y_blstm_train[:,1:]
y_blstm = to_categorical(y_loc, num_classes=N_CLASSES)
print(f"labels shape: {y_blstm.shape}")
# print(f"BLSTM labels:\n{y_blstm[:5]}\n")

weights = class_weight.compute_class_weight('balanced', np.unique(y_loc), y_loc.flatten())
class_weight_dict = dict(enumerate(weights))
print(f"class weight dict:\n{class_weight_dict}\n")

vfunc = np.vectorize(lambda x: class_weight_dict[x])
sample_weights = vfunc(y_loc)
print(f"sample_weights shape: {sample_weights.shape}")
print(f"sample_weights examples:\n{sample_weights[:5]}\n")

encoded:
[[3468, 37, 259, 347, 89, 659, 7, 6, 649, 259], [13, 268, 909, 3893, 3, 1475, 3109, 7035, 3363, 974, 5291, 114], [162, 29, 7501, 506, 26, 6, 845, 55, 29, 14, 56, 318, 7501, 2], [40, 321, 4575, 3865, 248], [46, 46, 643, 7, 229, 46]]

Vocabulary Size: 10000
input shape: (75000, 30)
BLSTM input example:
[[3468   37  259  347   89  659    7    6  649  259    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [  13  268  909 3893    3 1475 3109 7035 3363  974 5291  114    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [ 162   29 7501  506   26    6  845   55   29   14   56  318 7501    2
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [  40  321 4575 3865  248    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [  46   46  643    7  229   46    0    0    0    0    0    0

 ### Train model
 
 The following code is used to create and train a new model.

In [48]:
from keras.models import load_model

def get_blstm_file_name(embedding_size, hidden_size):
    param_str = str(N_CLASSES) + "_" + str(INPUT_SIZE) + "_" + str(embedding_size) + "_" + str(hidden_size)
    return "blstm_model_" + param_str + ".h5"

def save_blstm(blstm, embedding_size, hidden_size):
    blstm.save(get_blstm_file_name(embedding_size, hidden_size))

def load_blstm(embedding_size, hidden_size):
    return load_model(get_blstm_file_name(embedding_size, hidden_size))

hidden_sizes = [10, 30, 100]
embedding_sizes = [2, 5, 10, 15]

In [50]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import Bidirectional
from keras.layers import Dropout

should_train_blstm = False

def get_bi_lstm_model(embedding_size=EMBEDDING_SIZE, hidden_size=HIDDEN_SIZE, n_timesteps=N_TIMESTEPS, mode="concat"):
    model = Sequential()
    model.add(Embedding(vocab_size, embedding_size, input_length=N_TIMESTEPS))
    model.add(Bidirectional(LSTM(hidden_size, return_sequences=True), merge_mode=mode))
#     model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(N_CLASSES, activation='softmax')))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"], sample_weight_mode="temporal")
    return model

if should_train_blstm:
    blstm = get_bi_lstm_model()
    blstm.fit(X_blstm, y_blstm, epochs=20, validation_split=0.1, verbose=2, sample_weight=sample_weights)
    save_blstm(blstm, EMBEDDING_SIZE, HIDDEN_SIZE)

In [13]:
save_blstm(blstm, EMBEDDING_SIZE, HIDDEN_SIZE)

In [51]:
for hidden_size in hidden_sizes:
    for embedding_size in embedding_sizes:
        blstm = get_bi_lstm_model(embedding_size=embedding_size, hidden_size=hidden_size)
        print(f"\nembedding {embedding_size} hidden {hidden_size}")
        blstm.fit(X_blstm, y_blstm, epochs=10, validation_split=0.1, verbose=2, sample_weight=sample_weights)
        save_blstm(blstm, embedding_size, hidden_size)


embedding 2 hidden 10
Train on 67500 samples, validate on 7500 samples
Epoch 1/10
 - 173s - loss: 0.3872 - acc: 0.7206 - val_loss: 0.3450 - val_acc: 0.7746
Epoch 2/10
 - 168s - loss: 0.3316 - acc: 0.7874 - val_loss: 0.3271 - val_acc: 0.7944
Epoch 3/10
 - 168s - loss: 0.3038 - acc: 0.8170 - val_loss: 0.3084 - val_acc: 0.8361
Epoch 4/10
 - 167s - loss: 0.2872 - acc: 0.8348 - val_loss: 0.3037 - val_acc: 0.8189
Epoch 5/10
 - 168s - loss: 0.2792 - acc: 0.8416 - val_loss: 0.3055 - val_acc: 0.8354
Epoch 6/10
 - 167s - loss: 0.2740 - acc: 0.8451 - val_loss: 0.3056 - val_acc: 0.8297
Epoch 7/10
 - 168s - loss: 0.2703 - acc: 0.8469 - val_loss: 0.3144 - val_acc: 0.8540
Epoch 8/10
 - 168s - loss: 0.2672 - acc: 0.8487 - val_loss: 0.3124 - val_acc: 0.8411
Epoch 9/10
 - 167s - loss: 0.2648 - acc: 0.8498 - val_loss: 0.3185 - val_acc: 0.8560
Epoch 10/10
 - 167s - loss: 0.2629 - acc: 0.8506 - val_loss: 0.3182 - val_acc: 0.8342

embedding 5 hidden 10
Train on 67500 samples, validate on 7500 samples
Epoch

Epoch 10/10
 - 167s - loss: 0.2501 - acc: 0.8554 - val_loss: 0.3291 - val_acc: 0.8459

embedding 5 hidden 100
Train on 67500 samples, validate on 7500 samples
Epoch 1/10
 - 173s - loss: 0.3304 - acc: 0.7891 - val_loss: 0.2962 - val_acc: 0.8231
Epoch 2/10
 - 166s - loss: 0.2832 - acc: 0.8374 - val_loss: 0.2924 - val_acc: 0.8377
Epoch 3/10
 - 166s - loss: 0.2681 - acc: 0.8480 - val_loss: 0.2915 - val_acc: 0.8375
Epoch 4/10
 - 167s - loss: 0.2565 - acc: 0.8548 - val_loss: 0.3094 - val_acc: 0.8593
Epoch 5/10
 - 167s - loss: 0.2465 - acc: 0.8598 - val_loss: 0.3075 - val_acc: 0.8446
Epoch 6/10
 - 167s - loss: 0.2370 - acc: 0.8643 - val_loss: 0.3292 - val_acc: 0.8591
Epoch 7/10
 - 167s - loss: 0.2286 - acc: 0.8686 - val_loss: 0.3465 - val_acc: 0.8623
Epoch 8/10
 - 167s - loss: 0.2210 - acc: 0.8720 - val_loss: 0.3789 - val_acc: 0.8711
Epoch 9/10
 - 167s - loss: 0.2147 - acc: 0.8754 - val_loss: 0.3902 - val_acc: 0.8657
Epoch 10/10
 - 167s - loss: 0.2090 - acc: 0.8781 - val_loss: 0.4130 - val_ac

### Load a model

The following block is used to load models.

In [10]:
blstm = load_model(BLSTM_MODEL_FILE_NAME)

In [17]:
NUM_OF_TWEETS_TO_TEST = 10

def get_predictions(blstm, texts):
    text_to_integer_sequences = tokenizer.texts_to_sequences(texts)
    blstm_input = pad_sequences(text_to_integer_sequences, maxlen=N_TIMESTEPS, padding='post')
    ypred = blstm.predict_classes(blstm_input)
    return np.insert(ypred, 0, 0, axis=1)

print(f"example of labels (emoji locations):\n{y_input[:NUM_OF_TWEETS_TO_TEST]}\n")
print(f"model predictions:\n{get_predictions(blstm, X_input[:NUM_OF_TWEETS_TO_TEST])}\n")

example of labels (emoji locations):
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

model predictions:
[[0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 

### Load/Save predictions

Use the code below to save/load a model's predictions. Training and testing the model requires a powerful Nvidia GPU and is time consuming. This also enables the evaluation of results without having to work with the model itself.

In [None]:
blstm_predictions_flag = False

if blstm_predictions_flag:
    BLSTM_PREDICTIONS_FILE_NAME = "blstm_predictions_" + PARAMETER_STR + ".pkl"
    predictions = get_predictions(text_lines)

    result_dict = {}
    result_dict["num_of_classes"] = N_CLASSES
    result_dict["num_of_input_tweets"] = INPUT_SIZE
    result_dict["embedding_size"] = EMBEDDING_SIZE
    result_dict["hidden_size"] = HIDDEN_SIZE
    result_dict["input_X"] = X_blstm
    result_dict["input_y"] = y_blstm
    result_dict["ypred"] = predictions

    joblib.dump(result_dict, BLSTM_PREDICTIONS_FILE_NAME)

In [None]:
if blstm_predictions_flag:
    result_dict = joblib.load(BLSTM_PREDICTIONS_FILE_NAME)

### Evaluation

In [15]:
def evaluate_blstm(blstm, texts, labels):
    predictions = get_predictions(blstm, texts)
    print_scores(calc_scores(predictions.flatten(), labels.flatten()))

In [43]:
print(f"test:")
evaluate_blstm(blstm, X_blstm_test, y_blstm_test)
print(f"\ntrain:")
evaluate_blstm(blstm, X_blstm_train, y_blstm_train)

test:
Accuracy: 0.8842885074631341
Precision: 0.6137139825591034
Recall: 0.16381856843512066
F1: 0.25860716936645856
ratio of positive/negative predictions 0.12319002745709369

train:
Accuracy: 0.929224556024226
Precision: 0.8889981466085145
Recall: 0.30345990224475017
F1: 0.4524692351651533
ratio of positive/negative predictions 0.09636777284635417


Accuracy: 0.8684525806451613
Precision: 0.8862431529333372
Recall: 0.18585372870558742
F1: 0.3072699815349909
ratio of positive/negative predictions 0.15697741935483872


In [None]:
for hidden_size in hidden_sizes:
    for embedding_size in embedding_sizes:
        blstm = load_blstm(embedding_size, hidden_size)
        print(f"\nEmbedding {embedding_size} hidden {hidden_size}")
        evaluate_blstm(blstm, X_blstm_test, y_blstm_test)


Embedding 2 hidden 10
Accuracy: 0.8404877419354839
Precision: 0.8914559721011334
Recall: 0.15689296823780496
F1: 0.2668256114630038
ratio of positive/negative predictions 0.18500387096774193

Embedding 5 hidden 10
Accuracy: 0.8631083870967742
Precision: 0.8475073313782991
Recall: 0.1729853028011227
F1: 0.2873245870364026
ratio of positive/negative predictions 0.15952129032258064

Embedding 10 hidden 10
Accuracy: 0.8726167741935484
Precision: 0.8230958230958231
Recall: 0.1805647320652363
F1: 0.2961600433474498
ratio of positive/negative predictions 0.14842322580645162

Embedding 15 hidden 10
Accuracy: 0.8778064516129033
Precision: 0.8175081239597368
Recall: 0.18631010440374263
F1: 0.30346136306800625
ratio of positive/negative predictions 0.14286967741935483

Embedding 2 hidden 30
