### 10 Emoji Prediction

The task of this project is to make a system that would automatically fill the text with the appropriate emoticons. This can be done in two steps. First, for each position within the text a prediction is made whether an emoticon should be placed there. Second, an appropriate emoticon is chosen from a list of available emoticons. Both these tasks can be set up as supervised classification problems.

Competition website:
https://competitions.codalab.org/competitions/17344

Dataset:
https://competitions.codalab.org/competitions/17344

Entry point:
https://arxiv.org/pdf/1702.07285.pdf (Barbieri, Francesco, Miguel Ballesteros, and Horacio Saggion. Are Emojis Predictable?)

In [44]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


Define paths to folders containing data and results.

In [45]:
# promjena
#NUMBER_OF_TWEETS = None
NUMBER_OF_TWEETS = 100000

MAX_WORDS_PER_TWEET = 30
DATA_LOCATION = "./train/data/"
RESULT_LOCATION = "./result/"

# promjena
#TWEET_FILE_NAME = "tweet_by_ID_28_4_2018__03_20_05" + "_"
TWEET_FILE_NAME = "tweet_by_ID_02_6_2018__05_40_21" + "_"

if NUMBER_OF_TWEETS is not None:
    TWEET_FILE_NAME += str(NUMBER_OF_TWEETS)
else:
    TWEET_FILE_NAME += "ALL"

### Load the data
Tweets are loaded in two ways: list of strings (for the TF-IDF vectorizer) and a list of list of words (for feature extraction). Labels are read as a numpy array of N * MAX_WORDS_PER_TWEET dimensions.

In [46]:
base_file_name = DATA_LOCATION + TWEET_FILE_NAME

text_lines = []
text_lines_split = []

with open(base_file_name + ".text", 'r', encoding="utf-8") as out_text:
    for line in out_text:
        text_lines.append(line[:-1])
        text_lines_split.append(line[:-1].split())
        
loc_lines = []
with open(base_file_name + ".loclabels", 'r') as loc_labels:
    for line in loc_labels:
        loc_line = []
        for c in line[:-1]:
            loc_line.append(int(c))
        loc_lines.append(loc_line)

loc_lines = np.asarray(loc_lines)

emo_lines = []
with open(base_file_name + ".emolabels", 'r') as emo_labels:
    for e_line, loc in zip(emo_labels, loc_lines):
        emo_line = [0]*31
        e_line2 = e_line.split()
        
        br = 0
        for idx, val in enumerate(loc[:-1]):
            if val==1:
                if int(e_line2[br])==0:
                    emo_line[idx]=20
                else:
                    emo_line[idx]=int(e_line2[br])+1
                br += 1
        emo_lines.append(emo_line)
        
emo_lines = np.asarray(emo_lines)

# full_text = open(base_file_name + ".full", 'r')
# emoji_labels = open(base_file_name + ".emolabels", 'r')
# emoji_ids = open(base_file_name + ".ids", 'r')

print(f"example of tweet texts:\n{text_lines[:5]}\n")
print(f"example of labels (emoji locations):\n{loc_lines[:5]}")
print(f"example of labels (emoji type):\n{emo_lines[:5]}")

example of tweet texts:
['lol @ west covina , california', 'things got a little festive at the office #christmas2016 @ redrock', 'step out and explore . @ ellis island cafe', '@user @ cathedral preparatory school', "my baby bear @ bubby's"]

example of labels (emoji locations):
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
example of labels (emoji type):
[[ 0  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0 18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0]
 [ 0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0]
 [19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  

### TF-IDF Feature Extraction
TF-IDF is computed on the collection of tweets. Then for every position between words a new example is generated: a 2 * k array containing the k left and k right tfidf values of words. 
Labels for locations are taken as 1 or 0 wether an emoji was there in the original tweet.
Labels for emojis are mapped to a number between 0 and 20 where 20 marks no appearance of emoji and the rest describes 20 most frequent emojis.

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
import random

X_tfidf = None
word_to_tfidf_index_dict = {}


# promjena
def tfidf_features(tweets, loc_labels, emo_labels, k, func):
    global word_to_tfidf_index_dict, X_tfidf
    
    tfidf_model = TfidfVectorizer(input="content", analyzer="word", stop_words="english")
    X_tfidf = tfidf_model.fit_transform(text_lines)

    #word_to_tfidf_index_dict = {}
    for i, word in enumerate(tfidf_model.get_feature_names()):
        word_to_tfidf_index_dict[word] = i

    print(f"X shape {X_tfidf.shape}")
    print(f"y_l shape {loc_lines.shape}")
    print(f"y_e shape {emo_lines.shape}")
    print(f"some tf-idf values\n{X_tfidf[0]}\n")

    N = len(tweets)
    X = []
    y_l = []
    y_e = []
    
    for tweet_index, (tweet, l_label, e_label) in enumerate(zip(tweets, loc_labels, emo_labels)):
        for pos in range(len(tweet) + 1):
                
            x = []
            for i in range(pos - k, pos + k):
                if i < 0 or i >= len(tweet):
                    x.append(0.0)
                else:
                    x.append(func(tweet_index, tweet[i]))
            X.append(x)
            y_l.append(l_label[pos])
            y_e.append(e_label[pos])
            
    return np.asarray(X), np.asarray(y_l), np.asarray(y_e)

def word_to_tfidf(tweet_index, word):
    global word_to_tfidf_index_dict, X_tfidf
    
    if word in word_to_tfidf_index_dict:
        return X_tfidf[tweet_index, word_to_tfidf_index_dict[word]]
    else:
        return 0.0

# promjena
def print_dataset(X, y_l, y_e):
    emoji_num = np.count_nonzero(y_l)
    class_freq_ratio = emoji_num / (X.shape[0] * X.shape[1])
    
    print("after feature extraction:")
    print(f"X shape {X.shape}")
    print(f"y_l shape {y_l.shape}")
    print(f"y_e shape {y_e.shape}")
    print(f"some X values\n{X[:5]}")
    print(f"some y_l values\n{y_l[:5]}")
    print(f"non zero elements (1 in label) in y_l {emoji_num}")
    print(f"class frequency ratio {class_freq_ratio}\n")
    print(f"some y_e values\n{y_e[:5]}")

### Word2Vector Feature Extraction

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec

num_of_features = 6
w2v_model_file_name = "w2v_model"

In [49]:
w2v_model = Word2Vec(text_lines_split, min_count=1, size=num_of_features)
w2v_model.save(w2v_model_file_name)

In [50]:
w2v_model = Word2Vec.load(w2v_model_file_name)

example = w2v_model.wv["california"]
print(example.shape)
print(example)

(6,)
[ 2.972653  -1.3512479  4.148444   7.4149575  2.2700298  4.582564 ]


In [51]:
def w2v_features(tweets, loc_labels, emo_labels, k):
    N = len(tweets)
    X = []
    y_l = []
    y_e = []
    
    for tweet_index, (tweet, l_label, e_label) in enumerate(zip(tweets, loc_labels, emo_labels)):
        for pos in range(len(tweet) + 1):
            x = []
            for i in range(pos - k, pos + k):
                if i < 0 or i >= len(tweet):
                    x.append(np.zeros(num_of_features))
                else:
                    x.append(w2v_model.wv[tweet[i]])
            x = np.average(x, axis=0)
            X.append(x)
            y_l.append(l_label[pos])
            y_e.append(e_label[pos])
            
    return np.asarray(X), np.asarray(y_l), np.asarray(y_e)

In [52]:
should_use_tfidf = False
should_use_hybrid = True
k = 3

# promjena
#X_tfidf, y_tfidf = neighbor_features(text_lines_split, loc_lines, k, word_to_tfidf)
X_tfidf, y_loc_tfidf, y_emo_tfidf = tfidf_features(text_lines_split, loc_lines, emo_lines, k, word_to_tfidf)

X_w2v, y_loc_w2v, y_emo_w2v = w2v_features(text_lines_split, loc_lines, emo_lines, k)
X_hybrid = np.hstack((X_tfidf, X_w2v))

if should_use_tfidf:
    X, y_l, y_e = X_tfidf, y_loc_tfidf, y_emo_tfidf
else:
    X, y_l, y_e = X_w2v, y_loc_w2v, y_emo_w2v
    
if should_use_hybrid:
    X = X_hybrid

print_dataset(X, y_l, y_e)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


X shape (100000, 95000)
y_l shape (100000, 31)
y_e shape (100000, 31)
some tf-idf values
  (0, 50292)	0.4161419391955878
  (0, 91420)	0.4402275555538724
  (0, 20437)	0.7091913039760875
  (0, 14685)	0.36065673449735347

after feature extraction:
X shape (1346405, 12)
y_l shape (1346405,)
y_e shape (1346405,)
some X values
[[ 0.          0.          0.          0.41614194  0.          0.44022756
   1.31879329 -0.68362592  0.9591155  -0.11707349 -0.00876852  2.159688  ]
 [ 0.          0.          0.41614194  0.          0.44022756  0.7091913
   1.39896623 -0.73007096  0.9767785  -0.04547271 -0.00633871  2.21500153]
 [ 0.          0.41614194  0.          0.44022756  0.7091913   0.
   1.75498541 -0.80632033  1.59922015 -0.03095887 -0.47036003  2.79820529]
 [ 0.41614194  0.          0.44022756  0.7091913   0.          0.36065673
   2.25042748 -1.03152835  2.29062748  1.20486736 -0.09202174  3.56196594]
 [ 0.          0.44022756  0.7091913   0.          0.36065673  0.
   2.3913765  -0.6136179

### Build data sets
Dataset is randomly split into train and test subsets. Ignored while using KFold.

In [53]:
from sklearn.model_selection import train_test_split

if not should_use_kfold:
    X_train, X_test, y_loc_train, y_loc_test, y_emo_train, y_emo_test = train_test_split(X, y_l, y_e, test_size=0.2, random_state=42)

    print(X_train.shape)
    print(y_loc_train.shape)
    print(y_emo_train.shape)
    print(X_train[30:50])
    print(y_loc_train[30:50])
    print(y_emo_train[30:50])
    

### Baselines

Make sure to check should_train flags when training/testing.

In [54]:
should_use_kfold = True

should_train_global = False
should_train_linear_svm = True
should_train_bagging_svm = False
should_train_random_forest = True
should_train_adaboost = True

linear_svm_model_file_name = "linear_svm.pkl"
bagging_model_file_name = "bagging_svm.pkl"
random_forest_file_name = "random_forest.pkl"
adaboost_file_name = "adaboost.pkl"

In [62]:
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, KFold

np.set_printoptions(threshold=np.inf)

kf = KFold(n_splits=5, shuffle=True)

def calc_scores(y_pred, y_test, multi_class):
    result = []
    result.append(accuracy_score(y_pred, y_test))
    if multi_class:
        average='weighted'
        #print(precision_score(y_pred[:200], y_test[:200], average=average))
        #print(recall_score(y_pred[:200], y_test[:200], average=average))
        #print(f1_score(y_pred[:200], y_test[:200], average=average))
        result.append(precision_score(y_pred, y_test, average=average))
        result.append(recall_score(y_pred, y_test, average=average))
        result.append(f1_score(y_pred, y_test, average=average))
    else:
        result.append(precision_score(y_pred, y_test))
        result.append(recall_score(y_pred, y_test))
        result.append(f1_score(y_pred, y_test))
    result.append(np.count_nonzero(y_pred) / y_pred.shape[0])
    return result

def print_scores(scores):
    print(f"Accuracy: {scores[0]}")
    print(f"Precision: {scores[1]}")
    print(f"Recall: {scores[2]}")
    print(f"F1: {scores[3]}")
    print(f"ratio of positive/negative predictions {scores[4]}")

def kfold_score(clf, X, y, multi_class):
    fold_scores = []
    for train_index, test_index in kf.split(X):
        X_train_kf, X_test_kf = X[train_index], X[test_index]
        y_train_kf, y_test_kf = y[train_index], y[test_index]
        clf.fit(X_train_kf, y_train_kf)
        y_pred = clf.predict(X_test_kf)
        fold_scores.append(calc_scores(y_pred, y_test_kf, multi_class))
        
    print(f"{y_test_kf[:200]}")
    print(f"{y_pred[:200]}")
    
    fold_scores = np.average(np.asarray(fold_scores), axis=0)
    print_scores(fold_scores)

def train_and_save_model(model, X_train, y_train, model_file_name):
    model.fit(X_train, y_train)
    joblib.dump(model, model_file_name)
    
def load_and_test_model(model_file_name, X_test, y_test):
    clf_loaded = joblib.load(model_file_name)
    y_pred = clf_loaded.predict(X_test)
    print_scores(calc_scores(y_pred, y_test))
    
def do_test(clf, X, y, clf_file_name, flag, multi_class=False):
    if should_use_kfold:
        kfold_score(clf, X, y, multi_class)
    else:
        if should_train_global and flag:
            train_and_save_model(clf, X_train, y_train, clf_file_name)
        load_and_test_model(clf_file_name, X_test, y_test)  

### Linear SVM Baseline

Linear SVM is based on the liblinear library and is faster on large datasets. Not using dual optimization problem makes the train extremely fast. It is the fastest so both feature extraction methods are used for comparison.

In [None]:
from sklearn.svm import LinearSVC

svm_clf = LinearSVC(class_weight="balanced", dual=False)

print("Emoji locations:\n")

print("TF-IDF results:")
do_test(svm_clf, X_tfidf, y_loc_tfidf, linear_svm_model_file_name, should_train_linear_svm)
print("\nword2vec results:")
do_test(svm_clf, X_w2v, y_loc_w2v, linear_svm_model_file_name, should_train_linear_svm)
print("\nhybrid results:")
do_test(svm_clf, X_hybrid, y_loc_w2v, linear_svm_model_file_name, should_train_linear_svm)

In [None]:
print("Emoji types:\n")

print("TF-IDF results:")
do_test(svm_clf, X_tfidf, y_emo_tfidf, linear_svm_model_file_name, should_train_linear_svm, True)
print("\nword2vec results:")
do_test(svm_clf, X_w2v, y_emo_w2v, linear_svm_model_file_name, should_train_linear_svm, True)
print("\nhybrid results:")
do_test(svm_clf, X_hybrid, y_emo_w2v, linear_svm_model_file_name, should_train_linear_svm, True)

Emoji types:

TF-IDF results:


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


### Bagging SVM

Warning! SVM is based on the libsvm library and it scales poorly with large datasets. That is why an ensemble (bagging) is used. Each classifier is trained on a portion of the data which greatly reduces training times and gives similar (if not better) results. Using 10 estimators it still takes a few hours to train. Results are just a bit better than a single linear SVM.

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

n_estimators = 10
bagging_svm_clf = BaggingClassifier(SVC(kernel='linear', class_weight='balanced'), max_samples=1.0 / n_estimators,
                                        n_estimators=n_estimators, bootstrap=False)


print("Emoji locations:\n")

do_test(bagging_svm_clf, X, y_l, bagging_model_file_name, should_train_bagging_svm)  

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

n_estimators = 10
bagging_svm_clf = BaggingClassifier(SVC(kernel='linear', class_weight='balanced'), max_samples=1.0 / n_estimators,
                                        n_estimators=n_estimators, bootstrap=False)

print("Emoji types:\n")

do_test(bagging_svm_clf, X, y_e, bagging_model_file_name, should_train_bagging_svm, True)

### Random Forest Baseline

Random forests are pretty fast, but the results are generally worse (F1). High accuracy and recall with low precision.

In [37]:
from sklearn.ensemble import RandomForestClassifier

random_forest_clf = RandomForestClassifier(min_samples_leaf=2)

print("Emoji locations:\n")

do_test(random_forest_clf, X, y_l, random_forest_file_name, should_train_random_forest)

Emoji locations:

Accuracy: 0.9325411001890218
Precision: 0.22312907029411772
Recall: 0.6638448028335906
F1: 0.3339893263151609
ratio of positive/negative predictions 0.02548044607677482


In [38]:
from sklearn.ensemble import RandomForestClassifier

random_forest_clf = RandomForestClassifier(min_samples_leaf=2)

print("Emoji types:\n")

do_test(random_forest_clf, X, y_e, random_forest_file_name, should_train_random_forest, True)

Emoji types:

Accuracy: 0.9247061619646392
Precision: 0.9247061619646392
Recall: 0.9247061619646392
F1: 0.9247061619646392
ratio of positive/negative predictions 0.9954924409817254


### AdaBoost Baseline



In [21]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adaboost_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, class_weight="balanced"), n_estimators=100)

print("Emoji locations:\n")

do_test(adaboost_clf, X, y_l, adaboost_file_name, should_train_adaboost)

Emoji locations:

Accuracy: 0.7456389422202087
Precision: 0.7587678978137375
Recall: 0.19594506116349328
F1: 0.31144955027282484
ratio of positive/negative predictions 0.2935929382318099


In [39]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adaboost_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, class_weight="balanced"), n_estimators=100)

print("Emoji types:\n")

do_test(adaboost_clf, X, y_e, adaboost_file_name, should_train_adaboost, True)

Emoji types:



KeyboardInterrupt: 

### BLSTM for emoji location prediction

Bidirectional long short-term memory recurrent network implementation using the Keras framework. Emoji location prediction is treated as a binary classification problem. INPUT_SIZE determines how much of the tweets are used for training. The model uses an embedding layer and its size is a hyperparameter. A single bidirectional layer is used which actually consists of two LSTM layers.

In [117]:
from pandas import DataFrame

N_TIMESTEPS = MAX_WORDS_PER_TWEET
N_CLASSES = 2
INPUT_SIZE = 100000
EMBEDDING_SIZE = 10
HIDDEN_SIZE = 20
PARAMETER_STR = str(N_CLASSES) + "_" + str(INPUT_SIZE) + "_" + str(EMBEDDING_SIZE) + "_" + str(HIDDEN_SIZE)

dummy_text = [""" Jack and Jill went up the hill\n
        To fetch a pail of water\n
        Jack fell down and broke his crown\n
        And Jill came tumbling after\n """]

### Data preparation

Words are converted to integer ids using a tokenizer. Each unique word has its own unique integer value. Tweets are padded with zeros to a set length. Padding is required by Keras. Since most of the labels are zero (emojis are present after every 15th word), class weights are calculated. Without class weights the model behaves like a majority class classifier which is of no use here.

In [114]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.utils.class_weight import compute_sample_weight

# integer encode text
X_input = text_lines[:INPUT_SIZE]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_input)
txt_to_seq = tokenizer.texts_to_sequences(X_input)
print(f"encoded:\n{txt_to_seq[0:5]}\n")
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f"Vocabulary Size: {vocab_size}")

# pad input sequences
X_blstm = pad_sequences(txt_to_seq, maxlen=N_TIMESTEPS, padding='post')
print(f"input shape: {X_blstm.shape}")
print(f"BLSTM input example:\n{X_blstm[:5]}\n")

y_loc = loc_lines[:INPUT_SIZE,1:]
y_blstm = to_categorical(y_loc, num_classes=N_CLASSES)
print(f"labels shape: {y_blstm.shape}")
# print(f"BLSTM labels:\n{y_blstm[:5]}\n")

weights = class_weight.compute_class_weight('balanced', np.unique(y_loc), y_loc.flatten())
class_weight_dict = dict(enumerate(weights))
print(f"class weight dict:\n{class_weight_dict}\n")

vfunc = np.vectorize(lambda x: class_weight_dict[x])
sample_weights = vfunc(y_loc)
print(f"sample_weights shape: {sample_weights.shape}")
print(f"sample_weights examples:\n{sample_weights[:5]}\n")

encoded:
[[178, 212, 4853, 65], [264, 63, 6, 60, 1939, 15, 1, 630, 2897, 12363], [1769, 43, 7, 1831, 6340, 213, 358], [29310, 3338, 1568, 6757, 70, 3338, 1361, 25, 1216, 7895, 2, 29311, 144], [40, 446, 643, 9518, 29312, 481, 19204, 2635, 1117]]

Vocabulary Size: 101805
input shape: (100000, 30)
BLSTM input example:
[[  178   212  4853    65     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0]
 [  264    63     6    60  1939    15     1   630  2897 12363     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0]
 [ 1769    43     7  1831  6340   213   358     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0]
 [29310  3338  1568  6757    70  3338  1361    25  1216  7895     2 29311
    144     0     0     0     0     0     0     

 ### Train model
 
 The following code is used to create and train a new model.

In [102]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import Bidirectional

def get_bi_lstm_model(n_timesteps=N_TIMESTEPS, mode="concat"):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=N_TIMESTEPS))
    model.add(Bidirectional(LSTM(20, return_sequences=True), merge_mode=mode))
    model.add(TimeDistributed(Dense(N_CLASSES, activation='softmax')))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"], sample_weight_mode="temporal")
    return model

blstm = get_bi_lstm_model()
blstm.fit(X_blstm, y_blstm, epochs=10, validation_split=0.05, verbose=2, sample_weight=sample_weights)

Train on 95000 samples, validate on 5000 samples
Epoch 1/10
 - 279s - loss: 0.2689 - acc: 0.8361 - val_loss: 0.2323 - val_acc: 0.8754
Epoch 2/10
 - 291s - loss: 0.1933 - acc: 0.9030 - val_loss: 0.2334 - val_acc: 0.9024
Epoch 3/10
 - 288s - loss: 0.1539 - acc: 0.9278 - val_loss: 0.2574 - val_acc: 0.9159
Epoch 4/10
 - 305s - loss: 0.1207 - acc: 0.9455 - val_loss: 0.3019 - val_acc: 0.9239
Epoch 5/10
 - 330s - loss: 0.0967 - acc: 0.9572 - val_loss: 0.3513 - val_acc: 0.9250
Epoch 6/10
 - 282s - loss: 0.0800 - acc: 0.9648 - val_loss: 0.4064 - val_acc: 0.9307
Epoch 7/10
 - 282s - loss: 0.0682 - acc: 0.9702 - val_loss: 0.4658 - val_acc: 0.9351
Epoch 8/10
 - 283s - loss: 0.0594 - acc: 0.9741 - val_loss: 0.5623 - val_acc: 0.9390
Epoch 9/10
 - 304s - loss: 0.0521 - acc: 0.9774 - val_loss: 0.6244 - val_acc: 0.9408
Epoch 10/10
 - 313s - loss: 0.0463 - acc: 0.9799 - val_loss: 0.6857 - val_acc: 0.9438


<keras.callbacks.History at 0x1d78c4ad748>

### Load/Save model

The following two blocks are used to save and load models.

In [119]:
BLSTM_MODEL_FILE_NAME = "blstm_model_" + PARAMETER_STR + ".h5"
blstm.save(BLSTM_MODEL_FILE_NAME)

In [120]:
from keras.models import load_model

blstm = load_model(BLSTM_MODEL_FILE_NAME)

In [121]:
NUM_OF_TWEETS_TO_TEST = 10

def get_predictions(texts):
    text_to_integer_sequences = tokenizer.texts_to_sequences(texts)
    blstm_input = pad_sequences(text_to_integer_sequences, maxlen=N_TIMESTEPS, padding='post')
    ypred = blstm.predict_classes(blstm_input)
    return np.insert(ypred, 0, 0, axis=1)

print(f"example of labels (emoji locations):\n{loc_lines[:NUM_OF_TWEETS_TO_TEST]}\n")
print(f"model predictions:\n{get_predictions(text_lines[:NUM_OF_TWEETS_TO_TEST])}\n")

example of labels (emoji locations):
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

model predictions:
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 

### Load/Save predictions

Use the code below to save/load a model's predictions. Training and testing the model requires a powerful Nvidia GPU and is time consuming. This also enables the evaluation of results without having to work with the model itself.

In [115]:
BLSTM_PREDICTIONS_FILE_NAME = "blstm_predictions_" + PARAMETER_STR + ".pkl"
predictions = get_predictions(text_lines)

result_dict = {}
result_dict["num_of_classes"] = N_CLASSES
result_dict["num_of_input_tweets"] = INPUT_SIZE
result_dict["embedding_size"] = EMBEDDING_SIZE
result_dict["hidden_size"] = HIDDEN_SIZE
result_dict["input_X"] = X_blstm
result_dict["input_y"] = y_blstm
result_dict["ypred"] = predictions

joblib.dump(result_dict, BLSTM_PREDICTIONS_FILE_NAME)

['blstm_predictions_2_100000_10_20.pkl']

In [None]:
result_dict = joblib.load(BLSTM_PREDICTIONS_FILE_NAME)