# Deep Products: Deep Tag Labeler

This is the first project for the book Deep Products, about using NLP and weakly supervised learning to build complete machine learning products. Using the non-code text of Stack Overflow posts (question and answers) to tag them using a multi-class, multi-label classifier using LSTMs and Emlo embeddings.

In [1]:
import json
import os
import re

import pandas as pd
import numpy as np
import tensorflow as tf
# import tensorflow_hub as hub

from tqdm import tqdm_notebook
#import tensorflow_addons as tfa

tf.test.is_gpu_available(
    cuda_only=False,
    min_cuda_compute_capability=None
)

True

In [2]:
np.random.seed(seed=1337)

## Load a Stratified Sample of Answered Stack Overflow Questions with Tags

We load a sample pulled from all answered questions from Stack Overflow. This data was converted from XML to parquet format via [code/stackoverflow/xml_to_parquet.py](stackoverflow/sample_json.spark.py) and then a more balanced stratified sample was computed for tags with over 50,000, 20,000 and 10,000 instances that reduced the maximum imbalance from 100-1000:1 to 8:1 using [code/stackoverflow/get_questions.spark.py](stackoverflow/get_questions.spark.py).

These scripts were run using a Spark cluster via Amazon Elastic MapReduce using 13 r5.12xlarge machines for about 24 hours at a cost of about \\$300 per full run, and about \\$1,500 overall to create and debug. Big data is expensive.

With this dataset the challenge isn't the number of records per say but rather the imbalance of the dataset if we wish to expand the number of tags the model can predict beyond low 3 digits. This leads us to some of the other techniques we'll cover involving weakly supervised learning.

In [3]:
sorted_all_tags = json.load(open('data/stackoverflow/08-05-2019/sorted_all_tags.50000.json'))
max_index = sorted_all_tags[-1][0] + 1

In [4]:
import pyarrow
posts_df = pd.read_parquet(
    'data/stackoverflow/08-05-2019/Questions.Stratified.Final.50000.parquet',
    columns=['_Body'] + ['label_{}'.format(i) for i in range(0, max_index)],
    engine='pyarrow'
)
posts_df.head(5)

Unnamed: 0,_Body,label_0,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,...,label_14,label_15,label_16,label_17,label_18,label_19,label_20,label_21,label_22,label_23
0,"[C, Mono, Winforms, MessageBox, problem, I, fi...",1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[Are, NET, data, providers, Oracle, require, O...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[How, I, focus, foreign, window, I, applicatio...",1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[Default, button, hit, windows, forms, trying,...",1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[Can, I, avoid, JIT, net, Say, code, always, g...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
'{:,}'.format(len(posts_df.index))

'1,293,018'

## Map from Tags to IDs

In [6]:
tag_index = json.load(open('data/stackoverflow/08-05-2019/tag_index.50000.json'))
index_tag = json.load(open('data/stackoverflow/08-05-2019/index_tag.50000.json'))

## Count the Most Common Tags

In [7]:
label_counts = json.load(open('data/stackoverflow/08-05-2019/label_counts.50000.json'))

# Sanity check the difference files
assert(len(label_counts.keys()) == len(tag_index.keys()) == len(index_tag.keys()) == len(sorted_all_tags))

## Make Record Count a Multiple of the Batch Size and Post Sequence Length

The Elmo embedding requires that the number of records be a multiple of the batch size times the number of tokens in the padded posts.

In [8]:
import math

BATCH_SIZE = 32
MAX_LEN = 100
TOKEN_COUNT = 10000
EMBED_SIZE = 50

# Convert label columns to numpy array
labels = posts_df[list(posts_df.columns)[1:]].to_numpy()

# training_count must be a multiple of the BATCH_SIZE times the MAX_LEN for the Elmo embedding layer
highest_factor = math.floor(len(posts_df.index) / (BATCH_SIZE * MAX_LEN))
training_count = highest_factor * BATCH_SIZE * MAX_LEN
print('Highest Factor: {:,} Training Count: {:,}'.format(highest_factor, training_count))

# posts_text = np.stack(
#     posts_df[0:training_count]['_Body'].values
# )

# Remove stopwords - now done in Spark, so can remove once that runs
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')

sentences = []
for body in posts_df[0:training_count]['_Body'].values.tolist():
    words = body.tolist()
#     words = tokenizer.tokenize(' '.join(words))
    
#     new_words = []
#     pad_count = 0
#     for i in range(MAX_LEN):
#         if i < len(words):
#             word = words[i].lower()  
#             if word not in stop_words:
#                 new_words.append(word)
#             else:
#                 pad_count += 1
#         else:
#             pad_count += 1
    
#     new_words += ['__pad__'] * pad_count
#     new_words = new_words[0:MAX_LEN]
      
    sentences.append(' '.join(words))

labels = labels[0:training_count]

sentences
# print(min([len(x) for x in posts_text]))
# print(max([len(x) for x in posts_text]))
# assert( min([len(x) for x in posts_text]) == MAX_LEN == max([len(x) for x in posts_text]) )

Highest Factor: 404 Training Count: 1,292,800


['C Mono Winforms MessageBox problem I file called hellowf cs On Ubuntu 8 10 I following looks like alt text The second part message missing Why happening The binary hellowf exe works fine Windows Update This really annoying Here mono versions I tried make work far My Current mono version clues __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__ __PAD__',
 'Are NET data providers Oracle require Oracle Client installed I developing rich client application use Entity Framework DevArt DotConnect Oracle connect central Oracle database However I found scenario requires every client machine Oracle client installed order connect Oracle 10g server Is easy way around D

In [9]:
# sentences = [' '.join(x) for x in posts_text]
# sentences

## Create an Elmo Embedding Layer using Tensorflow Hub

Note that this layer takes a padded two-dimensional array of strings.

In [10]:
# # From https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/
# tf.compat.v1.disable_eager_execution()

# sess = tf.compat.v1.Session()

# elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

# sess.run(tf.global_variables_initializer())
# sess.run(tf.tables_initializer())

# def ElmoEmbedding(x):
#     return elmo_model(inputs={
#                             "tokens": tf.squeeze(tf.cast(x, tf.string)),
#                             "sequence_len": tf.constant(BATCH_SIZE*[MAX_LEN])
#                       },
#                       signature="tokens",
#                       as_dict=True)["elmo"]

# text_input = Input(shape=(max_len,), dtype=tf.string)
# elmo_embedding = Lambda(ElmoEmbedding, output_shape=(max_len, 1024))(text_input)

## Create a Glove Embedding Layer

In [13]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=TOKEN_COUNT)
tokenizer.fit_on_texts(sentences)
# encoded_docs = tokenizer.texts_to_matrix(posts_text, mode='tfidf')
sequences = tokenizer.texts_to_sequences(sentences)

padded_sequences = pad_sequences(
    sequences,
    maxlen=MAX_LEN,
    dtype='int32',
    padding='post',
    truncating='pre',
    value=1
)

print(max([len(x) for x in padded_sequences]), min([len(x) for x in padded_sequences]))
assert( min([len(x) for x in padded_sequences]) == MAX_LEN == max([len(x) for x in padded_sequences]))

padded_sequences.shape

100 100


(1292800, 100)

In [14]:
padded_sequences

array([[  71, 3402, 2170, ...,    1,    1,    1],
       [  54,  170,   34, ...,    1,    1,    1],
       [  21,    2, 1478, ...,    1,    1,    1],
       ...,
       [   2,   13,   35, ...,  956,    1,    1],
       [ 119,  462,   32, ...,    1,    1,    1],
       [   2,   25,   44, ...,    1,    1,    1]], dtype=int32)

In [15]:
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open('data/GloVe/glove.6B.50d.txt'))

In [16]:
# Create embeddings matrix
all_embs = np.stack(embeddings_index.values())
emb_mean, emb_std = all_embs.mean(), all_embs.std()

# Create embedding matrix using our vocabulary
word_index = tokenizer.word_index
nb_words = min(TOKEN_COUNT, len(word_index))

# Initialize embedding matrix
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, EMBED_SIZE))

# Loop through each word and get its embedding vector
for word, i in word_index.items():
    if i >= TOKEN_COUNT: 
        continue # Skip words appearing less than the minimum allowed
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector

  


## Experimental Setup

We `train_test_split` rather than k-fold cross validate because it is too expensive.

In [17]:
from sklearn.model_selection import train_test_split

TEST_SPLIT = 0.1

# X_train, X_test, y_train, y_test = train_test_split(
#     posts_text,
#     labels,
#     test_size=TEST_SPLIT,
#     random_state=1337
# )
X_train, X_test, y_train, y_test = train_test_split(
    padded_sequences,
    labels,
    test_size=TEST_SPLIT,
    random_state=1337
)

#assert(X_train.shape[0] == y_train.shape[0])
#assert(X_train.shape[1] == MAX_LEN)
#assert(X_test.shape[0] == y_test.shape[0]) 
#assert(X_test.shape[1] == MAX_LEN)

## Create an LSTM Model to Classify Posts into Tags

We use the padded/tokenized posts as input, an Elmo embedding feeding an Long-Short-Term-Memory (LSTM) layer followed by a Dense layer with the same number of output neurons as our tag list.

We use focal loss as a loss function, which is used in appliations like object detection, because it 

In [18]:
# from keras.layers import Input, concatenate, Activation, Dense, LSTM, BatchNormalization, Embedding, Dropout, Lambda, Bidirectional
# from keras.metrics import categorical_accuracy, top_k_categorical_accuracy
# from keras.models import Model
# from keras.optimizers import Adam
# from keras_metrics import precision, f1_score, false_negative, true_positive, false_positive, true_negative

# # Text model
# text_input = Input(shape=(MAX_LEN,), dtype=tf.string)

# elmo_embedding = Lambda(ElmoEmbedding, output_shape=(MAX_LEN, 1024))(text_input)

# text_lstm = LSTM(
#     input_shape=(MAX_LEN, 1024,),
#     units=512,
#     recurrent_dropout=0.2,
#     dropout=0.2)(elmo_embedding)

# text_dense = Dense(200, activation='relu')(text_lstm)

# text_output = Dense(record_count, activation='sigmoid')(text_dense)

# text_model = Model(
#     inputs=text_input, 
#     outputs=text_output
# )



# from sklearn.metrics import hamming_loss

# from keras.optimizers import Adam
# adam = Adam(lr=0.0005)

# text_model.compile(
#     loss='binary_crossentropy',
#     optimizer=adam,
#     metrics=[
#         precision_m,
#         recall_m,
#         f1_m,
#         'mae',
#         abs_KL_div,
#         'accuracy'
#     ]
# )
# 
# text_model.summary()

## Compute Sample and Class Weights

Because we have skewed classes and multiple classes per example, we employ sample or class weights which weight the importance of each row according to the relative frequency of their labels.

In [19]:
from sklearn.utils.class_weight import compute_sample_weight

train_sample_weights = compute_sample_weight('balanced', y_train)
test_sample_weights = compute_sample_weight('balanced', y_test)

train_sample_weights.shape, test_sample_weights

((1163520,), array([1.30316677e-07, 1.15411547e-04, 1.30316677e-07, ...,
        1.30316677e-07, 1.38388975e-04, 1.79070271e-06]))

In [20]:
train_weight_vec = list(np.max(np.sum(y_train, axis=0))/np.sum(y_train, axis=0))
train_class_weights = {i: train_weight_vec[i] for i in range(y_train.shape[1])}

test_weight_vec = list(np.max(np.sum(y_test, axis=0))/np.sum(y_test, axis=0))
test_class_weights = {i: test_weight_vec[i] for i in range(y_test.shape[1])}

sorted(list(train_class_weights.items()), key=lambda x: x[1]), sorted(list(test_class_weights.items()), key=lambda x: x[1])

([(16, 1.0),
  (5, 1.174390280129519),
  (12, 1.3624019987938314),
  (17, 1.620059420141379),
  (20, 1.8687102644702323),
  (10, 1.8711868417938706),
  (8, 2.009173379411989),
  (9, 2.0102461100376283),
  (7, 2.147567699703941),
  (22, 2.1867688137843295),
  (13, 2.188281855419019),
  (14, 2.292860457023547),
  (11, 2.338430143144446),
  (1, 2.3760254830663823),
  (23, 2.4299895506792057),
  (6, 2.6281203257437262),
  (21, 2.647214410070979),
  (2, 2.666767850517724),
  (15, 2.9316648127549128),
  (0, 3.058921386567626),
  (3, 3.0622385747482572),
  (18, 3.2606293043012085),
  (4, 3.4826674888781217),
  (19, 4.0100928133083125)],
 [(16, 1.0),
  (5, 1.1816222042576126),
  (12, 1.3517262638717633),
  (17, 1.564116283217407),
  (10, 1.8412765064035272),
  (20, 1.8823781927452243),
  (9, 1.9712294897729827),
  (8, 1.9954493742889647),
  (13, 2.1750992063492065),
  (7, 2.1821348594177654),
  (22, 2.1897627965043696),
  (14, 2.294610151753009),
  (23, 2.370911057042444),
  (1, 2.379918588873

## Establish a Log for Performance

## Simple Baseline Model using `Conv1D`

In [21]:
def hamming_loss(y_true, y_pred, mode='multilabel'):
    if mode not in ['multiclass', 'multilabel']:
        raise TypeError('mode must be: [multiclass, multilabel])')

    if mode == 'multiclass':
        nonzero = tf.cast(tf.math.count_nonzero(y_true * y_pred, axis=-1), tf.float32)
        print(nonzero)
        return 1.0 - nonzero

    else:
        nonzero = tf.cast(tf.math.count_nonzero(y_true - y_pred, axis=-1), 
            tf.float32)
        return nonzero / y_true.get_shape()[-1]

In [None]:
# Model imports
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from tensorflow.keras.layers import ( Input, Embedding, GlobalMaxPooling1D, Conv1D, Dense, Activation, 
                                      Dropout, Lambda, BatchNormalization, concatenate )
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import multi_gpu_model
from tensorflow.contrib.metrics import f1_score

# Fit imports
from tensorflow.keras.losses import hinge, mae, binary_crossentropy, kld, Huber, squared_hinge
# from tensorflow_addons.losses.focal_loss import SigmoidFocalCrossEntropy
# from tensorflow_addons.metrics.f_scores import F1Score

# Hyperparameter/method search space
import itertools

learning_rates = [0.01, 0.001, 0.0001, 0.00005]
losses = [binary_crossentropy, hinge, squared_hinge, mae, kld, Huber, hamming_loss]
activations = ['relu', 'selu']
optimizers = ['adam', 'sgd']
dropout_ratios = [0.2]
filter_lengths = [128]
class_weight_set = [None, train_class_weights]
sample_weight_set = [None, train_sample_weights]

args = itertools.product(
    learning_rates,
    losses,
    activations,
    optimizers,
    dropout_ratios,
    filter_lengths,
    class_weight_set,
    sample_weight_set
)

performance_log = {}
for learning_rate, loss_function, activation, optimizer, dropout_ratio, filter_length, class_weights, sample_weights in args:
    
    #
    # Build ze model...
    #
    def build_model(
        token_count=20000,
        max_words=100,
        embedding_dim=50,
        label_count=y_train.shape[1],
        dropout_ratio=0.2,
        filter_length=filter_length,
        loss_function='binary_crossentropy',
        learning_rate=0.001,
        optimizer=Adam,
        activation='relu'
    ):
        """Build the model using this experiment's parameters"""
        
        hashed_input = Input(shape=(X_train.shape[1],), dtype='int64')
        
        emb = Embedding(token_count, embedding_dim, weights=[embedding_matrix])(hashed_input)

        # Specify each convolution layer and their kernel siz i.e. n-grams 
        conv1_1 = Conv1D(filters=filter_length, kernel_size=3)(emb)
        btch1_1 = BatchNormalization()(conv1_1)
        drp1_1  = Dropout(dropout_ratio)(btch1_1)
        actv1_1 = Activation('relu')(drp1_1)
        glmp1_1 = GlobalMaxPooling1D()(actv1_1)

        conv1_2 = Conv1D(filters=filter_length, kernel_size=4)(emb)
        btch1_2 = BatchNormalization()(conv1_2)
        drp1_2  = Dropout(dropout_ratio)(btch1_2)
        actv1_2 = Activation('relu')(drp1_2)
        glmp1_2 = GlobalMaxPooling1D()(actv1_2)

        conv1_3 = Conv1D(filters=filter_length, kernel_size=5)(emb)
        btch1_3 = BatchNormalization()(conv1_3)
        drp1_3  = Dropout(dropout_ratio)(btch1_3)
        actv1_3 = Activation('relu')(drp1_3)
        glmp1_3 = GlobalMaxPooling1D()(actv1_3)

        conv1_4 = Conv1D(filters=filter_length, kernel_size=6)(emb)
        btch1_4 = BatchNormalization()(conv1_4)
        drp1_4  = Dropout(dropout_ratio)(btch1_4)
        actv1_4 = Activation('relu')(drp1_4)
        glmp1_4 = GlobalMaxPooling1D()(actv1_4)

        # Gather all convolution layers
        cnct = concatenate([glmp1_1, glmp1_2, glmp1_3, glmp1_4], axis=1)
        drp1 = Dropout(dropout_ratio)(cnct)

        dns1  = Dense(32, activation='relu')(drp1)
        btch1 = BatchNormalization()(dns1)
        drp2  = Dropout(dropout_ratio)(btch1)

        out = Dense(y_train.shape[1], activation='sigmoid')(drp2)

        text_model = Model(
            inputs=hashed_input, 
            outputs=out
        )
        
        if activation == 'adam':
            activation = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
        if activation == 'sgd':
            activation = SGD(lr=learning_rate)

        text_model.compile(
            optimizer=optimizer,
            loss=loss_function,
            metrics=[
                'categorical_accuracy',
                tf.keras.metrics.Precision(),
                tf.keras.metrics.Recall(),
                tf.keras.metrics.BinaryAccuracy(),
                tf.keras.metrics.Hinge(),
                tf.keras.metrics.AUC(),
                tf.keras.metrics.Accuracy(),
                tf.keras.metrics.MeanAbsoluteError(),
                tf.keras.metrics.MeanAbsolutePercentageError(),
                tf.keras.metrics.TruePositives(),
                tf.keras.metrics.FalsePositives(),
                tf.keras.metrics.TrueNegatives(),
                tf.keras.metrics.FalseNegatives()
            ]
        )
        text_model.summary()

        return text_model
    
    #
    # Train ze model...
    #
    def train_model(
        model=None,
        dropout_ratio=0.1,
        learning_rate=0.001,
        optimizer='adam',
        activation='relu',
        epochs=5,
        class_weights=None,
        sample_weights=None,
    ):
        """Train the model using the current parameters and evaluate performance"""

        callbacks = [
            #ReduceLROnPlateau(), 
            EarlyStopping(patience=1), 
            #ModelCheckpoint(filepath='model-conv1d.h5', save_best_only=True)
        ]

        history = text_model.fit(
            X_train, 
            y_train,
            class_weight=class_weights,
            sample_weight=sample_weights,
            epochs=epochs,
            batch_size=BATCH_SIZE,
            validation_data=(X_test, y_test),
            callbacks=callbacks
        )

        # Evaluate to our log and return a description key and a list of metrics
        accr = text_model.evaluate(X_test, y_test, sample_weight=sample_weights)
        f1_score = 2.0 * (accr[1] * accr[2]) / \
                               (accr[1] + accr[2])
        return_val = str(loss_function) + ' ' + str(learning_rate) + ' ' + str(optimizer) + ' ' + str(activation), \
               [i for i in zip(accr + [f1_score], text_model.metrics_names + ['val_f1_score'])]
        print(text_model.metrics_names)
        return return_val

    #
    # main()
    #
    text_model = build_model(
        token_count=TOKEN_COUNT,
        max_words=100,
        embedding_dim=50,
        label_count=y_train.shape[1],
        filter_length=128,
        loss_function=loss_function,
        learning_rate=learning_rate,
        optimizer=optimizer,
        activation=activation,
        dropout_ratio=0.2
    )

    description_key, accuracies = train_model(
        model=text_model,
        dropout_ratio=dropout_ratio,
        learning_rate=learning_rate,
        optimizer=optimizer,
        activation=activation,
        epochs=1,
        class_weights=class_weights,
        sample_weights=sample_weights,
    )
    performance_log[description_key] = accuracies
    print(description_key, accuracies)

Model: "model_18"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_18 (InputLayer)           [(None, 100)]        0                                            
__________________________________________________________________________________________________
embedding_17 (Embedding)        (None, 100, 50)      500000      input_18[0][0]                   
__________________________________________________________________________________________________
conv1d_68 (Conv1D)              (None, 98, 128)      19328       embedding_17[0][0]               
__________________________________________________________________________________________________
conv1d_69 (Conv1D)              (None, 97, 128)      25728       embedding_17[0][0]               
___________________________________________________________________________________________

In [None]:
performance_log

In [None]:
# from keras.callbacks import EarlyStopping

# EPOCHS = 4

# history = text_model.fit(
#     X_train,
#     y_train,
#     epochs=EPOCHS,
#     batch_size=BATCH_SIZE,
#     callbacks=[
#         EarlyStopping(monitor='loss', patience=1, min_delta=0.0001),
#         EarlyStopping(monitor='val_loss', patience=1, min_delta=0.0001),
#     ],
#     class_weight=class_weights,
#     # sample_weight=train_sample_weights,
#     validation_data=(X_test, y_test)
# )

In [None]:
accr = text_model.evaluate(X_test, y_test) #, sample_weight=test_sample_weights)
[i for i in zip(accr, text_model.metrics_names)]

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

print(history.history)
# summarize history for accuracy
plt.plot(history.history['val_loss'])
plt.plot(history.history['f1_m'])
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['precision_m'])
plt.title('model accuracy')
plt.ylabel('metric')
plt.xlabel('epoch')
plt.legend(['val_loss', 'f1', 'categorical accuracy', 'MAE', 'precision'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
import statistics

from sklearn.metrics import hamming_loss, jaccard_score
import keras.backend as K
import tensorflow as tf

y_pred = text_model.predict(X_test)

sess = tf.Session()
best_cutoff = 0
max_score = 0
with sess.as_default():
    for cutoff in [0.0001, 0.001, 0.01, 0.1, 0.2, 0.4, 0.5, 0.6, 0.8]:
        y_pred_bin = K.greater(y_pred, cutoff).eval()
        print('Cutoff: {:,}'.format(cutoff))
        print('Hamming loss: {:,}'.format(
            hamming_loss(y_test, y_pred_bin)
        ))
        scores = []
        for j_type in ['micro', 'macro', 'weighted']:
            j_score = jaccard_score(y_test, y_pred_bin, average=j_type)
            print('Jaccard {} score: {:,}'.format(
                j_type,
                j_score
            ))
            scores.append(j_score)
        print('')
        mean_score = statistics.mean(scores)
        if mean_score > max_score:
            best_cutoff = cutoff
            max_score = mean_score

print('Best cutoff was: {:,} with mean jaccard score of {:,}'.format(best_cutoff, max_score))

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report, multilabel_confusion_matrix

y_pred = text_model.predict(X_test, batch_size=32, verbose=1)
y_pred_bool = np.where(y_pred > best_cutoff, 1, 0)

print(classification_report(y_test, y_pred_bool))

print(multilabel_confusion_matrix(y_test, y_pred_bool))

## View the Results

Now lets map from the one-hot-encoded tags back to the text tags and view them alongside the text of the original posts to sanity check the model and see if it really works.

In [None]:
predicted_tags = []
for test, pred in zip(y_test, y_pred_bool):
    tags = []
    for i, val in enumerate(test):
        if pred[i] == 1.0:
            tags.append(sorted_all_tags[i])
    predicted_tags.append(tags)

for text, tags in zip(X_test, predicted_tags):
    print(' '.join(text), tags)