# Deep Products: Deep Tag Labeler

This is the first project for the book Deep Products, about using NLP and weakly supervised learning to build complete machine learning products. Using the non-code text of Stack Overflow posts (question and answers) to tag them using a multi-class, multi-label classifier using LSTMs and Emlo embeddings.

In [2]:
import json
import os
import re

from tensorflow.keras import backend as K
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from tqdm import tqdm_notebook


tf.test.is_gpu_available(
    cuda_only=False,
    min_cuda_compute_capability=None
)

True

## Load 14 Million Answered Questions with Tags from Stack Overflow

We load all answered questions from Stack Overflow. This data was converted from XML to JSON and then sampled using Spark on a single `r5.12xlarge` machine cluster with [code/stackoverflow/sample_json.spark.py](stackoverflow/sample_json.spark.py).

In [4]:
posts_df = pd.read_parquet(
    'data/Questions.Stratified.Final.50000.parquet',
    columns=['_Body'] + ['label_{}'.format(i) for i in range(0, 100)],
    #filters=[('_Tags','!=',None),],
    engine='pyarrow'
)
posts_df.head(5)

Unnamed: 0,_Body,label_0,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,...,label_90,label_91,label_92,label_93,label_94,label_95,label_96,label_97,label_98,label_99
0,"[I, have, the, following, code:, Can, anybody,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[I, have, created, a, new, project, from, the,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[I, want, to, zoom, in, and, out, a, UIVIew, o...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[How, do, I, save, a, MapView, to, the, PhotoA...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[I, have, looked, at, the, facebook, iphone, a...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
'{:}'.format(len(posts_df.index))

'1049164'

## Count the Most Common Tags

In [6]:
label_counts = json.load(open('data/label_counts.50000.json'))
list(label_counts.items())[0:10]

[('0', [1034673, 14491]),
 ('1', [1023250, 25914]),
 ('2', [1030462, 18702]),
 ('3', [1035645, 13519]),
 ('4', [1037445, 11719]),
 ('5', [1010664, 38500]),
 ('6', [1031699, 17465]),
 ('7', [1031501, 17663]),
 ('8', [1033207, 15957]),
 ('9', [1035151, 14013])]

## Map from Tags to IDs

In [7]:
tag_index = json.load(open('data/tag_index.50000.json'))
index_tag = json.load(open('data/index_tag.50000.json'))

## Make Record Count a Multiple of the Batch Size and Post Sequence Length

The Elmo embedding requires that the number of records be a multiple of the batch size times the number of tokens in the padded posts.

In [8]:
import math

BATCH_SIZE = 128
MAX_LEN = 100

# Convert label columns to numpy array
labels = posts_df[list(posts_df.columns)[1:]].to_numpy()

# training_count must be a multiple of the BATCH_SIZE times the MAX_LEN for the Elmo embedding layer
highest_factor = math.floor(len(posts_df.index) / (BATCH_SIZE * MAX_LEN))
training_count = highest_factor * BATCH_SIZE * MAX_LEN
print('Highest Factor: {:,} Training Count: {:,}'.format(highest_factor, training_count))

posts_text = np.stack(
    posts_df[0:training_count]['_Body'].values
)
labels = labels[0:training_count]

Highest Factor: 81 Training Count: 1,036,800


In [9]:
posts_text

array([['I', 'have', 'the', ..., '__PAD__', '__PAD__', '__PAD__'],
       ['I', 'have', 'created', ..., '__PAD__', '__PAD__', '__PAD__'],
       ['I', 'want', 'to', ..., '__PAD__', '__PAD__', '__PAD__'],
       ...,
       ['OK.', 'I', 'am', ..., '__PAD__', '__PAD__', '__PAD__'],
       ['My', 'situation:', 'I', ..., 'into', 'a', 'space'],
       ['I', 'found', 'this', ..., '__PAD__', '__PAD__', '__PAD__']],
      dtype=object)

## Create an Elmo Embedding Layer using Tensorflow Hub

Note that this layer takes a padded two-dimensional array of strings.

In [15]:
# From https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/

sess = tf.compat.v1.Session()
K.set_session(sess)

elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

def ElmoEmbedding(x):
    return elmo_model(inputs={
                            "tokens": tf.squeeze(tf.cast(x, tf.string)),
                            "sequence_len": tf.constant(BATCH_SIZE*[MAX_LEN])
                      },
                      signature="tokens",
                      as_dict=True)["elmo"]

AttributeError: module 'tensorflow.python.keras.api._v2.keras.backend' has no attribute 'set_session'

## Experimental Setup

We `train_test_split` rather than k-fold cross validate because it is too expensive.

In [None]:
from sklearn.model_selection import train_test_split

TEST_SPLIT = 0.15

X_train, X_test, y_train, y_test = train_test_split(
    posts_text,
    labels,
    test_size=TEST_SPLIT,
    random_state=1337
)

assert(X_train.shape[0] == y_train.shape[0])
assert(X_train.shape[1] == MAX_LEN)
assert(X_test.shape[0] == y_test.shape[0]) 
assert(X_test.shape[1] == MAX_LEN)

## Create an LSTM Model to Classify Posts into Tags

We use the padded/tokenized posts as input, an Elmo embedding feeding an Long-Short-Term-Memory (LSTM) layer followed by a Dense layer with the same number of output neurons as our tag list.

We use focal loss as a loss function, which is used in appliations like object detection, because it 

In [11]:
def focal_loss(y_true, y_pred):
    gamma = 2.0
    alpha = 0.25
    pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
    pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
    return -K.sum(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1))-K.sum((1-alpha) * K.pow( pt_0, gamma) * K.log(1. - pt_0))

In [12]:
# from keras.layers import Input, concatenate, Activation, Dense, LSTM, BatchNormalization, Embedding, Dropout, Lambda, Bidirectional
# from keras.metrics import categorical_accuracy, top_k_categorical_accuracy
# from keras.models import Model
# from keras.optimizers import Adam
# from keras_metrics import precision, f1_score, false_negative, true_positive, false_positive, true_negative

# # Text model
# text_input = Input(shape=(MAX_LEN,), dtype=tf.string)

# elmo_embedding = Lambda(ElmoEmbedding, output_shape=(MAX_LEN, 1024))(text_input)

# text_lstm = LSTM(
#     input_shape=(MAX_LEN, 1024,),
#     units=512,
#     recurrent_dropout=0.2,
#     dropout=0.2)(elmo_embedding)

# text_dense = Dense(200, activation='relu')(text_lstm)

# text_output = Dense(record_count, activation='sigmoid')(text_dense)

# text_model = Model(
#     inputs=text_input, 
#     outputs=text_output
# )



# from sklearn.metrics import hamming_loss

# from keras.optimizers import Adam
# adam = Adam(lr=0.0005)

# text_model.compile(
#     loss='binary_crossentropy',
#     optimizer=adam,
#     metrics=[
#         precision_m,
#         recall_m,
#         f1_m,
#         'mae',
#         abs_KL_div,
#         'accuracy'
#     ]
# )
# 
# text_model.summary()

## Compute Sample and Class Weights

Because we have skewed classes and multiple classes per example, we employ sample or class weights which weight the importance of each row according to the relative frequency of their labels.

In [13]:
from sklearn.utils.class_weight import compute_sample_weight

train_sample_weights = compute_sample_weight('balanced', y_train)
test_sample_weights = compute_sample_weight('balanced', y_test)

NameError: name 'y_train' is not defined

In [14]:
tag_counts = json.load(open('data/label_counts.50000.json'))

max_count = max(map(lambda x: x[1], tag_counts.values()))

class_weights = {}
for i, counts in tag_counts.items():
    class_weights[int(i)] = 1 / (counts[1] / max_count)

list(class_weights.items())[0:10]

[(0, 6.324753295148713),
 (1, 3.536775488153122),
 (2, 4.900652336648487),
 (3, 6.779495524816924),
 (4, 7.820803822851778),
 (5, 2.3805714285714283),
 (6, 5.247752648153449),
 (7, 5.188926003510162),
 (8, 5.743686156545717),
 (9, 6.54049810889888)]

## Establish a Log for Performance

## Simple Baseline Model using `Conv1D`

In [18]:
from tf.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from keras.layers import Input, Embedding, GlobalMaxPool1D, Conv1D, Dense, Activation, Dropout, Lambda
from keras.models import Model, Sequential
from keras.utils import multi_gpu_model

def build_model(max_len, label_count, dropout_ratio=0.1, filter_length=50):
    
    text_input = Input(shape=(max_len,), dtype=tf.string)

    elmo_embedding = Lambda(ElmoEmbedding, output_shape=(max_len, 1024))(text_input)

    dropout = Dropout(dropout_ratio)(elmo_embedding)

    conv1d = Conv1D(filter_length, 4, padding='valid', activation='relu', strides=1)(dropout)

    global_1d = GlobalMaxPool1D()(conv1d)

    dense = Dense(label_count, activation='sigmoid')(global_1d)

    text_model = Model(
        inputs=text_input, 
        outputs=dense
    )
    
    #parallel_model = multi_gpu_model(text_model, gpus=2)

    text_model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=[
            'categorical_accuracy',
            precision_m,
            recall_m,
            f1_m,
            'mae',
            abs_KL_div,
            'accuracy',
        ]
    )
    text_model.summary()
    
    return text_model

In [19]:
text_model = build_model(MAX_LEN, y_train.shape[1])

callbacks = [
    #ReduceLROnPlateau(), 
    EarlyStopping(patience=4), 
    #ModelCheckpoint(filepath='model-conv1d.h5', save_best_only=True)
]

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0803 01:46:57.610231 140282935768832 saver.py:1489] Saver not created because there are no variables in the graph to restore


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 100)               0         
_________________________________________________________________
lambda_2 (Lambda)            (None, 100, 1024)         0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 100, 1024)         0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 97, 50)            204850    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 100)               5100      
Total params: 209,950
Trainable params: 209,950
Non-trainable params: 0
_________________________________________________________________


In [17]:
history = text_model.fit(
    X_train, 
    y_train,
    #class_weight=class_weights,
    epochs=20,
    batch_size=BATCH_SIZE,
    validation_data=(X_test, y_test),
    callbacks=callbacks
)

Train on 881280 samples, validate on 155520 samples
Epoch 1/20
  7168/881280 [..............................] - ETA: 4:26:52 - loss: 0.0944 - categorical_accuracy: 0.0389 - precision_m: 0.0000e+00 - recall_m: 0.0000e+00 - f1_m: 0.0000e+00 - mean_absolute_error: 0.0393 - abs_KL_div: 31.6757 - true_positive: 0.0000e+00 - false_positive: 0.0000e+00 - true_negative: 7056.0000 - false_negative: 112.0000 - acc: 0.9796

KeyboardInterrupt: 

In [None]:
# from keras.callbacks import EarlyStopping

# EPOCHS = 4

# history = text_model.fit(
#     X_train,
#     y_train,
#     epochs=EPOCHS,
#     batch_size=BATCH_SIZE,
#     callbacks=[
#         EarlyStopping(monitor='loss', patience=1, min_delta=0.0001),
#         EarlyStopping(monitor='val_loss', patience=1, min_delta=0.0001),
#     ],
#     class_weight=class_weights,
#     # sample_weight=train_sample_weights,
#     validation_data=(X_test, y_test)
# )

In [None]:
accr = text_model.evaluate(X_test, y_test, sample_weight=test_sample_weights)
[i for i in zip(accr, text_model.metrics_names)]

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

print(history.history)
# summarize history for accuracy
plt.plot(history.history['val_loss'])
plt.plot(history.history['f1_m'])
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['precision_m'])
plt.title('model accuracy')
plt.ylabel('metric')
plt.xlabel('epoch')
plt.legend(['val_loss', 'f1', 'categorical accuracy', 'MAE', 'precision'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
import statistics

from sklearn.metrics import hamming_loss, jaccard_score
import keras.backend as K
import tensorflow as tf

y_pred = text_model.predict(X_test)

sess = tf.Session()
best_cutoff = 0
max_score = 0
with sess.as_default():
    for cutoff in [0.0001, 0.001, 0.01, 0.1, 0.2, 0.4, 0.5, 0.6, 0.8]:
        y_pred_bin = K.greater(y_pred, cutoff).eval()
        print('Cutoff: {:,}'.format(cutoff))
        print('Hamming loss: {:,}'.format(
            hamming_loss(y_test, y_pred_bin)
        ))
        scores = []
        for j_type in ['micro', 'macro', 'weighted']:
            j_score = jaccard_score(y_test, y_pred_bin, average=j_type)
            print('Jaccard {} score: {:,}'.format(
                j_type,
                j_score
            ))
            scores.append(j_score)
        print('')
        mean_score = statistics.mean(scores)
        if mean_score > max_score:
            best_cutoff = cutoff
            max_score = mean_score

print('Best cutoff was: {:,} with mean jaccard score of {:,}'.format(best_cutoff, max_score))

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report, multilabel_confusion_matrix

y_pred = text_model.predict(X_test, batch_size=32, verbose=1)
y_pred_bool = np.where(y_pred > best_cutoff, 1, 0)

print(classification_report(y_test, y_pred_bool))

print(multilabel_confusion_matrix(y_test, y_pred_bool))

## View the Results

Now lets map from the one-hot-encoded tags back to the text tags and view them alongside the text of the original posts to sanity check the model and see if it really works.

In [None]:
predicted_tags = []
for test, pred in zip(y_test, y_pred_bool):
    tags = []
    for i, val in enumerate(test):
        if pred[i] == 1.0:
            tags.append(sorted_all_tags[i])
    predicted_tags.append(tags)

for text, tags in zip(X_test, predicted_tags):
    print(' '.join(text), tags)