# Multi-label text classification with keras
## by Rocco Schulz



https://blog.mimacom.com/text-classification/

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/learning-stack/Colab-ML-Playbook/blob/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/multi-label-classification-with-keras.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/learning-stack/Colab-ML-Playbook/blob/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/multi-label-classification-with-keras.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import keras

import os
%matplotlib inline

In [3]:
# Download data files
!wget https://github.com/learning-stack/Colab-ML-Playbook/blob/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/data/Questions.csv?raw=true
!wget https://github.com/learning-stack/Colab-ML-Playbook/blob/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/data/Tags.csv?raw=true


--2019-01-04 17:22:39--  https://github.com/learning-stack/Colab-ML-Playbook/blob/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/data/Questions.csv?raw=true
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/learning-stack/Colab-ML-Playbook/raw/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/data/Questions.csv [following]
--2019-01-04 17:22:39--  https://github.com/learning-stack/Colab-ML-Playbook/raw/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/data/Questions.csv
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/learning-stack/Colab-ML-Playbook/master/NLP/Performing%20Multi-label%20Text%20Classification%20with%20Keras/data/Questions.csv [following

In [4]:
df_questions = pd.read_csv('Questions.csv?raw=true', encoding='iso-8859-1')
df_tags = pd.read_csv('Tags.csv?raw=true', encoding='iso-8859-1')
df_questions.head(n=2)

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,"<p>Last year, I read a blog post from <a href=..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demog...


In [5]:
grouped_tags = df_tags.groupby("Tag", sort='count').size().reset_index(name='count')
grouped_tags.Tag.describe()

count                1315
unique               1315
top       crostons-method
freq                    1
Name: Tag, dtype: object

## Reducing the problem to the most common tags in the dataset
For rare tags there were simply not enough samples available to get reliable results, thus only the top 100 tags were kept. But even with only the 100 most frequently used tags there is still an imbalance as some tags are used more often than others.

 ![first vs last 5 tags](https://blog.mimacom.com/en/media/ef1a71298726a6f3babd46e9e2988fa1/first_vs_last_5_tags.png)

To address this imbalance we calculated class weights to be used as parameters for the loss function of our model. By multiplying the class weights with the categorical losses we can counter the imbalance, so that making false classifications for the tag algorithms is equally expensive as for the tag r.
The calculated class weights are plotted against the counts of the tags below.

 ![class weights plotted vs class observations](https://blog.mimacom.com/en/media/ba1508ba102ac1f7ddb254e9ba06d5a3/class-weights.png)

There are alternative ways to address class imbalances. We could also have used resampling to duplicate samples in under-represented classes or reduce the number of samples in over-represented classes or trained several models with balanced subsets of the data and model averaging. (cf. Longadge2013) For resampling there is a scikit-learn compatible library imbalanced-learn which also illustrates the class imbalance problem and supported resampling strategies in its documentation.

In [8]:
num_classes = 100
grouped_tags = df_tags.groupby("Tag").size().reset_index(name='count')
most_common_tags = grouped_tags.nlargest(num_classes, columns="count")
df_tags.Tag = df_tags.Tag.apply(lambda tag : tag if tag in most_common_tags.Tag.values else None)
df_tags = df_tags.dropna()
df_tags

Unnamed: 0,Id,Tag
0,1,bayesian
3,2,distributions
7,4,distributions
8,4,statistical-significance
9,6,machine-learning
10,7,dataset
16,10,ordinal
21,17,anova
22,17,chi-squared
23,17,generalized-linear-model


## Preparing the contents of the dataframe

The question body contains html tags that we don't want to feed into our model. We will thus strip all tags and combine title and question body into a single field for simplicity.

In [0]:
import re 

def strip_html_tags(body):
    regex = re.compile('<.*?>')
    return re.sub(regex, '', body)

df_questions['Body'] = df_questions['Body'].apply(strip_html_tags)
df_questions['Text'] = df_questions['Title'] + ' ' + df_questions['Body']

In [0]:
# denormalize tables

def tags_for_question(question_id):
    return df_tags[df_tags['Id'] == question_id].Tag.values

def add_tags_column(row):
    row['Tags'] = tags_for_question(row['Id'])
    return row

df_questions = df_questions.apply(add_tags_column, axis=1)

In [10]:
pd.set_option('display.max_colwidth', 400)
df_questions[['Id', 'Text', 'Tags']].head()

Unnamed: 0,Id,Text,Tags
0,6,"The Two Cultures: statistics vs. machine learning? Last year, I read a blog post from Brendan O'Connor entitled ""Statistics vs. Machine Learning, fight!"" that discussed some of the differences between the two fields. Andrew Gelman responded favorably to this:\n\nSimon Blomberg: \n\n\n From R's fortunes\n package: To paraphrase provocatively,\n 'machine learning is statistics minus\n any c...",[machine-learning]
1,21,"Forecasting demographic census What are some of the ways to forecast demographic census with some validation and calibration techniques?\n\nSome of the concerns:\n\n\nCensus blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?\nif let's say I have census data\ndating back to 4 - 5 census periods,\nhow far ca...",[forecasting]
2,22,Bayesian and frequentist reasoning in plain English How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?\n,[bayesian]
3,31,"What is the meaning of p values and t values in statistical tests? After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests. It seems that students easily learn how to perform the calculations required by a given test but get hung up on interpreting the resul...","[hypothesis-testing, t-test, p-value, interpretation]"
4,36,"Examples for teaching: Correlation does not mean causation There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:\n\n\nnumber of storks and birth rate in Denmark;\nnumber of priests in America and alcoholism;\nin the start of the 20th century it was noted that there was a strong correlation between 'N...",[correlation]


## Tokenizing the text
The text has to be vectorized so that we can feed it into our model. Keras comes with [several text preprocessing classes](https://keras.io/preprocessing/text/) that we can use for that.

The labels need encoded as well, so that the 100 labels will be represented as 100 binary values in an array. This can be done with the [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) from the sklearn library.

In [0]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import MultiLabelBinarizer

multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df_questions.Tags)
labels = multilabel_binarizer.classes_

maxlen = 180
max_words = 5000
tokenizer = Tokenizer(num_words=max_words, lower=True)
tokenizer.fit_on_texts(df_questions.Text)

def get_features(text_series):
    """
    transforms text data to feature_vectors that can be used in the ml model.
    tokenizer must be available.
    """
    sequences = tokenizer.texts_to_sequences(text_series)
    return pad_sequences(sequences, maxlen=maxlen)


def prediction_to_label(prediction):
    tag_prob = [(labels[i], prob) for i, prob in enumerate(prediction.tolist())]
    return dict(sorted(tag_prob, key=lambda kv: kv[1], reverse=True))

In the snippet above only the most frequent 5000 words are used to build a dictionary. We limit the sequence length to 180 words.

The labels need to be encoded as well, so that the 100 labels will be represented as 100 binary elements in an array. This was done with the MultiLabelBinarizer from the sklearn library.

Finally we can split our data into training and test set to conclude the data preparation:

In [12]:
from sklearn.model_selection import train_test_split

x = get_features(df_questions.Text)
y = multilabel_binarizer.transform(df_questions.Tags)
print(x.shape)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=9000)

(85085, 180)


## Imbalanced Classes
Some tags occur more often than others, thus the classes are not well balanced. The imbalanced class problem can be addressed by applying class weights, thus  weighting less frequent tags higher than very frequent tags.

In [13]:
most_common_tags['class_weight'] = len(df_tags) / most_common_tags['count']
class_weight = {}
for index, label in enumerate(labels):
    class_weight[index] = most_common_tags[most_common_tags['Tag'] == label]['class_weight'].values[0]
    
most_common_tags.head()

Unnamed: 0,Tag,count,class_weight
74,r,13236,11.552811
79,regression,10959,13.953189
41,machine-learning,6089,25.112991
98,time-series,5559,27.507285
71,probability,4217,36.261086


## Model Creation

There are various approaches to classify text and in practice you should try several of them to see which one works best for the task at hand. For brevity we will focus on Keras in this article, but we encourage you to try LightGBM, Support Vector Machines or Logistic Regression with n-grams or tf-idf input features. The latter shallow classifiers can be created as binary classifiers - one for each category. By running all of them one can determine probabilities for each category. Sklearn comes with the OneVsRestClassifier which supports this strategy. This is briefly demonstrated in our notebook [multi-label classification](https://www.kaggle.com/roccoli/multi-label-classification-with-sklearn) with sklearn on Kaggle which you may use as a starting point for further experimentation.

## Word Embeddings
In the previous steps we tokenized our text and vectorized the resulting tokens using one-hot encoding. The resulting vectors are sparse, binary representations which mainly contain zeros and are high-dimensional (depending on the number of unique words in the vocabulary).

Word embeddings on the other hand are low dimensional as they represent tokens as dense floating point vectors and thus pack more information into fewer dimensions. Words with similar meanings are associated with similar representations. Word embeddings can be obtained in two ways:

1.   Learn word embeddings together with the weights of the neural network
2.   Load pretrained word embeddings which were precomputed as part of a different machine learning task.

We decided to learn new word embeddings as the dataset contains vocabulary which is specific to the domain of statistics and we didn't expect to benefit from pretrained embeddings which use a broader vocabulary.

## Simple Baseline
We started with a simple model which only consists of an embedding layer, a dropout layer to reduce the size and prevent overfitting, a max pooling layer and one dense layer with a sigmoid activation to produce probabilities for each of the 100 classes that we want to predict.

In [19]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, GlobalMaxPool1D, Dropout
from keras.optimizers import Adam

model = Sequential()
model.add(Embedding(max_words, 20, input_length=maxlen))
model.add(Dropout(0.15))
model.add(GlobalMaxPool1D())
model.add(Dense(num_classes, activation='sigmoid'))

model.compile(optimizer=Adam(0.015), loss='binary_crossentropy', metrics=['categorical_accuracy'])
callbacks = [
    ReduceLROnPlateau(),
    EarlyStopping(patience=4),
    ModelCheckpoint(filepath='model-simple.h5', save_best_only=True)
]

history = model.fit(x_train, y_train,
                    class_weight=class_weight,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.1,
                    callbacks=callbacks)



Train on 61261 samples, validate on 6807 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [20]:
# With this simple model the categorical accuracy was 22 % on the held out test dataset. This is better than guessing but not really satisfactory.
simple_model = keras.models.load_model('model-simple.h5')
metrics = simple_model.evaluate(x_test, y_test)
print("{}: {}".format(simple_model.metrics_names[0], metrics[0]))
print("{}: {}".format(simple_model.metrics_names[1], metrics[1]))

loss: 0.06268962885289744
categorical_accuracy: 0.21660692249102675


## 1D Convolutional Neural Network

In [14]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, Flatten, GlobalMaxPool1D, Dropout, Conv1D
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.losses import binary_crossentropy
from keras.optimizers import Adam

filter_length = 300

model = Sequential()
model.add(Embedding(max_words, 20, input_length=maxlen))
model.add(Dropout(0.1))
model.add(Conv1D(filter_length, 3, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPool1D())
model.add(Dense(num_classes))
model.add(Activation('sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])
model.summary()

callbacks = [
    ReduceLROnPlateau(), 
    EarlyStopping(patience=4), 
    ModelCheckpoint(filepath='model-conv1d.h5', save_best_only=True)
]

history = model.fit(x_train, y_train,
                    class_weight=class_weight,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.1,
                    callbacks=callbacks)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 180, 20)           100000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 180, 20)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 178, 300)          18300     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 300)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
activation_1 (Activation)    (None, 100)               0         
Total params: 148,400
Trainable params: 148,400
Non-trainable params: 0
_________________________________________________________________
Trai

In [15]:
# This improved the categorical accuracy to 33 %.
cnn_model = keras.models.load_model('model-conv1d.h5')
metrics = cnn_model.evaluate(x_test, y_test)
print("{}: {}".format(model.metrics_names[0], metrics[0]))
print("{}: {}".format(model.metrics_names[1], metrics[1]))

loss: 0.051077511381576166
categorical_accuracy: 0.3342539813093022


## Testing the Model on New Cross Validated Questions
We can send a request to the Stackexchange API to get a new unanswered question and list the tags associated with the question:

In [16]:
import requests
import random
url = "https://api.stackexchange.com/2.2/questions/unanswered?pagesize=10&order=desc&sort=votes&site=stats&filter=!-MOiNm40F1U6n0W(EFNR1)GdsWAepKpT_"

data = requests.get(url).json()
item = random.choice(data.get('items'))
q = item.get('title') + " " + strip_html_tags(item.get('body'))
print(q)
print(item.get('tags'))

$ARIMA(p,d,q)+X_t$, Simulation over Forecasting period I have time series data and I used an $ARIMA(p,d,q)+X_t$ as the model to fit the data. The $X_t$ is an indicator random variable that is either 0 (when I don’t see a rare event) or 1 (when I see the rare event). Based on previous observations that I have for $X_t$ , I can develop a model for $X_t$ using Variable Length Markov Chain methodology. This enables me to simulate the $X_t$ over the forecasting period and gives a sequence of zeros and ones. Since this is a rare event, I will not see $X_t=1$  often. I can forecast and obtain the prediction intervals based on the simulated values for $X_t$.   

Question:  

How can I develop an efficient simulation procedure to take into account the occurrence of 1’s in the simulated $X_t$ over the forecasting period? I need to obtain the mean and the forecasting intervals.   

The probability of observing 1 is too small for me to think that the regular Monte Carlo simulation will work well i

The question is: "Simulation over Forecasting period...Thank you."

And the author tagged the question with: ['time-series', 'forecasting', 'simulation']

Now let's see which tags our models predict for the given text. We feed the question into our convolutional model and into the simple model to compare the actual tags with the computed predictions and to see how both models' predictions differ.

In [21]:
f = get_features([q])
p1 = prediction_to_label(cnn_model.predict(f)[0])
p2 = prediction_to_label(simple_model.predict(f)[0])
df = pd.DataFrame()
df['label'] = p1.keys()
df['p_cnn'] = p1.values()
df['p_simple'] = df.label.apply(lambda label : p2.get(label))
df['weighted'] = (2 * df['p_cnn'] + df['p_simple']) / 3
df.sort_values(by='p_cnn', ascending=False)[:10]

Unnamed: 0,label,p_cnn,p_simple,weighted
0,forecasting,0.577815,0.326433,0.494021
1,simulation,0.568521,0.108114,0.415052
2,time-series,0.353661,0.256785,0.321369
3,r,0.185621,0.129774,0.167005
4,monte-carlo,0.157432,0.058432,0.124432
5,prediction,0.117087,0.040378,0.091517
6,bootstrap,0.050716,0.002933,0.034789
7,self-study,0.04653,0.005762,0.032941
8,probability,0.039894,0.010047,0.029945
9,mcmc,0.036926,0.176397,0.083417


**So, The CNN model is better for estimating the 3 tags than the simple model.**

There are a few noteworthy things about these results:

* The first three tags are quite obvious as they are part of the question text. It is thus not a surprise that these were predicted with high confidence.
* MCMC is an interesting prediction as it isn't mentioned in the text but fits the question well (MCMC stands for Markov Chain Monte Carlo). The ARIMA tag has a low confidence score despite being a word in the text.
* Our simple model was able to predict the time-series tag with 23% confidence and predicts a higher confidence for r, which is also the most frequent tag in our training set. This is an indicator that our simple model is biased towards the majority class despite the class weights that we used in the training phase.