<a href="https://colab.research.google.com/github/nanda1296/Multi-label-text-classification-with-Keras/blob/main/MultiLabel_SequenceClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Data Cleansing, Feature Engg and 1D CNN code borrowed from https://www.kaggle.com/code/shree77/multi-label-classification

# Multi-label text classification with keras


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import keras

import os
#print(os.listdir("../input"))
%matplotlib inline

In [None]:
questions_df = pd.read_csv("drive/My Drive/Colab Notebooks/Data/Questions.csv", encoding='iso-8859-1')
tags_df = pd.read_csv("drive/My Drive/Colab Notebooks/Data/Tags.csv", encoding='iso-8859-1')

In [None]:
questions_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learning?,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Machine Learning, fight!""</a> that discussed some of the differences between the two fields. <a href=""http://andrewgelman.com/2008/12/machine_learnin/"">Andrew Gelman responded favorably to this</a>:</p>\..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,"<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?</li>\n<li>if let's say I have census data\ndating back to 4 - 5 census periods,\nhow far ca..."
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain English,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values in statistical tests?,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests. It seems that students easily learn how to perform the calculations required by a given test but get hung up on interpreting the results. Many computerized tools report test results in terms of ""p ..."
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not mean causation,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth rate in Denmark;</li>\n<li>number of priests in America and alcoholism;</li>\n<li>in the start of the 20th century it was noted that there was a strong correlation between 'Number of radios' and 'Numb..."


In [None]:
grouped_tags = tags_df.groupby("Tag", sort='count').size().reset_index(name='count')
grouped_tags.Tag.describe()

count     1315
unique    1315
top         2d
freq         1
Name: Tag, dtype: object

## Reducing the problem to the most common tags in the dataset
We only use the top 20 (arbitrarily picked number) tags because for rare tags there are simply not enough samples available to get reliable results.

In [None]:
num_classes = 20 #can increase to improve accuracy but training will take longer, while it will increase complexity it will also allow for more data
grouped_tags = tags_df.groupby("Tag").size().reset_index(name='count')
most_common_tags = grouped_tags.nlargest(num_classes, columns="count")
tags_df.Tag = tags_df.Tag.apply(lambda tag : tag if tag in most_common_tags.Tag.values else None)
tags_df = tags_df.dropna()

## Preparing the contents of the dataframe

The question body contains html tags that we don't want to feed into our model. We will thus strip all tags and combine title and question body into a single field for simplicity.

In [None]:
import re

def strip_html_tags(body):
    regex = re.compile('<.*?>')
    return re.sub(regex, '', body)

questions_df['Body'] = questions_df['Body'].apply(strip_html_tags)
questions_df['Text'] = questions_df['Title'] + ' ' + questions_df['Body']

In [None]:
# denormalize tables

def tags_for_question(question_id):
    return tags_df[tags_df['Id'] == question_id].Tag.values

def add_tags_column(row):
    row['Tags'] = tags_for_question(row['Id'])
    return row

questions_df = questions_df.apply(add_tags_column, axis=1)
questions_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Text,Tags
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learning?,"Last year, I read a blog post from Brendan O'Connor entitled ""Statistics vs. Machine Learning, fight!"" that discussed some of the differences between the two fields. Andrew Gelman responded favorably to this:\n\nSimon Blomberg: \n\n\n From R's fortunes\n package: To paraphrase provocatively,\n 'machine learning is statistics minus\n any checking of models and\n assumptions'.\n -- Brian ...","The Two Cultures: statistics vs. machine learning? Last year, I read a blog post from Brendan O'Connor entitled ""Statistics vs. Machine Learning, fight!"" that discussed some of the differences between the two fields. Andrew Gelman responded favorably to this:\n\nSimon Blomberg: \n\n\n From R's fortunes\n package: To paraphrase provocatively,\n 'machine learning is statistics minus\n any c...",[machine-learning]
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,"What are some of the ways to forecast demographic census with some validation and calibration techniques?\n\nSome of the concerns:\n\n\nCensus blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?\nif let's say I have census data\ndating back to 4 - 5 census periods,\nhow far can i forecast it into the\nfutur...","Forecasting demographic census What are some of the ways to forecast demographic census with some validation and calibration techniques?\n\nSome of the concerns:\n\n\nCensus blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?\nif let's say I have census data\ndating back to 4 - 5 census periods,\nhow far ca...",[]
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain English,How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?\n,Bayesian and frequentist reasoning in plain English How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?\n,[bayesian]
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values in statistical tests?,"After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests. It seems that students easily learn how to perform the calculations required by a given test but get hung up on interpreting the results. Many computerized tools report test results in terms of ""p val...","What is the meaning of p values and t values in statistical tests? After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests. It seems that students easily learn how to perform the calculations required by a given test but get hung up on interpreting the resul...",[hypothesis-testing]
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not mean causation,"There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:\n\n\nnumber of storks and birth rate in Denmark;\nnumber of priests in America and alcoholism;\nin the start of the 20th century it was noted that there was a strong correlation between 'Number of radios' and 'Number of people in Insane Asylums'\n...","Examples for teaching: Correlation does not mean causation There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:\n\n\nnumber of storks and birth rate in Denmark;\nnumber of priests in America and alcoholism;\nin the start of the 20th century it was noted that there was a strong correlation between 'N...",[correlation]


In [None]:
pd.set_option('display.max_colwidth', 400)
questions_df[['Id', 'Text', 'Tags']].head()

Unnamed: 0,Id,Text,Tags
0,6,"The Two Cultures: statistics vs. machine learning? Last year, I read a blog post from Brendan O'Connor entitled ""Statistics vs. Machine Learning, fight!"" that discussed some of the differences between the two fields. Andrew Gelman responded favorably to this:\n\nSimon Blomberg: \n\n\n From R's fortunes\n package: To paraphrase provocatively,\n 'machine learning is statistics minus\n any c...",[machine-learning]
1,21,"Forecasting demographic census What are some of the ways to forecast demographic census with some validation and calibration techniques?\n\nSome of the concerns:\n\n\nCensus blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?\nif let's say I have census data\ndating back to 4 - 5 census periods,\nhow far ca...",[]
2,22,Bayesian and frequentist reasoning in plain English How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?\n,[bayesian]
3,31,"What is the meaning of p values and t values in statistical tests? After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests. It seems that students easily learn how to perform the calculations required by a given test but get hung up on interpreting the resul...",[hypothesis-testing]
4,36,"Examples for teaching: Correlation does not mean causation There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:\n\n\nnumber of storks and birth rate in Denmark;\nnumber of priests in America and alcoholism;\nin the start of the 20th century it was noted that there was a strong correlation between 'N...",[correlation]


## Tokenizing the text
The text has to be vectorized so that we can feed it into our model. Keras comes with [several text preprocessing classes](https://keras.io/preprocessing/text/) that we can use for that.

The labels need encoded as well, so that the 20 labels will be represented as 20 binary values in an array. This can be done with the [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) from the sklearn library.

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.preprocessing import MultiLabelBinarizer

multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(questions_df.Tags)
labels = multilabel_binarizer.classes_

maxlen = 80 #can increase to improve accuracy but training will take longer
max_words = 3000 #can increase to improve accuracy but training will take longer, also can filter better like use PoS tagger to keep nouns and verbs in these 3k
tokenizer = Tokenizer(num_words=max_words, lower=True)
tokenizer.fit_on_texts(questions_df.Text)

def get_features(text_series):
    """
    transforms text data to feature_vectors that can be used in the ml model.
    tokenizer must be available.
    """
    sequences = tokenizer.texts_to_sequences(text_series)
    return pad_sequences(sequences, maxlen=maxlen)


def prediction_to_label(prediction):
    tag_prob = [(labels[i], prob) for i, prob in enumerate(prediction.tolist())]
    return dict(sorted(tag_prob, key=lambda kv: kv[1], reverse=True))

In [None]:
from sklearn.model_selection import train_test_split

x = get_features(questions_df.Text)
y = multilabel_binarizer.transform(questions_df.Tags)
print(x.shape)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=9000)

(85085, 80)


## Imbalanced Classes
Some tags occur more often than others, thus the classes are not well balanced. The imbalanced class problem can be addressed by applying class weights, thus  weighting less frequent tags higher than very frequent tags.

In [None]:
most_common_tags['class_weight'] = len(tags_df) / most_common_tags['count']
class_weight = {}
for index, label in enumerate(labels):
    class_weight[index] = most_common_tags[most_common_tags['Tag'] == label]['class_weight'].values[0]

most_common_tags.head()

Unnamed: 0,Tag,count,class_weight
986,r,13236,6.046162
1020,regression,10959,7.3024
669,machine-learning,6089,13.142881
1220,time-series,5559,14.395935
946,probability,4217,18.977235


## Building a 1D Convolutional Neural Network

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, Flatten, GlobalMaxPool1D, Dropout, Conv1D, SimpleRNN, GRU, LSTM, Bidirectional
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.losses import binary_crossentropy
from keras.optimizers import Adam

filter_length = 300

model = Sequential()
model.add(Embedding(max_words, 20, input_length=maxlen))
model.add(Dropout(0.1))
model.add(Conv1D(filter_length, 3, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPool1D())
model.add(Dense(num_classes))
model.add(Activation('sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])
model.summary()

callbacks = [
    ReduceLROnPlateau(),
    EarlyStopping(patience=4),
    ModelCheckpoint(filepath='model-conv1d.h5', save_best_only=True)
]

history = model.fit(x_train, y_train,
                    class_weight=class_weight,
                    epochs=5, #can increase to improve accuracy but training will take longer
                    batch_size=32,
                    validation_split=0.1,
                    callbacks=callbacks)

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 80, 20)            60000     
                                                                 
 dropout_9 (Dropout)         (None, 80, 20)            0         
                                                                 
 conv1d_10 (Conv1D)          (None, 78, 300)           18300     
                                                                 
 global_max_pooling1d_3 (Glo  (None, 300)              0         
 balMaxPooling1D)                                                
                                                                 
 dense_5 (Dense)             (None, 20)                6020      
                                                                 
 activation_4 (Activation)   (None, 20)                0         
                                                      

In [None]:
cnn_model = keras.models.load_model('model-conv1d.h5')
y_pred = cnn_model.predict(x_test)
metrics = cnn_model.evaluate(x_test, y_test)
print("{}: {}".format(model.metrics_names[0], metrics[0]))
print("{}: {}".format(model.metrics_names[1], metrics[1]))

loss: 0.12152751535177231
categorical_accuracy: 0.3361932039260864


In [None]:
from sklearn.metrics import multilabel_confusion_matrix
y_pred1 = (y_pred > 0.5)
#multilabel_confusion_matrix(y_test, y_pred1, labels=None, sample_weight=None)

In [None]:
from sklearn.metrics import classification_report
y_pred1 = (y_pred > 0.5)
print(classification_report(y_test, y_pred1,target_names=most_common_tags.Tag))

                          precision    recall  f1-score   support

                       r       0.68      0.43      0.53       526
              regression       0.60      0.46      0.52       525
        machine-learning       0.49      0.40      0.44       563
             time-series       0.66      0.70      0.68       377
             probability       0.61      0.49      0.54       595
      hypothesis-testing       0.44      0.07      0.12       705
              self-study       0.57      0.18      0.27       736
           distributions       0.66      0.50      0.57       657
                logistic       0.65      0.07      0.13      1256
          classification       0.25      0.01      0.03       371
             correlation       0.51      0.54      0.53       405
statistical-significance       0.32      0.25      0.28       408
                bayesian       0.75      0.58      0.65       382
                   anova       0.51      0.26      0.34       481
     norm

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
BiLSTMmodel = Sequential()
# Configuring the parameters
BiLSTMmodel.add(Embedding(max_words, output_dim=50, input_length=maxlen))
BiLSTMmodel.add(Bidirectional(LSTM(128, return_sequences=True)))
# Adding a dropout layer
BiLSTMmodel.add(Dropout(0.5))
BiLSTMmodel.add(Bidirectional(LSTM(64)))
BiLSTMmodel.add(Dropout(0.5))
# Adding a dense output layer with sigmoid activation
BiLSTMmodel.add(Dense(num_classes))
BiLSTMmodel.add(Activation('sigmoid'))

BiLSTMmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])
BiLSTMmodel.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_9 (Embedding)     (None, 80, 50)            150000    
                                                                 
 bidirectional (Bidirectiona  (None, 80, 256)          183296    
 l)                                                              
                                                                 
 dropout_12 (Dropout)        (None, 80, 256)           0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              164352    
 nal)                                                            
                                                                 
 dropout_13 (Dropout)        (None, 128)               0         
                                                                 
 dense_7 (Dense)             (None, 20)               

In [None]:
callbacks = [
    ReduceLROnPlateau(),
    EarlyStopping(patience=4),
    ModelCheckpoint(filepath='model-bilstm.h5', save_best_only=True)
]

BiLSTMhistory = BiLSTMmodel.fit(x_train, y_train,
                    class_weight=class_weight,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.1,
                    callbacks=callbacks)



In [None]:
bilstm_model = keras.models.load_model('model-bilstm.h5')
y_pred = bilstm_model.predict(x_test)
metrics = bilstm_model.evaluate(x_test, y_test)
print("{}: {}".format(model.metrics_names[0], metrics[0]))
print("{}: {}".format(model.metrics_names[1], metrics[1]))

loss: 0.1830926537513733
categorical_accuracy: 0.10295587033033371


In [None]:
LSTMmodel = Sequential()
# Configuring the parameters
LSTMmodel.add(Embedding(max_words, output_dim=50, input_length=maxlen))
LSTMmodel.add(LSTM(128, return_sequences=True))
# Adding a dropout layer
LSTMmodel.add(Dropout(0.5))
LSTMmodel.add(LSTM(64))
LSTMmodel.add(Dropout(0.5))
# Adding a dense output layer with sigmoid activation
LSTMmodel.add(Dense(num_classes))
LSTMmodel.add(Activation('sigmoid'))

LSTMmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])
LSTMmodel.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_8 (Embedding)     (None, 80, 50)            150000    
                                                                 
 lstm_4 (LSTM)               (None, 80, 128)           91648     
                                                                 
 dropout_10 (Dropout)        (None, 80, 128)           0         
                                                                 
 lstm_5 (LSTM)               (None, 64)                49408     
                                                                 
 dropout_11 (Dropout)        (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 20)                1300      
                                                                 
 activation_5 (Activation)   (None, 20)               

In [None]:
callbacks = [
    ReduceLROnPlateau(),
    EarlyStopping(patience=4),
    ModelCheckpoint(filepath='model-lstm.h5', save_best_only=True)
]

LSTMhistory = LSTMmodel.fit(x_train, y_train,
                    class_weight=class_weight,
                    epochs=2,
                    batch_size=32,
                    validation_split=0.1,
                    callbacks=callbacks)

Epoch 1/2
Epoch 2/2


In [None]:
lstm_model = keras.models.load_model('model-lstm.h5')
y_pred = lstm_model.predict(x_test)
metrics = lstm_model.evaluate(x_test, y_test)
print("{}: {}".format(model.metrics_names[0], metrics[0]))
print("{}: {}".format(model.metrics_names[1], metrics[1]))

loss: 0.17990423738956451
categorical_accuracy: 0.10213316231966019


In [None]:
GRUmodel = Sequential()
# Configuring the parameters
GRUmodel.add(Embedding(max_words, output_dim=50, input_length=maxlen))
GRUmodel.add(GRU(128, return_sequences=True))
# Adding a dropout layer
GRUmodel.add(Dropout(0.5))
GRUmodel.add(GRU(64))
GRUmodel.add(Dropout(0.5))
# Adding a dense output layer with sigmoid activation
GRUmodel.add(Dense(num_classes))
GRUmodel.add(Activation('sigmoid'))

GRUmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])
GRUmodel.summary()

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, 80, 50)            150000    
                                                                 
 gru (GRU)                   (None, 80, 128)           69120     
                                                                 
 dropout_14 (Dropout)        (None, 80, 128)           0         
                                                                 
 gru_1 (GRU)                 (None, 64)                37248     
                                                                 
 dropout_15 (Dropout)        (None, 64)                0         
                                                                 
 dense_8 (Dense)             (None, 20)                1300      
                                                                 
 activation_7 (Activation)   (None, 20)              

In [None]:
callbacks = [
    ReduceLROnPlateau(),
    EarlyStopping(patience=4),
    ModelCheckpoint(filepath='model-gru.h5', save_best_only=True)
]

GRUhistory = GRUmodel.fit(x_train, y_train,
                    class_weight=class_weight,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.1,
                    callbacks=callbacks)



In [None]:
gru_model = keras.models.load_model('model-gru.h5')
y_pred = gru_model.predict(x_test)
metrics = gru_model.evaluate(x_test, y_test)
print("{}: {}".format(model.metrics_names[0], metrics[0]))
print("{}: {}".format(model.metrics_names[1], metrics[1]))

loss: 0.16791535913944244
categorical_accuracy: 0.13274960219860077


In [None]:
RNNmodel = Sequential()
# Configuring the parameters
RNNmodel.add(Embedding(max_words, output_dim=50, input_length=maxlen))
RNNmodel.add(SimpleRNN(128, return_sequences=True))
# Adding a dropout layer
RNNmodel.add(Dropout(0.5))
RNNmodel.add(SimpleRNN(64))
RNNmodel.add(Dropout(0.5))
# Adding a dense output layer with sigmoid activation
RNNmodel.add(Dense(num_classes))
RNNmodel.add(Activation('sigmoid'))

RNNmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])
RNNmodel.summary()

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_11 (Embedding)    (None, 80, 50)            150000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 80, 128)           22912     
                                                                 
 dropout_16 (Dropout)        (None, 80, 128)           0         
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 64)                12352     
                                                                 
 dropout_17 (Dropout)        (None, 64)                0         
                                                                 
 dense_9 (Dense)             (None, 20)                1300      
                                                                 
 activation_8 (Activation)   (None, 20)              

In [None]:
callbacks = [
    ReduceLROnPlateau(),
    EarlyStopping(patience=4),
    ModelCheckpoint(filepath='model-rnn.h5', save_best_only=True)
]

RNNhistory = RNNmodel.fit(x_train, y_train,
                    class_weight=class_weight,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.1,
                    callbacks=callbacks)



In [None]:
rnn_model = keras.models.load_model('model-rnn.h5')
y_pred = rnn_model.predict(x_test)
metrics = rnn_model.evaluate(x_test, y_test)
print("{}: {}".format(model.metrics_names[0], metrics[0]))
print("{}: {}".format(model.metrics_names[1], metrics[1]))

loss: 0.1858353167772293
categorical_accuracy: 0.06681554019451141
