## Overview and Abstract
**MULTILABEL CLASSIFICATION PROBLEM**

This project consists of creating a MutliLabel classifier that predicts the category associated to a text-based search query. Nine categories are concatenated as Label. Data was processed using word-tokenization ('title') to conduct Natural Language Processing. Further, Labels were transformed into nine binary vectors which eased learning. Random Forest was again selected as the baseline, as tokenized words created a highly dimensional space. The chosen MultiClass methods were a dense, deep NN; a LSTM; and a CNN. These were evaluated using F1 scores. Differences between models used was negligible. Alternative embedding techniques and additional features would likely improve performance.

In [118]:
### PACKAGES ###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Packages used for Feature Engineering
  # General
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
  # A: Correcting class imbalances
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler  
  # B: Tokenizing words
from keras.preprocessing.sequence import pad_sequences
import nltk
from nltk.corpus import stopwords
import string
# Packages used for Shallow ML (scikit-learn)
from sklearn import metrics
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier # Shallow method deployed
# Packages used for Deep ML (keras)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras import backend as K
from keras.layers.recurrent import LSTM
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.wrappers.scikit_learn import KerasClassifier

## NLP project: Method

Some features of the dataset were altered to improve its appropriacy for the chosen classification models. The feature processing is as follows:
- Cleaning text by removing punctuation, capitalization, 'stopwords', and other characters of minimal value.
- Tokenization and padding were carried out to make sure the dimensions were fixed in the input.
- One-hot encoding target labels (y) to transform MultiLabel inputs to Multiclass. This simplifies inferences made.
- A validation set was created.

A Random Forest was chosen as a baseline model for subsequent neural networks. A Random Forest is beneficial because text padding creates a highly dimensional space. Yet, the independence of decisions trees within a Random Forest will capture the relationships between other dimensions / words.

As Multiclass  classifiers, each neural network outputs 9 neurons using a sigmoid activation function. Optimal hyperparameters for each classifier were found using Grid Search. Furthermore, the loss function for all classifiers was set to ‘binary_crossentropy’. All other activation functions were set to ‘relu’ with a ‘he-normal’ initializer.
- The 4-layer dense, deep neural network attempts to capture complexities present in tokenized 'title' data exclusively. Optimized F1 score was found when epochs were set to 10 and batch size to 400.
- A Recurrent Neural Network with Long-Short-Term Memory (**Hochreiter, S. & Schmidhuber, J., 1997**) attempts to capture immediate and longer-term representations, suitable for sentences of text.
- Though primarily deployed for image classification problems, more recently  Convolutional Neural Networks may be used for NLP, to see whether Convolutional Layers can identify spatial dependence and order between specific words (**Kalchbrenner, N. et al., 2016**).


In [119]:
## ARCHITECTURE ##
# Dense, deep neural network
class Model_Multiclass_Dense(keras.Model):
  def __init__(self, input_dim, input_length, output_dim = 264, activation1 = 'relu', activation2 = 'relu', kernel_initializer1 = 'he-normal', **kwargs):
    super(Model_Multiclass_Dense, self).__init__(**kwargs)
    self.embedding = layers.Embedding(input_dim = input_dim, output_dim = output_dim, input_length = input_length)
    self.flatten = layers.Flatten()
    self.hidden1 = layers.Dense(units = 50)
    self.hidden2 = layers.Dense(units = 50)
    self.main_output = layers.Dense(units = 9, activation = 'sigmoid')
  def call(self, inputs):
    embedding = self.embedding(inputs)
    flatten = self.flatten(embedding)
    hidden1 = self.hidden1(flatten)
    hidden2 = self.hidden2(hidden1)
    main_output = self.main_output(hidden2)
    return main_output
# LSTM
class Model_Multiclass_LSTM(keras.Model):
  def __init__(self, input_dim, input_length, output_dim = 100, activation1 = 'relu', activation2 = 'relu', kernel_initializer1 = 'he-normal', kernel_initializer2 = 'he-normal', **kwargs):
    super(Model_Multiclass_LSTM, self).__init__(**kwargs)
    self.embedding = layers.Embedding(input_dim = input_dim, output_dim = output_dim, input_length = input_length)
    self.LSTM = LSTM(128)
    self.main_output = layers.Dense(units = 9, activation = 'sigmoid')
  def call(self, inputs):
    embedding = self.embedding(inputs)
    LSTM = self.LSTM(embedding)
    main_output = self.main_output(LSTM)
    return main_output
# CNN
class Model_Multiclass_CNN(keras.Model):
  def __init__(self, input_dim, input_length, output_dim = 100, activation1 = 'relu', activation2 = 'relu', kernel_initializer1 = 'he-normal', kernel_initializer2 = 'he-normal', **kwargs):
    super(Model_Multiclass_CNN, self).__init__(**kwargs)
    self.embedding = layers.Embedding(input_dim = input_dim, output_dim = output_dim, input_length = input_length)
    self.conv = layers.Conv1D(filters=32, kernel_size=8, activation='relu')
    self.pooling = layers.MaxPooling1D(pool_size=2)
    self.flatten = layers.Flatten()
    self.hidden = layers.Dense(units = 10, activation='relu')
    self.main_output = layers.Dense(units = 9, activation = 'sigmoid')
  def call(self, inputs):
    embedding = self.embedding(inputs)
    conv = self.conv(embedding)
    pooling = self.pooling(conv)
    flatten = self.flatten(pooling)
    hidden = self.hidden(flatten)
    main_output = self.main_output(hidden)
    return main_output
## TRAINING FUNCTIONS ##
# Dense, deep neural network, LSTM and CNN
def ModelTrainerMulti(X, y, batch_size, epochs, model, input_dim, input_length):
  model = model(input_dim = input_dim, input_length=input_length)
  model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy', f1_m])
  model.fit(X, y, epochs = epochs, batch_size = batch_size, verbose = 0)
  return model

# Results and Discussion


### Task B: Results and discussion

| Model | Epochs | Batch size | Neurons per HiddenLayer(s) | F1 score | Kaggle score |
| --- | --- | --- | --- | --- | --- |
| Random Forest | N/A | N/A | N/A | 0.89 | 0.92 |
| 4-layer DNN | 10 | 500 | 100 | 0.95 | 0.91 |
| LSTM | 10 | 500 | N/A | 0.94 | 0.90 |
| CNN | 10 | 400 | 10 | 0.95 | 0.91 |

*With title*

NN architectures deployed performed similarly to the baseline method. The 'title' feature captured a lot of the information relating to each document's category, illustrated by correctly classifying ~92% of test instances, alone. This makes it a well-suited predictor. Non-discernible performance differences between both shallow and deep methods deployed suggests exploration of other features may have improved performance, and allowed NNs to flourish. Moreover, alternative word-embedding methods may have been taken, which may have learned better representations.

# References

- Géron, A. (2019). *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*. 2nd ed. Sebastopol: O'Reilly.
- Cheng, H. et al. (2016). *Wide & Deep Learning for Recommender Systems*. Google, Inc.
- Hochreiter, S. & Schmidhuber, J. (1997). *Long Short-Term Memory*. Neural Computation, 9, pp. 1735-1780.
- Kalchbrenner, N. et al. (2016). *A Convolutional Neural Network for Modelling Sentences*. Association for Computational Linguistics, p. 655–665.


# Code

In [145]:
# If loading files via Google Drive
from google.colab import drive
drive.mount('/content/gdrive')
# Load files
trainMulti = pd.read_csv('gdrive/My Drive/MSc Data Analytics/CS987 assignment/german-contracts-train.csv',  dtype={
        "docid":str, "publication_date":str, "contract_type":str, "nature_of_contract":str, "country_code":str,
        "country_name":str, "sector":str, "category":str, "value":float, "title":str,
        "description":str, "awarding_authority":str, "complete_entry":str,
        "label":str   
    })
testMulti = pd.read_csv('gdrive/My Drive/MSc Data Analytics/CS987 assignment/german-contracts-test.csv',  dtype={
        "docid":str, "publication_date":str, "contract_type":str, "nature_of_contract":str, "country_code":str,
        "country_name":str, "sector":str, "category":str, "value":float, "title":str,
        "description":str, "awarding_authority":str, "complete_entry":str,
        "label":str   
    })

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# If loading files via local PC
from google.colab import files
files.upload()
# Load files
import io
train = pd.read_csv('german-contracts-train.csv')
test = pd.read_csv('german-contracts-test.csv')

### Feature Processing
Preprocessing for NLP tasks consists of text cleaning (lower case-ing, removal of stopwords, punctuation and special characters), tokenization and padding to make sure we have fixed dimensions to our input. 


#### Brief exploratory analysis of data

In [82]:
# Understanding the shape of the data
r,c = trainMulti.shape
r1, c1 = testMulti.shape
print("- There are %s rows and %s columns in the Training dataset, and %s rows and %s columns in the Test dataset." % (r,c, r1, c1))
# Identifying any null values
print('- In the training dataset, there are %s Null cells.' % trainMulti.isna().sum().sum())
print('- In the testing dataset, there are %s Null cells.' % testMulti.isna().sum().sum())
# Identifying the number of unique values found across features
print('- Unique values found in the training dataset are:\n', trainMulti.nunique())
print('\n- Unique values found in the testing dataset are:\n', testMulti.nunique())

- There are 98320 rows and 13 columns in the Training dataset, and 24581 rows and 11 columns in the Test dataset.
- In the training dataset, there are 63095 Null cells.
- In the testing dataset, there are 15826 Null cells.
- Unique values found in the training dataset are:
 docid                 98320
publication_date        254
contract_type             2
nature_of_contract        3
country_code              1
country_name              1
sector                    1
category                307
value                 23063
title                 32207
description           70652
awarding_authority    13191
label                   176
dtype: int64

- Unique values found in the testing dataset are:
 docid                 24581
publication_date        250
contract_type             2
nature_of_contract        4
country_code              1
country_name              1
sector                    1
value                  6225
title                 13600
description           21365
awarding_authori

In [83]:
# Understanding the characteristics of the training data
trainMulti.head(3)

Unnamed: 0,docid,publication_date,contract_type,nature_of_contract,country_code,country_name,sector,category,value,title,description,awarding_authority,label
0,2493527426,2020-10-14,award,services,DE,Germany,public,['Energy & Environment'],75658.0,Germany-Wilhelmshaven: Cleaning services,Unterhalts- und Glasreinigung.\n,Staatliches Baumanagement Ems-Weser,100000
1,2538215982,2020-11-16,notice,services,DE,Germany,public,['Infrastructure & Construction'],,Germany-Dresden: Engineering-design services f...,ABS Karlsruhe-Stuttgart-Nürnberg-Leipzig/Dresd...,DB Netz AG,1000
2,2204943443,2020-02-13,notice,works,DE,Germany,public,['Infrastructure & Construction'],470000.0,"Germany-Germering: Heating, ventilation and ai...",Nach Fertigstellung des ersten Bauabschnitts e...,Große Kreisstadt Germering,1000


In [84]:
# Understanding the characteristics of the testing data
testMulti.head(3)

Unnamed: 0,docid,publication_date,contract_type,nature_of_contract,country_code,country_name,sector,value,title,description,awarding_authority
0,2535443526,2020-11-13,notice,services,DE,Germany,public,,Germany-Stuttgart: Software-related services,Pflege und Anpassung SKoKa-BW (1.1.2021-31.12....,"Regierungspräsidium Tübingen, Abteilung 9 — La..."
1,2487195007,2020-10-09,notice,services,DE,Germany,public,,"Germany-Mühlacker: Architectural, construction...",Vergabeverfahren der Stadt Mühlacker zur Verga...,Stadtverwaltung Mühlacker
2,2573583192,2020-12-11,notice,supplies,DE,Germany,public,,Germany-Darmstadt: Integrated circuit packages,Für die Gruppe HEL der Abteilung ACO werden fü...,GSI Helmholtzzentrum für Schwerionenforschung ...


**EDA informs feature processing**. 
- Many features have no unique or too many unique features: 'docid', 'country_code', 'country_name', 'sector'.
- 'title' appears to contain similar information to 'category' / 'label'. If the text can be extracted, then this is likely the best bet to predicting classes.
- 'description' also appears to have interesting information.

The most relevant elements of this dataset are the text features: description and title. For the standard baseline model and the dense, deep neural net we will use the title feature. 

For the more advanced models we will use the description feature.

#### Feature preprocessing

**Text processing 'title'**

In [146]:
# Extracting X, and removing geographic element, simultaneously
# Train
title_train = pd.DataFrame(trainMulti.title.str.split(':', 1).tolist(),columns = ['Location', 'title'])
title_train.drop(columns = 'Location', inplace = True)
# Test
title_test = pd.DataFrame(testMulti.title.str.split(':', 1).tolist(),columns = ['Location', 'title'])
title_test.drop(columns = 'Location', inplace = True)

In [147]:
# Preparing text corpora (set of texts) for Tokenisation
title_train = title_train['title']
title_test = title_test['title']

In [148]:
# Set up Tokenizer
from keras_preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level = False, oov_token = True, lower = True,                                               
                      filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n1234567890') 
# oov_token: Any new words found in title_test will receive a new value
# lower: All words are made lowercase, avoiding duplicates due to capitalisation
# filters: Removes characters, numbers, etc.

In [149]:
# Encoding 'title' using Tokenizer
tokenizer.fit_on_texts(title_train)
title_train = tokenizer.texts_to_sequences(title_train)
title_test = tokenizer.texts_to_sequences(title_test)

In [150]:
# Pad sentences, so all 'titles' (inputs) are of fixed size
# Ensure that 'titles' is padded to the longest sentence in title_train / title_test
maxlen_title = max(max([len(i) for i in title_train]), max([len(i) for i in title_test]))
title_train = pad_sequences(title_train, padding='post', maxlen = maxlen_title) # Train
title_test = pad_sequences(title_test, padding = 'post', maxlen = maxlen_title) # Test

In [151]:
train_vocab_title = len(np.unique(title_train))
print('The amount of vocabulary found by the Tokenizer in the train data is', train_vocab_title)
test_vocab_title = len(np.unique(title_test))
print('The amount of vocabulary found by the Tokenizer in the test data is', test_vocab_title)

The amount of vocabulary found by the Tokenizer in the train data is 2638
The amount of vocabulary found by the Tokenizer in the test data is 1850


**Text preprocessing 'description'**

In [152]:
# 'stopwords' = useful German filler words
nltk.download('stopwords')
stop = stopwords.words('german')
# Alternative Tokenizing method
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [153]:
# Remove German 'stopwords' and any grammar prior to splitting
def StopwordRemover(text):
  text = text.str.replace("'",'').str.replace(",","").str.replace(".","").str.replace("/","").str.replace("-"," ") # Punctuation found in strings
  text = text.str.lower().str.split().apply(lambda x: [item for item in x if item not in stop]) # Lowercase, split, & remove stopwords
  return (text)
trainMulti['description'] = StopwordRemover(trainMulti['description'])
testMulti['description'] = StopwordRemover(testMulti['description'])

In [154]:
# Preparing text corpora (set of texts) for Tokenisation
description_train = trainMulti['description']
description_test = testMulti['description']
# Apply Tokenizer
tokenizer.fit_on_texts(description_train)
description_train = tokenizer.texts_to_sequences(description_train)
description_test = tokenizer.texts_to_sequences(description_test)

In [155]:
# Pad sentences, so all 'descriptions' (inputs) are of fixed size
# Ensure that 'titles' is padded to the longest sentence in title_train / title_test
maxlen_title = max(max([len(i) for i in description_train]), max([len(i) for i in description_test]))
description_train = pad_sequences(description_train, padding='post', maxlen = maxlen_title) # Train
description_test = pad_sequences(description_test, padding = 'post', maxlen = maxlen_title) # Test

In [156]:
train_vocab_description = len(np.unique(description_train))
print('The amount of vocabulary found by the Tokenizer in the train data is', train_vocab_description)
test_vocab_description = len(np.unique(description_test))
print('The amount of vocabulary found by the Tokenizer in the test data is', test_vocab_description)

The amount of vocabulary found by the Tokenizer in the train data is 188612
The amount of vocabulary found by the Tokenizer in the test data is 73578


**One-Hot Encoding 'label' (y)**

This approach is preferred over label encoding, since it reduces predictions down to binary classifications.

In [157]:
y_train = trainMulti.copy()
# Loop which assigns each binary label to a new column
for i in range(len(y_train['label'][0])):
  y_train[i] = y_train['label'].apply(lambda x: x[i])
# Subsetting only newly created columns, and converting to integers (required by the NN)
y_train = y_train[list(range(9))].astype('int64')

**Separating into training and validation data**

In [158]:
X_train, X_valid, y_train, y_valid = train_test_split(pd.concat([pd.DataFrame(title_train), pd.DataFrame(description_train)], axis = 1), y_train, test_size = 0.3, random_state = 42)
# Separate out title and description again
X_train_title = np.array(X_train)[:, :25] # Title
X_valid_title = np.array(X_valid)[:, :25]
X_train_description = np.array(X_train)[:, 25:] # Description
X_valid_description = np.array(X_valid)[:, 25:]

### Standard baseline model

In [None]:
# Shallow baseline: Random Forest 
clf = RandomForestClassifier()
# Title
clf.fit(X_train_title, y_train)
RFpredictions_title = clf.predict(X_valid_title)
print('The F1 score of RF on title data:', metrics.f1_score(y_true = y_valid, y_pred = RFpredictions_title, average = 'macro'))
# Description
clf.fit(X_train_description, y_train)
RFpredictions_description = clf.predict(X_valid_description)
print('The F1 score of RF on description data:', metrics.f1_score(y_true = y_valid, y_pred = RFpredictions_description, average = 'macro'))

The F1 score of RF on title data: 0.8897913251713283
The F1 score of RF on description data: 0.6214208203646809


It appears that 'title' is a better predictor than 'description'. Since Random Forest can take only one input array, 'title' is chosen.

In [None]:
# Self-developed functions to enable upload to Kaggle
# Function: Combines each predicted column.
def ColumnCombiner(x):
  int_pred = np.array(x, dtype='int')
  final_preds = []
  for i in range(len(int_pred)):
    string_ints = [str(int) for int in int_pred[i]]
    str_of_ints = "".join(string_ints)
    final_preds.append(str_of_ints)
  return (final_preds)
# Functions: Correctly formats data for submission to Kaggle.
def KaggleUploader_Multi(x):
  # Combine predictions w/ 'docid'
  predictions = pd.DataFrame(x, columns = ['label'], dtype = 'str')
  docid = pd.DataFrame(testMulti['docid'])
  # Create prediction output for Kaggle
  predictions = pd.concat([docid, predictions], axis=1) #concatenate ID of each query with its prediction
  predictions.to_csv('Prediction.csv', index = False)
  from google.colab import files
  files.download('Prediction.csv')
# Functions: For neural networks, converts sigmoid predictions to (0,1)
def PredictionConverter_B(model, test_data):
  predictions = model.predict(test_data)
  # Transforming predictions to (0,1)
  for i in range(len(predictions)):
    for x in range(9):
      if predictions[i][x] > 0.5:
        predictions[i][x] = 1
      else:
        predictions[i][x] = 0
  return predictions

In [None]:
clf.fit(X_train_title, y_train_title)
predictions = clf.predict(title_test)
predictions = KaggleUploader_Multi(ColumnCombiner(predictions))
print('Kaggle prediction for title data: 0.91654')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Kaggle prediction for title data: 0.91654


### Dense, deep neural network

In [99]:
# Use scikit-learn to grid search neurons, epochs, and batch size
# Function to create model, required for Keras classifier
def DenseLayerConfig_Multi():
  model = keras.models.Sequential()
  model.add(layers.Embedding(input_dim = train_vocab_description + 1, output_dim = 264, input_length = title_train.shape[1]))
  model.add(layers.Flatten())
  model.add(layers.Dense(units = 100))
  model.add(layers.Dense(units = 50))
  model.add(layers.Dense(units = 9, activation = 'sigmoid'))
  # compile model
  model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
  return model
# random seed for reproducability
seed = 42
np.random.seed(seed)
# Create model for GridSearch
model = KerasClassifier(build_fn = DenseLayerConfig_Multi, epochs = 10, verbose = 1)
# Define GridSearch Parameters
#epochs = [10,15]
batch_size = [400,500]
param_grid = dict(batch_size = batch_size) # dictionary of parameters we wish to GridSearch through
grid = GridSearchCV(estimator = model, param_grid = param_grid, n_jobs = -1, cv = 3)
# Best parameters
clf_title_DL = grid.fit(X_train_title, y_train, verbose = 0)
print('Best F1 score:', clf_title_DL.best_score_)
print('Best Hyperparameters', clf_title_DL.best_params_)

Best F1 score: 0.9419388771057129
Best Hyperparameters {'batch_size': 400}


In [159]:
# Predictions using Title data
clf_title_DL = ModelTrainerMulti(X_train_title, y_train, batch_size = 400, epochs = 10, model = Model_Multiclass_Dense,
                                   input_dim = train_vocab_description +1, input_length = X_train_description.shape[1])
loss, accuracy, f1_score = clf_title_DL.evaluate(X_valid_title, y_valid)
print('Using title data, the F1 score on the validation set is', f1_score)

Using title data, the F1 score on the validation set is 0.9492183923721313


In [160]:
# Upload Deep NN predictions
KaggleUploader_Multi(ColumnCombiner(PredictionConverter_B(clf_title_DL, title_test)))
print('Kaggle prediction for title data: 0.91268')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Kaggle prediction for title data: 0.91268


### Additional 1: Recurrent Neural Network (LSTM)

In [None]:
# Predictions
clf_title_RNN = ModelTrainerMulti(X_train_title, y_train, batch_size = 500, epochs = 10, model = Model_Multiclass_LSTM, 
                                  input_dim = train_vocab_title +1, input_length = X_train_title.shape[1])
loss, accuracy, f1_score = clf_title_RNN.evaluate(X_valid_title, y_valid)
print('Using title data, the F1 score on the validation set is', f1_score)

Using title data, the F1 score on the validation set is 0.9359655976295471


In [None]:
# Upload LSTM predictions
KaggleUploader_Multi(ColumnCombiner(PredictionConverter_B(clf_title_RNN, title_test)))
print('Kaggle prediction for title data: 0.90210')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Kaggle prediction for title data: 0.90210


### Additional 2: Convolutional Neural Network

In [None]:
# Prediction
clf_title_CNN = ModelTrainerMulti(X_train_title, y_train, batch_size = 400, epochs = 10, model = Model_Multiclass_CNN, 
                                  input_dim = train_vocab_title +1, input_length = X_train_title.shape[1])
loss, accuracy, f1_score = clf_title_CNN.evaluate(X_valid_title, y_valid)
print('Using title data, the F1 score on the validation set is', f1_score)

Using title data, the F1 score on the validation set is 0.9457349181175232


In [None]:
# Upload CNN predictions
KaggleUploader_Multi(ColumnCombiner(PredictionConverter_B(clf_title_CNN, title_test)))
print('Kaggle prediction for title data: 0.91232')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Kaggle prediction for title data: 0.91232
