## Overview and Abstract



**Team J**


# Task overview

The dataset provided for CS98-DL-Task1 Relevance Modelling contains entries of search results returned by a procurement search engine during a user’s search session. The aim of our task was to decide whether each document is relevant or not by creating a binary classifier model.

In order to solve the problem at hand, we have implemented three different models. The first one was a standard machine learning base line model for which we have chosen to user a Gradient Boosting Classifier. The second one was a simple deep learning neural network with 3 hidden layers, the results of which were used for performance comparison with a more sophisticated deep learning model. The last model we have built was a Recurrent Neural Network, with LSTM hidden layers.


In [None]:
#Imports and seed specification
import tensorflow as tf
from tensorflow import keras
import os
import tempfile
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
import random as rn
import imblearn
from imblearn.over_sampling import SMOTE
from google.colab import files 
from sklearn.ensemble import  GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_validate
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV

SEED = 23
os.environ['PYTHONHASHSEED']=str(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)
rn.seed(SEED)
uploaded = files.upload()



# Method

##Data Processing and Feature Extraction

The process of tackling the problem began with a thorough exploration of the features in the dataset folllowed by data preprocessing. The dataset is relatively small with 33,000 entries and 19 columns containing both numerical and textual data. All columns except the 'nature' column did not contain empty values which can be problematic for the models to deal with. Due to the limited data provided, a decision was made to fill the missing entries with the neutral value 'Other' rather than dropping them. Furthermore, the values in the 'psrel' columns, which contains the labels for the training of the binary classifiers, were found to be highly imbalanced with only 5.95% of the values belonging to the positive class. This is likely to result in the poor performance of the models in their predictions for the minority class and thus prevent us from achieving reasonable results. The severe imbalance in the class distributions of the dataset implies that some sort of a data augmentation technique has to be employed. One common approach to addressing such issues is to use an algorithm to oversample the minority class. Rather than just duplicating entries of the minority group, a better approach is to synthesize new data from the existing examples. This type of data augmentation is generally referred to as 'Synthetic Minority Oversampling Technique', or SMOTE for short, and it was utilized in this task. After employing the SMOTE implementation from the imblearn library on the training dataset, the number of entries was increased in favour of the positive class resulting in a 50:50 distribution. **[1]**
In addition,  a decision was made to drop some of the columns which did not contain relevant information that can be used to classify the documents - 'user', 'session', 'query', 'timestamp','cpvs'. The main reasoning behind this decision is that those columns contained different identifiers with regards to the query, but since their values are almost always unique they do not really contain that much useful information for the training process. Nevertheless, a more sophisticated solution could have been employed which ideally could have established a relationship between the session and user identifiers since some of the entries overlap in their values for these columns. What's more, the remaning columns which contained categorical data were one-hot encoded before they were fed into the models. The final stage of the preprocessing was to scale the data using sklearn's StandardScaler in order to improve the learning process further, especially for the standard machine learning model.

##Baseline Machine Learning Model
In terms of model, a standard machine learning solution had to be created to serve as a baseline for the other models. A decision was made to utilize an ensemble machine learning model, which usually provide better performance and generalization to more simplistic models. The classifier selected for the task was the GradientBoostingClassifier (GBC). This classifier combines multiple weak learning models (decision trees) in order to create a strong predictive model capable of handling complex datasets. In recent years GBC models are becoming more popular due to their effectiveness and such models have been winning scientific Kaggle competitions quite frequently. **[2]**

##Neural Network Models
As for the neural models developed, two such models were creating which differ in their architecture and complexity in order to test the effectiveness of each approach. Firstly, a baseline neural network model with three hidden layers was developed. Since this model was to be used as a base for comparison, neither regularization techniques such as regularizers and dropouts, nor more complex initialization schemes for the weighs were used in its specification. The architecture of the model is sequential/feedforward and it  consists of an input layer, which expects an input shape with 44 features (the total number of features after the one hot encoding of the data). Then, this layer is followed by three hidden Dense layers with a "relu" activation function and equal number of neurons. Finally, the output layer of this model has only one neuron and a sigmoid activation function, which makes it suitable for the binary classification task. The sigmoid function returns a value between 0 and 1 which makes it straight-forward to interpret the output as either belonging to the positive or negative class - values below 0.5 belong to the negative class, while values above 0.5 can be interpreted as part of the positive one.

In order to test the performance on the dataset of a neural network with a different and more compltex architecture, an Recurrent Neural Network(RNN) model was designed. Generally , such models are usually employed to tackle problems involing sequential or time-series data due to their ability to capture information about what has been calculated so far in more complex memory cells in comparison to the normal neurons. Usually, RNNS are used for compltex tasks with sequential data  such as language translation, speech recognition, natural language processing.**[3]** Nevertheless, they have proven to be effective in finding solutions for binary classification issues as well, which is the reason why a model with such architecture was developed and evaluated. The model that was created consists of a series of N RNN layers with LSTM memory cells, each followed by a dropout layer that would help the model to generalise better and not overfit the data. Since the RNN expects a 3 dimensional array of training data, another version of the data reshaped in the format (Number of samples, Sequence length, Input dimensions). Similarly to the basic neural network model, the final output layer has one neuron with a sigmoid activation function for binary classificaiton. The class definitions for both neural network models are presented in the code snippet below.


In [None]:
#Class definition of the Baseline Deep Learning Model
class BaselineDeepLearningModel(keras.Model):
  def __init__(self, n_neurons=30, input_shape = [44,] ,activation = "relu",  **kwargs):
    super().__init__( **kwargs)
    #self.input_layer = keras.layers.Flatten(input_shape = input_shape)
    self.hidden1 = keras.layers.Dense(n_neurons, activation=activation)
    self.hidden2 = keras.layers.Dense(n_neurons, activation=activation)
    self.hidden3 = keras.layers.Dense(n_neurons, activation=activation)
    self.output_layer = keras.layers.Dense(1, activation = "sigmoid")

  def call(self, inputs):
    #input_layer = self.input_layer(inputs)
    hidden1 = self.hidden1(inputs)
    hidden2 = self.hidden2(hidden1)
    hidden3 = self.hidden3(hidden2)
    output_layer = self.output_layer(hidden3)
    return output_layer


In [None]:



#Class definition of the RNN Deep Learning Model
class RNNModel(keras.Model):
  def __init__(self, n_neurons=30, dropout = 0.2, input_shape = [1,44], activation = "relu",  **kwargs):
    super().__init__( **kwargs)
    self.lstm1 = keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED), return_sequences = True)
    self.dropout1 = keras.layers.Dropout(dropout)
    self.lstm2 = keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED), return_sequences = True)
    self.dropout2 = keras.layers.Dropout(dropout)
    self.lstm3 = keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED), return_sequences = True)
    self.dropout3 = keras.layers.Dropout(dropout)
    self.lstm4 = keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED), return_sequences = True)
    self.dropout4 = keras.layers.Dropout(dropout)
    self.lstm5 = keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED), return_sequences = True)
    self.dropout5 = keras.layers.Dropout(dropout)
    self.lstm6 = keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED), return_sequences = True)
    self.dropout6 = keras.layers.Dropout(dropout)
    self.lstm7 = keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED))
    self.output_layer = keras.layers.Dense(1, activation = "sigmoid")

  def call(self, inputs):
    lstm1 = self.lstm1(inputs)
    dropout1 = self.dropout1(lstm1)
    lstm2 = self.lstm2(dropout1)
    dropout2 = self.dropout2(lstm2)
    lstm3 = self.lstm3(dropout2)
    dropout3 = self.dropout3(lstm3)
    lstm4 = self.lstm4(dropout3)
    dropout4 = self.dropout4(lstm4)
    lstm5 = self.lstm5(dropout4)
    dropout5 = self.dropout5(lstm5)
    lstm6 = self.lstm6(dropout5)
    dropout6 = self.dropout6(lstm6)
    lstm7 = self.lstm7(dropout6)
    output_layer = self.output_layer(lstm7)
    return output_layer



##Training Schedule Approach
In order to achieve the best results possible, the configurations possible for the neural network, the optimal hidden layers and number of neurons had to be explored. For each of two models, a builder function was created and it was provided as a parameter to the KerasClassifier wrapper class. This wrapper was then used in the initialization of a RandomizedSearchCV along with a dictionary of the parameters that are to be explored. The RandomizedSearchCv class then runs a grid search based on the paramters value pairs provided and saves the result at each iteration. Finally, the best performing parameters can be obtained once the search is finished. Due to the limited GPU power and time provided in Google Collab, the space of values that was explored was fairly limited. The builder code for the builder functions for each of the deep learning models is shown below.

In [None]:
#Definition of the metrics to be used when compiling deep learning models.
metrics = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
]

In [None]:
def build_model_basic(n_hidden=3, n_neurons=30, learning_rate=3e-3, input_shape=[44,]):

  model = keras.models.Sequential()
  model.add(keras.layers.Flatten(input_shape=input_shape))
  for layer in range(n_hidden):
    model.add(keras.layers.Dense(n_neurons, activation="relu"))
  model.add(keras.layers.Dense(1, activation = "sigmoid"))
  
  model.compile(optimizer=  keras.optimizers.Adam(lr=learning_rate),
                loss = keras.losses.BinaryCrossentropy(),
                metrics = metrics)

  return model

In [None]:
def build_RNN_model(n_hidden=6, n_neurons=30, droupout = 0.2, input_shape=[1,44], learning_rate=0.01):

  model = keras.models.Sequential()
  model.add(keras.layers.LSTM(200,input_shape=input_shape, return_sequences = True))
  for layer in range(n_hidden):
    model.add(keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED), return_sequences = True))
    model.add(keras.layers.Dropout(droupout))
  model.add(keras.layers.LSTM(n_neurons, kernel_initializer = keras.initializers.he_normal(seed=SEED)))
  model.add(keras.layers.Dense(1, activation = "sigmoid"))
  
  model.compile(optimizer = keras.optimizers.Adam(lr=learning_rate),
                loss = keras.losses.BinaryCrossentropy(),
                metrics = metrics)
  model.summary()
  return model

- Describe any other things that you did or tried in order to improve performance

# Results and Discussion

**Parameters setttings for each approarch:**

Gradient Boosting Classifier model:
*   n_estimators = 10   
*   max_depth = 5
*   learning_rate = 1.0

Paramteres for the Base Dense Deep Learning Network:
*   Adam optimization with 0.01 learning rate
*   Hidden Dense Layers = 3
*   Dense Layer Neurons = 54
*   Activation = ReLU

The parameters used for RNN Neural Network were:
*   Adam optimization with 0.001 learning rate
*   GRU Layers = 3
*   GRU hidden state neurons = 200
*   Dropout rate = 0.5
---


The best performing model was the GBC model without data augmentation, then our basic model without data augmentation, followed by the basic model with data augmentation. The RNN model heavily overfitted the data when it was trained with no data augmentation. However, when trained with the oversampled data it performed nearly as good as the Basic Model. Overall, the data augmentation prooved quite successfull approach when used with the RNN network but it had negative impact on the GBC and the Basic Model. It might be a good idea to explore other data augmentation techniques for handling imbalanced datasets, for example the undersample approach can be used.


In [None]:
d = {'Model': ["Gradient Boosting Classifier [Data Augmentation]", "Gradient Boosting Classifier [No Data Augmentation]", "Basic Deep Learning Model [Data Augmentation]", "Basic Deep Learning Model [No Data Augmentation]", "RNN LSTM Deep Learning Model [Data Augmentation]", "RNN LSTM Deep Learning Model [No Data Augmentation]"], 'Training F1 Score': [0.98, 0.68, 0.42, 0.45, 0.37, 0.73], 'Kaggle Score': [0.074, 0.113, 0.073, 0.084, 0.065, 0.00]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,Model,Training F1 Score,Kaggle Score
0,Gradient Boosting Classifier [Data Augmentation],0.98,0.074
1,Gradient Boosting Classifier [No Data Augmenta...,0.68,0.113
2,Basic Deep Learning Model [Data Augmentation],0.42,0.073
3,Basic Deep Learning Model [No Data Augmentation],0.45,0.084
4,RNN LSTM Deep Learning Model [Data Augmentation],0.37,0.065
5,RNN LSTM Deep Learning Model [No Data Augmenta...,0.73,0.0


# Summary and Recommendation

In conclusion, to our surprise the standard machine learning outperformed the deep learning models that were created due to its ability to generalise better. Nevertheless, exploring the possibility of creating a more sophisticated deep learning model which combines more advanced techniques that are out of the scope of this course can be very beneficial as it is likely to produce even better results.

- Provide a recommendation of which approach(s) should be used/considered and the pro’s and con’s of the approach. i.e. should the company use a particular model, and if so what are the caveats?

# References

**[1]** Brownlee J., 2021, 'SMOTE for Imbalanced Classification with Python', *Machine Learning Mastery*,   Source: 
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ (Last accessed 11/04/2021)



**[2]** Nelson D., 2021, 'Gradient Boosting Classifiers in Python with Scikit-Learn', *Stack Abuse*,   Source: 
https://stackabuse.com/gradient-boosting-classifiers-in-python-with-scikit-learn/ (Last accessed 11/04/2021)


**[3]**  'Recurrent Neural Networks', *IBM*,   Source: 
https://www.ibm.com/cloud/learn/recurrent-neural-networks  (Last accessed 11/04/2021)

# Code

Feature Processing
- Report how your processed the data here or in a separate notebook (provide link if a separate notebook is used).

Training and Validating etc.
- Show your working here – where you report all your training and validation, etc. that you performed in order to get the results.
- Note that it is important that you results can be replicated. All code to reproduce the final predictions must be included, along with any code that justifies your choices.

Any Additional Analysis
- Add in any additional analysis etc that you performed here.

In [None]:
# To get it to work you need to install the lower verison 0.21.2
# un coment, install, and then restart

!pip install scikit-learn==0.21.2



In [None]:
test_df = pd.read_csv('test.csv')
x_test_Id = test_df.pop("Id")
train_df = pd.read_csv('train.csv')
train_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,user,session,query,timestamp,search,rank,serp,hour,day,month,dwell,new-sub,premium-pack,psrel,source,type,nature,cpvs,#cpv45
0,8438057,A311E564F0A79803FB564CEAB6D7499A,d4fe169251f77f0800245e2df8376856,2020-05-26 10:45:36,quick,1,1,10,Tue,May,1,1,0,0,Intercon,notice,services,"['66131100', '66141000', '66519600', '66520000']",1
1,8438876,5E91CF19B8BEBA58A90E54EC97AAB3AF,5066bca0a00273cf3925b0c2f260f763,2020-01-21 10:47:51,saved,75,8,10,Tue,Jan,10,1,0,0,Contrax Weekly,notice,services,"['79421000', '92520000', '92521000']",2
2,922102585,7D717BA805FB42D51D6C8EC15C0DE2C1,174e0e6c62fd5d7b044dd05b47ce79c9,2020-02-05 09:37:42,advanced,4,1,9,Wed,Feb,21,1,0,0,Contrax Weekly,notice,services,"['79421000', '92520000', '92521000']",2
3,2105483652,D4855E55686DB80328B141598E3174CE,0f9f7f67dc569a6e3dba1ef35ce8970a,2020-01-21 14:43:57,advanced,66,4,14,Tue,Jan,21,0,0,0,Contrax Weekly,notice,services,"['79421000', '92520000', '92521000']",2
4,8438876,5E91CF19B8BEBA58A90E54EC97AAB3AF,5066bca0a00273cf3925b0c2f260f763,2020-01-21 10:48:33,saved,81,9,10,Tue,Jan,20,1,0,0,Contrax Weekly,notice,services,"['72000000', '72263000', '72300000']",1


In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user          33000 non-null  int64 
 1   session       33000 non-null  object
 2   query         33000 non-null  object
 3   timestamp     33000 non-null  object
 4   search        33000 non-null  object
 5   rank          33000 non-null  int64 
 6   serp          33000 non-null  int64 
 7   hour          33000 non-null  int64 
 8   day           33000 non-null  object
 9   month         33000 non-null  object
 10  dwell         33000 non-null  int64 
 11  new-sub       33000 non-null  int64 
 12  premium-pack  33000 non-null  int64 
 13  psrel         33000 non-null  int64 
 14  source        33000 non-null  object
 15  type          33000 non-null  object
 16  nature        21598 non-null  object
 17  cpvs          33000 non-null  object
 18  #cpv45        33000 non-null  int64 
dtypes: i

In [None]:
train_df.nunique()

user             1171
session          8196
query           12906
timestamp       32610
search              4
rank              623
serp               78
hour               24
day                 7
month               6
dwell            1205
new-sub             2
premium-pack        2
psrel               2
source             12
type                4
nature              3
cpvs             9441
#cpv45             24
dtype: int64

In [None]:
plt.figure(figsize = (6,5))
sns.heatmap(train_df.corr(method = "kendall"), annot = True, fmt = ".1g", vmin = -1, vmax =1, center =0,cmap="coolwarm")
plt.show()

In [None]:
neg, pos = np.bincount(train_df['psrel'])
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Examples:
    Total: 33000
    Positive: 1962 (5.95% of total)



In [None]:
cleaned_train_df = train_df.copy()
# You don't want the `Time` column.
cleaned_train_df.drop(columns=['user', 'session', 'query', 'timestamp','cpvs'], inplace=True)

cleaned_train_df.fillna("Other", inplace=True)

In [None]:
#Train DataFrame encoding
le = preprocessing.LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
enc.fit(cleaned_train_df['search'].to_numpy().reshape(-1, 1))
search_onehot = enc.transform(cleaned_train_df['search'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(search_onehot, columns=enc.get_feature_names())

cleaned_train_df = cleaned_train_df.drop(['search'], axis=1)
cleaned_train_df = cleaned_train_df.reset_index(drop=True)
cleaned_train_df = pd.concat([cleaned_train_df, ohe_df], axis=1)


enc.fit(cleaned_train_df['source'].to_numpy().reshape(-1, 1))
source_onehot = enc.transform(cleaned_train_df['source'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(source_onehot, columns=enc.get_feature_names())

cleaned_train_df = cleaned_train_df.drop(['source'], axis=1)
cleaned_train_df = cleaned_train_df.reset_index(drop=True)
cleaned_train_df = pd.concat([cleaned_train_df, ohe_df], axis=1)

enc.fit(cleaned_train_df['day'].to_numpy().reshape(-1, 1))
day_onehot = enc.transform(cleaned_train_df['day'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(day_onehot, columns=enc.get_feature_names())

cleaned_train_df = cleaned_train_df.drop(['day'], axis=1)
cleaned_train_df = cleaned_train_df.reset_index(drop=True)
cleaned_train_df = pd.concat([cleaned_train_df, ohe_df], axis=1)

enc.fit(cleaned_train_df['month'].to_numpy().reshape(-1, 1))
month_onehot = enc.transform(cleaned_train_df['month'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(month_onehot, columns=enc.get_feature_names())

cleaned_train_df = cleaned_train_df.drop(['month'], axis=1)
cleaned_train_df = cleaned_train_df.reset_index(drop=True)
cleaned_train_df = pd.concat([cleaned_train_df, ohe_df], axis=1)

enc.fit(cleaned_train_df['nature'].to_numpy().reshape(-1, 1))
nature_onehot = enc.transform(cleaned_train_df['nature'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(nature_onehot, columns=enc.get_feature_names())

cleaned_train_df = cleaned_train_df.drop(['nature'], axis=1)
cleaned_train_df = cleaned_train_df.reset_index(drop=True)
cleaned_train_df = pd.concat([cleaned_train_df, ohe_df], axis=1)



enc.fit(cleaned_train_df['type'].to_numpy().reshape(-1, 1))
type_onehot = enc.transform(cleaned_train_df['type'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(type_onehot, columns=enc.get_feature_names())

cleaned_train_df = cleaned_train_df.drop(['type'], axis=1)
cleaned_train_df = cleaned_train_df.reset_index(drop=True)
cleaned_train_df = pd.concat([cleaned_train_df, ohe_df], axis=1)

In [None]:
#Test DataFrame encoding
test_df.fillna("Other", inplace=True)

test_df.drop(columns=['user','session','query', 'timestamp','cpvs'], inplace = True)

from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

le = preprocessing.LabelEncoder()

enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
enc.fit(test_df['search'].to_numpy().reshape(-1, 1))
search_onehot = enc.transform(test_df['search'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(search_onehot, columns=enc.get_feature_names())

test_df = test_df.drop(['search'], axis=1)
test_df = test_df.reset_index(drop=True)
test_df = pd.concat([test_df, ohe_df], axis=1)

enc.fit(test_df['source'].to_numpy().reshape(-1, 1))
source_onehot = enc.transform(test_df['source'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(source_onehot, columns=enc.get_feature_names())

test_df = test_df.drop(['source'], axis=1)
test_df = test_df.reset_index(drop=True)
test_df = pd.concat([test_df, ohe_df], axis=1)

enc.fit(test_df['day'].to_numpy().reshape(-1, 1))
day_onehot = enc.transform(test_df['day'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(day_onehot, columns=enc.get_feature_names())

test_df = test_df.drop(['day'], axis=1)
test_df = test_df.reset_index(drop=True)
test_df = pd.concat([test_df, ohe_df], axis=1)

enc.fit(test_df['month'].to_numpy().reshape(-1, 1))
month_onehot = enc.transform(test_df['month'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(month_onehot, columns=enc.get_feature_names())

test_df = test_df.drop(['month'], axis=1)
test_df = test_df.reset_index(drop=True)
test_df = pd.concat([test_df, ohe_df], axis=1)

enc.fit(test_df['nature'].to_numpy().reshape(-1, 1))
nature_onehot = enc.transform(test_df['nature'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(nature_onehot, columns=enc.get_feature_names())

test_df = test_df.drop(['nature'], axis=1)
test_df = test_df.reset_index(drop=True)
test_df = pd.concat([test_df, ohe_df], axis=1)

enc.fit(test_df['type'].to_numpy().reshape(-1, 1))
type_onehot = enc.transform(test_df['type'].to_numpy().reshape(-1, 1))

# Make the 2d array a pandas series
ohe_df = pd.DataFrame(type_onehot, columns=enc.get_feature_names())

test_df = test_df.drop(['type'], axis=1)
test_df = test_df.reset_index(drop=True)
test_df = pd.concat([test_df, ohe_df], axis=1)

In [None]:
# Use a utility from sklearn to split and shuffle our dataset.
train_df, val_df = train_test_split(cleaned_train_df, test_size=0.2,random_state=SEED)

# Form np arrays of labels and features.
train_labels = np.array(train_df.pop('psrel'))
val_labels = np.array(val_df.pop('psrel'))
train_features = np.array(train_df)
val_features = np.array(val_df)

In [None]:
#Scale the training and validation data
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)
val_features = scaler.transform(val_features)

In [None]:
#Data augmentation/oversampling
oversample = SMOTE()
train_features_oversampled, train_labels_oversampled = oversample.fit_resample(train_features, train_labels)

In [None]:
neg, pos = np.bincount(train_labels_oversampled)
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Examples:
    Total: 49666
    Positive: 24833 (50.00% of total)



In [None]:
#Grid search for Gradient Boosting Classifier (Takes 3hrs to run)
parameters = {
    "max_depth": [3, 5, 10, 15, 20, 25, 30, 60, None],
    "n_estimators": [100, 200, 300, 500],
    "learning_rate":[0.1, 0.3, 0.5, 0.8],
}
grid_search = GridSearchCV(estimator=GradientBoostingClassifier(), param_grid=parameters, return_train_score=True, scoring='f1_macro')
grid_search = grid_search.fit(train_features, train_labels)
grid_search.best_params_



In [None]:
#GradientBoostingClassifier initilization and prediction
gb_clf = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, max_depth=5)
#gb_clf.fit(train_features,train_labels)
gb_clf.fit(train_features_oversampled, train_labels_oversampled)
y_predicted_gbc = gb_clf.predict(test_df)
y_predicted_gbc = y_predicted_gbc.reshape(-1,1)

#scores = cross_validate(gb_clf, train_features, train_labels, cv=10,scoring=('f1_macro'), return_train_score=True)
scores = cross_validate(gb_clf, train_features_oversampled, train_labels_oversampled, cv=10,scoring=('f1_macro'), return_train_score=True)
print("Training Score: " + str(np.average(scores['train_score'])))
print("Test Score: " + str(np.average(scores['test_score'])))

Training Score: 0.9459554324378765
Test Score: 0.9396229597614969


In [None]:
#Reshape train and validation data for the RNN
train_features_rnn = train_features.reshape(train_features.shape[0], 1, train_features.shape[1])
train_features_oversampled_rnn = train_features_oversampled.reshape(train_features_oversampled.shape[0], 1, train_features_oversampled.shape[1])
val_features_rnn = val_features.reshape(val_features.shape[0], 1,  val_features.shape[1])

In [None]:
# Given this function, we pass it through as a parameter to a Keras Classification Wrapper
keras_class = keras.wrappers.scikit_learn.KerasClassifier(build_fn=build_RNN_model)

early_stopping_cb = keras.callbacks.EarlyStopping(monitor = 'loss', patience=3, restore_best_weights=True)
keras_class_fit_oversampled = keras_class.fit(train_features_oversampled_rnn, train_labels_oversampled, callbacks = [early_stopping_cb])

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_32 (LSTM)               (None, 1, 200)            196000    
_________________________________________________________________
lstm_33 (LSTM)               (None, 1, 30)             27720     
_________________________________________________________________
dropout_24 (Dropout)         (None, 1, 30)             0         
_________________________________________________________________
lstm_34 (LSTM)               (None, 1, 30)             7320      
_________________________________________________________________
dropout_25 (Dropout)         (None, 1, 30)             0         
_________________________________________________________________
lstm_35 (LSTM)               (None, 1, 30)             7320      
_________________________________________________________________
dropout_26 (Dropout)         (None, 1, 30)            

In [None]:
# Parameter space what we want explore
# Note that it exactly matches up to our build_model function parameters.

param_distribs = {
    "n_neurons": np.arange(25,35),
    "learning_rate": [0.01,0.02]
}

# How many possible combinations do we have ?? 30 * 3 = 90?

# Set up the search - trying n_iter possibilities for cv folds OVERSAMPLED
#rnd_search_cv_oversampled = RandomizedSearchCV(keras_class, param_distribs, n_iter=10, cv=2)
#rnd_search_cv_oversampled.fit(train_features_oversampled, train_labels_oversampled, epochs=50, validation_data=(val_features, val_labels), callbacks=[early_stopping_cb])

#Non-oversampled rnd_search_cv
rnd_search_cv = RandomizedSearchCV(keras_class, param_distribs, n_iter=10, cv=2)
rnd_search_cv.fit(train_features_oversampled_rnn, train_labels_oversampled, epochs=50, callbacks=[early_stopping_cb])

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_40 (LSTM)               (None, 1, 200)            196000    
_________________________________________________________________
lstm_41 (LSTM)               (None, 1, 32)             29824     
_________________________________________________________________
dropout_30 (Dropout)         (None, 1, 32)             0         
_________________________________________________________________
lstm_42 (LSTM)               (None, 1, 32)             8320      
_________________________________________________________________
dropout_31 (Dropout)         (None, 1, 32)             0         
_________________________________________________________________
lstm_43 (LSTM)               (None, 1, 32)             8320      
_________________________________________________________________
dropout_32 (Dropout)         (None, 1, 32)            

RandomizedSearchCV(cv=2, error_score='raise-deprecating',
                   estimator=<tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7fcdbeb152d0>,
                   iid='warn', n_iter=10, n_jobs=None,
                   param_distributions={'learning_rate': [0.01, 0.02],
                                        'n_neurons': array([25, 26, 27, 28, 29, 30, 31, 32, 33, 34])},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring=None, verbose=0)

In [None]:
# So what is the best parameters that were found???
#RandomSearch results for oversampled model {'n_neurons': 54, 'learning_rate': 0.01}
#RandomSearch results for non-oversampled model {'n_neurons': 51, 'learning_rate': 0.02}
#RNN network {'n_neurons': 32, 'learning_rate': 0.01}
print(rnd_search_cv.best_params_)
print(rnd_search_cv.best_score_)

{'n_neurons': 32, 'learning_rate': 0.01}
0.40635445713996887


In [None]:
model_class_nonoversampled = build_model_basic(n_neurons=51, learning_rate= 0.02)

In [None]:
#RNN training without data augmentation
model_rnn_func = build_RNN_model(n_neurons = 31, learning_rate= 0.01)

history_rnn_class = model_rnn_func.fit(train_features_rnn, train_labels,epochs = 20, validation_data = (val_features_rnn, val_labels))

#Prediction
test_df_rnn = test_df.to_numpy().reshape(test_df.shape[0],  1,test_df.shape[1])
y_predicted_rnn = model_rnn_func.predict(test_df_rnn)

Model: "sequential_26"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_229 (LSTM)              (None, 1, 200)            196000    
_________________________________________________________________
lstm_230 (LSTM)              (None, 1, 31)             28768     
_________________________________________________________________
dropout_174 (Dropout)        (None, 1, 31)             0         
_________________________________________________________________
lstm_231 (LSTM)              (None, 1, 31)             7812      
_________________________________________________________________
dropout_175 (Dropout)        (None, 1, 31)             0         
_________________________________________________________________
lstm_232 (LSTM)              (None, 1, 31)             7812      
_________________________________________________________________
dropout_176 (Dropout)        (None, 1, 31)           

In [None]:
#RNN training with data augmentation
model_rnn_func_oversamp = RNNModel(n_neurons = 31)
model_rnn_func_oversamp.compile(optimizer = keras.optimizers.Adam(lr=0.01),
                loss = keras.losses.BinaryCrossentropy(),
                metrics = metrics)

history_rnn_class = model_rnn_func_oversamp.fit(train_features_oversampled_rnn, train_labels_oversampled,epochs = 20, validation_data = (val_features_rnn, val_labels))

#Prediction
test_df_rnn = test_df.to_numpy().reshape(test_df.shape[0],  1,test_df.shape[1])
y_predicted_rnn_oversampl = model_rnn_func_oversamp.predict(test_df_rnn)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
#Transforming the predictions to boolean output
y_pred_bool = np.where(y_predicted_gbc > 0.5, 1, 0)
y_pred_bool

#Generate the output file for the submission
pd.DataFrame(y_pred_bool).set_index(x_test_Id).rename(columns={0:'psrel'}).to_csv('gbc_prediction_oversampl.csv')