# Submission from Martin King, 122108604.

I have written comments in markdown in some parts of this Jupyter notebook. These comments start with "My comments:" in red colour.

# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 07/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

One of the most successful approaches is an algorithm portfolio, where a solver is selected among a set of candidates depending on the problem type. Your task is to create a classifier that takes as input the SAT instance's features and identifies the class.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, the fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.


The original dataset is available at:
https://github.com/bprovanbessell/SATfeatPy/blob/main/features_csv/all_features.csv



## Data Preparation

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/train_dataset.csv", index_col=0)
df.head()



Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,608,71,8.56338,0.116776,0.045172,0.173688,0.029605,0.060855,2.802758,0.045172,...,5078250.0,1056.695041,1.0,2.981935e-09,2113.390083,1081.900778,1.0,1.30208e-29,2163.801556,matching
1,615,70,8.785714,0.113821,0.049617,0.168633,0.03252,0.069919,2.607264,0.049617,...,5469376.0,1207.488426,1.0,6.927306e-28,2414.976852,1186.623627,1.0,3.491123e-120,2373.247255,matching
2,926,105,8.819048,0.113391,0.033385,0.186444,0.017279,0.047516,3.022879,0.033385,...,4297025.0,441.327046,1.0,1.194627e-76,882.654092,474.697562,1.0,0.0,949.395124,matching
3,603,70,8.614286,0.116086,0.049799,0.133441,0.033167,0.063018,2.688342,0.049799,...,6640651.0,1181.583331,1.0,2.437278e-30,2363.166661,1149.059132,1.0,4.67009e-147,2298.118264,matching
4,228,43,5.302326,0.188596,0.067319,0.162581,0.048246,0.087719,2.203308,0.067319,...,2437500.0,1091.423921,0.999966,0.03723599,2182.810606,1296.888087,1.0,6.307424e-06,2593.776167,matching


In [None]:
# Label or target variable
df['target'].value_counts()

tseitin           298
dominating        294
cliquecoloring    268
php               266
subsetcard        263
op                201
tiling            120
5clique           108
3color            104
matching          102
5color             98
4color             98
3clique            98
4clique            94
Name: target, dtype: int64

# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate a decision tree classifier using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset.

In [None]:
# YOUR CODE HERE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#from sklearn.preprocessing import LabelEncoder
#label_encoder = LabelEncoder()
#label_encoder.fit(df['target'])
#df['target'] = label_encoder.transform(df['target'])

from sklearn.model_selection import train_test_split
X = df.iloc[:, :-1]
y = df['target']

# Clean data.
X = X.fillna(0)
X = X.replace([np.inf, -np.inf], 0)

train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.7, random_state=122108)

# Scale the features.
#scaler = StandardScaler()
#train_X = scaler.fit_transform(train_X)
#test_X = scaler.transform(test_X)

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier

classifierDT = Pipeline([
    # ('dr', PCA(327)),
    ("predictor", DecisionTreeClassifier())
])

classifierDT.fit(train_X, train_y)

from sklearn.model_selection import cross_val_score
train_acc = np.mean(cross_val_score(classifierDT, train_X, train_y, scoring="accuracy"))
test_acc = np.mean(cross_val_score(classifierDT, test_X, test_y, scoring="accuracy"))

print("CV Train accuracy: ", train_acc)
print("CV Test accuracy: ", test_acc)

from sklearn.metrics import accuracy_score
predictions = classifierDT.predict(train_X)
train_acc = accuracy_score(predictions, train_y)
predictions = classifierDT.predict(test_X)
test_acc = accuracy_score(predictions, test_y)

print("Pred. Train accuracy: ", train_acc)
print("Pred. Test accuracy: ", test_acc)


CV Train accuracy:  0.9733394202236931
CV Test accuracy:  0.9447413793103449
Pred. Train accuracy:  0.9994075829383886
Pred. Test accuracy:  0.9737569060773481


<font color="red">**My comment:**</font> The predicted training and test accuracies are at 99.94% and 98.48% respectively. These are quite high values already. In the following 2 sections, especially the final section, an attempt is made to improve the accuracy even further.

## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature selection.
* Feature normalisation.

Your report should provide concrete information about your reasoning; everything should be well-explained.

The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

<font color="red">**My comment:**</font> I use GridSearchCV to find optimal values for PCA components to retain, maximum depths in tree, minimum number of samples in a node to split, and minimum number of samples must be kept in a leaf.

In [None]:
# YOUR CODE HERE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier

classifierDT = Pipeline([
    ('dr', PCA()),
    ("predictor", DecisionTreeClassifier())
])

dt_param_grid = {
    "dr__n_components": [150, 210, 270, 327],
    "predictor__max_depth": [10, 20, 30, 40, None],
    "predictor__min_samples_split": [2, 5, 10, 15],
    "predictor__min_samples_leaf": [1, 2, 4, 6]
}

from sklearn.model_selection import GridSearchCV
dt_gs = GridSearchCV(classifierDT, dt_param_grid, scoring="accuracy")
dt_gs.fit(train_X, train_y)

dt_gs.best_params_, dt_gs.best_score_

KeyboardInterrupt: 

<font color="red">**My comment:**</font> Evaluate classifier decition tree fitted after grid search of hyperparameters.

In [None]:
classifierDT.set_params(**dt_gs.best_params_)
classifierDT.fit(train_X, train_y)

from sklearn.model_selection import cross_val_score
train_acc = np.mean(cross_val_score(classifierDT, train_X, train_y, scoring="accuracy"))
test_acc = np.mean(cross_val_score(classifierDT, test_X, test_y, scoring="accuracy"))
print("CV Train accuracy: ", train_acc)
print("CV Test accuracy: ", test_acc)

from sklearn.metrics import accuracy_score
predictions = classifierDT.predict(train_X)
train_acc = accuracy_score(predictions, train_y)

predictions = classifierDT.predict(test_X)
test_acc = accuracy_score(predictions, test_y)
print("Pred. Train accuracy: ", train_acc)
print("Pred. Test accuracy: ", test_acc)

CV Train accuracy:  0.8690709883588223
CV Test accuracy:  0.816360153256705
Pred. Train accuracy:  0.9946682464454977
Pred. Test accuracy:  0.8908839779005525


<font color="red">**My comment:**</font>

Grid search found optimal:

({'dr__n_components': 150, \
  'predictor__max_depth': None, \
  'predictor__min_samples_leaf': 1, \
  'predictor__min_samples_split': 2}, \
 0.8803223710779063)

 Decision tree set to these hyperparameters values indeed produce very high training accuracy of 0.9947. However, the test accuracy has degraded to 0.8909 from 0.9848 previously. The decision tree with these grid search hyperparameters has been probably overfitted.

 I had in the previous part experimented with standardising the features using:

scaler = StandardScaler() \
train_X = scaler.fit_transform(train_X) \
test_X = scaler.transform(test_X)

I did not find that this has any substantial effect on the train and test accuracies.

## New classifier (10 Marks)

Replicate the previous task for a classifier different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.


Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

<font color="red">**My comment:**</font>

Using experience I gained from the Deep Learning module this semester, I set up a simple neural network (NN) for the current classification problem. The NN has 5 dense (fully connected) layers, including the input and output layers. Each layer has a drop out rate of 0.35 to allow the training not to overfit to the training data. The training is very fast even without using GPU on Colab. Training accuracy = 0.9995. In the final (next) part, I test the model using the test data provided.

In [None]:
# YOUR CODE HERE
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.utils import to_categorical

# Read in data.
df = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/train_dataset.csv", index_col=0)

X = df.iloc[:, :-1]
y = df['target']

# Data cleaning. Missing and infinity values.
X = X.fillna(0)
X = X.replace([np.inf, -np.inf], 0)

# Split data into train and validation sets.
train_X, valid_X, train_y, valid_y = train_test_split(X, y, train_size=0.8)

# Preprocess the features and target class.
scaler = StandardScaler()
train_X = scaler.fit_transform(train_X)
valid_X = scaler.transform(valid_X)

encoder = LabelEncoder()
train_y_encoded = encoder.fit_transform(train_y)
valid_y_encoded = encoder.transform(valid_y)

train_y_categorical = to_categorical(train_y_encoded)
valid_y_categorical = to_categorical(valid_y_encoded)

# For NN, we need to have validation and test sets.
# Commented now. Use provided test data in the next section.
# valid_X2, test_X, valid_y_categorical2, test_y_categorical = train_test_split(valid_X, valid_y_categorical, train_size=0.8)

# Set up a dense neural network.
model = Sequential()
model.add(Dense(327, activation='relu', input_shape=(train_X.shape[1],)))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dense(32, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dense(14, activation='softmax'))

# Compile the model.
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model.
model.fit(train_X, train_y_categorical, epochs=100, batch_size=32, validation_data=(valid_X, valid_y_categorical))

# Check the model training accuracy.
# I use the provided test data for evaluation in the final part of this notebook.
loss, accuracy = model.evaluate(train_X, train_y_categorical)
print("Training loss: ", loss)
print("Training accuracy: ", accuracy)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
from joblib import dump

# Saving the trained NN model.
# Only need to do once normally.
dump(model, 'mpk_model.joblib')

['mpk_model.joblib']

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv


<font color="red">**My comment:**</font> Using the test data provided, the trained NN produces a test accuracy of 1.0. I experimented a little bit with changing drop out rates of the neurons. In the end, I have chosen 0.35 dropout rate. I hope that this makes the model robust to unseen new data and will attain at least 0.99 accuracy.

<font color="red">**Note:**</font> Running this evaluation took not more than 20 seconds.

In [None]:
from joblib import dump, load
from io import BytesIO
import requests

# These packages are needed for preprocessing input data.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.utils import to_categorical

# INSERT YOUR MODEL'S URL
mLink = 'https://github.com/martin-king/dm_assign2/blob/main/mpk_model.joblib?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model = load(mfile)

# Input test data.
df_test = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/test_dataset.csv", index_col=0)

# YOUR CODE HERE
X = df_test.iloc[:, :-1]
y = df_test['target']

# Data cleaning. Missing and infinity values.
X = X.fillna(0)
X = X.replace([np.inf, -np.inf], 0)

# Preprocess the features and target class.
# This part is critical for my model.
scaler = StandardScaler()
test_X = scaler.fit_transform(X)

encoder = LabelEncoder()
test_y_encoded = encoder.fit_transform(y)
test_y_categorical = to_categorical(test_y_encoded)

# Evaluate the model.
loss, accuracy = model.evaluate(test_X, test_y_categorical)
print("Test loss: ", loss)
print("Test accuracy: ", accuracy)

# Another way of doing the above.
pred = model.predict(test_X)
pred = np.argmax(pred, axis=1)
true = np.argmax(test_y_categorical, axis=1)
#from sklearn.metrics import accuracy_score
#accuracy = accuracy_score(true, pred)
#print(f'Test accuracy: {accuracy * 100:.2f}%')


Test loss:  0.001877226517535746
Test accuracy:  0.9984050989151001
