# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 06/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

Recent advances in supervised learning have provided powerful techniques for classifying problems. In this project, we see the SAT problem as a classification problem. Given a Boolean formula (represented by a vector of features), we are asked to predict if it is satisfiable or not.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.

The dataset is available at:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_train.csv

This is original unpublished data.

## Data Preparation

In [154]:
import pandas as pd

df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/6d5738101d173b97c565f143f945dedb9c42a400/dm_assignment2/sat_dataset_train.csv?raw=true")
df.head()

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,420,10,42.0,0.02381,0.6,0.0,0.6,0.6,0.0,0.6,...,78750.0,8e-06,0.0,7.875e-06,8e-06,2.385082e-21,0.0,2.385082e-21,2.385082e-21,1
1,230,20,11.5,0.086957,0.137826,0.089281,0.117391,0.16087,2.180946,0.137826,...,6646875.0,17433.722184,1.0,2.981244e-12,34867.444369,17277.21,1.0,1.358551e-53,34554.42,0
2,240,16,15.0,0.066667,0.3,0.0,0.3,0.3,0.0,0.3,...,500000.0,1525.878932,0.0,1525.879,1525.878932,1525.879,0.0,1525.879,1525.879,1
3,424,30,14.133333,0.070755,0.226415,0.485913,0.056604,0.45283,2.220088,0.226415,...,87500.0,0.000122,1.0,6.535723e-14,0.000245,8.218628e-07,1.0,1.499676e-61,1.643726e-06,0
4,162,19,8.526316,0.117284,0.139701,0.121821,0.111111,0.185185,1.940843,0.139701,...,5859400.0,16591.49431,1.0,6.912725999999999e-42,33182.988621,16659.03,1.0,0.0,33318.07,1


In [155]:
df.dtypes

c                       int64
v                       int64
clauses_vars_ratio    float64
vars_clauses_ratio    float64
vcg_var_mean          float64
                       ...   
rwh_2_mean            float64
rwh_2_coeff           float64
rwh_2_min             float64
rwh_2_max             float64
target                  int64
Length: 328, dtype: object

In [156]:
df['target'].value_counts()

1    976
0    953
Name: target, dtype: int64

Expected accuracy

In [157]:
976/(976+953)

0.5059616381544841

Remove infinity and NAN

In [158]:
import numpy as np
df.replace([np.inf, -np.inf], np.nan,inplace=True)
df = df.fillna(0)

Select the features i.e, all predictors

In [159]:
features = df.iloc[:,:-1]

Split into train and test set

In [160]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(features, df.target, random_state=420, test_size=0.30)

Check shape to make sure its correct

In [161]:
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(1350, 327) (579, 327) (1350,) (579,)


# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate K-NN and decision tree classifiers using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset. Compare the results of both classifiers.

In [162]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import neighbors
from sklearn import metrics
from sklearn import model_selection

knn_scores = []
for i in range(1,15):
  knn = KNeighborsClassifier(n_neighbors=i)
  knn.fit(X_train, Y_train)
  knn_scores.append((i,knn.score(X_test, Y_test)*100))

Best k-NN

In [163]:
list(sorted(knn_scores, key=lambda x: x[1], reverse=True))
print(f'Best K: {knn_scores[0][0]} \nAccuracy of k-NN: {knn_scores[0][1]}')

Best K: 1 
Accuracy of k-NN: 81.69257340241796


Decision Tree Classifier

In [164]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=11,criterion='entropy')
dtc.fit(X_train, Y_train)
print(f'Accuracy of Decision Tree Classifier: {dtc.score(X_test, Y_test)*100}')

Accuracy of Decision Tree Classifier: 97.75474956822107


# Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature normalisation.

Your report should provide concrete information of your reasoning; everything should be well-explained.

Do not get stressed if the things you try do not improve the accuracy. The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

## Feature Normalisation
##### https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
Compare model accuracies of k-NN and Decision Tree Classifer before and after normalising the data.

MinMaxScalar

In [166]:
from sklearn.preprocessing import MinMaxScaler

mn_scale = MinMaxScaler()
mn_scaled_xtrain = mn_scale.fit_transform(X_train)
mn_scaled_xtest = mn_scale.fit_transform(X_test)

StandardScalar

In [169]:
from sklearn.preprocessing import StandardScaler

std_scale = StandardScaler()
scaled_xtrain = std_scale.fit_transform(X_train)
scaled_xtest = std_scale.fit_transform(X_test)

In [170]:
# Unscaled
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, Y_train)
dtc = DecisionTreeClassifier(random_state=11,criterion='entropy')
dtc.fit(X_train, Y_train)
print(f'Unscaled:\nk-NN accuracy: {knn.score(X_test,Y_test)*100} \nDtc accuracy: {dtc.score(X_test, Y_test)*100}')

# MinMaxScalar
knn.fit(mn_scaled_xtrain,Y_train)
knn.score(mn_scaled_xtest, Y_test)
dtc = DecisionTreeClassifier(random_state=11,criterion='entropy')
dtc.fit(mn_scaled_xtrain, Y_train)
print(f'\nMinMaxScalar:\nk-NN accuracy: {knn.score(mn_scaled_xtest,Y_test)*100} \nDtc accuracy: {dtc.score(mn_scaled_xtest, Y_test)*100}')

# StandardScalar
knn.fit(scaled_xtrain,Y_train)
knn.score(scaled_xtest, Y_test)
dtc = DecisionTreeClassifier(random_state=11,criterion='entropy')
dtc.fit(scaled_xtrain, Y_train)
print(f'\nStandardScalar:\nk-NN accuracy: {knn.score(scaled_xtest,Y_test)*100} \nDtc accuracy: {dtc.score(scaled_xtest, Y_test)*100}')

Unscaled:
k-NN accuracy: 81.69257340241796 
Dtc accuracy: 97.75474956822107

MinMaxScalar:
k-NN accuracy: 89.98272884283247 
Dtc accuracy: 98.10017271157167

StandardScalar:
k-NN accuracy: 91.8825561312608 
Dtc accuracy: 94.12780656303973


## Cross Validation</br>
##### https://scikit-learn.org/stable/modules/cross_validation.html
Performed 5-fold CV for the task:<br>
Cross-validation on the k-NN model with best accuracy (K=1)<br> Cross-validation on Decision Tree Classifier

In [171]:
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(scaled_xtrain, Y_train)
dtc = DecisionTreeClassifier(random_state=11,criterion='entropy')
dtc.fit(scaled_xtrain, Y_train)

# Cross-validated scores
knn_cv_scores = cross_val_score(knn, scaled_xtest, Y_test, cv=10)
dtc_cv_scores = cross_val_score(dtc, scaled_xtest, Y_test, cv=10)
print(f'k-NN cv_scores when k=1: {knn_cv_scores*100} \nDecision Tree Classifier cv_scores: {dtc_cv_scores*100}')

k-NN cv_scores when k=1: [93.10344828 89.65517241 86.20689655 89.65517241 82.75862069 89.65517241
 94.82758621 91.37931034 87.93103448 87.71929825] 
Decision Tree Classifier cv_scores: [ 98.27586207  94.82758621 100.          98.27586207  98.27586207
  98.27586207  96.55172414 100.          96.55172414  94.73684211]


In [172]:
print(f'k-NN cv_scores when k=1: {knn_cv_scores.mean()*100} \nDecision Tree Classifier cv_scores: {dtc_cv_scores.mean()*100}')

k-NN cv_scores when k=1: 89.28917120387175 
Decision Tree Classifier cv_scores: 97.57713248638838


## Hyperparameter Tuning
##### https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Tune k-NN and Decision Tree Classifier using GridSearchCV

In [186]:
from sklearn.model_selection import GridSearchCV

# knn grid cv
knn = KNeighborsClassifier()
parameters = dict(n_neighbors = list(range(1,20,2)),
                  p = [1,2],
                  weights = ['uniform','distance'])
knn_gs = GridSearchCV(knn, parameters)
knn_gs.fit(scaled_xtrain, Y_train)
print(f'k-NN GridSearch: \nAccuracy: {knn_gs.score(scaled_xtest,Y_test)*100} \nBest Estimator: {knn_gs.best_estimator_} \nParameters used: {knn_gs.best_params_}')


k-NN GridSearch: 
Accuracy: 91.36442141623489 
Best Estimator: KNeighborsClassifier(n_neighbors=7, p=1, weights='distance') 
Parameters used: {'n_neighbors': 7, 'p': 1, 'weights': 'distance'}


In [187]:
# Dtc grid cv
dtc = DecisionTreeClassifier(random_state=11)
parameters = dict(criterion = ['gini', 'entropy'],
                  splitter = ['best', 'random'],
                  max_depth = list(range(1,30)))
dtc_gs = GridSearchCV(dtc, parameters)
dtc_gs.fit(scaled_xtrain, Y_train)
print(f'\nDtc GridSearch: \nAccuracy: {dtc_gs.score(scaled_xtest,Y_test)*100} \nBest Estimator: {dtc_gs.best_estimator_} \nParameters used: {dtc_gs.best_params_}')



Dtc GridSearch: 
Accuracy: 94.81865284974094 
Best Estimator: DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=11) 
Parameters used: {'criterion': 'entropy', 'max_depth': 5, 'splitter': 'best'}


## Feature reduction
### Principle Component Analysis
##### https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Made use of Principle Component Analysis to reduce the number of features in the models<br> 98% of information retained 


In [185]:
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

pca = PCA(n_components=.98, random_state=11)
pca.fit(mn_scaled_xtrain)
pca.fit(mn_scaled_xtest)
pca_xtrain = pca.transform(mn_scaled_xtrain)
pca_xtest = pca.transform(mn_scaled_xtest)

# kNN
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(pca_xtrain, Y_train)
print(f'Accuracy of Decision Tree Classifier after PCA reduction: {knn.score(pca_xtest,Y_test)*100}')

# Dtc
dtc = DecisionTreeClassifier(random_state=11,criterion='entropy')
dtc.fit(pca_xtrain, Y_train)
print(f'Accuracy of Decision Tree Classifier after PCA reduction: {dtc.score(pca_xtest,Y_test)*100}')


Accuracy of Decision Tree Classifier after PCA reduction: 90.32815198618307
Accuracy of Decision Tree Classifier after PCA reduction: 85.83765112262522


# New classifier (10 Marks)

Replicate the previous task for a classifier that we did not cover in class. So different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.
Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

## AdaBoost Classifer
##### https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

In [188]:
sat_df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv?raw=true")
sat_df.shape

(483, 328)

Remove infinity and NANs

In [189]:
import numpy as np
sat_df.replace([np.inf, -np.inf], np.nan,inplace=True)
sat_df = sat_df.fillna(0)

Select features i.e., all attributes expect target

In [190]:
features = sat_df.iloc[:,:-1]

Split dataset into train and test

In [191]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(features, sat_df.target, random_state=11, test_size=0.30)

StandardScalar

In [201]:
from sklearn.preprocessing import StandardScaler

std_scale = StandardScaler()
std_scaled_xtrain = std_scale.fit_transform(X_train)
std_scaled_xtest = std_scale.fit_transform(X_test)

Hyperparameter Search and fit AdaBoostClassifier

In [230]:
from sklearn.ensemble import AdaBoostClassifier

# hyperparameter search
parameters = dict(algorithm = ['SAMME', 'SAMME.R'],
                  learning_rate = [0.2,0.4,0.6,0.8,1.0])
abc = AdaBoostClassifier(random_state=11)
gs_abc = GridSearchCV(estimator = abc, param_grid = parameters, cv = 10)

# fit model
gs_abc.fit(std_scaled_xtrain, Y_train)
print(f'Accurary of AdaBoostClassifier: {gs_abc.score(std_scaled_xtest, Y_test)*100}')

Accurary of AdaBoostClassifier: 100.0


# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv

In [None]:
from joblib import dump, load
from sklearn.pipeline import Pipeline
from io import BytesIO
import requests
from sklearn.pipeline import Pipeline

# INSERT YOUR MODEL'S URL
mLink = 'URL_OF_YOUR_MODEL_SAVED_IN_YOUR_GITHUB_REPOSITORY?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model = load(mfile)
# YOUR CODE HERE