<a href="https://colab.research.google.com/github/huntergibson/hunter/blob/main/Gibson__Second__Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 07/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

One of the most successful approaches is an algorithm portfolio, where a solver is selected among a set of candidates depending on the problem type. Your task is to create a classifier that takes as input the SAT instance's features and identifies the class.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, the fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.


The original dataset is available at:
https://github.com/bprovanbessell/SATfeatPy/blob/main/features_csv/all_features.csv



## Data Preparation

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/train_dataset.csv", index_col=0)
df.head()
df = df.fillna(0)

In [None]:
# Label or target variable
df['target'].value_counts()

tseitin           298
dominating        294
cliquecoloring    268
php               266
subsetcard        263
op                201
tiling            120
5clique           108
3color            104
matching          102
5color             98
4color             98
3clique            98
4clique            94
Name: target, dtype: int64

# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate a decision tree classifier using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset.

In [None]:
# YOUR CODE HERE
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [None]:
# Creates encoder for target
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform(y)

In [None]:
# Train - Test Split
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.3, random_state=42)

In [None]:
# Prints shape of datasets
print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

Train shape: (1688, 327) (1688,)
Test shape: (724, 327) (724,)


In [None]:
# Replace with zero
import numpy as np
X_train.replace([np.inf, -np.inf], 0, inplace=True)
X_test.replace([np.inf, -np.inf], 0, inplace=True)

In [None]:
# Check null cells
df.isnull().sum().sum()

0

In [None]:
from sklearn import tree

# Create a decision tree classifier
cls = tree.DecisionTreeClassifier(random_state = 42)

# Train tree on training data
cls.fit(X_train, y_train)

# Check accuracy on test data
cls.score(X_test, y_test)

0.988950276243094

## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature selection.
* Feature normalisation.

Your report should provide concrete information about your reasoning; everything should be well-explained.

The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

###I first started by minmax scaling the data to prevent large features from overshadowing the small ones. I then decided to define a set of hyperparameters and use cross-validation in order to improve model performance and reduce overfitting of the data. After printing the score from this result I got an accuracy of 99.5%.I then tried to run a decision tree on the data which did not improve the accuracy as it only got an accuracy of 98.3%. So I decided to just move on with the knn regressor.

In [None]:
# Creates minmax scaler
scaler = preprocessing.MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Creates parameters 
parameters = {'n_neighbors': range(1,328), 'p': [1,2], 'weights': ['uniform', 'distance']}

In [None]:
from sklearn import neighbors

knn = neighbors.KNeighborsRegressor()

reg = model_selection.GridSearchCV(knn, parameters)
reg.fit(X_train, y_train)
print('The best classifier is:', reg.best_estimator_)
print('Its accuracy is:', reg.best_score_)
print('Its parameters are:', reg.best_params_)

The best classifier is: KNeighborsRegressor(n_neighbors=1, p=1)
Its accuracy is: 0.9949851820277482
Its parameters are: {'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}


In [None]:
from sklearn import tree

# Create a decision tree classifier
cls = tree.DecisionTreeClassifier(random_state = 42)

# Train tree on training data
cls.fit(X_train, y_train)

# Check accuracy on test data
cls.score(X_test, y_test)

0.9834254143646409

## New classifier (10 Marks)

Replicate the previous task for a classifier different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.


Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

###In this first model the classifier I used was Logistic Regression. Logistic regression is very efficient and quick to classify unknown data points. After fitting the dataset and scoring the model the accuracy was 98.2%. Which was lower that the 99.4% using knn so I decided to try another classifier technique. I chose Random Forest classifier next as this deals with majority voting to come up with a result and can maybe give me a better accuracy. After fitting the classifier on the data it returned perfect accuracy of 100%. After seeing this result I chose the Random Forest classifier model as my best model.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create classifier
cls = LogisticRegression(random_state = 42)

# Train classifier
cls.fit(X_train, y_train)

# Check accuracy
cls.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9820441988950276

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create classifier
cls = RandomForestClassifier(n_estimators=100, random_state=42)

# Train classifier
cls.fit(X_train, y_train)

# Check Accuracy
cls.score(X_test, y_test)

1.0

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv


In [None]:
from joblib import dump, load
from io import BytesIO
import requests

# INSERT YOUR MODEL'S URL
mLink = 'URL_OF_YOUR_MODEL_SAVED_IN_YOUR_GITHUB_REPOSITORY?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model = load(mfile)

# Loads dataset
df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv?raw=true", index_col=0)
df = df.replace([np.inf, -np.inf], 0)
df = df.fillna(0)
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Model evaluation
# Creates encoder for target
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform(y)
# Train - Test Split
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.3, random_state=42)
# Fill with zero
import numpy as np
X_train.replace([np.inf, -np.inf], 0, inplace=True)
X_test.replace([np.inf, -np.inf], 0, inplace=True)
from sklearn import tree
# Create a decision tree classifier
cls = tree.DecisionTreeClassifier(random_state = 42)
# Train tree on training data
cls.fit(X_train, y_train)
# Creates minmax scaler
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
# Create classifier
cls = RandomForestClassifier(n_estimators=100, random_state=42)
# Train classifier
cls.fit(X_train, y_train)
# Check Accuracy
cls.score(X_test, y_test)