# Project Specification

For this project we will perform **THREE** short questions which cover the breadth of the machine learning module along with attempting **ONE** of the four longer project-style questions. All of the short tasks and longer project-style questions can be found in this notebook.

The learning objectives of these short questions are:
- To demonstrate a wide-range of machine learning skills.
- To be able to apply the most appropriate approach at the right time.



---
## Question 1: Classification (10 marks)

Load the dataset below, where X and y are the feature (input) variables and target (output) variable. Based on this dataset, build TWO classifiers using different machine learning approaches to predict the two classes in the target variable. You are free to use any appropriate machine learning models and libraries, but you need to split the dataset into training and test sets and optimise the model's hyperparameters (e.g. using GridSearchCV()). As a result, the performance metrics of the best classifier should be reported over the test set. Please follow the steps below to complete the code.

The dataset is available at:
https://ncl.instructure.com/courses/53509/files/7659751?wrap=1 and
https://ncl.instructure.com/courses/53509/files/7659755?wrap=1


## Set up the environment and load the dataset

In [None]:
# just run this cell, don't change the code
import numpy as np
from numpy import loadtxt
X = loadtxt('cls_X.csv', delimiter=',')
y = loadtxt('cls_y.csv', delimiter=',')

## Q1.1 Split the data into training and test sets (20% for testing)

In [None]:
# write your code below to replace the ellipsis "..."

# Importing required libraries
from sklearn.model_selection import train_test_split

# Split the data and keep 20% for testing purpose by specifying test_size as 0.2
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X1_train.shape)
print("X_test shape:", X1_test.shape)
print("y_train shape:", y1_train.shape)
print("y_test shape:", y1_test.shape)

X_train shape: (320, 6)
X_test shape: (80, 6)
y_train shape: (320,)
y_test shape: (80,)


## Q1.2 Create your first classifier

#### Q1.2.1 First, make an attempt by using an appropriate machine learning method without optimising the hyperparameter(s). Report the model accuracy over the test set (i.e. test accuracy).

In [None]:
# write your code below to replace the ellipsis "..."

# Importing required libraries
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Creating the instance of support vector classifier
svm = SVC()

# Fitting the training data in order to train the model
svm.fit(X1_train, y1_train)

# Making predictions based on testing data
svm_predictions = svm.predict(X1_test)

# Evaluating performance of the model by checking its accuracy
print("SVM Performance:")
print("Accuracy of support vector machine: ", accuracy_score(y1_test, svm_predictions))

SVM Performance:
Accuracy of support vector machine:  0.8875


#### Q1.2.2 Then, optimise the hyperparameter(s) using the same machine learning method as above. Report the best hyperparameter(s) and, use it to make your first classifier and print out its test accuracy.

In [None]:
# write your code below to replace the ellipsis "..."

# Importing required libraries
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV

# Creating the instance of support vector classifier
svm = SVC()

# Defining the parameter grid of hyperparameters to optimize
param_grid = {'C': [0.1, 10, 100], 'gamma': ['scale', 'auto'], 'kernel': ['linear', 'rbf']}

# Fitting the grid search to the data
svm_grid = GridSearchCV(svm, param_grid, cv=5)
svm_grid.fit(X1_train, y1_train)

# Searching for the best hyperparameters and printing them
best_parameters = svm_grid.best_params_
print("Best Hyperparameters found:", best_parameters)

# Use the best parameters found to create optimized SVM
optimized_svm = SVC(**best_parameters)
optimized_svm.fit(X1_train, y1_train)

# Making predictions on the testing data
optimized_svm_predictions = optimized_svm.predict(X1_test)

# Evaluate the model
print("\nSVM Performance with Best Hyperparameters:")
print("Accuracy of optimized support vector machine: ", accuracy_score(y1_test, optimized_svm_predictions))

Best Hyperparameters found: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}

SVM Performance with Best Hyperparameters:
Accuracy of optimized support vector machine:  0.9


## Q1.3 Create your second classifier

#### Q1.3.1 First, without optimising the hyperparameter(s), make an attempt by using a different machine learning method to the first classifier. Report the model accuracy over the test set (i.e. test accuracy).

In [None]:
# write your code below to replace the ellipsis "..."

# Importing required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


# Creating an instance of KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Training the algorithm with help of training set
knn.fit(X1_train, y1_train)

# Prediction on the testing data
knn_pred = knn.predict(X1_test)

# Evaluate the model
print(f"Accuracy of KNN with 3 nearest neighbours: {accuracy_score(y1_test, knn_pred)}")

Accuracy of KNN with 3 nearest neighbours: 0.8875


#### Q1.3.2 Then, optimise the hyperparameter(s) using the same machine learning method as above. Report the best hyperparameter(s) and, use it to make your second classifier and print out its test accuracy.



In [None]:
# write your code below to replace the ellipsis "..."

# Importing required libraries
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Creating an instance of KNN classifier
knn = KNeighborsClassifier()

# Defining the parameter grid of hyperparameters to optimize
parameter_grid = {'n_neighbors': [1, 4, 6], 'p': [1, 2]}

# Fitting the grid search to the data
knn_grid_search = GridSearchCV(estimator=knn, param_grid=parameter_grid, cv=5, scoring='accuracy')
knn_grid_search.fit(X1_train, y1_train)

# Searching the best hyperparameter(s) and printing them
knn_best_parameters = knn_grid_search.best_params_
print("Best Hyperparameters:", knn_best_parameters)

# Using the best model for prediction
optimized_knn = KNeighborsClassifier(**knn_best_parameters)
optimized_knn.fit(X1_train, y1_train)

# Making predictions on the testing data
optimized_knn_predictions = optimized_knn.predict(X1_test)

# Calculate accuracy using the best model
print(f"\nAccuracy of optimized KNN with best parameters: {accuracy_score(y1_test, optimized_knn_predictions)}")

Best Hyperparameters: {'n_neighbors': 4, 'p': 2}

Accuracy of optimized KNN with best parameters: 0.925


## Q1.4 Report the precision, recall, f1 score and confusion matrix on the best of the two classifiers

In [None]:
# write your code below to replace the ellipsis "..."

# Importing required libraries
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

# Calculating precision, recall, and F1-score of optimized KNN algorithm
precision = precision_score(y1_test, optimized_knn_predictions)
recall = recall_score(y1_test, optimized_knn_predictions)
f1 = f1_score(y1_test, optimized_knn_predictions)

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")

# Calculate the confusion matrix
cf_matrix = confusion_matrix(y1_test, optimized_knn_predictions)
print("Confusion Matrix:")
print(cf_matrix)

Precision: 0.9523809523809523
Recall: 0.9090909090909091
F1-score: 0.9302325581395349
Confusion Matrix:
[[34  2]
 [ 4 40]]


---
## Question 2: Regression (10 marks)

In this question you are given a simple dataset which you will perform regression on to predict values. You will build TWO Regression models and then take the best one and perform hyperparameter tuning on it.

## Set up the environment

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

## Read in the data

You'll need to download the data.csv file from https://ncl.instructure.com/courses/53509/files/7657710?wrap=1 and upload it to your Google Drive. I placed it in a folder called data. Then you need to mount your Google Drive in Colab (cell below).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Then read in the data

In [None]:
data = np.loadtxt('/content/drive/MyDrive/data/data.csv', delimiter=',')
print(data)

[[1.00000000e+00 9.00000000e+00 2.57259990e+00 2.49788633e+02]
 [1.00000000e+00 5.00000000e+00 9.21413366e+00 5.04502032e+02]
 [1.00000000e+00 1.70000000e+01 7.12330090e+00 1.33580225e+03]
 ...
 [5.00000000e+00 3.10000000e+01 6.80121067e+00 3.15979690e+03]
 [5.00000000e+00 1.00000000e+01 4.14995662e+00 6.21315789e+02]
 [5.00000000e+00 9.00000000e+00 9.61878173e+00 1.30105857e+03]]


## Q2.1 Split the data into X and y

X is the first three columns

y is the last column

In [None]:
# your answer here

# Splitting the data into X and y
X = data[:,:3]
y = data[:,3]
# Printing the shape of X and y
print("X: ", X.shape)
print("y: ", y.shape)

X:  (1000, 3)
y:  (1000,)


## Q2.2 Create the Train and Test datasets

20% of the data is kept back for testing

In [None]:
# your answer here

# Split the data into training and testing set
X2_train, X2_test, y2_train, y2_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Printing the shape of train and test data
print("X_train shape:", X2_train.shape)
print("X_test shape:", X2_test.shape)
print("y_train shape:", y2_train.shape)
print("y_test shape:", y2_test.shape)

X_train shape: (800, 3)
X_test shape: (200, 3)
y_train shape: (800,)
y_test shape: (200,)


## Q2.3 Use TWO Regression approaches on the dataset

In each case report the R^2 value against the test data.

Q2.3.1 Regression approach 1

### Approach 1: Linear Regression.

In [None]:
# your answer here

# Importing required libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Creating an instance of linear regression
linear_regression = LinearRegression()

# Training the model using train set
linear_regression.fit(X2_train, y2_train)

# Making prediction on test data
linear_prediction = linear_regression.predict(X2_test)

# Evaluating the performance of the model
linear_r2value = r2_score(y2_test, linear_prediction)
print("R^2 value for Linear Regression: ", linear_r2value)

R^2 value for Linear Regression:  0.8625374681259913


Q2.3.2 Regression approach 2

### Approach 2: Decision Tree Regression

In [None]:
# your answer here

# Importing required libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

# Creating an instance of Decision Tree Regressor
decisiontree_regression = DecisionTreeRegressor(max_depth=5, random_state=42)

# Training the model using train set
decisiontree_regression.fit(X2_train, y2_train)

# Predicting on the test data
decisiontree_prediction = decisiontree_regression.predict(X2_test)

# Evaluating the performance of the model
decisiontree_r2value = r2_score(y2_test, decisiontree_prediction)
print("R^2 value for Decision Tree Regression: ", decisiontree_r2value)

R^2 value for Decision Tree Regression:  0.916664211341809


## Q2.4 Optimise the hyperparameters

Take your best Regression approach from above and identify the best hyperparameters. Note as some Regression approaches have many hyperparameters you may limit yourself here to just THREE.

Q2.4.1 Search for the best hyperparameters

In [None]:
# your answer here

# Importing required libraries
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

decisiontree_regression = DecisionTreeRegressor(random_state=42)

# Defining the grid of hyperparameters to search
tree_parameter_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Fitting the grid search to the data
tree_grid_search = GridSearchCV(estimator=decisiontree_regression, param_grid=tree_parameter_grid, cv=5, scoring='neg_mean_squared_error')
tree_grid_search.fit(X2_train, y2_train)

# Getting the best parameters found by GridSearchCV
tree_best_parameters = tree_grid_search.best_params_

Q2.4.2 Output the best hyperparameters found

In [None]:
# your answer here

# Printing the best parameters found
print("Best Hyperparameters found: ", tree_best_parameters)

Best Hyperparameters found:  {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}


Q2.4.3 Show the results for the best model

In [None]:
# your answer here

# Creating the instance of DecisionTreeRegressor using best hyperparameters
optimized_decisiontree_regression = DecisionTreeRegressor(**tree_best_parameters)

# Training the optimized decision tree model with train set
optimized_decisiontree_regression.fit(X2_train, y2_train)

# Making predictions on testing data
optimized_decisiontree_predictions = optimized_decisiontree_regression.predict(X2_test)

# Evaluating the performance of optimized decision tree regression
optimized_decisiontree_r2value = r2_score(y2_test, optimized_decisiontree_predictions)
print("R^2 value for optimized Decision Tree Regression: ", optimized_decisiontree_r2value)

R^2 value for optimized Decision Tree Regression:  0.9823223105850983


---
## Question 3: Deep Learning (10 marks)

Q3.1 For MNIST dataset, implement a deep learning model with 3 hidden layers with layer size: 128, 256, 50.


In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.datasets import mnist
import keras.utils as utils

batch_size = 128
nb_classes = 10
im_dim = 784 # the total pixel number
nb_epoch = 2

In [None]:
# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(60000, im_dim)
X_test = X_test.reshape(10000, im_dim)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
Y_train = utils.to_categorical(y_train, nb_classes)
Y_test = utils.to_categorical(y_test, nb_classes)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
# Write down your code about MLP model for question Q3.1 here
# you should call your model 'model'

model = Sequential()
model.add(Dense(input_dim=im_dim,units=128))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(units=256))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(units=50))
model.add(Activation('relu'))
model.add(Dense(units=nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Here we have implemented a MLP model with 3 hidden layers of size 128, 256 and 50.

Add code to output your network structure

In [None]:
# your code here

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               100480    
                                                                 
 activation (Activation)     (None, 128)               0         
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 256)               33024     
                                                                 
 activation_1 (Activation)   (None, 256)               0         
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 50)                1

In [None]:
for iterator, layer in enumerate(model.layers):
    print(f"Layer {iterator+1}: {layer.name} - Number of input dimensions: {layer.output_shape[0]} - Number of neurons: {layer.output_shape[1]}")

Layer 1: dense - Number of input dimensions: None - Number of neurons: 128
Layer 2: activation - Number of input dimensions: None - Number of neurons: 128
Layer 3: dropout - Number of input dimensions: None - Number of neurons: 128
Layer 4: dense_1 - Number of input dimensions: None - Number of neurons: 256
Layer 5: activation_1 - Number of input dimensions: None - Number of neurons: 256
Layer 6: dropout_1 - Number of input dimensions: None - Number of neurons: 256
Layer 7: dense_2 - Number of input dimensions: None - Number of neurons: 50
Layer 8: activation_2 - Number of input dimensions: None - Number of neurons: 50
Layer 9: dense_3 - Number of input dimensions: None - Number of neurons: 10
Layer 10: activation_3 - Number of input dimensions: None - Number of neurons: 10


Train the model for just two epochs to show it works. All code provided - just run.

In [None]:
# Train
history = model.fit(X_train, Y_train, epochs=nb_epoch,
                    validation_split = 0.2,
                    batch_size=batch_size, verbose=1)

# Evaluate
evaluation = model.evaluate(X_test, Y_test, verbose=1)
print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (evaluation[0], evaluation[1]))

Epoch 1/2
Epoch 2/2
Summary: Loss over the test dataset: 0.13, Accuracy: 0.96


Q 3.2 For MNIST dataset, implement a CNN model with only one 2D CNN layer as the hidden layer.

In [None]:
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

batch_size = 128
num_classes = 10
epochs = 2

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# input image dimensions
img_rows, img_cols = 28, 28
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [None]:
# Write down your code about the CNN model of Q3.2 here
# you should call your model 'model'

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(img_rows, img_cols, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy, optimizer='adam', metrics=['accuracy'])

Add code to output your network structure

In [None]:
# your code here

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2  (None, 13, 13, 32)        0         
 D)                                                              
                                                                 
 flatten (Flatten)           (None, 5408)              0         
                                                                 
 dense_4 (Dense)             (None, 10)                54090     
                                                                 
Total params: 54410 (212.54 KB)
Trainable params: 54410 (212.54 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


We just train for two epochs to demonstrate that the network does work. Just run it.

In [None]:
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
              verbose=1, shuffle=True,
              validation_split = 0.2)
score = model.evaluate(x_test, y_test, verbose=0)
print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (score[0], score[1]))

Epoch 1/2
Epoch 2/2
Summary: Loss over the test dataset: 0.12, Accuracy: 0.97
