Question 2
This question involves creating a statistical model that can differentiate between authentic and
counterfeit banknotes. We utilize the Banknote Authentication dataset from the UCI Machine
Learning Repository, provided by Volker Lohweg at the University of Applied Sciences, OstwestfalenLippe. Accessible at (https://archive.ics.uci.edu/ml/datasets/banknote+authentication),
the dataset comprises 1372 samples derived from images of real and fake banknotes. These samples
are contained in the banknote.csv file. Each sample is described by five attributes: one binary
response variable and four predictors. The response variable is 1 for genuine notes and 0 for forgeries. The predictors, extracted using wavelet transform, include measurements of image variance,
skewness, kurtosis, and entropy. Table 2 provides a detailed description of the variables in the
banknote data.
Variable Description
class Response taking two values: 0 for a forged banknote and 1 for a genuine banknote
variance Variance of wavelet transformed banknote image
skewness Skewness of wavelet transformed banknote image
curtosis Curtosis of wavelet transformed banknote image
entropy Entropy of the banknote image
Table 2: Description of the response and predictors in banknote.csv file. The response is in the
first row and the remaining rows describe the four predictors, which are continuous and can take
any real values.
Answer the following questions based on classification models. Evaluate their predictive performance using accuracy, TPR, and FPR estimates based on 10-fold cross-validation.


1. (10 pts) How does logistic regression and the perceptron perform in classifying banknotes as
genuine or counterfeit based on the four given image-derived features? Summarize predictive
performance.

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


In [21]:
# Load the dataset
data = pd.read_csv('banknote.csv')

data.head(5)

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [22]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# Split the dataset into features and target variable
X = data.iloc[:, 0:4].values  # Features

y = data.iloc[:,-1].values   # Target variable

# Function to evaluate model by k-fold cross validation
# Inputs: k-fold division, the model, X matrix and y matrix 
# Output: Accuracy, TPR, FPR metrics
# Process: Divide the data into k folds. For each fold, train the data on other k-1 folds
# and calculate mspe on the current fold.    Store the values in an array
def kFoldModelEvaluator(kFold, model, X, y):
    # Define a matrix for storing metrics

    accuracyScores = []
    tprScores = []
    fprScores = []
    
    # Perform 10-fold cross-validation
    for train_index, test_index in kFold.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # Fit the model
        model.fit(X_train, y_train)
        

        # Calculate accuracy
        accuracyScores.append(model.score(X_test, y_test))

        # Predict on testing data
        y_pred = model.predict(X_test)

        # Calculate TPR and FPR
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        tprScores.append( tp / (tp + fn))
        fprScores.append( fp / (fp + tn))
    
    return {"accuracyScores": accuracyScores, "tprScores" : tprScores, "fprScores" : fprScores}



In [23]:
y

array([0, 0, 0, ..., 1, 1, 1], dtype=int64)

1. (10 pts) How does logistic regression and the perceptron perform in classifying banknotes as
genuine or counterfeit based on the four given image-derived features? Summarize predictive
performance.

In [24]:
from sklearn.linear_model import LogisticRegression, Perceptron



model = LogisticRegression(penalty=None)
kf = KFold(n_splits=10, shuffle=True, random_state=42)

print("Accuracy: ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("accuracyScores")))
print("TPR: ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("tprScores")))
print("FPR: ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("fprScores")))

Accuracy:  0.9898021792023695
TPR:  0.9888331762323699
FPR:  0.009328434706096041


In [25]:
model = Perceptron(penalty=None)
kf = KFold(n_splits=10, shuffle=True, random_state=42)

print("Accuracy: ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("accuracyScores")))
print("TPR: ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("tprScores")))
print("FPR: ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("fprScores")))

Accuracy:  0.9839627631439753
TPR:  0.9785143012113953
FPR:  0.011619557989003363


Logistic Regression seems to have marginally better accuracy on average of 98.9% vs. 98.3 of the perceptron. It also has a better TPR and FPR

2. (10 pts) Assess the predictive performance improvement of logistic regression and perceptron with polynomial features of degree 2.

In [26]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)

X_poly = poly.fit_transform(X)
model = LogisticRegression(penalty=None)
kf = KFold(n_splits=10, shuffle=True, random_state=42)

print("Accuracy: ", np.mean(kFoldModelEvaluator(kf, model, X_poly, y).get("accuracyScores")))
print("TPR: ", np.mean(kFoldModelEvaluator(kf, model, X_poly, y).get("tprScores")))
print("FPR: ", np.mean(kFoldModelEvaluator(kf, model, X_poly, y).get("fprScores")))

Accuracy:  1.0
TPR:  1.0
FPR:  0.0


In [27]:
model = Perceptron(penalty=None)
kf = KFold(n_splits=10, shuffle=True, random_state=42)

print("Accuracy: ", np.mean(kFoldModelEvaluator(kf, model, X_poly, y).get("accuracyScores")))
print("TPR: ", np.mean(kFoldModelEvaluator(kf, model, X_poly, y).get("tprScores")))
print("FPR: ", np.mean(kFoldModelEvaluator(kf, model, X_poly, y).get("fprScores")))

Accuracy:  0.9985401459854014
TPR:  0.9966517857142858
FPR:  0.0


A Quadratic features based Logistic Regression seems to be able to achieve perfect separation on the data. Percpetron also nearly manages to achieve the same with a few false negative errors

3. (10 pts) Compare the predictive performance of the previous models with SVMs with polynomial
kernels of degrees 1, 2, 10, and 20. Does SVM perform better than the previous approaches?

In [28]:
from sklearn import svm

for degree in [1,2,10,20]:
    model = svm.SVC(kernel="poly", degree = degree)
    kf = KFold(n_splits=10, shuffle=True, random_state=42)

    print("Accuracy for degree"+str(degree)+": ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("accuracyScores")))
    print("TPR for degree"+str(degree)+": ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("tprScores")))
    print("FPR for degree"+str(degree)+": ", np.mean(kFoldModelEvaluator(kf, model, X, y).get("fprScores")))

Accuracy for degree1:  0.9846926901512747
TPR for degree1:  0.9984848484848484
FPR for degree1:  0.02596872678836103
Accuracy for degree2:  0.9730297260129059
TPR for degree2:  1.0
FPR for degree2:  0.04827484353572536
Accuracy for degree10:  0.7653549137839839
TPR for degree10:  0.557826547979445
FPR for degree10:  0.05838144965486223
Accuracy for degree20:  0.7150481328678727
TPR for degree20:  0.3666633172124153
FPR for degree20:  0.004938647997591086


The classifier seems to not perform well for 10 and 20 degree models at all, with a .56 and .37 TPRs respectively. 2nd degree kernel does seem to reach separation and a perfect TPR although its false positive rate still lags behind Logistic Regression and Perceptron

4. (10 pts) Implement a kernel approximation using random Fourier features for banknote classification using logistic regression with degree 3 polynomial features. Vary the random feature
dimension as 6, 12, and 24. How does this approach compare with the SVM in terms of predictive
performance?

In [29]:
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Create a list of random feature dimensions
random_feature_dims = [6, 12, 24]

# Create a PolynomialFeatures object
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

        
# Loop over random feature dimensions
for n_components in random_feature_dims:       
    # Create a pipeline with RBFSampler for fourier features
    rbf_sampler = RBFSampler(n_components=n_components, gamma=1.0)
    logreg = LogisticRegression(penalty=None, max_iter=5000)
    model = Pipeline([("rbf_sampler", rbf_sampler),
                    
                            ("logreg", logreg)])
    kf = KFold(n_splits=10, shuffle=True, random_state=42)

    res = kFoldModelEvaluator(kf, model, X_poly, y)

    print("Accuracy for components"+str(n_components)+": ", np.mean(res.get("accuracyScores")))
    print("TPR for components"+str(n_components)+": ", np.mean(res.get("tprScores")))
    print("FPR for components"+str(n_components)+": ", np.mean(res.get("fprScores")))

Accuracy for components6:  0.5430180894953984
TPR for components6:  0.07378987445320946
FPR for components6:  0.07823590198130417
Accuracy for components12:  0.5219083888712578
TPR for components12:  0.16578949331480483
FPR for components12:  0.18728302618683104
Accuracy for components24:  0.518978102189781
TPR for components24:  0.2476027461603516
FPR for components24:  0.2616714700295687


The Random feature kernel does not work as well, with the best preformance coming from 24 components. This indicates more components are required to increase accuracy

5. (10 pts) Fit a two-layered (shallow) neural network. Is its performance better than that of the previous methods? Justify the choice of tuning parameters such as number of hidden units, pre-activation function, learning rate, and epochs.

In [30]:
from sklearn.neural_network import MLPClassifier
for activation in ["relu", "logistic"]:
    for learning_rate in [0.001, 0.01, 0.05, 0.1, 1]:
        model = MLPClassifier(
            hidden_layer_sizes= (20,),
            activation=activation,
            learning_rate_init=learning_rate,
            max_iter= 5000 
        )
        kf = KFold(n_splits=10, shuffle=True, random_state=42)

        res = kFoldModelEvaluator(kf, model, X, y)

        print("Accuracy for activation:"+activation+", learning_rate:"+str(learning_rate)+": ", np.mean(res.get("accuracyScores")))
        print("TPR for activation:"+activation+", learning_rate:"+str(learning_rate)+": ", np.mean(res.get("tprScores")))
        print("FPR for activation:"+activation+", learning_rate:"+str(learning_rate)+": ", np.mean(res.get("fprScores")))

Accuracy for activation:relu, learning_rate:0.001:  1.0
TPR for activation:relu, learning_rate:0.001:  1.0
FPR for activation:relu, learning_rate:0.001:  0.0
Accuracy for activation:relu, learning_rate:0.01:  1.0
TPR for activation:relu, learning_rate:0.01:  1.0
FPR for activation:relu, learning_rate:0.01:  0.0
Accuracy for activation:relu, learning_rate:0.05:  1.0
TPR for activation:relu, learning_rate:0.05:  1.0
FPR for activation:relu, learning_rate:0.05:  0.0
Accuracy for activation:relu, learning_rate:0.1:  1.0
TPR for activation:relu, learning_rate:0.1:  1.0
FPR for activation:relu, learning_rate:0.1:  0.0
Accuracy for activation:relu, learning_rate:1:  1.0
TPR for activation:relu, learning_rate:1:  1.0
FPR for activation:relu, learning_rate:1:  0.0
Accuracy for activation:logistic, learning_rate:0.001:  0.991262033216968
TPR for activation:logistic, learning_rate:0.001:  0.9969696969696968
FPR for activation:logistic, learning_rate:0.001:  0.013182865621029239
Accuracy for activ

Choosing ReLU with 0.01 initial learning rate as this value has achieved convergance.

Now testing on hidden unit size and max iterations

In [31]:
import warnings
warnings.filterwarnings('ignore')

for size in [2, 4, 10, 20]:
    for max_iter in [10, 100, 500, 1000]:
        model = MLPClassifier(
            hidden_layer_sizes= (size,),
            activation="relu",
            learning_rate_init=0.01,
            max_iter= max_iter 
        )
        kf = KFold(n_splits=10, shuffle=True, random_state=42)

        res = kFoldModelEvaluator(kf, model, X, y)

        print("Accuracy for size:"+str(size)+", max_iter:"+str(max_iter)+": ", np.mean(res.get("accuracyScores")))
        print("TPR for size:"+str(size)+", max_iter:"+str(max_iter)+": ", np.mean(res.get("tprScores")))
        print("FPR for size:"+str(size)+", max_iter:"+str(max_iter)+": ", np.mean(res.get("fprScores")))


Accuracy for size:2, max_iter:10:  0.7318152967311965
TPR for size:2, max_iter:10:  0.7189076160189166
FPR for size:2, max_iter:10:  0.26819391453879565
Accuracy for size:2, max_iter:100:  0.9905321062096689
TPR for size:2, max_iter:100:  0.9903632086999024
FPR for size:2, max_iter:100:  0.009463729818560102
Accuracy for size:2, max_iter:500:  0.9956257272823443
TPR for size:2, max_iter:500:  0.996875
FPR for size:2, max_iter:500:  0.005288427326098559
Accuracy for size:2, max_iter:1000:  0.9963556542896435
TPR for size:2, max_iter:1000:  0.9952116935483872
FPR for size:2, max_iter:1000:  0.00251821349382325
Accuracy for size:4, max_iter:10:  0.9241986670898129
TPR for size:4, max_iter:10:  0.9430903318057402
FPR for size:4, max_iter:10:  0.08994532268950356
Accuracy for size:4, max_iter:100:  1.0
TPR for size:4, max_iter:100:  1.0
FPR for size:4, max_iter:100:  0.0
Accuracy for size:4, max_iter:500:  1.0
TPR for size:4, max_iter:500:  1.0
FPR for size:4, max_iter:500:  0.0
Accuracy fo

The model can converge with just 4 hidden units, given it has a sufficient size of iterations

6. (10 pts) Compare the predictive performance when a deep neural network replaces the shallow
neural networks.

In [32]:
max_iter = 500

model = MLPClassifier(
            hidden_layer_sizes= (4,4,4),
            activation="relu",
            learning_rate_init=0.01,
            max_iter= 500 
)
kf = KFold(n_splits=10, shuffle=True, random_state=42)

res = kFoldModelEvaluator(kf, model, X, y)

print("Accuracy for size:(4,4,4),: ", np.mean(res.get("accuracyScores")))
print("TPR for size:(4,4,4), : ", np.mean(res.get("tprScores")))
print("FPR for size:(4,4,4), : ", np.mean(res.get("fprScores")))

Accuracy for size:(4,4,4),:  1.0
TPR for size:(4,4,4), :  1.0
FPR for size:(4,4,4), :  0.0


A four layered network can replace the given network if it has upto 500 iterations available

9. (10 pts) Perform a sensitivity analysis on the tuning parameters, such as degree of the polynomial, regularization parameter, random feature dimension, learning rate, number of layers, and
number of epochs. How do changes in these parameters affect the model’s performance? Which
model balances sensitivity to the choice of tuning parameter and predictive accuracy?

Number of components in a random feature kernel has the highest sensitivity in terms of change in output. The iterations for this model seem to be sufficient for convergance at 100 to 500 range, while the initial Learning rate is best between 0.01 and 0.05 to ensure best convergance. Quadratice features seem sufficient for prediction