# CS 6140 Machine Learning: Assignment - 2 (Total Points: 100)
## Prof. Ahmad Uzair





## Question 1 - Naive Bayes Classification (20 points)

![Q1_1.png](attachment:Q1_1.png)

![Q1_2.png](attachment:Q1_2.png)

## Question 2 - Classification Metrics (10 points)

![Q2.png](attachment:Q2.png)

## Question 3 -  Logistic Regression and Perceptron  (70 points)

 In this problem you will be applying logistic regression and perceptron to the breastcancer dataset for binary classification:

 **default of credit card clients**:  This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.



### Task
- Prepare a normalized version of data. Use min-max normalization. 
- Train two logistic regression models using gradient descent with raw as well as normalized data. 
- Train two perceptron classifiers with raw as well as normalized data.
- Compare training and test results of four models in terms of accuracy. 

Note:

The skeleton code is only a guide. You can change the method definitions where necessary with appropriate comments.

In [18]:
#import librarys

import numpy as np
import pandas as pd
import sys
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, precision_score, recall_score
from numpy import log,dot,exp,shape

from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()


In [19]:
def load_data(dataset):
    ''' data: input features
        labels: output features
    '''
    
    data = dataset.iloc[:, :-1].values
    labels = dataset.iloc[:, -1].values.reshape(-1,1)
    
    return data, labels

### 1) Implementation of sigmoid and cost function (10 points)

In [20]:
def sigmoid(z):
    ''' return sigmoid'''
    
    return 1/(1 + np.exp(-z))


In [21]:
## Implement the loss function for logistic regression

def compute_cost(y_pred, y_label):
    """
    Cost function in logistic regression where the cost is calculated
    Returns cost
    """
    
    len_label=len(y_label)
    cost= (-y_label*np.log(y_pred) - (1-y_label)*np.log(1-y_pred)).sum()/len_label
    return cost


### 2)  Implement logistic regression using batch gradient descent and evaluation (20 points)
Algorithm can be given as follows:

```for j in 0 -> max_iteration: 
    for i in 0 -> m: 
        theta += (alpha / m) * (y[i] - h(x[i])) * x_bar
```

In [41]:
def logistic_regression_using_batch_gradient_descent(x, y, alpha, max_iter):
    """
    Compute the params for logistic regression using batch gradient descent
    ip: input variables
    op: output variables
    params: corresponding parameters
    alpha: learning rate
    max_iter: maximum number of iterations
    Returns parameters, cost, params_store
    """ 
    # initialize iteration, number of samples, cost and parameter array
    iteration = 0
    cost = np.zeros(max_iter)
    mincost = sys.maxsize
    mincostweights = None

    
    theta = np.zeros((shape(x)[1]+1,1))
    x = np.c_[np.ones((shape(x)[0],1)),x]
    
    while iteration < max_iter:
        
        theta = theta - alpha * dot(x.T, sigmoid(dot(x, theta)) - np.reshape(y,(len(y),1)))
        y_pred = sigmoid(np.dot(x, theta))
        cost[iteration] = compute_cost(y_pred, y)
        
        if cost[iteration] < mincost:
            mincost = cost[iteration]
            mincostweights = theta
        
        iteration = iteration + 1
    
    return cost, mincostweights

def predict(x, weights):
    z = dot(np.c_[np.ones((shape(x)[0],1)),x], weights)
    lis = []
    for i in sigmoid(z):
        if i>0.5:
            lis.append(1)
        else:
            lis.append(0)
    return lis

def F1_score(y,y_hat):
    tp,tn,fp,fn = 0,0,0,0
    for i in range(len(y)):
        if y[i] == 1 and y_hat[i] == 1:
            tp += 1
        elif y[i] == 1 and y_hat[i] == 0:
            fn += 1
        elif y[i] == 0 and y_hat[i] == 1:
            fp += 1
        elif y[i] == 0 and y_hat[i] == 0:
            tn += 1
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1_score = 2*precision*recall/(precision+recall)
    return f1_score
    

### 3) Implementation of perceptron.(20 points) 

In [23]:
class Perceptron:
# constructor 
    def __init__ (self):
        self.w = None   #weights
        self.b = None   #bias

        
    def model(self, x):
        return 1 if (np.dot(self.w, x) >= self.b) else 0
    
    def predict(self, X):
        Y = []
        for x in X:
            result = self.model(x)
            Y.append(result)
        return np.array(Y)
    
    def fit(self, X, Y, epochs = 1, lr = 1):
    
        self.w = np.ones(X.shape[1])
        self.b = 0

        accuracy = {}
        max_accuracy = 0

        wt_matrix = []

        for i in range(epochs):
            for x, y in zip(X, Y):
                y_pred = self.model(x)
                if y == 1 and y_pred == 0:
                    self.w = self.w + lr * x
                    self.b = self.b - lr * 1
                elif y == 0 and y_pred == 1:
                    self.w = self.w - lr * x
                    self.b = self.b + lr * 1

            wt_matrix.append(self.w)    
            accuracy[i] = accuracy_score(self.predict(X), Y)
            if (accuracy[i] > max_accuracy):
                max_accuracy = accuracy[i]
                j = i
                chkptw = self.w
                chkptb = self.b

        self.w = chkptw
        self.b = chkptb

        #print("Max Training data accuracy =", max_accuracy, "iteration", j)
        #print(accuracy.values())
        
        return np.array(wt_matrix)


### 4) Apply 80-20 split on data to prepare training and test sets. Report training and test results in terms of accuracy, precision and recall for both logistic regression and perceptron. (20)

In [46]:
# Sample training code cell change according to your variables and structure

# Training the model

from sklearn.model_selection import train_test_split
#reserve the test data, do not use them for cross-validation!

dataset = pd.read_excel('Credit card Default.xlsx')
data, labels = load_data(dataset)

x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.20)


In [47]:
'''

logistic regression algorithm 

'''

alpha = 0.003     # better learning rates 0.001, 0.003, 0.01, 0.005
max_iter = 1000

print(x_train)

# raw data
print("\nLogistic Regression Raw data details:\n")


standard_x_train = np.copy(x_train)
standard_x_test = np.copy(x_test)

for i in range(shape(x_train)[1]):
    standard_x_train[:,i] = (x_train[:,i] - np.mean(x_train[:,i]))/np.std(x_train[:,i])
    
for i in range(shape(x_test)[1]):
    standard_x_test[:,i] = (x_test[:,i] - np.mean(x_test[:,i]))/np.std(x_test[:,i])

cost, mincostweights = logistic_regression_using_batch_gradient_descent(standard_x_train, y_train, alpha, max_iter)

Y_pred_train = predict(standard_x_train, mincostweights)
print("Raw Training data accuracy", accuracy_score(Y_pred_train, y_train))
print("Raw Training data precision", precision_score(Y_pred_train, y_train))
print("Raw Training data recall", recall_score(Y_pred_train, y_train))

f1_score = 2 * (precision_score(Y_pred_train, y_train) * recall_score(Y_pred_train, y_train))/ (precision_score(Y_pred_train, y_train) + recall_score(Y_pred_train, y_train))
print("f1 score", f1_score)

print("-----------------")

Y_pred_test = predict(standard_x_test, mincostweights)
print("Raw Testing data accuracy", accuracy_score(Y_pred_test, y_test))
print("Raw Testing data precision", precision_score(Y_pred_test, y_test))
print("Raw Testing data recall", recall_score(Y_pred_test, y_test))

f1_score = 2 * (precision_score(Y_pred_test, y_test, zero_division=1) * recall_score(Y_pred_test, y_test, zero_division=1))/ (precision_score(Y_pred_test, y_test, zero_division=1) + recall_score(Y_pred_test, y_test, zero_division=1))
print("f1 score", f1_score)





# Normalized data
print("\nLogistic Regression Normalized data details:\n")

cost, mincostweights = logistic_regression_using_batch_gradient_descent(min_max_scaler.fit_transform(x_train), y_train, alpha, max_iter)

Y_pred_train = predict(min_max_scaler.fit_transform(x_train), mincostweights)
print("Normalized Training data accuracy", accuracy_score(Y_pred_train, y_train))
print("Normalized Training data precision", precision_score(Y_pred_train, y_train))
print("Normalized Training data recall", recall_score(Y_pred_train, y_train))

f1_score = 2 * (precision_score(Y_pred_train, y_train) * recall_score(Y_pred_train, y_train))/ (precision_score(Y_pred_train, y_train) + recall_score(Y_pred_train, y_train))
print("f1 score", f1_score)


print("-----------------")

Y_pred_test = predict(min_max_scaler.fit_transform(x_test), mincostweights)
print("Normalized Testing data accuracy", accuracy_score(Y_pred_test, y_test))
print("Normalized Testing data precision", precision_score(Y_pred_test, y_test))
print("Normalized Testing data recall", recall_score(Y_pred_test, y_test))

f1_score = 2 * (precision_score(Y_pred_test, y_test, zero_division=1) * recall_score(Y_pred_test, y_test, zero_division=1))/ (precision_score(Y_pred_test, y_test, zero_division=1) + recall_score(Y_pred_test, y_test, zero_division=1))
print("f1 score", f1_score)



[[  1916 240000      2 ...      0   3332      0]
 [  2051  30000      1 ...      0   1500      0]
 [  2049 120000      2 ...      0      0      0]
 ...
 [   800 210000      2 ...   1342   1038   2000]
 [  4154 280000      2 ...  20669   3003   3250]
 [  3692 500000      1 ...  24652  18060   3306]]

Logistic Regression Raw data details:

Raw Training data accuracy 0.7871485943775101
Raw Training data precision 0.08068181818181819
Raw Training data recall 0.6454545454545455
f1 score 0.14343434343434344
-----------------
Raw Testing data accuracy 0.7941767068273092
Raw Testing data precision 0.12217194570135746
Raw Testing data recall 0.7105263157894737
f1 score 0.2084942084942085

Logistic Regression Normalized data details:

Normalized Training data accuracy 0.7914156626506024
Normalized Training data precision 0.1340909090909091
Normalized Training data recall 0.6310160427807486
f1 score 0.22118088097469538
-----------------
Normalized Testing data accuracy 0.7961847389558233
Normaliz

In [51]:
'''

perceptron algorithm 

'''
print(x_train)

# For raw data
print("Perceptron raw data details:\n")

perceptron = Perceptron()
wt_matrix = perceptron.fit(x_train, y_train, 400, 0.0001) # better learning rates 0.0001, 0.01

Y_pred_train = perceptron.predict(x_train)
print("Raw Training data accuracy", accuracy_score(Y_pred_train, y_train))
print("Raw Training data precision", precision_score(Y_pred_train, y_train))
print("Raw Training data recall", recall_score(Y_pred_train, y_train))

f1_score = 2 * (precision_score(Y_pred_train, y_train) * recall_score(Y_pred_train, y_train))/ (precision_score(Y_pred_train, y_train) + recall_score(Y_pred_train, y_train))
print("f1 score", f1_score)

print("-------------------")

Y_pred_test = perceptron.predict(x_test)
print("Raw Testing data accuracy", accuracy_score(Y_pred_test, y_test))
print("Raw Testing data precision", precision_score(Y_pred_test, y_test, zero_division=1))
print("Raw Testing data recall", recall_score(Y_pred_test, y_test, zero_division=1))

f1_score = 2 * (precision_score(Y_pred_test, y_test, zero_division=1) * recall_score(Y_pred_test, y_test, zero_division=1))/ (precision_score(Y_pred_test, y_test, zero_division=1) + recall_score(Y_pred_test, y_test, zero_division=1))
print("f1 score", f1_score)




# For Normalized data
print("\nPerceptron Normalized data details:\n")

perceptron = Perceptron()
wt_matrix = perceptron.fit(min_max_scaler.fit_transform(x_train), y_train, 400, 0.01) # better learning rates 0.01, 3, 0.001 


Y_pred_train = perceptron.predict(min_max_scaler.fit_transform(x_train))
print("Normalized Training data accuracy", accuracy_score(Y_pred_train, y_train))
print("Normalized Training data precision", precision_score(Y_pred_train, y_train))
print("Normalized Training data recall", recall_score(Y_pred_train, y_train))

f1_score = 2 * (precision_score(Y_pred_train, y_train) * recall_score(Y_pred_train, y_train))/ (precision_score(Y_pred_train, y_train) + recall_score(Y_pred_train, y_train))
print("f1 score", f1_score)

print("-------------------")

Y_pred_test = perceptron.predict(min_max_scaler.fit_transform(x_test))
print("Normalized Testing data accuracy", accuracy_score(Y_pred_test, y_test))
print("Normalized Testing data precision", precision_score(Y_pred_test, y_test))
print("Normalized Testing data recall", recall_score(Y_pred_test, y_test))

f1_score = 2 * (precision_score(Y_pred_test, y_test) * recall_score(Y_pred_test, y_test))/ (precision_score(Y_pred_test, y_test) + recall_score(Y_pred_test, y_test))
print("f1 score", f1_score)


[[  1916 240000      2 ...      0   3332      0]
 [  2051  30000      1 ...      0   1500      0]
 [  2049 120000      2 ...      0      0      0]
 ...
 [   800 210000      2 ...   1342   1038   2000]
 [  4154 280000      2 ...  20669   3003   3250]
 [  3692 500000      1 ...  24652  18060   3306]]
Perceptron raw data details:

Raw Training data accuracy 0.7796184738955824
Raw Training data precision 0.004545454545454545
Raw Training data recall 0.6666666666666666
f1 score 0.009029345372460496
-------------------
Raw Testing data accuracy 0.7781124497991968
Raw Testing data precision 0.0
Raw Testing data recall 1.0
f1 score 0.0

Perceptron Normalized data details:

Normalized Training data accuracy 0.7886546184738956
Normalized Training data precision 0.1590909090909091
Normalized Training data recall 0.5785123966942148
f1 score 0.24955436720142601
-------------------
Normalized Testing data accuracy 0.7991967871485943
Normalized Testing data precision 0.2895927601809955
Normalized Tes

In [306]:
'''


Acknowledgements:

1. for perceptron algorithm i refered this https://hackernoon.com/perceptron-deep-learning-basics-3a938c5f84b6
2. for logistic regression algorithm i referred this https://www.analyticsvidhya.com/blog/2022/02/implementing-logistic-regression-from-scratch-using-python/
3. I tried to use precision, recall, f1score inbuilt librarays along with that I also used scratch code above for understanding.
4. I faced issues with the raw data when doing logistic regression as i was uanble to find cost due to the prediction values are 0. which are
   giving nan values as cost, so i used general standardization on raw data.
5. I tried with different learning rates () and different sizes of epochs for perceptron and logistic regression.
6. Accuracy of normalized data is good compared to raw data for both algorithms.

'''

'\n\n\nAcknowledgements:\n\n1. for perceptron algorithm i refered this https://hackernoon.com/perceptron-deep-learning-basics-3a938c5f84b6\n\n\n'