# CS345 Spring 2024 Assignment 2

## Part 1 How well do Perceptrons classify

Evaluate the perceptron algorithm (the implementation provided in the lecture notebook) on the Haberman's dataset provided in the last assignment and the Breast Cancer Wisconsin (diagnostic) dataset provided by Scikit. Compare its accuracy against the SVC implementation (use default values for C). Perform a tenfold cross-validation and report the average and standard deviation for each classifier on each dataset. Since you might not be able to get the perceptron code in the lecture notebook to play well with the Sklearn cross-validation code, you can implement the cross-validation code simply using nested for loops. Is there a classifier among the two that appears to perform better? Provide a discussion of the observations you see.

Make sure to allow the perceptron algorithm to run for a sufficient number of epochs

Reminder : [Notebook for cross validation](https://github.com/sarathsreedharan/CS345/blob/master/spring24/notebooks/module05_02_cross_validation.ipynb )

Note: Please remember that the perceptron expects labels to be 1 and -1. If the dataset doesn't provide labels in that form, you must first convert it into that form.

### Note on presenting the results
You can consider using panda dataframes to present your results

In [None]:
## Your code goes here
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load Haberman's dataset
haberman_df = pd.read_csv('haberman.csv')
X_haberman = haberman_df.drop('status', axis=1).values
y_haberman = haberman_df['status'].values
y_haberman = np.where(y_haberman == 2, -1, 1)  # Convert labels to -1 and 1

# Load Breast Cancer Wisconsin dataset
breast_cancer = load_breast_cancer()
X_bc = breast_cancer.data
y_bc = breast_cancer.target
y_bc = np.where(y_bc == 0, -1, 1)  # Convert labels to -1 and 1

# Preprocess data
scaler = StandardScaler()
X_bc = scaler.fit_transform(X_bc)

class perceptron :
    """An implementation of the perceptron algorithm.
    Note that this implementation does not include a bias term"""
 
    def __init__(self, iterations=100, learning_rate=0.2, 
                 plot_data=False, random_w=False, seed=42) :
        self.iterations = iterations
        self.learning_rate = learning_rate
        self.plot_data = plot_data
        self.random_w = random_w
        self.seed = seed
  
    def fit(self, X, y) :
        """
        Train a classifier using the perceptron training algorithm.
        After training the attribute 'w' will contain the perceptron weight vector.
 
        Parameters
        ----------
 
        X : ndarray, shape (num_examples, n_features)
        Training data.
 
        y : ndarray, shape (n_examples,)
        Array of labels.
 
        """
        
        if self.random_w :
            rng = np.random.default_rng(self.seed)
            self.w = rng.uniform(-1 , 1, len(X[0]))
            print("initialized with random weight vector")
        else :
            self.w = np.zeros(len(X[0]))
            print("initialized with a zeros weight vector")
        self.wold = self.w
        converged = False
        iteration = 0
        while (not converged and iteration <= self.iterations) :
            converged = True
            for i in range(len(X)) :
                if y[i] * self.decision_function(X[i]) <= 0 :
                    self.wold = self.w
                    self.w = self.w + y[i] * self.learning_rate * X[i]
                    converged = False
                    if self.plot_data:
                        self.plot_update(X, y, i)
            iteration += 1
        self.converged = converged
        if converged :
            print ('converged in %d iterations ' % iteration)
 
    def decision_function(self, x) :
        return np.dot(x, self.w)
 
    def predict(self, X) :
        """
        make predictions using a trained linear classifier
 
        Parameters
        ----------
 
        X : ndarray, shape (num_examples, n_features)
        Training data.
        """
 
        scores = np.dot(X, self.w)
        return np.sign(scores)
    
    def plot_update(self, X, y, ipt) :
        fig = plt.figure(figsize=(4,4))
        plt.xlim(-1,1)
        plt.ylim(-1,1)
        plt.xlabel("Feature 1")
        plt.ylabel("Feature 2")
        plt.arrow(0,0,self.w[0],self.w[1], 
                  width=0.001,head_width=0.05, 
                  length_includes_head=True, alpha=1,
                  linestyle='-',color='darkred')
        plt.arrow(0,0,self.wold[0],self.wold[1], 
                  width=0.001,head_width=0.05, 
                  length_includes_head=True, alpha=1,
                  linestyle='-',color='orange')
        anew = -self.w[0]/self.w[1]
        aold = -self.wold[0]/self.wold[1]
        pts = np.linspace(-1,1)
        plt.plot(pts, anew*pts, color='darkred')
        plt.plot(pts, aold*pts, color='orange')
        plt.title("in orange:  old w; in red:  new w")
        cols = {1: 'g', -1: 'b'}
        for i in range(len(X)): 
            plt.plot(X[i][0], X[i][1], cols[y[i]]+'o', alpha=0.6,markersize=5) 
        plt.plot(X[ipt][0], X[ipt][1], 'ro', alpha=0.2,markersize=20)

# Function to perform cross-validation
def custom_cross_val_score(model, X, y, cv=10):
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=kf)
    return scores.mean(), scores.std()

# Instantiate classifiers
perceptron = perceptron(iterations=100, learning_rate=0.01)
svc = SVC()

# Perform cross-validation
haberman_perceptron_mean, haberman_perceptron_std = custom_cross_val_score(perceptron, X_haberman, y_haberman)
haberman_svc_mean, haberman_svc_std = custom_cross_val_score(svc, X_haberman, y_haberman)

bc_perceptron_mean, bc_perceptron_std = custom_cross_val_score(perceptron, X_bc, y_bc)
bc_svc_mean, bc_svc_std = custom_cross_val_score(svc, X_bc, y_bc)

# Print results
print("Haberman's Dataset:")
print("Perceptron Accuracy: {:.2f} (+/- {:.2f})".format(haberman_perceptron_mean, haberman_perceptron_std))
print("SVC Accuracy: {:.2f} (+/- {:.2f})".format(haberman_svc_mean, haberman_svc_std))

print("\nBreast Cancer Wisconsin Dataset:")
print("Perceptron Accuracy: {:.2f} (+/- {:.2f})".format(bc_perceptron_mean, bc_perceptron_std))
print("SVC Accuracy: {:.2f} (+/- {:.2f})".format(bc_svc_mean, bc_svc_std))


: 

*Discussion of the results here*.

## Part 2 - Learning Curve

We looked briefly at the idea of learning curves in the notebook for cross-validation. It looked at how accuracy changed with respect to the number of training examples. In this part of the assignment, plot a learning curve for the perceptron algorithm for an increasing number of training examples. To plot this, 
1. First, create a held-out validation set against which you will be comparing all your trained models
2. Now, from the remaining data set, create train sets of increasing sizes. To create more compact plots you can use a logarithmic scale like 10, 20, 40, 80, 160, etc..

For the plot X-axis should be the training data size considered and the Y-axis should be the accuracy of the model (obtained by training on the dataset of that size) as measured on the held-out validation set.  

After receiving the plot, make sure to discuss your observations about the plot

In [None]:
# Code to generate the plot goes here

*Discussion of the plot here*.

## Part 3 - Standardization

Scaling your features is a core part of pre-processing data. One of the methods we have already seen in detail is that of [normalization](https://github.com/sarathsreedharan/CS345/blob/master/spring24/notebooks/module01_03_dot_products.ipynb). However, there are other methods. A popular one is called standardization. Under this method, you update each feature value such that it has zero mean and unit variation.

You can find the details on how to do this at the following [wikipedia page](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)). Now, in this part, you have to implement the code to perform standardization (do not use the Sklearn function for it).

1. Use that code on the Haberman's Dataset. Show how the mean of resultant feature values is zero and the standard deviation is one. Note that the method is applied across each individual feature. So, for a data matrix, the mean of the column is zero, and the standard deviation is one.
2. Compare the accuracy obtained from Perceptron on Haberman's dataset, with and without standardization. As before, use tenfold cross-validation to compute accuracy and report the accuracy the same way you did for Part 1. Provide a small discussion on the results you see.

In [None]:
# Your code for standardization and it's application on the Haberman's dataset

In [None]:
# Your code to evaluate perceptron on a standardized and non-standardized dataset

*Discussion of the results comparing perceptron on the two datset goes here*.

### Grading 

Although we will not grade on a 100 pt scale, the following is a sample grading sheet that will give you a basic weightage of the different questions:  

```
Grading sheet for assignment 1

Part 1:  40 points.
  Fixing labels: 10
  Cross-validation code: 20
  Comparison, reporting, and discussion: 20
Part 2:  20 points.
  Creation of the learning curve: 10 points
  Discussion of the plot: 10 points
Part 3:  40 points
  Standardization code and demonstrating it on the dataset: 20 points
  Comparison, reporting, and discussion: 20 points
```