<a href="https://colab.research.google.com/github/jeffreyjgong/Intro-ML/blob/main/01_exercise_binary_classification_perceptron.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Exercise 1: Training, Validation, and Testing a Perceptron**

*CPSC 381/581: Machine Learning*

*Yale University*

*Instructor: Alex Wong*


**Prerequisites**:

1. Enable Google Colaboratory as an app on your Google Drive account

2. Create a new Google Colab notebook, this will also create a "Colab Notebooks" directory under "MyDrive" i.e.
```
/content/drive/MyDrive/Colab Notebooks
```

3. Create the following directory structure in your Google Drive
```
/content/drive/MyDrive/Colab Notebooks/CPSC 381-581: Machine Learning/Exercises
```

4. Move the 01_exercise.ipynb into
```
/content/drive/MyDrive/Colab Notebooks/CPSC 381-581: Machine Learning/Exercises
```
so that its absolute path is
```
/content/drive/MyDrive/Colab Notebooks/CPSC 381-581: Machine Learning/Exercises/01_exercise.ipynb
```

In this exercise, we will introduce basic data handling using NumPy to create training, validation and testing splits. We will implement a training and validation loop for a Perceptron and test it on the testing split.


**Submission**:

1. Implement all TODOs in the code blocks below.

2. Report your validation and testing scores.

```
Report validation and testing scores here. For example,

Max training iterations: 1
Training loss: 0.18681  Validation accuracy: 80.70%
Max training iterations: 2
Training loss: 0.61978  Validation accuracy: 43.86%
Max training iterations: 3
Training loss: 0.08791  Validation accuracy: 87.72%
Max training iterations: 4
Training loss: 0.08352  Validation accuracy: 91.23%

Test accuracy: 87.72%

```

3. List any collaborators.

```
Collaborators: Doe, Jane (Please write names in <Last Name, First Name> format)

Collaboration details: Discussed ... implementation details with Jane Doe.
```

Import packages

In [1]:
!pip install scikit-learn==1.1

Collecting scikit-learn==1.1
  Downloading scikit_learn-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.19.1 requires scikit-learn>=1.2.2, but you have scikit-learn 1.1.0 which is incompatible.[0m[31m
[0mSuccessfully installed scikit-learn-1.1.0


In [3]:
import numpy as np
import sklearn.datasets as skdata
from sklearn.linear_model import Perceptron
import warnings

warnings.filterwarnings(action='ignore')

Loading data

In [12]:
# TODO: Load breast cancer dataset
# dictionary object keyed by data, target, target_names, feature_names, ...
breast_cancer_data = skdata.load_breast_cancer()

# TODO: Get data, target, target_names, and feature_names from the dataset
x = breast_cancer_data['data']
y = breast_cancer_data['target']
target_names = breast_cancer_data['target_names']
feature_names = breast_cancer_data['feature_names']

# TODO: Show the number of samples and features in the dataset
print('Number of samples in dataset: {}'.format(x.shape[0]))
print('Number of features in each sample: {}'.format(x.shape[1]))

# TODO: Check to make sure that there are the same number of ground truth
assert x.shape[0] == y.shape[0], 'Number of sample and ground truth does not match!'

Number of samples in dataset: 569
Number of features in each sample: 30


Accessing the features of the dataset

In [13]:
# TODO: Examine the dataset by showing the first two data points
# Print each feature name and value in each line "<name>: <value>" followed by "target: <name> (<value>)"
for sample_idx in range(0, 2):

    print('sample: {}'.format(sample_idx + 1))

    for feature_value, feature_name in zip(x[sample_idx], feature_names):
        print('{} : {}'.format(feature_name, feature_value))

    target_name = target_names[y[sample_idx]]
    print('target: {} ({}) \n'.format(target_name, y[sample_idx]))

sample: 1
mean radius : 17.99
mean texture : 10.38
mean perimeter : 122.8
mean area : 1001.0
mean smoothness : 0.1184
mean compactness : 0.2776
mean concavity : 0.3001
mean concave points : 0.1471
mean symmetry : 0.2419
mean fractal dimension : 0.07871
radius error : 1.095
texture error : 0.9053
perimeter error : 8.589
area error : 153.4
smoothness error : 0.006399
compactness error : 0.04904
concavity error : 0.05373
concave points error : 0.01587
symmetry error : 0.03003
fractal dimension error : 0.006193
worst radius : 25.38
worst texture : 17.33
worst perimeter : 184.6
worst area : 2019.0
worst smoothness : 0.1622
worst compactness : 0.6656
worst concavity : 0.7119
worst concave points : 0.2654
worst symmetry : 0.4601
worst fractal dimension : 0.1189
target: malignant (0) 

sample: 2
mean radius : 20.57
mean texture : 17.77
mean perimeter : 132.9
mean area : 1326.0
mean smoothness : 0.08474
mean compactness : 0.07864
mean concavity : 0.0869
mean concave points : 0.07017
mean symmet

Creating the training, validation and testing splits

In [14]:
# TODO: Shuffle the dataset based on sample indices
shuffled_indices = np.random.permutation(x.shape[0])

# TODO: Choose the first 80% as training set, next 10% as validation and the rest as testing
train_split_size = int(0.80*x.shape[0])
val_split_size = int(0.90*x.shape[0])
# val_split_size is the point at which we truncate the dataset for validation

train_indices = shuffled_indices[0:train_split_size]
val_indices = shuffled_indices[train_split_size:val_split_size]
test_indices = shuffled_indices[val_split_size:]

# TODO: Select the examples from x and y to construct our training, validation, testing sets
x_train, y_train = x[train_indices, :], y_train[train_indices]
x_val, y_val = x[val_indices, :], y[val_indices]
x_test, y_test = x[test_indices, :], y[test_indices]

# TODO: Print the number of samples in training, validation and testing sets
print('Number of samples in dataset: {}'.format(x.shape[0]))
print('Number of training samples: {}'.format(x_train.shape[0]))
print('Number of validation samples: {}'.format(x_val.shape[0]))
print('Number of testing samples: {}'.format(x_test.shape[0]))

NameError: name 'y_train' is not defined

Implementing training and validation loop

In [None]:
def mean_accuracy(predictions, ground_truths):
    '''
    Computes the mean accuracy between predictions and ground truths

    Arg(s):
        predictions : numpy[int64]
            predictions (y_hat)
        ground_truths : numpy[int64]
            groundtruth labels (y)
    Returns:
        float : mean accuracy score
    '''

    # TODO: Implement mean accuracy
    mean_accuracy = 0.0

    return mean_accuracy

In [None]:
# Define a list to store perceptron models
models = []

# Define a list of max iterations
max_iterations = [1, 2, 3, 4]

# Define a list to store training losses and validation accuracy scores
losses_train = []
mean_accuracies_val = []

for max_iter in max_iterations:

    '''
    Training the perceptron model
    '''
    # TODO: Set up our Perceptron model
    # max_iter is the maximum iterations through the data for training the perceptron
    # penalty and alpha relates to regularization (which we haven’t covered so ignore them)
    model = Perceptron(penalty=None, alpha=0.0, max_iter=max_iter)

    # TODO: Train our perceptron model on the training set using fit function
    model.fit(x_train, y_train)

    # TODO: Store model into list of models
    models.append(model)

    # TODO: Make predictions on the validation set using the predict function
    y_hat_train = model.predict(x_train)

    # TODO: Compute the loss on the training set
    scores_train = np.where(y_hat_train != y_train, 1, 0)
    loss_train = np.mean(scores_train)

    # TODO: Store the loss into our set of losses
    losses_train.append(loss_train)

    '''
    Validate our perceptron model on the validation set
    '''
    # TODO: Make predictions on the validation set using the predict function
    y_hat_val = None

    # TODO: Compute the accuracy on the validation set
    accuracy_val = np.where(y_hat_val == y_val, 1, 0)
    mean_accuracy_val = np.mean(accuracy_val)

    # TODO: Store the score into


    print('Max training iterations: {}'.format(max_iter))
    print('Training loss: {:0.5f}  Validation accuracy: {:0.2f}%'.format(loss_train, 100 * mean_accuracy_val))

# TODO: Choose the best model based on highest validation accuracy
best_model_idx = None
best_model = None

Testing our model

In [None]:
# TODO: Make predictions on the testing set using our best model
y_hat_test = None

# TODO: Compute the accuracy on the testing set
accuracy_test = None
mean_accuracy_test = 0.0

print('Test accuracy: {:0.2f}%'.format(100 * mean_accuracy_test))