[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lyeskhalil/mlbootcamp2022/blob/main/lab_3_1_neuralnets.ipynb)

# UofT FASE ML Bootcamp
#### Wednesday June 15, 2022
#### Intro to Neural Networks in PyTorch - Lab 1, Day 3 
#### Teaching team: Elias Khalil, Alex Olson, Rahul Patel, and Jake Mosseri
##### Lab author: Kyle E. C. Booth, kbooth@mie.utoronto.ca, edited by Jake Mosseri

In this lab, we will be taking our first look at developing our own *neural networks* (NN) with [PyTorch](https://pytorch.org/), probably the most popular machine learning library for working with NNs. 

In [2]:
!pip3 install torch torchvision torchaudio
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### An Intuitive Intro to Neural Nets

In today's lab, we're going to start with an intuitive exercise on the Titanic dataset using Logistic Regression and a simple Neural Network before moving onto some more complex stuff. Let's start by loading our dataset. 

Remember, the Titanic data is stored in a CSV file (located in the 'data' directory of your root folder), so we need to use Pandas to load the data and then separate it into our X (features) and y (target). We also need to: i) drop unimportant columns, and ii) impute missing values.

We're going to do this all in the next cell - refer to the decision tree lab from yesterday for the detailed steps.

In [3]:
from sklearn import preprocessing
from sklearn.datasets import fetch_openml


data = fetch_openml("titanic", version=1, as_frame=True).frame
data.survived = pd.to_numeric(data['survived'])
data.drop(['boat', 'body', 'home.dest'], axis=1, inplace=True)
data = data.drop(['name', 'ticket', 'cabin', 'embarked'], axis=1) # remove unimportant columns
le = preprocessing.LabelEncoder()
le.fit(data['sex'])
data['sex'] = le.transform(data['sex']) 
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare
0,1.0,1,0,29.0,0.0,0.0,211.3375
1,1.0,1,1,0.9167,1.0,2.0,151.55
2,1.0,0,0,2.0,1.0,2.0,151.55
3,1.0,0,1,30.0,1.0,2.0,151.55
4,1.0,0,0,25.0,1.0,2.0,151.55


Next, as we have become accustomed to doing, we will split the dataset into a training set (where we will do our cross validation) and a test set (our hold-out data). We've done this a few times during the labs, so hopefully you're getting used to the process!

In [4]:
from sklearn.model_selection import train_test_split

target_data = data["survived"]
feature_data = data.iloc[:, data.columns != "survived"]

X_train, X_test, y_train, y_test = train_test_split(feature_data, target_data, test_size=0.3, random_state=0)

Now, we're ready to try out some models on our training data (you haven't seen anything new yet!). Since we're solving a binary classification problem (i.e., predicting a 0 or 1 target), we want to design classifiers. So far in the course we've covered the following simple classifiers: **k-nearest neighbors**, **decision trees**, and **logistic regression**.

In this exercise, we're going to fit a logistic regression to our data and then design a neural network architecture that behaves exactly like a logistic regression and validate that we get the same result.

##### Recap: Logistic Regression

Logistic regression models are linear models similar to linear regression models. Hopefully you somewhat remember them from lecture. Let's review them, starting with the linear regression equation:


<center>$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +, \dots, + \beta_n x_n$</center>

Where $\hat{y}$ is our prediction, $\beta$ is our vector of coefficients (the things we learn), and $x$ is our feature vector. The linear regression equation defines a line in $n$ dimensional space. The problem with linear regression is that it doesn't really perform well on classification tasks. Consider the following example:

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/linear-classification.png?raw=1" width="500"/>

The green line represents our trained linear regression model. Our feature is the size of a tumor, and our target is whether it is malignant or not (0 or 1). As we can see, even though our model is trained to the data to minimize error, for a lot of the values of tumor size it is going to give us a weird result (e.g., for some really small tumors, the prediction would be a negative value!).

To resolve this, we use the *logistic function* (also called the *sigmoid* function) to 'squish' our linear model to be bounded by 0 and 1. The logistic (sigmoid) function is $\frac{1}{1+e^{-x}}$, and thus our logistic regression equation becomes:

<center>$\large\hat{y} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 +, \dots, + \beta_n x_n)}}$</center>

In this equation, the values for $\hat{y}$ can never be below 0, and can never exceed 1 even for the most extreme feature values $x$.

**YOUR TURN:** 
* Assuming you've trained a nice logistic regression model to the below data (see Figure), what might the model fit look like (i.e., what will the line look like)? ____________________________________
* For new data samples with features $x$, how would you convert the output of the logistic regression, $\hat{y}$, into a classification (0 or 1)? ______________________________

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/logistic-classification.jpg?raw=1" width="400"/>

OK, cool! So a quick review of logistic regression. Let's use scikit-learn to fit a logistic regression model to our training set and then predict on our test set (we won't do cross validation this time). Remember, when we used decision trees and tree ensembles, our cross validation accuracy was somewhere from 75-80%. 

*Note: Remember to first impute missing values!*

In [5]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Remember to impute missing values! (training set)
imp.fit(X_train)

X_train = imp.transform(X_train) 

logreg = LogisticRegression() 
logreg.fit(X_train, y_train)

X_test = imp.transform(X_test) # Remember to impute missing values! (test set)
predictions = logreg.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
print ("Logistic accuracy: %.2f" % (accuracy * 100), "%")

Logistic accuracy: 79.13 %


And so our logistic regression gives us similar performance to the tree methods we explored yesterday. 

**YOUR TURN:** 
* The default regularization parameter for sklearn's logistic regression is L2 (or ridge regression); can you figure out how to change it to L1 (LASSO)? ______________________
* What is the mean accuracy of an L1 regularized Logistic regression model on the training set? ______________________

OK - time for the good stuff!

### Neural Networks

A *neural network* (NN) is a type of machine learning model that, like linear or logistic regression, takes a feature vector, X, as input and predicts a target, y. The way it does this is a little bit different, however. A typical NN architecture consists of: an input layer, hidden layers, and an output layer. Each layer consists of a set of nodes (neurons) connected by edges (outputs). Let's look at the figure below:

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/nn.jpeg?raw=1" width="400"/>

**Input layer**: This is a passive layer that simply takes in your feature data and outputs it to the hidden layers. You can think of each input layer neuron as being associated with a feature in your feature set.

**Hidden layer**: This is where the magic happens. The original features, as received by the input layer, go through a series of transformations within the hidden layer. You can think of each node (neuron) within the hidden layer as a highly transformed feature. 

**Output layer**: This is where we get our final result, the 0 or 1 prediction.

### Zooming In

Let's take a look at what is happening at any given node (neuron) within the hidden layer. Take a look at the following image of a neuron within an NN:

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/neuron.png?raw=1" width="500"/>

Every neuron has some inputs ($x_1, x_2, \dots, x_n$) with input weights ($w_1, w_2, \dots, w_n$) and an output, $Y$. The neuron itself applies a transformation, $f$, known as the *activation function*, to the linear combination of its inputs and input weights. The value $b$ is a constant weight called the bias.

There are many different types of activation functions, but the popular ones are the *sigmoid*, *tanh*, and *ReLU* activation functions. Yes, you heard correctly: the sigmoid function is a popular activation function! (This should be reminding you of the logistic regression model we discussed above).

**YOUR TURN:**
* If you were to develop a simple neural network architecture that was equivalent to a logistic regression model for the Titanic data, how would you do it? Get a pen and paper and draw it out. Make sure to specify: the input layer, the hidden layer(s), the output layer, the activation function(s), the weights, and the biases.
* How many hidden layers does your NN have? What type of activation function, $f$, does it use? _________________
* Say you wanted to add another layer to your NN architecture with 3x neurons, what would your new architecture look like? _________________

### Intro to  PyTorch

OK, so now that we've made the connection between NNs and Logistic Regression, let's code up our little NN in PyTorch and use it to predict survivorship on the Titanic dataset.

First, *tensors* are the fundamental data type of PyTorch. Each tensor is effectively a multi-dimensional array, just like a numpy array. The primary difference is that tensors have been setup in such a way to enhance the NN training process.

Let's load our X and y training data into tensors: 

In [6]:
import torch
from torch import nn, optim
from torch.autograd import Variable

X_train_tensor = Variable(torch.Tensor(X_train))
y_train_tensor = Variable(torch.Tensor(y_train.values))

X_test_tensor = Variable(torch.Tensor(X_test))

Next, we will actually define our logistic regression network model class. The below function, `LogisticRegression`, applies a sigmoid transformation to the output, as required.

In [7]:
class LogisticRegression(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LogisticRegression, self).__init__()
        self.linear = torch.nn.Linear(input_dim, output_dim)

    def forward(self, x):
        outputs = torch.sigmoid(self.linear(x))
        return outputs

Next, we identify the dimensions of our problem: 6 x 2 (6 features and 2 target classes: 0 or 1), initialize our model with those dimensions and then specify the loss function ([cross entropy](https://en.wikipedia.org/wiki/Cross_entropy)) and optimization technique ([stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)).

In [8]:
input_dim = 6
output_dim = 2
    
model = LogisticRegression(input_dim, output_dim)

criterion = torch.nn.CrossEntropyLoss() 
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

Next, we set up our dataset to be *iterable* such that we can train our neural network in *batches*. A batch is a subset of the total data such that if we combined them all, we'd get the whole dataset. Batching is done to speed up the training process and reduce memory requirements.

**YOUR TURN:** 
* If we select a batch size of 32, how many batches of training data will be generated?________________

The `DataLoader` function does this batching operation for us.

In [9]:
batch_size = 32

train_loader = torch.utils.data.DataLoader(dataset=X_train_tensor, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=X_test_tensor, batch_size=batch_size, shuffle=True)

Next, we train our NN and test its performance on the test set every 500 iterations. We use 3000 *epochs* and print our accuracy values every 4000 iterations. An epoch is when the entire dataset has passed through the network. An iteration is when a single forward/backward pass of the network over a batch of data is done. 

**YOUR TURN:**
* If our batch size is 32 and we elect to do 3000 epochs (i.e., the network sees the entire dataset 3000 times), how many iterations (forward/backward passes of the data) will our network see? _______________________

Run the below code and check out the incremental accuracy improvement output below.

In [12]:
import random

random.seed(1)
torch.manual_seed(1)
torch.backends.cudnn.deterministic = True

iterations = 0
epochs = 3000

for epoch in range(epochs):
    for features, target in enumerate(train_loader):
        optimizer.zero_grad()
        outputs = model(X_train_tensor.float())
        loss = criterion(outputs, y_train_tensor.long())
        loss.backward()
        optimizer.step()
        
        iterations += 1

        if iterations % 4000 == 0:
            outputs = model(X_test_tensor.float())
            _, predicted = torch.max(outputs.data, 1)

            accuracy = accuracy_score(y_test, np.array(predicted))
            print ("Logistic NN test set accuracy: %.2f after %d iterations" % (accuracy * 100, iterations))

Logistic NN test set accuracy: 76.08 after 4000 iterations
Logistic NN test set accuracy: 76.08 after 8000 iterations
Logistic NN test set accuracy: 76.08 after 12000 iterations
Logistic NN test set accuracy: 76.08 after 16000 iterations
Logistic NN test set accuracy: 75.83 after 20000 iterations
Logistic NN test set accuracy: 75.83 after 24000 iterations
Logistic NN test set accuracy: 75.83 after 28000 iterations
Logistic NN test set accuracy: 75.57 after 32000 iterations
Logistic NN test set accuracy: 75.57 after 36000 iterations
Logistic NN test set accuracy: 75.57 after 40000 iterations
Logistic NN test set accuracy: 75.57 after 44000 iterations
Logistic NN test set accuracy: 75.57 after 48000 iterations
Logistic NN test set accuracy: 75.57 after 52000 iterations
Logistic NN test set accuracy: 75.57 after 56000 iterations
Logistic NN test set accuracy: 75.83 after 60000 iterations
Logistic NN test set accuracy: 75.83 after 64000 iterations
Logistic NN test set accuracy: 75.83 after

Hopefully you got an accuracy of around ~79%. What you'll notice is that is the same accuracy we got from sklearn's built-in logistic regression function from earlier in the lab! 

Let's take a look at the trained model parameters using the `model.parameters()` function within PyTorch.

In [11]:
params = list(model.parameters())
params

[Parameter containing:
 tensor([[ 0.2954,  3.7111,  0.0665,  1.6594,  0.0412, -0.1659],
         [-0.2349, -5.7985,  0.0333, -0.5174, -0.4381,  0.0094]],
        requires_grad=True), Parameter containing:
 tensor([-1.2316,  1.3425], requires_grad=True)]

Hm, well this is interesting! We can see that our model consists of two tensors: the first has (2,6), and the second has dimension (1,2). Refer back to how you drew what you thought this NN architecture would look like.

**YOU TURN:**
* What do you think these values represent? ____________________________
* How many hidden layers does the architecture have? ______________________________
* Draw the architecture and label (some of) the weights (trained parameters). ______________________________

(*Hint: to answer these questions, try printing `outputs.data[0]` and `predicted[0]` to look at the model's assessment of the first sample*)

You'll notice that this took considerably longer to train than scikit-learn's logistic regression: this is because PyTorch is set-up to be more flexible and train architectures much more complex than a simple single neuron network. Scikit-learn's implementation of logistic regression is highly optimized. 

**YOUR TURN:**
* How many epochs would you need to increase the process to 100,000 iterations? ______________________
* Does increasing to 100,000 iterations improve your test set accuracy? ______________________
* Compare the predictions of your logistic regression from scikit-learn and your network developed with PyTorch. Are all the predictions the same? How many predictions are pair-wise different? ______________________

Congratulations! You've completed an introduction to neural networks and PyTorch. If you want to explore more sophisicated architectures and applications, check out designing a PyTorch neural network to properly classify digit images here:

https://towardsdatascience.com/handwritten-digit-mnist-pytorch-977b5338e627

Other than that, you're done the lab!