<center><h1>Very Brief Introduction to Deep Learning</h1></center>
<center><h3>Paul Stey</h3></center>


# Impact of Deep Learning

It would be difficult the overstate the impact.
  * AI chatbots
  * Facial recognition
  * Image processing
  * Voice recognition 
  * Medical imaging (e.g., tumor detection, pathology)
  * Self-driving cars
  * Game-playing AIs
  * Virtual assistants (e.g., Siri, Alexa, Google)

# Nomenclature

Deep learning is a family modeling approaches with many names:

  * Neural networks (NN)
  * Deep neural networks (DNN)
  * Artificial neural networks (ANN)

## Neural Network Basics


What is a neural network?
  * Universal function approximator
  * A species of directed acyclic graphs (usually)
		

## What do neural networks do?

Like many other statistical or machine learning models (e.g., GLM, random forests, boosting), neural networks:
  * Attempt to approximate a data-generating mechanism
  * Can be used for classification problems
  * Can be used for regression problems
  * Can also be used for dimension reduction like principal components analysis (PCA)


## Neural Networks vs. other ML Modeling

Similarities to other types of machine learning models

  * Input variables (i.e., _**X**_, features, predictors, etc.) and output variable (i.e., _y_)

  
<center><img src="images/input_output.png" width=420/></center>

## Applications of Deep Learning
Deep learning is extremely flexible, and can be applied to many domains.
  
<center><img src="images/self-driving_car.jpg" width=860/></center>

## History of Neural Networks

* Neural networks have been around since the 1940s, with the first artificial neuron being created by Warren McCulloch and Walter Pitts in 1943.
* In the 1950s and 1960s, researchers developed the first neural network algorithms, including the Perceptron algorithm, which was capable of learning to classify images.
* During the 1970s and 1980s, neural networks fell out of favor as researchers found it difficult to train them effectively. However, the development of backpropagation, a learning algorithm that could effectively train deep neural networks, in the 1980s paved the way for their resurgence.
* In the 1990s, researchers developed Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), which improved the ability of neural networks to handle complex data such as images and sequences.
* In the 2010s, advances in computing power and the availability of large datasets led to a rapid increase in the use of deep learning for a wide range of applications, including computer vision, natural language processing, and speech recognition. 
* Today, deep learning is one of the most active areas of research in artificial intelligence.

### More Recently

Neural networks are experiencing a major resurgence. There are at least three reasons.

  * Better algorithms for back-propagation
  * GPUs are well suited to building neural networks
    - Matrix multiplies can be made embarrassingly parallel 
    - GPUs have much better memory bandwidth
  * More labeled data
  
  
<center><img src="images/two_johns.jpg" width=380/></center>

# Multi-Layer Perceptron

An early and fairly straightforward example of a neural network.

<center><img src="images/neural_network3.png" width = 420/></center>

## Single Neuron

A single neuron takes inputs, $x_j$, and applies the weights, $w_{\cdot j}$ to the input by computing the dot product of the vectors $x$ and $w$. The result is the input to the "activation function".

<center><img src="images/neuron2.png" width = 420/></center>

# Multi-Layer Perceptron

Larger networks can have many, _many_ weights!
  * Origin of the term "deep" neural networks 
  * Largest models have _trillions_ of weights (i.e., parameters)

<center><img src="images/neural_network4.png" width = 420/></center>


### Activation Functions
  * The notion of an activation function comes again from the conceptual relationship to neurons in the brain.

  * Activation functions are analogous to "link" functions in generalized linear models (GLMs). 
    
  * In fact, one common activation function is the sigmoid function, which is just our old friend the logistic function which you are using when you fit logistic regression models.

### Purpose of Activation Functions

There are a few reasons we use activation functions.    

  * Need to take some linear predictor and transform it so that it is bounded appropriate. For instance, the value of logistic function is in the range $(0, 1)$. 
  * Allows us to introduce non-linearities. 
    - Approximate a data-generating mechanism 
    - Trying to approximate a function that might be very complicated and include non-linearities

### Common Activation Functions

Some common activation functions include the following: 
  * Sigmoid (i.e., logistic)
  * Hyperbolic tangent: $tanh$
  * Rectified linear unit (ReLU)
  * softplus
  
<center><img src="images/activation_functions.png" width = 420/></center>

<center><h1>Challenge Question</h1></center>

The sigmoid and the ReLU activation functions are two of the most common in deep learning. The formulas for these are below. Write a `sigmoid()` and a `relu()` function in Python that implements these.

$$sigmoid(x) = \frac{1}{1 + e^{-x}}$$

$$relu(x) = \text{max}(0, x)$$

<br>
<br>

**Hint:** Note that the NumPy module has the `e` constant included as a  part of the module.


# Varieties of Neural Network (and layers)

1. The "feed-forward" layer/network
  * Mult-layer perceptron is a feed-foward network
  * Most networks involve at least _some_ feed-foward layer
2. Convolutional neural network (CNN)
  * Ubiquitous in computer vision (i.e., image classification, object detection, facial recognition)
3. Recurrent neural networks (RNN)
  * Long short-term memory (LSTM) networks
4. Generative adversarial network (GAN)
  * Widely used in game-playing AI
5. Autoencoders

6. Transformers
  

# Convolutional Neural Networks (CNNs)

* Regular neural nets don't scale well to images
  - For images of size $32 \times 32 \times 3$, a _single_ fully-connected neuron in the first layer would have $3072$ weights.
  - Images of size $200 \times 200 \times 3$, a _single_ neuron gives $120000$ weights.
* Full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

## CNNs (cont.)
What are CNNs?
  * ConvNets are very similar to neural networks discussed thus far. Dot product, followed by non-linearity, and loss function at the end.
  * Explicit assumption that input are images.
  * Layers have neurons arranged in 3 dimensions (width, height, depth) to form an **activation volume**
  
<center><img src="images/cnn.png" width="750"></center>

<center><img src="images/convolution.gif" width="750"></center>

### Orginal Image
<center><img src="images/building.jpg" width="750"></center>

### Apply Sobel operator filter

<center><img src="images/building_sobel.jpg" width="750"></center>

## Architecture of CNN
	

Types of layers used to build ConvNets
  * Convolutional Layer
    - Input: 3-d volume
    - Output: 3-d volume
    - Convolutional "filters" with small regions in the image
    - Output depth, depends on the number of filters
  * Pooling Layer
    - Downsampling along spatial dimensions (width, height)
  * Fully-Connected Layer (what we've seen so far)
    - Compute class score. Dimensions are transformed to $1 \times 1 \times k$, where $k$ is number of classes 

# Training vs. Inference

1. Training neural network
  * Process that computes weights (i.e., parameter estimates)
  * Can take hours, days, weeks, or months
  * Typically done on specialized hardware
    - GPUs, TPUs, FPGAs
2. Inference
  * Use existing network (i.e., weights)
  * Make predictions (i.e., classification, or numerical prediction)
  * Happens fast; in many cases _extemely_ fast (e.g., milliseconds)
  * Needs to happen on all kinds of devices (e.g., phones, cameras, sensors)

# Deep Learning Packages

1. TensorFlow
  * Free, open-source software
  * Primarily developed by Google
  * C++ library, callable from C++, Python, or R
  * Created by Google
2. Keras
  * "Front-end" API for TensorFlow
3. PyTorch
  * Free, open-source software
  * Developed in large part by Facebook
  * C++/Cython library callable from Python
  * Created by Facebook


# Hyperparameters 

One of the nuances of deep learning models is there are often many hyperparameters that can be tuned. These are aspects of the model that can have a huge impact on the model's accuracy, rate of convergence, and total time for training and inference. 

Here are some of the most consequential hyperparameters:

* Learning rate

* Batch size

* Number of layers

* Number of units per layer

* Activation function

* Dropout layers

* Number of epochs

## Learning Rate

* Controls the step size taken by the optimization algorithm when updating the model's weights. 

* It determines the rate at which the model learns from the data and converges to the optimal solution.

* A smaller learning rate corresponds to smaller weight updates, resulting in a slower convergence towards the optimal solution. 
  - While this can lead to more precise and stable convergence, it can also make the training process take longer and increase the risk of getting stuck in local minima.
  
* A larger learning rate leads to more significant weight updates, which can speed up the training process and allow the model to escape local minima. 
  - However, using a learning rate that is too large may cause the model to overshoot the optimal solution, resulting in oscillation or divergence, and poor convergence.

## Batch Size 

* The batch size is an important hyperparameter in deep learning that controls the number of training examples used in a single update of the model's weights during training. 
  - It is the number of samples that are processed simultaneously in each iteration of the training process.
  
* Smaller batch sizes result in noisier gradient estimates, which can help escape local minima and promote better generalization. 
  - However, this may also lead to slower convergence. Larger batch sizes provide more accurate gradient estimates, which can lead to faster convergence but may risk getting stuck in local minima.
  
* Larger batch sizes require more memory to store intermediate values, both in terms of GPU/TPU memory and system memory (RAM). 
  - This can become a limiting factor, especially when working with large models and high-resolution data.

## Number of layers

* Defines the depth of a neural network, which influences the model's capacity to learn complex patterns and representations. 

* Deeper networks can potentially learn more intricate features, but they are also more prone to overfitting and require more computational resources.

* More shallow networks require less resources and time for training, but can underfit the data

## Number of Units Per Layer

* Determines the width of each layer in a neural network. More units per layer can increase the expressive power of the model, but may also lead to overfitting and increased computational complexity.

* More units per layer also increases computational burden.

## Activation Function

* The non-linear function applied to the output of each layer, which introduces non-linearity into the model and allows it to learn complex mappings. 

* Common activation functions include ReLU, sigmoid, and tanh.

## Number of Epochs

* The number of epochs in training a deep learning model refers to the number of times the entire training dataset is passed through the model during the training process. 

* An epoch consists of multiple iterations, where each iteration processes a batch of training examples and updates the model's weights based on the computed gradients.

* _Underfitting_: If the number of epochs is too low, the model may not have enough time to learn the underlying patterns in the data, leading to underfitting. In this case, the model performs poorly on both the training and validation datasets.

* _Overfitting_: If the number of epochs is too high, the model can start to memorize the training data, leading to overfitting. In this case, the model performs well on the training dataset but poorly on unseen data or the validation dataset.

## Dropout Layers

* _Regularization technique_: Used to prevent overfitting by randomly dropping out, or "deactivating," a proportion of neurons in a layer during training.

* _Dropout rate_: Determines the fraction of neurons to be deactivated in a given layer; typically set to a value between 0 and 1 (e.g., 0.5 corresponds to dropping out 50% of the neurons)

* _Training-time only_: Dropout is applied only during training. When the model is used for inference or evaluation, all neurons are active, and their outputs are scaled down by the dropout rate to compensate for the increased number of active neurons compared to training.


In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

In [None]:
X, y = fetch_openml('mnist_784', version = 1, return_X_y = True)

In [None]:

def show_image(x):
    x_resize = np.array(x).reshape(28, 28)
    plt.imshow(x_resize, 
               cmap = "Blues")
    plt.show()

In [None]:
n = 137                    # image number
show_image(X.iloc[n])      # plot image

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.values, 
                                                    y.values.astype(int), 
                                                    test_size = 0.2, 
                                                    random_state = 0)

In [None]:
!pip install torch               # this may take a bit of time

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.nn.functional as F

In [None]:
# Feed forward neural network definition
class Net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

In [None]:
# Constants
input_size = 784
output_size = 10
num_samples = X_train.shape[0]

# Hyperparameters
hidden_size = 524            # number of neurons in hidden layers
batch_size = 100             # number of samples in each iteration
num_epochs = 2               # number of full passes through training set
learning_rate = 0.1          # step size for optimization algorithm


# Create DataLoader
dataset = data.TensorDataset(torch.Tensor(X_train), torch.LongTensor(y_train))
dataloader = data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Initialize the model, loss function, and optimizer
model = Net(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
itr = 1
print_itr = 200
total_loss = 0
loss_list = []
acc_list = []

# Training loop
for epoch in range(num_epochs):
    for inputs, labels in dataloader:
        # Forward pass
        outputs = model(inputs)

        # Compute loss
        loss = criterion(outputs, labels)
        total_loss += loss.item()

        # Zero gradients
        optimizer.zero_grad()

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()
        
        if itr % print_itr == 0:
            pred = torch.argmax(outputs, dim=1)
            correct = pred.eq(labels)
            acc = torch.mean(correct.float())
            print('[Epoch {}/{}] Iteration {} -> Train Loss: {:.4f}, Accuracy: {:.3f}'.format(epoch+1, num_epochs, itr, total_loss/print_itr, acc))
            loss_list.append(total_loss/print_itr)
            acc_list.append(acc)
            total_loss = 0
        
        itr += 1


In [None]:
def calculate_accuracy(model, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    
    with torch.no_grad():
        for data, target in test_loader:

            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)                        # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nAverage loss: {:.4f}, Accuracy: {}/{} ({:.3f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

In [None]:
dataset_test = data.TensorDataset(torch.Tensor(X_test), torch.LongTensor(y_test))
dataloader_test = data.DataLoader(dataset_test, batch_size=batch_size, shuffle=True)


In [None]:
calculate_accuracy(model, dataloader_test)

In [None]:
calculate_accuracy(model, dataloader)

In [None]:
# Neural network definition
class Net2(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Net2, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.dropout1 = nn.Dropout(0.20)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        self.dropout2 = nn.Dropout(0.40)
        self.fc4 = nn.Linear(hidden_size, hidden_size)
        self.dropout3 = nn.Dropout(0.60)
        self.fc5 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.dropout1(x)
        x = self.fc3(x)
        x = self.relu(x)
        x = self.dropout2(x)
        x = self.fc4(x)
        x = self.relu(x)
        x = self.dropout3(x)
        x = self.fc5(x)
        output = F.log_softmax(x, dim=1)
        return output

In [None]:
# Constants
input_size = 784
output_size = 10
num_samples = X_train.shape[0]

# Hyperparameters
hidden_size = 424           # number of neurons in hidden layers
batch_size = 250              # number of samples in each iteration
num_epochs = 3              # number of full passes through training set
learning_rate = 0.5       # step size for optimization algorithm


# Create DataLoader
dataset = data.TensorDataset(torch.Tensor(X_train), torch.LongTensor(y_train))
dataloader = data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Initialize the model, loss function, and optimizer
model = Net2(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
itr = 1
print_itr = 200
total_loss = 0
loss_list = []
acc_list = []

# Training loop
for epoch in range(num_epochs):
    for inputs, labels in dataloader:
        # Forward pass
        outputs = model(inputs)

        # Compute loss
        loss = criterion(outputs, labels)
        total_loss += loss.item()

        # Zero gradients
        optimizer.zero_grad()

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()
        
        if itr % print_itr == 0:
            pred = torch.argmax(outputs, dim=1)
            correct = pred.eq(labels)
            acc = torch.mean(correct.float())
            print('[Epoch {}/{}] Iteration {} -> Train Loss: {:.4f}, Accuracy: {:.3f}'.format(epoch+1, num_epochs, itr, total_loss/print_itr, acc))
            loss_list.append(total_loss/print_itr)
            acc_list.append(acc)
            total_loss = 0
        
        itr += 1


In [None]:
test_accuracy(model, dataloader_test)

In [None]:
test_accuracy(model, dataloader)