# Introduction to Neural Networks

In this assignment you first will be introduced to the components of a deep learning model. You will study the code that creates and executes a model and you will apply the theoretical background to propose a model for image data. 

Learning goals:
- understand components of a deep learning model and how they work mathematically
- relate the components to the hyperparameters and model setup of keras tensorflow model
- propose improvements in design 

Data:

The data we will use is the Breast Cancer Wisconsin (Diagnostic) Data Set of the UCI Machine Learning Repository. You are however free to use your own dataset of interest to study the code. 

<a name='2'></a>
In case you want to study image data classifications you can use a dataset like the Dataset of breast ultrasound imagesfrom Al-Dhabyani W, Gomaa M, Khaled H, Fahmy  2020 Feb;28:104863. DOI: 10.1016/j.dib.2019.104863


Sources used: 
- https://medium.com/mlearning-ai/binary-classification-of-breast-cancer-diagnosis-using-tensorflow-neural-networks-30ac8f40388
- Deep Learning for the Life Sciences by Bharath Ramsundar, Peter Eastman, Pat Walters, Vijay Pande Released April 2019 Publisher(s): O'Reilly Media, Inc. ISBN: 9781492039839 https://www.oreilly.com/library/view/deep-learning-for/9781492039822/



# Assignment

Study the material and answer the questions

1. Study the [background text](#0)
2. Study the [code steps](#1). Add comments in your own words and explain design choices such as
    - number of [layers](#01), 
    - [width](#02) of layers, 
    - number of [epochs](#03), 
    - [activation functions](#04), 
    - [loss function](#05), 
    - [gradient descent function](#06), 
    - [regularization function](#07)
3. Run the [code](#1). Evaluate the performance by discussing the results of the evaluation metrics. What hyper parameters would you recommend to change? Explain your choices. 
4. How do I set up a `batch_size` and how does it effect the outcome? Why do you think the batch_size was not set in the first place?
5. (Optional) Would there be a possibility to execute cross validation? How? 
6. (Optional) How can I introduce a validation test set? What would I need to change in the code?
7. Study the [tensor](#2) text. Consider a dataset of breast cancer images. What needs to be changed to the deep learning model design to make a model based on pictures? You can answer this in words, but if you like you can also try to code the solution. 

<a name='0'></a>
# Background Deep Learning Models

## Linear models

One of the simplest model is a linear model

$y = \theta x + b$ in which $x$ is the data, $\theta$ are the weights and $b$ is the bias vector. Their sizes are determined by the numbers of input and output values. If $x$ has length $m$ and you want $y$ to have lenght $s$ then $\theta$ will be a $s \times m$ matrix and $b$ will be a vector of length $s$. By this equation each output is a linear combination of the input components. By setting $\theta$ and $b$ you can choose any linear combination you want for each component. This model is introduces in 1957 and called perceptron. 

Unfortunately straight lines often does not fit real datasets. This problem becomes worse in high dimensional data.

<a name='01'></a>
## Multilayer perceptrons

A simple approach to this problem is to stack multiple linear transformations

$y = \theta_2 \varphi(\theta_1 x + b_1) + b_2$. Now the result of the ordinary linear transformation $\theta_1 x + b_1$ is passed through a nonlinear function $\varphi(x)$. We call this function $\varphi(x)$ *activation function*. By combining linear with non linear we enable the model to learn a much wider range of functions. 

And we do not need to stop. We can stack as many as we want on top of each other.

$$ h_1 = \varphi_1(\theta_1 x + b_1)$$

$$ h_2 = \varphi_2(\theta_2 h_1 + b_2)$$

$$ h_3 = \varphi_3(\theta_3 h_2 + b_2)$$

$$...$$

$$ h_{n-1} = \varphi_{n-1}(\theta_{n-1} h_{n-2} + b_{n-1})$$

$$ y = \varphi_n(\theta_n h_{n-1} + b_n)$$


## Neural network

Multilayer perceptrons start with an input layer $x$ and information flows from one layer to the next layer resulting in the output layer $y$. This principle is also called *Neural Network*. 

<a name='04'></a>
## Nodes
A deep learning node is "a computational unit that has one or more weighted input connections, a transfer function that combines the inputs in some way, and an output connection. Nodes are then organized into layers to comprise a network

<a name='04'></a>
## Activation functions 
Activation functions serve two main purposes: 

1. to assist a model in accounting for **interaction effects**. An interactive effect occurs when the prediction of one variable, let's say A, is influenced differently based on the value of another variable, let's call it B. To illustrate this, consider a scenario where a model aims to determine whether a certain body weight indicates an increased risk of diabetes. In order to make an accurate prediction, the model needs to take into account the individual's height as well. Some bodyweights may indicate a higher risk for shorter individuals, while signaling good health for taller individuals. Thus, the impact of body weight on diabetes risk is contingent on height, and we can describe weight and height as having an interaction effect.

2. Assist a model in capturing **non-linear effects**. This refers to the situation where plotting a variable on the horizontal axis and the corresponding predictions on the vertical axis does not result in a straight line. In other words, the impact of increasing the predictor by one unit varies at different values of that predictor. <a href="https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning" target="_blank">link</a>



What should the activation function be? Most popular is the *rectified linear unit* (`RelU`) $\varphi(x) = max(0,x)$ function, which you can use as a default. Another popular function is the `logistic sigmoid` function $\varphi(x) = 1/(1 + e^{-x})$ 

**Rectified Linear Function**: The Rectified Linear Unit is the most commonly used activation function in deep learning models. The function returns 0 if it receives any negative input, but for any positive value x it returns that value back. So it can be written as f(x)=max(0,x). 

**Sigmoid Functions**: A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point and exactly one inflection point. A sigmoid "function" and a sigmoid "curve" refer to the same object. <a href="https://en.wikipedia.org/wiki/Sigmoid_function" target="_blank">wiki</a>

A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula:

$
\varphi(x) = 1/(1 + e^{-x})
$ 

Other standard sigmoid functions are given in the Examples section. In some fields, most notably in the context of artificial neural networks, the term "sigmoid function" is used as an alias for the logistic function. 

A diverse range of sigmoid functions, such as the logistic and hyperbolic tangent functions, have been employed as activation functions for artificial neurons. Sigmoid curves are also prevalent in statistics, where they serve as cumulative distribution functions that range from 0 to 1. Examples include the integral representations of the logistic density, the normal density, and the Student's t probability density functions. Notably, the logistic sigmoid function is invertible, and its inverse is known as the logit function.

Sigmoid functions would seem to have a couple advantages. Even though it gets close to flat, it isn't completely flat anywhere. So it's output always reflects changes in it's input, which we might expect to be a good thing. Secondly, it is non-linear (or curved everywhere). Accounting for non-linearities is one of the activation function's main purposes. So, we expect a non-linear function to work well.

However researchers had great difficulty building models with many layers when using the tanh function. It is relatively flat except for a very narrow range (that range being about -2 to 2). The derivative of the function is very small unless the input is in this narrow range, and this flat derivative makes it difficult to improve the weights through gradient descent. This problem gets worse as the model has more layers. This was called the vanishing gradient problem.

The ReLU function has a derivative of 0 over half it's range (the negative numbers). For positive inputs, the derivative is 1. When training on a reasonable sized batch, there will usually be some data points giving positive values to any given node. So the average derivative is rarely close to 0, which allows gradient descent to keep progressing.

<a name='02'></a>
## Deep learning

In the model there are two parameters to consider. Width and Depth. The width refers to the size of the layers. We can choose $h_i$ to have any length. They can be larger or smaller then the input and output vectors. Depth refers to the number of layers. When we have only one hidden layer the model is shallow. When we have many layers the model is described as deep, hence deep learning. Often the choice of width and depth is more art then science. 
<a name='05'></a>
## Loss functions

To train the model we need a training dataset with a large number of samples $(x,y)$ and an loss function $L(y, \hat{y})$. To calculate the loss `Euclidean distance` is often used $L(y, \hat{y}) = \sqrt{\sum_i (y_i - \hat{y}_i)^2}$. When $y$ represents a probability distribution, a popular choise is the `cross entropy` $L(y, \hat{y}) = -\sum_i y_i log (\hat{y}_i)$. We measure the model performance by taking the average loss over every sample. 

average loss: $\langle L \rangle = \frac{1}{N} \sum_{i=1}^{N} L(y, \hat{y})$
<a name='06'></a>
## Gradient descent
Now we have a way to determine how well the model works we need a way to improve it. We search for parameters that minimizes the average loss over the training set. Most work in deep learning use some kind of gradient descent algorithm with learning rate $\epsilon$. $$\theta \mathrel{\mathop:} = \theta - \epsilon \frac{\partial}{\partial\theta} \langle L \rangle$$


However with deep learning this takes enormous amount of time. Therefor it is better to use *stochastic gradient descent* (SDG). For every step we take a small set of samples (known as batch) from the training set instead of all samples. The time of each step now is depending on the batch size. The downsite is that is does a lesser job on reducing the loss because it is based on an estimated gradient, not the true gradient. Most deep learning algorithms use SDG. Two of the most popular algorithms are `Adam` and `RMSProp`. 

<a name='03'></a>
## Epoch
`epoch = 10` means that 10 epochs of gradient descent training will be conducted. 
During one epoch, the model is presented with each training example once, and the model's parameters are updated based on the loss incurred on those examples. In practical terms, an epoch consists of two main steps:

- Forward propagation: Each training example is fed through the model, and the model produces predictions for each example.

- Backward propagation (also known as backpropagation): The model calculates the loss between the predicted outputs and the actual labels. The gradients of the loss with respect to the model's parameters are then computed, allowing the model's parameters to be updated in the direction that minimizes the loss.

After one epoch is completed, the model has seen and learned from all the training examples once. Typically, multiple epochs are performed to allow the model to further refine its parameters and improve its performance.

<a name='07'></a>
## Regularization

To avoid overfitting we use regularization. In deep learning a popular method is called `dropout`. For each layer in the model, you randomly select a subset of elements in the output vector $h_i$ and set them to 0. On every step in the gradient descent, you pick a different random subset. By using dropout you asume that no individual calculation within the model should be too important.


## Hyperparameters optimization

In summary there are a lot of choices to make. Such as

- Number of layers in the model
- Width of each layer
- Activation function
- Learning rate
- Batch size
- Loss function
- Number of epochs
- Number of elements to set to 0 when using dropout

Ideally we want a low loss on the test set. But we cannot use the test set for training. We can use another approach however, with a validation set. 

## Using validation set
- For each set of hyperparameters train the model on the training set and compute loss on validation set. 
- Whichever set of hyperparameters give the lowest loss on the validation set, accept them as your final model
- Evaluate that final model on the test set to get an unbiased measure


<a name='2'></a>
# Background Tensors

Tensors are a fundamental concept in the field of mathematics and play a vital role in understanding and manipulating data in various dimensions. 

### scalar
A scalar is a **0D** tensor, representing a single number. It has no dimensions or axes. Scalars are the simplest form of data representation

### vector
Moving on to the next level, we encounter vectors. Vectors are **1D** tensors consisting of an array of values along a single axis. 

### matrix
The next type of tensor is the matrix, which is a **2D** tensor. Matrices are arranged in rows and columns and can be thought of as a rectangular grid of numbers. They are extensively used in various mathematical operations, such as linear transformations and matrix multiplications. In machine learning the feature matrix X is a 2D tensors of shape(samples, features). 


### 3D tensor
When we pack multiple matrices together in a new array, we obtain a 3D tensor. This tensor can be visually interpreted as a cube of numbers, with three axes representing depth, height, and width. Timeseries data is often represented by a 3D tensors with shape (samples, timesteps, features) 


### 4D and beyond
By further extending this concept, we can create even higher-dimensional tensors. For instance, packing 3D tensors into an array gives rise to a 4D tensor. Similarly, the process can be repeated to form 5D tensors and beyond

A single image has 3 dimensions, height, width and color-channel. A dataset with multiple images is a 4D tensor of shape (samples, height, width, channels) or (samples, channels, height, width). 

In the case of videos we even have a 5D tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)


### Tensor transformation
All transformations learned by deep neural networks can be reduced to a handful of tensor operations applied to tensors of numeric data. For instance, it’s possible to add tensors, multiply tensors, and so on.
A tensor operation example is a relu activation function. A layer is a data-processing module that takes as input one or more tensors and that outputs one or more tensors. The layer’s weights learned with gradient descent.

Learning happens by drawing random batches of data samples and their targets, and computing the gradient of the network parameters with respect to the loss on the batch. The network parameters are then moved a bit (the magnitude of the move is defined by the learning rate) in the opposite direction from the gradient

The entire learning process is made possible by the fact that neural networks are chains of differentiable tensor operations, and thus it’s possible to apply the chain rule of derivation to find the gradient function mapping the current parameters and current batch of data to a gradient value.


### Layers
Different layers are appropriate for different tensor formats and different types of data processing. For instance, simple vector data, stored in 2D tensors of shape (samples, features), is often processed by densely connected layers, also called fully connected or **dense layers** (the `Dense` class in Keras)
Sequence data, stored in 3D tensors of shape (samples, timesteps, features), is typically processed by **recurrent layers** such as an `LSTM` layer. 
Image data, stored in 4D tensors, is usually processed by 2D **convolution layers** `(Conv2D)`.





<a name='1'></a>
# Study Case

Consider the Breast Cancer Wisconsin (Diagnostic) Data Set. UCI Machine Learning Repository: Breast Cancer Wisconsin (diagnostic) data set. (n.d.). https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. Consider the code below. 


In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

df = pd.read_csv("data_wisconsin.csv")

le = LabelEncoder()
le.fit(df['diagnosis'])
df['diagnosis'] = le.transform(df['diagnosis'])

X = df[df.columns[2:-1]]
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.20, random_state=42)

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

model = Sequential()

model.add(Dense(20, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(10, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam')

model.fit(x=X_train, y=y_train, epochs=100, validation_data=(X_test, y_test))

model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

predicted=(model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, predicted))

confusion_matrix = confusion_matrix(y_test, predicted)
cm_display = ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, 
                                    display_labels = [False, True])

cm_display.plot()
plt.show()