# Neural Networks and Deep Learning

## by Charu C. Aggarwal

### Chapter 1: An Introduction to Neural Networks


“Thou shalt not make a machine to counterfeit a human mind.” 

                –Frank Herbert

# Chapter Organization

- single-layer and multilayer networks 
- different components of neural networks
- Multilayer neural networks
- The importance of nonlinearity
- Advanced topics in deep learning
- Some notable benchmarks used by the deep learning community

# Biological and Artificial Neural Networks

- Artificial neural networks simulate biological learning mechanisms.
- Biological neurons connect via **axons**, **dendrites**, and **synapses**.
- Artificial neural networks neurons are connected via **weights**, which simulate synaptic strength.

![Figure%201.1.png](attachment:Figure%201.1.png)

# Simulating Biological Learning in Neural Networks


- Weights in Artificial neural networks scale inputs like synaptic strengths do in biological systems.
- Each neuron processes inputs, then propagates results to output neurons.

# Training Artificial Neural Networks


- Learning occurs by adjusting weights in response to **external stimuli** (training data).
- Training data consists of **input-output pairs** (e.g., pixel data of images and their labels).
- Feedback from training data helps adjust the weights to improve accuracy.

**Example:**

- Input: Pixel data of an image.
- Output: Predicted label (e.g., carrot, banana).

# Error Feedback and Weight Adjustment


- Errors in prediction lead to weight adjustments (similar to synaptic adjustments in biology).
- The goal is to reduce error in future iterations and improve the network’s accuracy.
- The process of adjusting weights to correct predictions is mathematically justified.

- *Model Generalization:* the ability to accurately compute functions of unseen inputs by training over a finite set of input-output pairs


# Generalization in Neural Networks


- the aim is to **generalize** learned knowledge to **unseen data**.
- Example: After training with many banana images, the network can recognize bananas in new, unseen images.
- Generalization is key to the usefulness of machine learning models.


# Neural Networks and Traditional Machine Learning Models

- Basic computation units in neural networks are inspired by **traditional machine learning** algorithms, like:
  - **Least-squares regression**.
  - **Logistic regression**.
- Power in neural networks comes from combining these basic units to minimize prediction error.


# Power of Neural Networks

- neural networks gain power by combining multiple computational units.
- This allows the model to learn **more complex functions** compared to elementary models.
- In basic form, neural networks can reduce to classical models
- The true power of neural networks is seen when combining units to handle more complex data.


# Breakthroughs in Deep Learning

- Deep learning has shown **extraordinary performance** in specific tasks by utilizing deep architectures.
- Biological neural networks also derive power from **depth** and **connectivity**.
- Advances in neural networks often parallel breakthroughs in understanding **biological neural structures**.

# The Big Data Era and Neural Networks

- The success of neural networks is driven by **increased data availability** and **computational power**.
- Traditional machine learning still has an edge on **small datasets** due to:
  - **Model interpretability**.
  - **Hand-crafted features**.

![figure%201.2.png](attachment:figure%201.2.png)

# The Basic Idea of Neural Networks

- Neural networks compute functions from input to output nodes.
- Each node contains a variable that is either computed or externally set (input nodes).
- Networks are structured as **directed acyclic graphs** with parameterized edges.
- Functions at nodes are influenced by weights on incoming edges and variables in previous nodes.


# General Principle of Machine Learning

- Neural networks follow the general machine learning principle:
  - A basic function structure is chosen, and parameters (weights) are learned from data.
- By learning the weights, the network "learns" the function that relates inputs to outputs, consistent with the observed data.

# Single-Layer vs Multi-Layer Networks

- In a **single-layer network**, inputs are connected directly to output nodes.
- **Multi-layer networks** separate input and output layers with **hidden layers** of nodes.
- Hidden layers allow networks to learn more complex functions.
- Weights on edges are adjusted during training to relate inputs to outputs.

![image.png](attachment:image.png)

# Learning a Function with Neural Networks


- The goal is to **learn a function** that relates inputs to outputs using training examples.
- The neural network adjusts edge weights to construct a function that matches the data.
- This process is called **training** and uses **input-output pairs** from training data.


# The Target Variable and Training Data


- In classification, training instances are of the form ($\overline{X}$, *y*), where:
  - \($\overline{X}$): Feature vector containing input variables
      - $[x1, . . . xd]$ contains d feature variables,
  - \(*y*): Binary class variable {-1 , +1}
      - $y \in \{-1, +1\}$ contains the observed value
  - $\overline{W}$ is the weight vector
- The **target variable** \(*y*) represents the property being predicted by the network.
- Training aims to minimize the **mismatching** between \(*y*) and the predicted output \( $\widehat{y}$).


<p style="text-align: center;">${y} = f_\overline{W}(\overline{X})$.</p>


# Optimizing the Prediction Function


- The neural network learns a **prediction function** \($\overline{X}$) parameterized by weights \($\overline{W}$).
- Optimization involves minimizing the error between the predicted and observed values.
- A **loss function** penalizes incorrect predictions to improve the model during training.
<br><br>
<p style="text-align: center;">$\text{Minimize}_\overline{W}\ \text{Mismatching between }\mathit{y}\ \text{and}\ f_{\overline{W}}(\overline{X})$</p>





# Testing Data and Function Prediction


- After training, the neural network uses the learned function \$f(\overline{X}$) to predict outcomes for **testing data**

- The testing data includes cases where the target variable $(\mathit{y})$ is not yet observed.
- This step evaluates how well the model generalizes to new, unseen data.


# The Perceptron: A Single-Layer Neural Network


- The simplest neural network is the **perceptron**, with one input layer and one output node.
- It computes a **linear function** of the inputs to predict the output.

![figure%201.3.png](attachment:figure%201.3.png)

# Perceptron Function Calculation

- The output node computes the **linear function** 
<p style="text-align: center;">$\overline{W} \cdot \overline{X}^T = \sum_{i=1}^{d} w_i x_i$</p>
- this multiplies each feature by it's weight and adds it up


- The **sign function** is applied to convert the real value into a binary class label:
- This function allows the perceptron to classify input data into two classes {+1, -1}

![1.1.png](attachment:1.1.png)

# Learning Weights in the Perceptron


- Initially, weights \($\overline{W}$) are unknown and randomly set.
- The perceptron **updates the weights** based on the errors between the predicted and observed values:
<p align="center">
$\overline{W} \leftarrow \overline{W} + \alpha (y - \hat{y}) \overline{X}^T$
</p>

- The learning rate \($\alpha$) controls the update size, and the algorithm iterates through training data until convergence.


# Epochs and Iterative Weight Updates

- Each pass through the entire training dataset is called an **epoch**.
- The perceptron cycles through training examples, updating weights until misclassifications are minimized.
- When the predicted class $y \neq \hat{y}$
, the weight update is triggered:
<p align="center">
$\overline{W} \leftarrow \overline{W}+ \alpha y X^T$
</p>


# Linear Separability

- The perceptron performs well when the data is **linearly separable**.
- Linearly separable data can be divided by a **hyperplane**, where the sign of
$\overline{W} \cdot \overline{X}^T$
 distinguishes between the two classes.
- If the data is not linearly separable, the perceptron may not converge.

![figure%201.4.png](attachment:figure%201.4.png)


# Convergence of the Perceptron

- For linearly separable data, the perceptron algorithm is guaranteed to converge with zero error on the training set.
- When the data is not linearly separable, convergence is not guaranteed.
- Alternative algorithms, like **support vector machines** or multi-layer networks, are used for non-linearly separable data.


# The Role of Bias in Predictions
- Bias helps capture the invariant part of a prediction
    - helps when the input features aren't enough to predict the output
- In cases of class imbalance, predictions may not match class distributions without bias.
    - class imbalance happens when there is more data in one class than the other

# Example of Prediction Imbalance
## Imbalance in Predictions
- Feature variables may be mean-centered
    - if $\overline{X}$ is mean-centered then summing all feature vectors will equal 0
- If the class distribution is not balanced, predictions based on these features will result in undesirable properties:

![image-2.png](attachment:image-2.png)

# The Need for a Bias Term
## Why Bias is Necessary
- Without bias, summing predictions over training points may result in poor performance.
- Incorporating a bias variable, `b`, captures the invariant part of the prediction.

# Incorporating Bias in Neural Networks
## Adding a Bias Neuron
- A bias neuron transmits a constant value (usually 1) to the output node.
- The edge connecting the bias neuron provides the bias term.

![image.png](attachment:image.png)

# Feature Engineering for Bias
## Bias Through Feature Engineering
- Another method to incorporate bias is by adding a constant feature (value = 1).
- The coefficient of this feature acts as the bias in the model.

# The Original Perceptron Model
## Perceptron as Hardware
- The original perceptron was a hardware implementation.
- The Mark I perceptron used circuits, not software, for updates.

![figure%201.5.png](attachment:figure%201.5.png)

# The Role of Loss Functions

- A **loss function** quantifies the error when the predicted value \( $\hat{y} \$) differs from the observed value (*y*).
- The goal of training is to minimize the aggregate loss across all training examples.
- Loss functions guide the weight updates to improve the network's accuracy.


# Objective Functions in Neural Networks
## Loss Optimization in Neural Networks
- Most ML algorithms use gradient descent to optimize a loss function.

# Perceptron Loss Function
## Perceptron Criterion
- Perceptron criterion penalizes when the sign of $(\overline{W} \cdot \overline{X}_i^T)$ differs from $y_i$
    - larger errors are penalized more
- Loss function:
<p align="center">
$L_i = \max\{-y_i (\overline{W} \cdot \overline{X}_i^T), 0\}$
</p>

- if the prediction is correct, the loss is 0

# Gradient Descent in Perceptron
## Gradient Descent Updates
- Gradient descent updates weights based on the gradient of the loss.
![image-4.png](attachment:image-4.png)

- Perceptron update:

![image-2.png](attachment:image-2.png)

<small>More on this in Emily's presentation</small>


# Weakness of Perceptron
## Weakness of the Original Perceptron
- The original perceptron can achieve an optimal loss of 0 by setting $\overline{W} = 0$
    - the dot product will be 0 for all input features
- Despite this, the perceptron still converges to meaningful solutions in linearly separable data.
- In linearly separable data sets, a nonzero weight vector $\overline{W}$ exists in which the sign of $\overline{W} \cdot \overline{X}_i^T$ correctly matches the sign of $y_i$ for each and every training instance $(\overline{X}_i, y_i)$.



# Proportionality
- direct proportionality of the loss to the *magnitude* of the weight vector can dilute the goal of class separation by simply reducing the magnitude of the weight vector

- updates to weights can lead to reduction in weight that does not improve accuracy

# Pocket Algorithm
## Handling Non-Separable Data
- In non-linearly separable data, perceptron behavior is unpredictable.
- The pocket algorithm retains the best solution during training
    - also works for inseparable data


# Neural Network Components
## Base Components of Neural Architectures
- The perceptron algorithm uses a single output node and a sign activation function.

# The Sign Function as an Activation Function

- The **sign function** in the perceptron serves as an **activation function**.
- The activation function determines the output based on the linear combination of inputs.
- Other activation functions (e.g., ReLU, sigmoid) are used in more complex networks.

A single-layer network with column vector $\overline{W}$ of weights and input (row) vector $\overline{X}$ would have a prediction of the following form: 
$$\hat{y} = \Phi(\overline{W} \cdot \overline{X}^T)$$



## Activation Functions

- Linear activation (identity activation): 
$$
\Phi(v) = v
$$

- Classical activation functions:

![image.png](attachment:image.png)

# Pre- and Post- Activation
- Neurons in neural networks perform two distinct computations:

![figure%201.6.png](attachment:figure%201.6.png)

$a_h = \overline{W} \cdot \overline{X}$ (linear transformation/ weighted sum of the input values)

$h = \Phi(a_h)$ (non-linear transformation)

h is what gets transfered to the next layer


# Nonlinear Activation Functions
## Role of Nonlinear Activation
- Nonlinear activation (e.g., ReLU, tanh) increases the modeling power of neural networks.
- These functions help create more complex compositions in larger networks.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)


![image.png](attachment:image.png)

# Softmax Activation Function
## What is the Softmax Function?
- Almost always used in the output layer to map k real values into k probabilities of discrete events.
- output layer takes the values from the last layer and converts to probabilities for each class that add up to 1
- Particularly useful in classification problems with unordered class labels.

![image.png](attachment:image.png)


- $\Phi(v)_i$ probability that the input belongs in class *i*
- ${\sum_{j=1}^{k}}$ sum of the exponentials of all outputs = 1
- $\exp(v_j)$ exponential function
- $ \quad \forall i \in \{1, \dots, k\}$ applied to every class from 1 - *k*


# Visualizing Softmax
## Softmax Layer Example
- Each output corresponds to the probability of a particular class
    - the higher the probability the more likely to be in that class
- **No weights** are associated with the softmax layer.

![figure%201.8.png](attachment:figure%201.8.png)

# Loss Function: Least Squares
## Numeric Values
- Loss functions determine how predictions deviate from actual values.
- Choice of loss function depends on the problem (e.g., regression vs classification).
- For numeric outputs with a single training instance:  
  Loss = $(y - \hat{y})^2$
- Aims to minimize the difference between true and predicted values.

# Logistic Regression Loss (Binary)
## Binary Target: Logistic Regression
- Loss for binary targets:
  - Observed value y from {−1, +1}
  - Sigmoid activation outputs probability $\hat{y} \in (0, 1)$
  - Loss: Negative log probability of the correct prediction.

# Categorical Target Loss
## Multiclass Loss: Cross-Entropy
- For categorical predictions, loss is defined as:  
![image.png](attachment:image.png)
- This is known as **cross-entropy loss**.
- The key point to remember is that the nature of the output nodes, the activation function, and the loss function depend on the application at hand.

# Multilayer Neural Networks
## Introduction to Multilayer Networks
- Contains more than one computational layer.
- Hidden layers perform intermediate computations.


# Feed-Forward Networks
## Defining Feed-Forward Architecture
- Successive layers feed into one another from input to output.
- Common architecture connects every node in one layer to the next.

![figure%201.9.png](attachment:figure%201.9.png)


# Bias in Multilayer Networks
## Bias Neurons in Neural Networks
- Bias neurons can be used in both hidden and output layers.
- They help capture the invariant part of the predictions.

# Dimensionality of Layers
## Defining Layer Dimensionality
- The number of units in each layer is the **dimensionality**.
- Layer weights are represented by matrices.
- weights determine the strength of the connections between neurons
    - the weights are stored in matricies

# Weights Between Layers
## Weight Matrix for Layers
- Weights between layers are stored in matrices.
![image.png](attachment:image.png)

# Vector-Based Neural Architecture
## Vector Representation of Neural Networks
- Neural network layers can be represented as vectors.
- Activation functions apply element-wise to vector arguments.

# The Multilayer Network as a Computational Graph
## Neural Networks as Computational Graphs
- Neural networks compute nested compositions of parametric multivariate functions.
- These compositions make neural networks powerful approximators.
- showcases the networks ability to model complex interactions between different features extracted by the previous layers

# Backpropagation in Neural Networks
## Learning Weights with Backpropagation
- Weights are updated via gradient descent to minimize the loss.
- Backpropagation helps compute gradients for each layer.
- weights are updated with the gradient descent to reduce loss

![image.png](attachment:image.png)

- quantifies the difference between the predicted output and the observed output


# Backpropogation algorithm
## Straightforward loss functions
- multilayer networks are often too massive to write the lostt function in closed form
- instead of calculating the entire loss function the backpropogation algorithm computes gradients by propogating errors backwards
- modularity in neural network design leads to modularity in learning parameters

# The Role of Activation Functions
## Power of Nonlinear Activations
- Nonlinear activation functions (e.g., ReLU, sigmoid) increase network expressiveness.
- Without nonlinear activations, a multilayer network is equivalent to a single layer.


# The Importance of Nonlinearity

- Neural networks are powerful due to repeated composition of functions.
- Not all functions are effective—nonlinear squashing functions are key.
- **Theorem 1.5.1:** A multi-layer network with only identity activations reduces to a single-layer network.
- This limits its expressivity, making it equivalent to linear regression.


# Proof of Theorem 1.5.1

- Consider a network with `k` hidden layers.
- The recurrence condition for multi-layer networks is defined as:
![image.png](attachment:image.png)
- When identity activations are used, the output reduces to:
  ![image-2.png](attachment:image-2.png)
 $W_{xo}$ is the single matrix from multiplying all the weight matrices together
  ![image-3.png](attachment:image-3.png)
  
  output is the matrix times the inputs


# Lemma 1.5.1

- A network with identity activations and perceptron output reduces to a single-layer perceptron.
- This is due to the composition of linear functions remaining linear.
- **Observation 1.5.1:** Composition of linear functions is always linear.


# Nonlinear Activations

- Nonlinear activations like ReLU or sigmoid are crucial for deep learning.
- These squashing functions have large gradients near zero and saturate for large values.
- **Universal Approximation Theorem:** A two-layer network can approximate any function, provided enough hidden units.


# Nonlinear Activations in Action

- Consider two-class data that is not linearly separable.
- A neural network with ReLU activations can learn new features like:

$h_1 = \max\{x_1, 0\}$
takes x when positive and 0 when negative

$h_2 = \max\{-x_1, 0\}$
takes x when negative and 0 when positive


- These new features enable linear separability in hidden space.
![figure%201.10.png](attachment:figure%201.10.png)


# Role of ReLU in Linear Separability

- ReLU helps threshold negative values to 0, aiding in linear separability.
- In the hidden space, points become linearly separable.
- With appropriate weights, even simple linear functions can classify the data perfectly.


# Two Notable Benchmarks
- Neural network benchmarks are dominated by computer vision data.
- UCI repository datasets can be used, but perceptually-oriented data is favored.
- Two key datasets dominate deep learning papers:
  1. MNIST (handwritten digits)
  2. ImageNet (visual categories)

# MNIST Database of Handwritten Digits
- MNIST: Modified National Institute of Standards and Technology database.
- Contains 60,000 training and 10,000 testing images of handwritten digits (0-9).
- Images normalized to 28x28 pixels, grayscale values (0-255).
- Created by NIST from American Census Bureau employees and high school students' handwriting.
![figure%201.11.png](attachment:figure%201.11.png)

# MNIST Usage in Neural Networks
- MNIST often used for quick testing of machine learning algorithms.
- Image representation (28x28 = 784 dimensions) supports both traditional and neural network models.
- A simple support vector machine achieves a 0.56% error rate.
- Deeper neural networks and convolutional neural networks (CNNs) achieve lower error rates, down to 0.21%.


# Broader Use of MNIST
- MNIST’s simple, non-spatial structure allows testing of non-vision neural network algorithms.
- Useful for data reconstruction and generic neural network testing.
- Provides visual feedback when used for non-vision tasks, offering insights not available in UCI datasets.


# ImageNet Database
- ImageNet contains over 14 million images across 1,000 categories, organized by the WordNet hierarchy of nouns.
- Key for benchmarking due to the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
- Has led to many state-of-the-art architectures, surpassing human performance in image classification.


# Transfer Learning with ImageNet
- ImageNet-trained CNNs are used for feature extraction in other tasks.
- Hidden activations in the penultimate layer define new multidimensional representations.
- These representations can be used in traditional machine learning, enabling transfer learning across domains.


# Conclusion

- Neural networks can be seen as computational graphs that simulate the learning process.
- These graphs recursively compose simpler functions to learn complex functions.
- The primary challenge is learning the parameters of the graph to optimize a loss function.
- Basic neural networks can function like simple machine learning models (e.g., least-squares regression).
- The power of neural networks lies in complex combinations of underlying functions.
- Backpropagation allows efficient computation of gradients, enhancing learning.
- Deep learning methods require specialized architectures (e.g., recurrent neural networks, convolutional neural networks) tailored to specific domains like text and images.



Aggarwal, C. C. (2023). Neural Networks and Deep Learning: A Textbook. Springer International Publishing. https://doi.org/10.1007/978-3-031-29642-0

OpenAI. (2023). ChatGPT (October 2023 version) [Large language model]. https://chat.openai.com/