# An Introduction to Deep Learning

This notebook provides an introduction to the fundamental concepts of deep learning. 

**Goals:**
*   Understand what "deep" means in deep learning.
*   Visualize the structure of a deep neural network.
*   Grasp the mathematics of forward and backward propagation.
*   Apply these concepts to a practical problem and compare them to a traditional ML model.
*   See examples of how deep learning is used in the basic sciences.

We will use both **scikit-learn** and **PyTorch**. **GPU support** is enabled for PyTorch examples if available.

**Why Deep Learning?**
- Automatic feature extraction from raw data.
- Scales to large datasets.
- Can approximate any continuous function (Universal Approximation Theorem).

**Applications in Basic Sciences:**
- Physics: Predicting particle trajectories.
- Chemistry: Predicting molecular properties.
- Biology: Classifying cells in microscopy images.

**Quick Links**
- [TensorFlow Playground](https://playground.tensorflow.org/)
- [3Blue1Brown — What is a Neural Network?](https://www.youtube.com/watch?v=aircAruvnKk)
- [MIT Lecture: Deep Learning Basics](https://youtu.be/n1ViNeWhC24)
- [Distill.pub Momentum Visualization](https://distill.pub/2017/momentum/)

## 1. From Machine Learning to Deep Learning: The Next Step

You've already encountered powerful machine learning algorithms like linear/logistic regression and SVMs. These models are excellent for many tasks, but they often require manual **feature engineering**. This means the data scientist must carefully select and craft the input features for the model to work well.

Deep learning models, on the other hand, can learn these features automatically. This is achieved through the use of "deep" neural networks, which have multiple layers. Each layer learns to recognize progressively more complex features from the data.

### Visualizing a Deep Neural Network

Here is a simple visualization of a deep network. It has an input layer, two "hidden" layers (which makes it *deep*), and an output layer.

```mermaid
graph TD
    subgraph Input Layer
        I1(Input 1)
        I2(Input 2)
        I3(...)
    end
    subgraph Hidden Layer 1
        H1_1(Neuron)
        H1_2(Neuron)
        H1_3(Neuron)
    end
    subgraph Hidden Layer 2
        H2_1(Neuron)
        H2_2(Neuron)
    end
    subgraph Output Layer
        O1(Output)
    end

    I1 --> H1_1; I1 --> H1_2; I1 --> H1_3;
    I2 --> H1_1; I2 --> H1_2; I2 --> H1_3;
    I3 --> H1_1; I3 --> H1_2; I3 --> H1_3;

    H1_1 --> H2_1; H1_1 --> H2_2;
    H1_2 --> H2_1; H1_2 --> H2_2;
    H1_3 --> H2_1; H1_3 --> H2_2;

    H2_1 --> O1;
    H2_2 --> O1;
```

---
:::{exercise} Remembering previous cases
Think about a dataset you have worked with before. Would a deep learning approach have been beneficial? Why or why not? Consider the size of the dataset, the complexity of the patterns, and whether you had to do a lot of feature engineering.
:::

A simple example shows the power of a deep neural network

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor

# Non-linear dataset: y = sin(x) + noise
np.random.seed(0)
X = np.linspace(-2*np.pi, 2*np.pi, 100).reshape(-1, 1)
y = np.sin(X) + 0.1 * np.random.randn(*X.shape)

# Linear regression
lin_reg = LinearRegression().fit(X, y)
y_pred_lin = lin_reg.predict(X)

# Simple NN
mlp = MLPRegressor(hidden_layer_sizes=(20,), activation='tanh', max_iter=5000, random_state=0)
mlp.fit(X, y.ravel())
y_pred_mlp = mlp.predict(X)

# Plot
plt.figure(figsize=(8,5))
plt.scatter(X, y, color='gray', alpha=0.5, label='Data')
plt.plot(X, y_pred_lin, label='Linear Regression', color='red')
plt.plot(X, y_pred_mlp, label='Neural Network', color='blue')
plt.legend()
plt.title("Linear vs Neural Network Fit")
plt.show()

:::{exercise}
Play with the number of hidden layers, and the activation function
:::

A pytorch implementation


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Toy dataset
X_torch = torch.tensor(X, dtype=torch.float32).to(device)
y_torch = torch.tensor(y, dtype=torch.float32).to(device)

# Model
model = nn.Sequential(
    nn.Linear(1, 20),
    nn.Tanh(),
    nn.Linear(20, 1)
).to(device)

loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(1000):
    optimizer.zero_grad()
    y_pred = model(X_torch)
    loss = loss_fn(y_pred, y_torch)
    loss.backward()
    optimizer.step()

print("Final loss:", loss.item())

## 2. The Core of Deep Learning: Deep Neural Networks

The process of passing input data through the network to get an output is called **forward propagation**. For each neuron, we calculate a weighted sum of the outputs from the previous layer, add a bias, and then pass this result through a non-linear **activation function**.

### The Mathematics

For a single neuron *j* in layer *l*, its output *a<sub>j</sub><sup>(l)</sup>* is:

$z_j^{(l)} = \sum_k (w_{jk}^{(l)} \cdot a_k^{(l-1)}) + b_j^{(l)}$

$a_j^{(l)} = g(z_j^{(l)})$

Where:
- $a_k^{(l-1)}$ is the activation of the *k*-th neuron in the previous layer.
- $w_{jk}^{(l)}$ is the weight of the connection from neuron *k* to neuron *j*.
- $b_j^{(l)}$ is the bias of neuron *j*.
- $g$ is the activation function.

### Common Activation Functions

The most common activation functions are Sigmoid, Tanh, and **ReLU (Rectified Linear Unit)**. ReLU is the most popular choice for hidden layers in deep learning today because it helps mitigate a problem called the "vanishing gradient."

In [None]:
# Let's visualize the common activation functions
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

z = np.linspace(-5, 5, 200)

plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(z, sigmoid(z))
plt.title('Sigmoid Activation')
plt.grid(True)

plt.subplot(1, 3, 2)
plt.plot(z, relu(z))
plt.title('ReLU Activation')
plt.grid(True)

plt.subplot(1, 3, 3)
plt.plot(z, tanh(z))
plt.title('Tanh Activation')
plt.grid(True)

plt.show()

:::{exercise} Relu activation
If a neuron in a hidden layer uses a ReLU activation function and its input *z* is -5, what will be its output? What if the input *z* is 5?
:::

## 3. Learning from Mistakes: Backpropagation

How does the network learn the correct values for its weights and biases?

1.  It first makes a prediction using **forward propagation**.
2.  It measures how wrong that prediction is using a **loss function** (e.g., Mean Squared Error or Cross-Entropy).
3.  It calculates the gradient of the loss with respect to every weight and bias in the network. This is done efficiently via an algorithm called **backpropagation**, which is essentially an application of the chain rule from calculus.
4.  It uses an **optimizer** (like Gradient Descent) to update the weights and biases in the direction that minimizes the loss.

This cycle is repeated many times with the training data. The core of backpropagation is figuring out how much each parameter contributed to the error, and propagating this error information "backward" from the output layer to the input layer.

The error $\delta$ in a hidden layer *l* is calculated based on the errors in the next layer *l+1*:

$\delta^{(l)} = ((W^{(l+1)})^T \delta^{(l+1)}) \odot g'(z^{(l)})$

Where $\odot$ is element-wise multiplication and $g'(z^{(l)})$ is the derivative of the activation function.

:::{exercise} Relu and linear activation
Why is the derivative of the activation function ($g'$) important in the backpropagation equation above? What would happen if we used a linear activation function (where $g'(z)$ is just a constant) in all hidden layers?
:::

## 4. Applications in Basic Sciences

Deep learning is revolutionizing scientific research:

-   **Biology:** DeepMind's **AlphaFold** uses a deep learning model to predict the 3D structure of proteins from their amino acid sequence, a grand challenge in biology.
-   **Chemistry:** Deep learning can predict the properties of molecules, accelerating drug discovery and materials science.
-   **Physics:** At the Large Hadron Collider (LHC), physicists use deep learning to classify particles and find anomalies in the immense stream of data from collisions.
-   **Astrophysics:** Neural networks are used to classify galaxies, find exoplanets, and detect gravitational waves in noisy sensor data.

The following is some data from physics, to be used as training for a nn

In [None]:
import numpy as np
from sklearn.datasets import make_blobs

# Physics: projectile motion
def projectile_data(n=100):
    g = 9.81
    angles = np.random.uniform(20, 70, n) * np.pi / 180
    speeds = np.random.uniform(10, 30, n)
    t = np.linspace(0, 2, 20)
    X_data, y_data = [], []
    for v, a in zip(speeds, angles):
        x = v*np.cos(a)*t
        y_pos = v*np.sin(a)*t - 0.5*g*t**2
        mask = y_pos >= 0
        X_data.extend(np.column_stack([x[mask], t[mask]]))
        y_data.extend(y_pos[mask])
    return np.array(X_data), np.array(y_data).reshape(-1,1)

X_phys, y_phys = projectile_data(50)

## 5. Final Exercises & Mini-Project

Now, let's put it all into practice. We will use the famous Iris dataset. The goal is to classify a flower as one of three species based on four measurements. We will build a simple Deep Neural Network for this and compare its performance to a classic SVM.

**1. Conceptual Exercise: From Logistic Regression to a Neural Network**
Explain how a single neuron with a sigmoid activation function is mathematically equivalent to logistic regression.

**2. Conceptual Exercise: Choosing the Right Model**
You are given a dataset of 500 patient records with 15 features to predict the likelihood of a certain disease. The relationships are expected to be complex but the dataset is small. Would you choose an SVM or a deep neural network as your first model to try? Justify your answer.

**3. Conceptual Exercise: Activation Functions in Practice**
You are building a deep neural network for classifying images into 10 different categories.
   - What activation function would you likely use for the hidden layers? Why?
   - What activation function would you use for the output layer? (Hint: The output needs to represent the probability for 10 different classes).

**4. Conceptual Exercise: Backpropagation Intuition**
If the loss of your neural network is very high after the first few training batches, what does this imply about the initial (randomly set) weights and biases? How does backpropagation help?

**5. Mini-Project: Iris Classification - DNN vs. SVM**
Follow the code below to load the data, train both an SVM and a DNN, and compare their performance.

In [None]:
# Setup: Import all necessary libraries
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn import datasets
import numpy as np

print("TensorFlow version:", tf.__version__)

In [None]:
# --- Mini-Project Step 1: Load and Prepare the Data ---

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale the features. This is important for both SVM and Neural Networks.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training data shape:", X_train_scaled.shape)
print("Test data shape:", X_test_scaled.shape)

In [None]:
# --- Mini-Project Step 2: Train and Evaluate the SVM Model ---

# Create an SVM classifier
svm_model = SVC(kernel='rbf', probability=True, random_state=42)

# Train the model
svm_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_test_scaled)

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"Support Vector Machine (SVM) Accuracy: {accuracy_svm * 100:.2f}%")

In [None]:
# --- Mini-Project Step 3: Build, Train, and Evaluate the Deep Neural Network ---

# Build the model using Keras Sequential API
# This is a simple DNN with two hidden layers.
# Input layer (implicit) -> Dense(10) -> Dense(10) -> Output(3)
dnn_model = keras.Sequential([
    # Input layer shape is defined by the data, so we specify it in the first layer
    keras.layers.Dense(10, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    
    # First hidden layer with 10 neurons and ReLU activation
    keras.layers.Dense(10, activation='relu'),
    
    # Output layer with 3 neurons (one for each Iris class)
    # 'softmax' activation is used for multi-class classification to get probability distribution
    keras.layers.Dense(3, activation='softmax')
])

# Compile the model
# We specify the optimizer, loss function, and metrics to track.
dnn_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Print a summary of the model architecture
dnn_model.summary()

In [None]:
# Now, train the DNN model
# An "epoch" is one full pass through the entire training dataset.
# We will use the test set for validation during training to monitor performance.
history = dnn_model.fit(X_train_scaled, y_train, epochs=50, validation_data=(X_test_scaled, y_test), verbose=1)

# Evaluate the final model on the test set
loss, accuracy_dnn = dnn_model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"\nDeep Neural Network (DNN) Accuracy: {accuracy_dnn * 100:.2f}%")

### Conclusion of Mini-Project

Compare the accuracy of the SVM and the DNN. On a simple, small dataset like Iris, a classic SVM often performs just as well as, or even better than, a neural network. The real power of deep learning becomes apparent with much larger and more complex datasets (e.g., thousands of images, text documents, or scientific sensor data) where the model can learn intricate hierarchical features that would be impossible to engineer by hand.

**Experiment further!** Go back and change the DNN architecture. What happens if you:
-   Change the number of neurons in the hidden layers (e.g., to 5 or 20)?
-   Add another hidden layer?
-   Train for more or fewer epochs?