# Contents
- What is Deep Learning?
- Neural Network
    - Neuron
    - Perceptron
    - Neural Network
    - Imlementation Steps
    - Activation Function
    - Loss Function
    - Gradient Descent
- Train Deep Neural Network
    - Data Augmentation
    - Batch Normalization
    - Dropout Regularization
    - Optimizers
- CNN

# What is Deep Learning?
Deep Learning is a subset of machine learning, which itself is a branch of artificial intelligence (AI). It focuses on teaching computers to learn and make decisions by mimicking the way the human brain works. It uses artificial neural networks (ANNs) to process data and make predictions or classifications. These networks consist of layers of interconnected nodes (neurons), where each layer learns to extract and process different features from the input data.

Deep learning is particularly powerful for tasks where feature extraction is challenging or where data is unstructured, such as images, audio, and text. It has enabled significant advancements in fields like natural language processing (NLP), computer vision, and speech recognition.

## Differences between Machine Learning and Deep Learning
| **Aspect**               | **Machine Learning (ML)**                              | **Deep Learning (DL)**                             |
|--------------------------|-------------------------------------------------------|--------------------------------------------------|
| **Definition**           | A subset of AI focused on enabling machines to learn from data using algorithms. | A subset of ML that uses neural networks with multiple layers to model complex patterns. |
| **Data Dependency**      | Performs well with smaller datasets.                  | Requires large amounts of labeled data to perform effectively. |
| **Feature Engineering**  | Relies on manual feature extraction and selection.     | Automatically extracts features through multiple layers of the network. |
| **Model Complexity**     | Models are simpler (e.g., linear regression, SVM).     | Models are complex with deep neural network architectures. |
| **Computation Power**    | Requires less computational power.                     | Requires significant computational resources (GPUs/TPUs). |
| **Training Time**        | Faster training times for most models.                 | Training can be time-consuming due to large datasets and model complexity. |
| **Interpretability**     | Easier to interpret and understand model decisions.    | Often considered a "black box" with lower interpretability. |
| **Applications**         | Suitable for tabular data (e.g., fraud detection, customer segmentation). | Excels in unstructured data like images, text, audio (e.g., image recognition, NLP). |
| **Scalability**          | Limited scalability with increasing data complexity.   | Highly scalable for large and complex datasets. |
| **Example Algorithms**   | Linear Regression, Decision Trees, Random Forests, SVM. | Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transformers. |

## Applications
| **Field**              | **Application**                    | **Deep Learning Model**         |
|------------------------|------------------------------------|---------------------------------|
| **Computer Vision**     | Object detection, facial recognition | Convolutional Neural Networks (CNNs) |
| **Natural Language Processing (NLP)** | Language translation, sentiment analysis | Recurrent Neural Networks (RNNs), Transformers |
| **Speech Recognition** | Voice assistants, transcription     | RNNs, Transformers             |
| **Generative Models**  | Image and text generation           | Generative Adversarial Networks (GANs), Transformers |
| **Autonomous Vehicles** | Perception and path planning        | CNNs, YOLO, RNNs               |
| **Healthcare**         | Disease diagnosis, drug discovery   | CNNs, AlphaFold                |
| **Recommendation Systems**     | E-commerce, streaming services      | Neural Collaborative Filtering |
| **Finance**            | Fraud detection, stock prediction   | Autoencoders, RNNs             |

## Frameworks
### 1. TensorFlow
TensorFlow is an open-source deep learning framework developed by Google Brain. It is widely used in research and production for creating complex machine learning models.
### 2. PyTorch
PyTorch, developed by Facebook’s AI Research (FAIR), is known for its dynamic computation graph and flexibility. It is popular among researchers for its intuitive design.
### 3. Keras
Keras is a high-level API for building and training neural networks. Initially developed independently, it is now integrated into TensorFlow as `tf.keras`. Keras focuses on user-friendliness and rapid prototyping.
### Comparison
| **Feature**               | **TensorFlow**                       | **PyTorch**                          | **Keras**                                |
|---------------------------|--------------------------------------|--------------------------------------|-----------------------------------------|
| **Ease of Use**           | Moderate                            | High                                 | Very High                               |
| **Computation Graph**     | Static (can be dynamic via `tf.function`) | Dynamic                              | High-level, static (via TensorFlow)     |
| **Flexibility**           | High                                | Very High                            | Moderate                                |
| **Debugging**             | Challenging                         | Easy (real-time debugging)           | Easy (abstracted)                       |
| **Deployment**            | Excellent (TensorFlow Serving, Lite) | Moderate                            | Integrated with TensorFlow (via `tf.keras`) |
| **Best For**              | Production, scalability             | Research, experimentation            | Beginners, rapid prototyping            |

### Choosing the Right Framework
- **TensorFlow**: If you need a production-ready system or plan to deploy on mobile/IoT devices.
- **PyTorch**: If you prioritize flexibility and debugging, or are working on research projects.
- **Keras**: If you are a beginner or need to quickly prototype a model.

# Neural Network
## Neuron
A neuron in deep learning is the basic building block of artificial neural networks, inspired by the biological neurons in the human brain. It is a computational unit that takes inputs, processes them, and produces an output.

### Components of a Neuron
#### 1. Inputs (\$ 𝑥_1, x_2, ... , x_n \$):
- These represent features or data points.
- For example, in image classification, they could be pixel values.
#### 2. Weights (\$ w_1, w_2, ... , w_n \$):
- Each input has an associated weight that signifies its importance.
- Weights are learned during training.
#### 3. Bias (\$ b \$):
- Bias helps the model shift the activation function, improving learning capabilities.
- It acts as an offset.
#### 4. Summation Function:
The neuron computes a weighted sum of inputs:

\$
 z = \sum_{i=1}^{n} w_i \cdot x_i + b
\$

#### 5. Activation Function 
- This introduces non-linearity, allowing the network to learn complex patterns.
- Common activation functions include:
    - Sigmoid: Outputs values between 0 and 1.
    - ReLU (Rectified Linear Unit): Outputs max(0, z)
    - Tanh: Outputs values between -1 and 1.
- Without activation functions, the neuron would act as a simple linear equation. This limits the network to learning only linear relationships. Activation functions introduce non-linearity, enabling the network to capture more complex patterns.
#### 6. Output:
The final output is:

\$
y = f(z)
\$
### Neurons in a Layer
A single neuron is rarely used alone.
Multiple neurons are stacked to form layers in a neural network.
Outputs from one layer become inputs to the next.

## Perceptron
A perceptron is the fundamental building block of neural networks. It models a single neuron and is inspired by the biological neurons in the human brain. It works as a binary classifier.

Components of a Perceptron
1. Inputs ($\ x_1, x_2, ..., x_n $\ ): Features of the data.
2. Weights ($\ w_1, w_2, ..., w_n $\ ): Each input is assigned a weight that signifies its importance.
3. Summation Function ($\ z = \sum_i{(w_i x_i )+ b} $\): The weighted sum of inputs plus a bias term b.
4. Activation Function (f(z)): Applies a threshold to determine the output (0 or 1).

### Differences between perceptron and neuron
| **Aspect**         | **Perceptron**                                   | **Neuron in Neural Network**                       |
|---------------------|------------------------------------------------|---------------------------------------------------|
| **Complexity**      | Simple model for binary classification.         | Generalized for more complex tasks.              |
| **Activation**      | Step function (binary output).                  | Non-linear functions (e.g., ReLU, sigmoid).       |
| **Capability**      | Solves only linearly separable problems.         | Solves both linear and non-linear problems.       |
| **Learning**        | Simple weight adjustment (e.g., perceptron rule).| Uses backpropagation and gradient descent.        |
| **Usage**           | Standalone classifier (single-layer).           | Forms building blocks of deep neural networks.    |
| **Output**          | Binary (0 or 1).                                | Continuous or probabilistic values.              |

## Neural Network
A neural network is a computational model designed to simulate the way the human brain processes information. It consists of layers of interconnected nodes (neurons), organized into an input layer, one or more hidden layers, and an output layer. 
### Components of a Neural Network
#### 1. Input Layer:
- Accepts raw data as input. Each neuron in this layer corresponds to a feature in the input data.
- Example: If you are predicting house prices, features like "size", "location", and "number of rooms" will be inputs.
#### 2. Hidden Layers:
- Process data using weights, biases, and activation functions to learn intermediate representations.
- The number of layers and neurons depends on the complexity of the task.
#### 3 Output Layer:
- Provides the final result of the model.
- Example: In classification, this layer outputs probabilities or class labels.
#### 4. Weights and Biases:
- Weights determine the importance of a connection between neurons.
- Bias adjusts the weighted sum output for flexibility in learning.
#### 5. Activation Function:
Introduces non-linearity, enabling the network to solve complex problems.
Common activation functions: Sigmoid, ReLU, Tanh, Softmax.
### How a Neural Network Works: Step-by-Step
#### 1.Forward Propagation:
- Data flows through the network from the input layer to the output layer.
- Each neuron computes the weighted sum of inputs, applies an activation function, and sends the output to the next layer.
#### 2. Loss Calculation:
- A loss function measures the error between the predicted and actual values.
#### 3. Backpropagation:
- The network adjusts weights and biases to minimize the error.
- This involves calculating gradients using the chain rule and updating parameters with an optimization algorithm like gradient descent.
#### 4. Training:
- The process of forward propagation, loss calculation, and backpropagation repeats over multiple iterations (epochs) until the network learns the patterns in the data.

## Imlementation Steps
1. Import the Required Libraries
2. Load the Dataset
3. Extract Features
4. Preprocess Data
5. Split the Data into Training and Test Sets
6. Define the Deep Learning Model
7. Compile the Model
8. Train the Model
9. Evaluate the Model
10. Make Predictions
11. Visualization
12. Save the Model

## Activation Function
An activation function in a neural network defines the output of a node (neuron) given an input or a set of inputs. It determines whether a neuron should be activated or not based on the input it receives. Activation functions introduce non-linearity into the model, enabling it to learn and perform complex tasks.

Without activation functions, neural networks would behave like a linear regression model, unable to capture the non-linear patterns in data.

### Types
#### 1. Linear Activation Function:
- Formula: 𝑓(𝑥) = 𝑥
- Problem: Does not introduce non-linearity; all layers collapse into a single layer.
#### 2. Non-Linear Activation Functions: These are the most commonly used.
- **Sigmoid**:
    -Formula: \$ \frac{1}{1 + e^{-x}} \$
    - Range: 0 to 1
    - Usage: Good for binary classification; used in the output layer.
    - Drawbacks: Can cause vanishing gradients.
- **Tanh (Hyperbolic Tangent)**:
    - Formula: 𝑓(𝑥) = tanh(𝑥) = \$ \frac{e^x + e^{-x}}{e^x - e^{-x}} \$
    - Range: −1 to 1
    - Usage: Good for hidden layers; addresses vanishing gradients better than sigmoid.
    - Drawbacks: Still susceptible to vanishing gradients for very large or small inputs.
- **ReLU (Rectified Linear Unit)**:
    - Formula: f(x)=max(0,x)
    - Range: 0 to ∞
    - Usage: Widely used in hidden layers; fast to compute.
    - Drawbacks: Can suffer from "dead neurons" where gradients become zero for inputs less than 0.
- **Leaky ReLU**:
    - Formula: f(x)=x if x>0, otherwise f(x)=αx(α>0)
    - Usage: Solves the "dead neurons" issue by allowing a small gradient for negative inputs.
- **Softmax**:
    - Formula: \$ \frac{e^{x_i}}{\sum_{j} e^{x_j}} \$
    - Range: 0 to 1
    - Usage: Used in the output layer for multi-class classification; ensures probabilities sum to 1.

### Example
Suppose we have a neural network layer receiving the input vector x = [1, 2, 3] and the weights w = [0.5, 0.1, -0.4] with a bias b = 0.2.

#### Step 1: Compute the pre-activation output (z):
\$
z = w \cdot x + b
\$

\$
z = (1 \times 0.5) + (-2 \times 0.1) + (3 \times -0.4) + 0.2
\$

\$
z = 0.5 - 0.2 - 1.2 + 0.2 = -0.7
\$

#### Step 2: Apply an activation function.
- **Sigmoid**:
  \$
  f(z) = \frac{1}{1 + e^{0.7}} \approx 0.332
    \$
- **ReLU**:
    \$
  f(z) = \max(0, -0.7) = 0
    \$
- **Tanh**:
    \$
  f(z) = \tanh(-0.7) \approx -0.604
    \$

#### Explanation of Outputs:
- **Sigmoid** squashes the value to a range between 0 and 1, indicating activation strength.
- **ReLU** outputs 0, meaning the neuron is inactive for negative input.
- **Tanh** outputs a negative value, showing activation while preserving sign.

### Choosing an Activation Function

- **Hidden Layers**:
  - **ReLU**: Default choice due to simplicity and efficiency.
  - **Leaky ReLU**: Use if the dead neuron issue arises.
- **Output Layer**:
  - **Sigmoid**: For binary classification.
  - **Softmax**: For multi-class classification.
  - **Linear**: For regression problems.

## Loss Function
A loss function in deep learning quantifies how well the predicted outputs of a neural network match the actual target values. It is a critical component that guides the optimization process during training by measuring the error or deviation between predictions and ground truths. The goal of training a neural network is to minimize this loss.
### Types
Loss functions can be categorized based on the type of task:
#### 1. Regression Loss Functions
- MSE
- MAE
#### 2. Classification Loss Functions
**Binary Cross-Entropy:**

\$
\text{BCE} = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
\$
   
**Categorical Cross-Entropy:**

\$
\text{CCE} = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij})
\$
   
**Hinge Loss:**

\$
\text{Hinge Loss} = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \cdot \hat{y}_i)
\$

### Choosing a Loss Function
**1. Regression**:
- **MSE** for smoother models that penalize large errors.
- **MAE** for robust models against outliers.

**2. Classification**:
- **Binary Cross-Entropy** for binary problems.
- **Categorical Cross-Entropy** for multi-class problems.
- **Hinge Loss** for margin-based classifiers like SVMs.

**3. Custom Tasks**:
- Combine loss functions or define your own based on specific requirements.

## Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function in machine learning and deep learning models by iteratively updating model parameters (weights and biases) in the direction of the negative gradient of the loss function. This process enables the model to learn the best parameters for making accurate predictions.

### Key Concepts
**1. Loss Function**: A function that quantifies the error between predicted and actual values. Common loss functions include Mean Squared Error (MSE) for regression and Cross-Entropy for classification.
**2. Gradient**: A vector of partial derivatives of the loss function with respect to model parameters. It indicates the direction and rate of change of the loss function.
**3. Learning Rate (η)**:
- A small positive value that determines the step size for updating parameters.
- A high learning rate can cause overshooting, while a low learning rate slows convergence.
- 
### Step-by-Step Process
**1. Initialize Parameters**: Start with random values for weights (w) and biases (b).

**2. Compute Predictions**: Use the model to predict outputs (\$ \hat{y} \$).

**3. Calculate Loss:** Compute the loss using a predefined loss function, such as MSE.

**4. Compute Gradients**: Calculate the derivatives of the loss with respect to each paramete: 

\$
\frac{∂w}{∂L}, \frac{∂L}{∂b}
\$

**5. Update Parameters:** Adjust the parameters in the direction of the negative gradient:

\$
w=w−η.\frac{∂w}{∂L}⋅ 
b=b−η.\frac{∂L}{∂b}⋅ 
\$
**6. Repeat**: Iterate over the dataset multiple times (epochs) until the loss converges to a minimum or a stopping criterion is met.

### Variants
#### Batch Gradient Descent:
- Computes the gradient using the entire dataset in one iteration.
- Pros: Stable convergence.
- Cons: Slow for large datasets.
#### Stochastic Gradient Descent (SGD):
- Computes the gradient for one data point at a time.
- Pros: Faster updates, suitable for large datasets.
- Cons: Noisy convergence.
#### Mini-Batch Gradient Descent:
- Computes the gradient for small batches of data.
- Pros: Combines the advantages of batch and stochastic methods.

# Train Deep Neural Network
## Data Augmentation
Data augmentation generates additional training examples by applying transformations to existing data. It helps the model generalize better by exposing it to variations.
### Why Use Data Augmentation?
- Reduces overfitting.
- Increases dataset size, especially when data is limited.
- Improves model robustness to variations.
### Common Data Augmentation Techniques
#### Image Data
- **Flipping**: Horizontal or vertical flips.
- **Rotation**: Rotate images by random angles.
- **Scaling**: Zoom in/out on images.
- **Cropping**: Randomly crop parts of images.
- **Brightness Adjustment**: Vary brightness to mimic lighting conditions.
- **Noise Addition**: Add Gaussian noise to simulate variability.
#### Text Data
- **Synonym Replacement**: Replace words with their synonyms.
- **Back Translation**: Translate text to another language and back.
- **Random Deletion**: Remove random words from sentences.
#### Time-Series Data:
- **Jittering**: Add random noise.
- **Time Warping**: Stretch or compress time intervals.
- **Random Sampling**: Drop some data points randomly.

In [None]:
# For Image Data
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import numpy as np
from PIL import Image

# Example Image Data
datagen = ImageDataGenerator(
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

# Load image (as numpy array)
img = np.array(Image.open('example.jpg'))
img = img.reshape((1,) + img.shape)  # Reshape for the generator

for batch in datagen.flow(img, batch_size=1):
    augmented_image = batch[0]  # Get the augmented image
    break


## Batch Normalization
Batch Normalization (BN) is a technique to improve the training of deep neural networks by normalizing intermediate layers.

### Why Batch Normalization?
1. **Internal Covariate Shift**:
- During training, the distribution of inputs to each layer changes due to updates in the parameters of the previous layers. This phenomenon is called internal covariate shift.
- BN addresses this by normalizing the input to each layer, reducing dependency on the initialization of weights.

2. **Improved Gradient Flow**: BN reduces the risk of vanishing or exploding gradients, enabling deeper networks to train efficiently.

3. **Faster Convergence**: By stabilizing the input distribution, BN allows the network to use higher learning rates.

4. **Regularization Effect**: BN introduces noise to the layer’s input during training, acting as a form of regularization and reducing the need for dropout.

### Where to Apply Batch Normalization
- BN is typically applied after the affine transformation (e.g., Wx+b) and before the activation function.
- It is commonly used in fully connected layers, convolutional layers, and sometimes recurrent layers

In [None]:
model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128),
    BatchNormalization(),  # Apply Batch Normalization
    Activation('relu'),    # Activation after BN
    Dense(64),
    BatchNormalization(),  # Apply Batch Normalization
    Activation('relu'),
    Dense(10, activation='softmax')  # Output layer
])

## Dropout Regularization
Dropout is a regularization technique where a fraction of the neurons in a layer are randomly "dropped out" (set to zero) during each training iteration. This prevents the network from becoming overly dependent on specific neurons, leading to better generalization.
###  How Dropout Works
- During training, for each forward pass, a random subset of neurons is ignored.
- During inference (testing), all neurons are used, but their outputs are scaled by the dropout rate to account for the training phase.

In [None]:
model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dropout(0.2),  # Dropout with rate 0.2
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(10, activation='softmax')
])

## Optimizers
Optimizers are algorithms or methods used to update the weights and biases of a neural network to minimize the loss function.
### SGD:
- Uses a fixed learning rate for updates.
- Training may be slower and prone to oscillations.
- simple and effective for convex problems but can struggle with complex deep learning models.

### RMSprop:
- Adapts learning rates for each parameter.
- Handles non-stationary objectives well, leading to faster convergence.
- adapts learning rates dynamically, making it suitable for deep neural networks.
### Adam:
- Combines momentum and adaptive learning rates.
- Typically achieves the best results with minimal hyperparameter tuning.
- versatile and widely used optimizer that combines the strengths of SGD and RMSprop, making it the default choice for most tasks.

### Comparison of SGD, RMSprop, and Adam
| Feature                | SGD                     | RMSprop            | Adam                        |
|------------------------|-------------------------|--------------------|-----------------------------|
| **Learning Rate**      | Fixed or Decayed        | Adaptive           | Adaptive                   |
| **Momentum**           | Optional               | No                 | Yes                        |
| **Handling Gradients** | Oscillations in Path    | Smoothing by Gradients | Combines Momentum + RMS   |
| **Performance**        | Slower                 | Faster in Practice | Generally Best for DL      |


# Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a class of deep learning models specifically designed for tasks involving spatial data, such as images and videos.
## Components
### 1. Convolutional Layers
- **Purpose**: Extract features from input images by applying filters (kernels) that slide across the input.
- **Operation**:
    - The filter computes a weighted sum of the pixel values it covers.
    - This produces a "feature map" that highlights certain features of the image (e.g., edges, textures).
- **Key Parameters**:
    - Kernel size: Dimensions of the filter (e.g., 3x3, 5x5).
    - Stride: Step size for the filter's movement.
    - Padding: Adding borders to the input to preserve dimensions.
### 2. Activation Function
### 3. Pooling Layers
- **Purpose**: Downsample the feature maps to reduce dimensionality and computation while retaining important information.
- **Types**:
    - **Max Pooling**: Takes the maximum value in a region.
    - **Average Pooling**: Takes the average value in a region.
### 4. Fully Connected Layers
- **Purpose**: Combine features extracted by the convolutional layers to perform final predictions.
- The output from the previous layers is flattened and passed through dense layers.
### 5. Dropout (Regularization)
- **Purpose**: Prevent overfitting by randomly deactivating neurons during training.
## Workflow
1. **Input**: An image (e.g., 32x32x3 for a color image with 3 channels: RGB).
2. **Convolution**: Apply filters to extract feature maps.
3. **Activation (ReLU)**: Introduce non-linearity.
4. **Pooling**: Downsample feature maps.
5. **Repeat Steps 2-4**: Extract higher-level features.
6. **Flatten**: Convert the feature maps to a 1D vector.
7. **Fully Connected Layer**: Predict class probabilities or outputs.

In [None]:
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])