# Convolutional Neural Networks (CNNs):
- deep learning models that extract features from images using convolutional layers, followed by pooling and fully connected layers for tasks like image classification.
-  They excel in capturing spatial hierarchies and patterns, making them ideal for analyzing visual data.


# CNN ARCHITECTURE:
* 1. Convolutional Layer
* 2. Pooling Layer
* 3. Flattening
* 4. Fully Connected Layer : Utilizes the output from the convolution process and predicts the class of the image based on the features extracted in previous stages.

** Feature Extraction: A process of separating the image into features of the image for analysis using a convolution tool.
** This CNN model of feature extraction aims to reduce the number of features present in a dataset. It creates new features which summarises the existing features contained in an original set of features. 
** CNNs are also used for image classification, object detection, and image segmentation.
** Apart from these layers, two more important parameters which are the dropout layer and the activation function.
** 


#### Sequential() : A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
#### Conv2D : This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs.
- A filter or a kernel in a conv2D layer “slides” over the 2D input data, performing an elementwise multiplication. As a result, it will be summing up the results into a single output pixel.

### kernel( filters) : 
- The kernel size here refers to the widthxheight of the filter mask.



In [50]:
from keras.layers import Dense, Input, Dropout, GlobalAveragePooling2D, Flatten, Conv2D, BatchNormalization, Activation, MaxPooling2D
from keras.models import Model, Sequential
from keras.optimizers import Adam

# Number of labels :
num_labels =  7

 # Creating a sequential model :

model = Sequential()


#### Determining the no of filters(kernels) to use in each convo layer: (general principles & guidelines) :
1. Increasing Complexity:As you go deeper into the network, the features captured by the filters become more abstract and complex. Increasing the number of filters in deeper layers allows the network to learn more detailed and varied features. Typically, you start with a smaller number of filters (e.g., 32 or 64) and increase it in deeper layers (e.g., 128, 256, 512). 
* For CNN : Begin with 32 or 64 filters for the first convolutional layer. These numbers are large enough to capture various features in the initial layers but not too large to overburden the computational resources.
2. Empirical guidelines: Common practice involves starting with a base number (like 32 or 64 filters) and then doubling the number of filters after each pooling layer or every few convolutional layers. 
- Layer 1: 32 filters
- Layer 2: 64 filters
- Layer 3: 128 filters
- Layer 4: 256 filters, etc.​
3. Network Depth and Computational Resources: The number of filters is also influenced by the computational resources available (e.g., GPU memory). More filters mean more parameters and higher computational cost. It’s important to balance the network’s capacity to learn with the available resources to avoid overfitting and ensure efficient training​.
4. Experimentation and Tuning: The optimal number of filters can vary depending on the specific dataset and task. It often requires experimentation and tuning. Techniques like hyperparameter optimization and cross-validation can help in finding the best configuration for the number of filters at each layer.

** Kernel Size:
- Common practice is to use  3*3 kernels.
- Larger kernels like 5*5 can be used in initial layers for broader feature extraction.

* For this: we will satrt with :
1. Number of Convolutional Layers:Start with 2-3 convolutional layers for a simpler model and increase as needed based on performance.
2. Number of Filters:
- Initial Layers: Use 32 or 64 filters.
- Intermediate Layers: Increase to 128 or 256 filters.
- Deeper Layers: Use 512 or more filters if the network is very deep.
3. kernel size : 3*3
4. Pooling size : 2*2


### Strides: 
- Stride determines how many pixels the kernel shifts over the input at a time. 
- Eg:  stride = 1 --means the dot product is performed on an n x n window of the 2D input, then shifts kernel by one pixel for subsequent operation across both axes. 
- stride length decreases: results in learning more features and larger output layers due to more feature extraction.
- stride length increases: results in reduced output layer dimensions. 
- Purpose: To control the overlap of receptive fields, reduce the spatial dimensions of the output, and potentially speed up the computations.

*When to use stride:*
- In the early layers of a CNN, it's common to use strides of 1 to preserve as much spatial information as possible. These layers typically extract low-level features like edges and textures.
- Strides are often used in the deeper layers of the network, especially after several convolutional operations with strides of 1. By this stage, you might want to reduce the spatial dimensions of your feature maps to make the network more computationally efficient and to increase the receptive field of the neurons. A common choice for these layers is to use strides of 2.

** Use of Strides: 

1. Reducing spatial dimensions : When the i/p image size is large --- using strides can help reduce the spatial dimensions more quickly than using pooling layers alone.
2. Efficient Computation:To decrease computational cost, especially in deeper networks. Larger strides reduce the size of the output feature map, leading to fewer computations in subsequent layers.
3. Avoiding Pooling Layers: In certain architectures like some versions of ResNet, strides are used in convolutional layers instead of pooling layers to reduce dimensions, ensuring that the information flow is more controlled and less lossy.

*NOTE* : 
- While strides help in reducing dimensions, excessive downsampling can lead to loss of important spatial information.
- The choice of using strides depends on the overall design and objective of the CNN architecture.


For small images(as in this project)(48x48 pixels), using strides can be beneficial but should be done carefully:
- Downsampling: If your image size is small, using large strides (e.g., stride of 2) in early layers might quickly reduce the spatial dimensions, which could lead to loss of important information. Therefore, you might want to use strides cautiously or rely on pooling layers for downsampling.
- Pooling Layers: Alternatively, we can use pooling layers (e.g., max pooling) to reduce the spatial dimensions while retaining important features. Pooling layers can also help in reducing the computational load and control overfitting.

### Padding : 
- Helps to preserve the input spatial dimension by adding extra pixels around the input image borders. 
- By conserving border information, helps to improve model performance in determining the output spatial size of feature maps.
1. Valid padding(No padding): 
2. Same padding(Zero Padding)



NOTE:
- For small images(48x48 pixels), same padding can be beneficial to preserve spatial dimensions, particularly in deeper layers where reducing spatial size too quickly can lead to loss of important information.
- In more complex architectures, like those with multiple convolutional layers, same padding helps maintain a consistent feature map size, which can be advantageous for architectural consistency and ease of debugging.


| Layer Type              | Typical Range or Value    |       Notes      |
|-------------------------|---------------------------|-----------------|
| Convolutional        | 3-10 layers | Start with 2-3 layers for simpler models, increase as needed  |
| Filters          | 32, 64, 128, 256, 512  | Increase with depth of the network|
| Kernel Size      | 3x3 (commonly used)  | Can use 5x5 in initial layers for broader feature extraction |
| Pooling Layers      | 2x2 MaxPooling  | After every few convolutional layers |
| Strides      | 1 (default), 2 for downsampling  |  Use larger strides to reduce spatial dimensions |


#### Batch normalization : 
- applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1.

## Droput Layer:
- regularization technique used in CNN (and other deep learning models) to help prevent overfitting. 
- Overfitting occurs when a model demonstrates high performance on the training data but struggles to generalize well to unseen data.
- The dropout layer functions by randomly deactivating a portion of input units during each training update. 
- This implies that during forward propagation, certain neurons in the network are ‘dropped out’ or temporarily disregard, along with their associated connections, based on a specific probability. The remaining neurons were then rescaled by a factor of (1/(1-droped_rate)) to account for the dropped neurons during training.

In [51]:
# 1st Convolution layer
model.add(Conv2D(36,(3,3), padding='same', input_shape=(48, 48,1)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))  #read max pooling adv gpt****************************************
model.add(Dropout(0.25)) #add the range table for dropout
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [52]:
# 2nd Convolution layer
model.add(Conv2D(128,(5,5), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

# 3rd Convolution layer
model.add(Conv2D(512,(3,3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

# 4th Convolution layer
model.add(Conv2D(512,(3,3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

## Flattening : 
- Used to convert all the resultant 2D arrays from pooled feature maps into a single long continuous linear vector. 
- The flattened matrix is fed as input to the fully connected layer to classify the image.
- In some architectures, e.g. CNN an image is better processed by a neural network if it is in 1D form rather than 2D.
- Does not affect the batch size.

- After finishing the previous steps, we're supposed to have a pooled feature map by now. That means we're literally going to flatten our pooled feature map into a column.
- Reason : we're going to need to insert this data into an artificial neural network later on.

#### Why convert to 1D?
1. Fully Connected Layers Expect 1D Input:
- Fully connected (dense) layers are designed to process vectors, not multi-dimensional tensors. They work similarly to traditional NN's where each neuron is connected to every neuron in the previous layer.
- So as to transition from the convolutional/pooling layers (which output multi-dimensional feature maps) to the fully connected layers, we need to flatten the data into a 1D vector.

2. Matrix application : 
- The operations in fully connected layers involve matrix multiplication, which requires input in a 1D form to align correctly with the weight matrix.
- Flattening ensures that the multi-dimensional features are properly organized into a single vector that can be multiplied by the weight matrix of the fully connected layer.

In [53]:
model.add(Flatten())
model.summary()

### DropOut Layers:
- In pursuit of trying too hard to learn different features from the dataset, the deep NN sometimes learn the statistical noise in the dataset. --- improves model performance (training dataset) -- fails massively on new data points (test dataset): OVERFITTING
- To tackle this problem we have various regularisation techniques that penalise the weights of the network but this wasn’t enough.

1. Prevents Overfitting: Reduces the likelihood that the model will memorize the training data, leading to better generalization to new, unseen data.
2. Improves Model Robustness
3. Promotes NN Regularization: Acts as a form of regularization by randomly dropping units (along with their connections) during training.
4. Enhances Training Efficiency: Simplifies the model training process by preventing complex co-adaptations of neurons.







# Fully Connected Layers:


In [54]:
# Fully connected layer 1st layer
model.add(Dense(256))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.25))

# Fully connected layer 2nd layer
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.25))


# Dense layer:

In [55]:
model.add(Dense(num_labels, activation='softmax'))

# Optimizers:

# Learning rate: 
Q.whats the ideal range
q. hwo to decide the value 
Q ideal value for cnn 

# Loss

# Metrics

In [56]:
opt = Adam(learning_rate=0.0001)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])


In [57]:

model.save('/Users/pratiksha/Documents/Pratiksha/Documents/GitHub/GitHub/Face-expression-recognition-with-Deep-Learning/my_model.keras')


* Reference:*
- https://www.upgrad.com/blog/basic-cnn-architecture/
- https://www.simplilearn.com/tutorials/deep-learning-tutorial/convolutional-neural-network#layers_in_a_convolutional_neural_network
- https://learnopencv.com/understanding-convolutional-neural-networks-cnn/ (imp)
- https://www.superdatascience.com/blogs/convolutional-neural-networks-cnn-step-3-flattening

In [61]:
# Number of epochs to train the neural network
epochs = 50

from keras.callbacks import ModelCheckpoint

# Define the checkpoint callback to save the best model weights
checkpoint = ModelCheckpoint("model_weights.keras", monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

# Fit the model using the fit method
history = model.fit(
    train_generator,
    steps_per_epoch=train_generator.n // train_generator.batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=validation_generator.n // validation_generator.batch_size,
    callbacks=callbacks_list
)


NameError: name 'train_generator' is not defined