## Lecture Notes: Pre-trained Models in CNNs

### I. The Need for Pre-trained Models (Transfer Learning Context)

Deep Learning models, especially CNNs used for image classification, require vast amounts of data to perform well; this characteristic is described as **"data hungry"**.

**Challenges of Training Custom Models:**

1.  **Data Acquisition and Labeling:** Obtaining large quantities of image data (e.g., 10,000 photos) is difficult, and manually labeling that data (e.g., specifying if a photo is a dog or a cat) is a **tedious task**. Manual labeling requires time, or the hiring of staff, resulting in potential **financial loss**.
2.  **Training Time:** Training models on huge datasets requires significant time, often taking **hours, days, or even weeks**, making the overall model building process slow.

**Solution: Pre-trained Models**
Pre-trained models are CNN architectures or other neural network architectures that have been built by someone else and trained on a different dataset. These models are so robust that they can be used for new problems.

**Benefits of Using Pre-trained Models:**

*   **Saves Training Time:** You do not have to conduct full training.
*   **Reduced Data Need:** You do not need to secure a massive amount of data; even if you have **zero data** or **very little data**, the pre-trained model can still be utilized.

### II. The Foundation: ImageNet Dataset and ILSVRC

The concept of pre-trained models stems from the **ImageNet dataset**.

**ImageNet Dataset Details:**

*   **Definition:** ImageNet is a visual database of images.
*   **Scale:** It contains 1.4 million images (or 1.4 crore images, in Indian terms).
*   **Content:** It covers approximately **20,000 categories** of daily household items and common items seen daily, such as cats, dogs, vehicles, tables, and chairs.
*   **Labeling:** The images are well-organized and labeled, often including the specific breed (e.g., a specific breed of dog) and sometimes featuring **bounding box labeling** to show object location (useful for object localization tasks).
*   **Creation Method:** The dataset was built using **crowd help** and a service by Amazon called **Mechanical Turk**.

**The ImageNet Large Scale Visual Recognition Challenge (ILSVRC)**

*   **Start:** The competition, also known as the ImageNet Challenge, started in 2010.
*   **Goal:** To surface the best image classification models.
*   **Competition Dataset:** A subset of the original ImageNet data was used, containing **1 million images** and **1,000 classes** (down from the 20,000 classes in the full ImageNet dataset).

### III. The Deep Learning Revolution (2010–2016)

The ILSVRC charted the progress of computer vision systems:

| Year | Model Type | Winning Model/Architecture | Error Rate | Notes |
| :--- | :--- | :--- | :--- | :--- |
| 2010 | Machine Learning | N/A | **28%** | Models relied on manual feature extraction. |
| 2011 | Machine Learning | N/A | **25%** | Minor improvement. |
| 2012 | Deep Learning (CNN) | **AlexNet** | **16.4%** | This was a revolutionary moment, improving over the second place by more than 10%. It used ReLU activation and focused the entire tech world's attention on CNNs. |
| 2013 | Deep Learning (CNN) | **ZFNet** | **11.7%** | Continued improvement. |
| 2014 | Deep Learning (CNN) | **VGG (Visual Geometry Group)** | **7.3%** | Became a very famous architecture used widely. |
| 2015 | Deep Learning (CNN) | **GoogLeNet** | **6.7%** |. |
| 2016 | Deep Learning (CNN) | **ResNet** | **3.5%** | This architecture surpassed the average human error rate, which is typically around 5%. |

**Key Trend:** As the years progressed, researchers consistently added **more layers** to the CNN models, increasing complexity, which resulted in a **reduction in the error rate**.

**AlexNet Architecture (2012 Winner):**

*   Input images were $227 \times 227$ colored images.
*   The first layer used 96 filters of $11 \times 11$ size.
*   It used max pooling with a size of $3 \times 3$ and a stride of 2.
*   The final stage included three fully connected layers, with 1,000 units (for 1,000 classes).

### IV. Using Pre-trained Models in Keras

The Keras library provides many pre-trained models ready for use.

**Available Models:**

Keras Applications lists various famous pre-trained models, including:

*   **VGG16** (16 layers) and **VGG19** (19 layers)
*   **ResNet50** and **ResNet50 V2**
*   **Xception**
*   **MobileNet** (known for being light-weight)
*   **Inception V3**

**Model Characteristics (VGG16 vs. ResNet50):**

*   **VGG16** is large, around **528 MB**, containing roughly **130.4 million parameters** (weights/numbers stored).
*   **ResNet50** is more light-weight, around **98 MB**, with fewer parameters.

**Accuracy Metrics:**

*   **Top 1 Accuracy:** Accurately identifying the single correct classification of the image.
*   **Top 5 Accuracy:** The correct answer is included within the model’s top five predictions.

**Keras Implementation Example (ResNet50):**

1.  **Loading the Model:** Import the model (e.g., `ResNet50`) and load it, specifying the required weights: `weights='imagenet'`. This ensures the model uses the weights obtained after training on the ImageNet dataset.
2.  **Processing Input:** Images are loaded, converted to a batch, and processed.
3.  **Prediction:** The processed image is sent to the model for prediction.
4.  **Decoding:** The raw output is decoded to display human-readable predictions (e.g., showing not just 'dog' but 'Labrador Retriever' or 'Golden Retriever').

Using this method, one can create a **"universal classifier"** that can predict the content of various input images (e.g., dog, bread, tomato, chair) without requiring the user to build and train their own custom architecture.

***

In [5]:
import keras
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np

model = ResNet50(weights='imagenet')

img_path = './test_set/test_set/cats/cat.4001.jpg'
img = keras.utils.load_img(img_path, target_size=(224, 224))
x = keras.utils.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5


In [8]:
preds = model.predict(x)
# decode the results into a list of tuples (class, description, probability)
# (one such list for each sample in the batch)
print('Predicted:', decode_predictions(preds, top=3)[0])

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/imagenet_class_index.json
Predicted: [('n03598930', 'jigsaw_puzzle', 0.67379516), ('n02123597', 'Siamese_cat', 0.11601709), ('n04589890', 'window_screen', 0.10092571)]
