**D3APL: Aplicações em Ciência de Dados** <br/>
IFSP Campinas

Prof. Dr. Samuel Martins (Samuka) <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Animal Dataset - v4
We will evaluate some **multiclass classification** CNNs to predict the classes of the **Animal Dataset**: https://www.kaggle.com/datasets/alessiocorrado99/animals10


Target goals:
- Custom model:
    - VGG16 (with trained weights) for Feature Extraction
    - SVM for classification

## 1. Set up

#### 1.1 TensorFlow

In [None]:
import tensorflow as tf

In [None]:
tf.__version__

**GPU available?**

In [None]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

### 1.2 Allocating memory on demand
By default `TensorFlow` allocates _GPU memory_ for the **lifetime of a process**, not the lifetime of the **session object** (so memory can linger much longer than the object). That is why memory is lingering after you stop the program. <br/>
Instead, we can indicate to `TensorFlow` allocates **memory on demand**.

Sources: <br/>
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

https://python.tutorialink.com/cuda-error-out-of-memory-python-process-utilizes-all-gpu-memory/ <br/>
https://blog.fearcat.in/a?ID=00950-b4887eea-22e7-4853-b4de-fe746a9e56e6 <br/>
https://stackoverflow.com/a/45553529

In [None]:
gpus = tf.config.list_physical_devices('GPU')

if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

### 1.2 Fixing the seed for reproducibility (optional)
That's a try for reprodubility in Keras. See more on:
- https://stackoverflow.com/a/59076062
- https://machinelearningmastery.com/reproducible-results-neural-networks-keras/

In [None]:
import os
import tensorflow as tf
import numpy as np
import random

def reset_random_seeds(seed=42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    
# make some random data
reset_random_seeds()

### 1.3. Dataset
**Animal Dataset**: https://www.kaggle.com/datasets/alessiocorrado99/animals10

In [None]:
import pandas as pd
import numpy as np
import os

In [None]:
# train
dataset_df_train = pd.read_csv('../datasets/animals-dataset/preprocessed/train.csv')

# validation
dataset_df_validation = pd.read_csv('../datasets/animals-dataset/preprocessed/validation.csv')

# test
dataset_df_test = pd.read_csv('../datasets/animals-dataset/preprocessed/test.csv')

## 2. Building and Training a CNN via Keras

### 2.1 Defining the Network Architecture - VGG16

**VGG16 (with pre-trained weights) for feature extraction**

In [None]:
# https://keras.io/api/applications/vgg/
# https://towardsdatascience.com/transfer-learning-with-vgg16-and-keras-50ea161580b4

from tensorflow.keras.applications import VGG16

vgg16 = VGG16(include_top=None,   # we will ignore the top layers that consists of the MLP classifier of VGG16
              weights="imagenet", # we will use the weights learned for the ImageNet dataset
              input_shape=(100, 100, 3))  # let's consider a smaller resolution than the original paper due to lack of memory

# unnecessary because we will not train these network
vgg16.trainable = False

### 2.2 Preprocessing

- **Image Resizing**
    + Since the **input layer's shape** and the **images' shape** ***are different***, we need to **resize** the images to the **input layer's shape**.
    + Let's use the function `c2.resize()` for that: https://learnopencv.com/image-resizing-with-opencv/#resize-by-wdith-height
- **Intensity (feature) Scaling**
    + Animals dataset contain 24-bit color images, i.e., it is a color image where each channel is a 8-bit grayscale image (values from 0 to 255)
    + We will simply rescale the values to [0, 1] by dividing them by 255.
- **Label Encoder**
    + Encode the string classes into class integers from 0 to n_classes-1
    + https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

However, the _preprocessing data_ **may not fit into our memory**!!! <br/>
So, we need to deal with that first!

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(dataset_df_train['class'])

In [None]:
from animals_utils import preprocess_animals_dataset

X_train, y_train = preprocess_animals_dataset(dataset_df_train, label_encoder, new_dims=(100, 100))

In [None]:
X_test, y_test = preprocess_animals_dataset(dataset_df_test, label_encoder, new_dims=(100, 100))

In [None]:
print(f'X_train.shape: {X_train.shape}')
print(f'y_train.shape: {y_train.shape}\n')

print(f'X_test.shape: {X_test.shape}')
print(f'y_test.shape: {y_test.shape}')

### 2.3 Feature Extraction by VGG16

In case of GPU drivers, we can monitor its use by [_gpustat_](https://github.com/wookayin/gpustat).

On terminal, use: `gpustat -cpi`


In [None]:
# feature extraction by VGG16
X_train = feat_extractor.predict(X_train)

In [None]:
X_test = feat_extractor.predict(X_test)

In [None]:
print(f'X_train.shape: {X_train.shape}')
print(f'y_train.shape: {y_train.shape}\n')

print(f'X_test.shape: {X_test.shape}')
print(f'y_test.shape: {y_test.shape}')

### 2.4 Training a Linear SVM from the extracted Features

## 3. Evaluating and Predicting New Samples by using our Overfitted Model

#### **Class Prediction**

In [None]:
from sklearn.metrics import classification_report

class_names = label_encoder.classes_

print(classification_report(y_test, y_test_pred, target_names=[name for name in class_names]))

We got the **best accuracy** so far.

# Exercise

Repeat the experiments considering different classifiers:
- Random Forest
- RBF SVM
- Some other ensemble learning classifier