# Homework: Deep Learning (week 8)

### Dataset

In this homework, we'll build a model for predicting if we have an image of a bee or a wasp. 
For this, we will use the "Bee or Wasp?" dataset that was obtained from [Kaggle](https://www.kaggle.com/datasets/jerzydziewierz/bee-vs-wasp) and slightly rebuilt.

You can download the dataset for this homework from [here](https://github.com/SVizor42/ML_Zoomcamp/releases/download/bee-wasp-data/data.zip):

```bash
wget https://github.com/SVizor42/ML_Zoomcamp/releases/download/bee-wasp-data/data.zip
unzip data.zip
```

The dataset contains around 2500 images of bees and around 2100 images of wasps. 

The dataset contains separate folders for training and test sets. 

In [4]:
import pickle
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator

train_gen = ImageDataGenerator(rescale=1/255).flow_from_directory(
    '/kaggle/input/bee-or-wasp/data/train',
    class_mode='binary',
    batch_size=20,
    target_size=(150, 150)
)

test_gen = ImageDataGenerator(rescale=1/255).flow_from_directory(
    '/kaggle/input/bee-or-wasp/data/test',
    class_mode='binary',
    batch_size=20,
    target_size=(150, 150)
)

Found 3677 images belonging to 2 classes.
Found 918 images belonging to 2 classes.


In [5]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.optimizers import SGD

model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
    MaxPooling2D(),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=SGD(learning_rate=0.002, momentum=0.8), 
              loss='binary_crossentropy', metrics=['accuracy'])

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_1 (Conv2D)           (None, 148, 148, 32)      896       
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 74, 74, 32)        0         
 g2D)                                                            
                                                                 
 flatten_1 (Flatten)         (None, 175232)            0         
                                                                 
 dense_2 (Dense)             (None, 64)                11214912  
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 11215873 (42.79 MB)
Trainable params: 11215873 (42.79 MB)
Non-trainable params: 0 (0.00 Byte)
____________

In [6]:
# Fit the model to the training data
history = model.fit(
    train_gen,
    epochs=10,
    validation_data=test_gen
)

# Save the history to a file
with open('/kaggle/working/history.pkl', 'wb') as file:
    pickle.dump(history.history, file)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [9]:
# Training data augmentation
train_gen_aug = ImageDataGenerator(
    rescale=1/255,
    rotation_range=50,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    horizontal_flip=True,
    fill_mode='nearest'
)

train_gen_aug = train_gen_aug.flow_from_directory(
    '/kaggle/input/bee-or-wasp/data/train',
    class_mode='binary',
    batch_size=20,
    target_size=(150, 150)
)

# Continue training (using augmented data from now on)
history_aug = model.fit(
    train_gen_aug,
    epochs=10,
    validation_data=test_gen
)

# Save the history_aug to a file
with open('/kaggle/working/history_aug.pkl', 'wb') as file:
    pickle.dump(history_aug.history, file)

Found 3677 images belonging to 2 classes.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [1]:
import pickle
# Load the histories from a file
with open('history.pkl', 'rb') as file:
    history = pickle.load(file)
with open('history_aug.pkl', 'rb') as file:
    history_aug = pickle.load(file)

In [10]:
import pandas as pd
import numpy as np

history_df = pd.DataFrame(history)
history_aug_df = pd.DataFrame(history_aug)

In [11]:
# Question 3
# What is the median of training accuracy for all the epochs for this model?
history_df.accuracy.median()

0.7652978003025055

In [12]:
# Question 4
# What is the standard deviation of training loss for all the epochs for this model?
history_df.loss.std()

0.09792663242856825

In [13]:
# Question 5
# What is the mean of test loss for all the epochs for the model trained with augmentations?
history_aug_df.val_loss.mean()

0.47552295923233034

In [17]:
# Question 6
# What's the average of test accuracy for the last 5 epochs (from 6 to 10) for the model trained with augmentations?
history_aug_df.accuracy.loc[5:].mean()

0.7917867779731751