## Columbia University
### ECBM E4040 Neural Networks and Deep Learning. Fall 2022.

# ECBM E4040 - Assignment 2- Task 5: Kaggle Open-ended Competition

Kaggle is a platform for predictive modelling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.

If you don't have a Kaggle account, feel free to join at [www.kaggle.com](https://www.kaggle.com). To let the CAs do the grading more conveniently, please __use Lionmail to join Kaggle__ and __use UNI as your username__.

The competition is located here: https://www.kaggle.com/t/d908ef03b7244102a1e006516a6555a6

You can find detailed description about this in-class competition on the website above. Please read carefully and follow the instructions.

<span style="color:red">__TODO__:</span>

- Train a custom model for the bottle dataset classification problem. You are free to use any methods taught in the class or found by yourself on the Internet (ALWAYS provide reference to the source). General training methods include:
  - Dropout
  - Batch normalization
  - Early stopping
  - l1-norm & l2-norm penalization
- You are given the test set to generate your predictions (70% public + 30% private, but you don't know which ones are public/private). Students should achieve an accuracy on the public test set of at least 70%. Two points will be deducted for each 1% below 70% accuracy threshold (i.e. 65% accuracy will have 10 points deducted). The accuracy will be shown on the public leaderboard once you submit your prediction .csv file. The private leaderboard will be released after the competition. The final ranking is based on the private leaderboard result, not the public leaderboard.


NOTE: 
* Report your results on the Kaggle, for comparison with other students' optimal results (you can do this several times). 
* Save your best model.

__Hint__: You can start from what you implemented in task 4. Students are allowed to use pretrained networks, and utilize transfer learning. 

## Useful Information: 

- Unzip zip files in GCP or acquire administrator permission for other application installation: When you upload your dataset to your vm instances, you may want to unzip your files. However, unzip command is not built in. To use `sudo apt install unzip` or for future applications installation, you need to: 
  - Change username to default (or just restart the vm instance)
  - Type `sudo su` to get root
  - You can remove sudo for the following installation commands (e.g. `apt install unzip`).
- If you meet kernel crash (or the running never ends), you might consider using a larger memory CPU. Especially if you include large network structure like VGG, 15GB memory or more CPU is recommended
- Some python libraries that you might need to install first: pandas, scikit-learn. there are **2 OPTIONS** that you can use to install them:
  - In the envTF24 environment in linux interface, type: `pip install [package name]` 
  - In the jupyter notebook (i.e. this file), type `!pip install [package name]`. You’d better restart the virtual environment, even the instance to get these packages functional.
- You might need extra pip libraries to handle dataset, include network, etc. You can follow step 3 to install them.

## HW Submission Details:

There are two components to reporting the results of this task: 

**(A) Submission (up to 20 submissions each day) of the .csv prediction file through the Kaggle platform**. You should start doing this __VERY early__, so that students can compare their work as they are making progress with model optimization.

**(B) Submitting your best CNN model through Github Classroom repo.**

**Note** that assignments are submitted through github classroom only. All code for training your kaggle model should be done in this task 5 jupyter notebook, or in a user defined module (.py file) that is imported for use in the jupyter notebook.

<span style="color:red">__Submission content:__</span>

(i) In your Assignment 2 submission folder, create a subfolder called __KaggleModel__. Save your best model using `model.save()`. This will generate a `saved_model.pb` file, a folder called `variables`, and a folder called `checkpoints` all inside the __KaggleModel__ folder. Only upload your best model to GitHub classroom. 

(ii) <span style="color:red">If your saved model exceeds 100 MB, do not upload it to GitHub classroom (.gitignore it or you will get an error when pushing).</span> Upload it instead to Google Drive and explicitly provide the link under the 'Save your best model' cell. 

(iii) Remember to delete any intermediate results, we only want your best model. Do not upload any data files. The instructors will rerun the uploaded best model and verify against the score which you reported on the Kaggle.

**The top 10 final submissions of the Kaggle competition will receive up to 10 bonus points proportional to the private test accuracy.**

## Load Data

In [1]:
#Generate dataset
import os
import pandas as pd
import numpy as np
from PIL import Image


#Load Training images and labels
train_directory = "./data/train_128" #TODO: Enter path for train128 folder (hint: use os.getcwd())
image_list=[]
label_list=[]
for sub_dir in os.listdir(train_directory):
    print("Reading folder {}".format(sub_dir))
    sub_dir_name=os.path.join(train_directory,sub_dir)
    for file in os.listdir(sub_dir_name):
        filename = os.fsdecode(file)
        if filename.endswith(".jpg") or filename.endswith(".png"):
            image_list.append(np.array(Image.open(os.path.join(sub_dir_name,file))))
            label_list.append(int(sub_dir))
X_train=np.array(image_list)
y_train=np.array(label_list)

#Load Test images
test_directory = "./data/test_128"#TODO: Enter path for test128 folder (hint: use os.getcwd())
test_image_list=[]
test_df = pd.DataFrame([], columns=['Id', 'X'])
print("Reading Test Images")
for file in os.listdir(test_directory):
    filename = os.fsdecode(file)
    if filename.endswith(".jpg") or filename.endswith(".png"):
        test_df = test_df.append({
            'Id': filename,
            'X': np.array(Image.open(os.path.join(test_directory,file)))
        }, ignore_index=True)
        
test_df['s'] = [int(x.split('.')[0]) for x in test_df['Id']]
test_df = test_df.sort_values(by=['s'])
test_df = test_df.drop(columns=['s'])
X_test = np.stack(test_df['X'])


print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)

Reading folder 1
Reading folder 3
Reading folder 0
Reading folder 2
Reading folder 4
Reading Test Images
Training data shape:  (15000, 128, 128, 3)
Training labels shape:  (15000,)
Test data shape:  (3500, 128, 128, 3)


## Build and Train Your Model Here

In [8]:
# YOUR CODE HERE
import tensorflow as tf
import datetime
import numpy as np
from tensorflow.keras.layers import Dense, Flatten, Conv2D, AveragePooling2D, Dropout , BatchNormalization,MaxPooling2D
from tensorflow.keras import Model
from utils.image_generator import ImageGenerator
from tqdm import tqdm

class MyModel(Model):

    def __init__(self, input_shape, output_size=5):
       
        super(MyModel, self).__init__()
        # For example:
        
        self.conv_layer_1 = Conv2D(filters=32, kernel_size=(3, 3), strides=(1,1), activation='relu', input_shape=input_shape, padding="same")
        self.maxpool_layer_1 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')
        self.conv_layer_2 = Conv2D(filters=64, kernel_size=(3, 3), strides=(1,1), activation='relu', input_shape=input_shape, padding="same")
        self.maxpool_layer_2 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')
        self.conv_layer_3 = Conv2D(filters=128, kernel_size=(3, 3), strides=(1,1), activation='relu', input_shape=input_shape, padding="same")
        self.maxpool_layer_3 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')
        self.batch_norm_1 = BatchNormalization()
        self.conv_layer_4 = Conv2D(filters=256, kernel_size=(3, 3), strides=(1,1), activation='relu', input_shape=input_shape, padding="same")
        self.maxpool_layer_4 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')
        self.batch_norm_2 = BatchNormalization()
        self.drop_out_1 = Dropout(0.5)
        self.flatten_layer = Flatten()
        self.fc_layer_1 = Dense(256, activation='relu')
        self.drop_out_2 = Dropout(0.5)
        self.fc_layer_2 = Dense(128,activation='relu')
        self.fc_layer_3 = Dense(output_size, activation='softmax')      
        
    def call(self, x):
        
        x = self.conv_layer_1(x)
        x = self.maxpool_layer_1(x)
        x = self.conv_layer_2(x)
        x = self.maxpool_layer_2(x)
        x = self.conv_layer_3(x)
        x = self.maxpool_layer_3(x)
        x = self.batch_norm_1(x)
        x = self.conv_layer_4(x)
        x = self.maxpool_layer_4(x)
        x = self.batch_norm_2(x)
        x = self.drop_out_1(x)
        x = self.flatten_layer(x)
        x = self.fc_layer_1(x)
        x = self.drop_out_2(x)
        x = self.fc_layer_2(x)
        out = self.fc_layer_3(x)
        
        return out


class My_trainer():
       
    def __init__(self,X_train, y_train, X_test,epochs=10, batch_size=256, lr=1e-3):
        self.X_train = X_train.astype("float32")
        self.y_train = y_train.astype("float32")
        self.X_test = X_test.astype("float32")
        self.epochs = epochs
        self.batch_size = batch_size
        self.lr = lr

    # Initialize MyLenet model
    def init_model(self):
        self.model = MyModel(self.X_train[0].shape)

    #initialize loss function and metrics to track over training
    def init_loss(self):
        self.loss_function = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

        self.train_loss = tf.keras.metrics.Mean(name='train_loss')
        self.train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

    # Initialize optimizer
    def init_optimizer(self):
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=self.lr)

    # Prepare batches of train data using ImageGenerator
    def batch_train_data(self, shuffle=True):
        train_data = ImageGenerator(self.X_train, self.y_train)
        self.train_data_next_batch = train_data.next_batch_gen(self.batch_size,shuffle=shuffle)
        self.n_batches = train_data.N_aug // self.batch_size
    
    # Define training step
    def train_step(self, images, labels, training=True):
        with tf.GradientTape() as tape:
        # training=True is always recommended as there are few layers with different
        # behavior during training versus inference (e.g. Dropout).
            predictions = self.model(images, training=training)
            loss = self.loss_function(labels, predictions)
        gradients = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))

        self.train_loss(loss)
        self.train_accuracy(labels, predictions)

    # train epoch
    def train_epoch(self, epoch):
        self.train_loss.reset_states()
        self.train_accuracy.reset_states()
        for batches in tqdm(range (self.n_batches)):
            x_batch,y_batch = next(self.train_data_next_batch)
            self.train_step(x_batch,y_batch)

        template = 'Loss: {}, Accuracy: {} '
        print(template.format(self.train_loss.result(),
                            self.train_accuracy.result() * 100))
                            
            
    # start training
    def run(self):
        self.init_model()
        self.init_loss()
        self.init_optimizer()
        self.batch_train_data()

        for epoch in range(self.epochs):
            print('Training Epoch {}'.format(epoch + 1))
            self.train_epoch(epoch)
            
    def predict(self):

        predictions = self.model.call(self.X_test)
        predictions = np.argmax(predictions,axis =1)

        return predictions 
    
Trainer = My_trainer(X_train,y_train,X_test,epochs= 30,batch_size=32,lr = 0.0005)
Trainer.run()


Training Epoch 1


100%|██████████| 468/468 [00:17<00:00, 26.81it/s]


Loss: 1.4969576597213745, Accuracy: 96.1872329711914 
Training Epoch 2


100%|██████████| 468/468 [00:18<00:00, 25.72it/s]


Loss: 2.141601324081421, Accuracy: 25.88140869140625 
Training Epoch 3


100%|██████████| 468/468 [00:18<00:00, 25.89it/s]


Loss: 1.4605143070220947, Accuracy: 31.477031707763672 
Training Epoch 4


100%|██████████| 468/468 [00:18<00:00, 25.93it/s]


Loss: 1.3375842571258545, Accuracy: 39.83039474487305 
Training Epoch 5


100%|██████████| 468/468 [00:18<00:00, 25.82it/s]


Loss: 1.194035530090332, Accuracy: 48.103633880615234 
Training Epoch 6


100%|██████████| 468/468 [00:18<00:00, 25.73it/s]


Loss: 1.0511714220046997, Accuracy: 56.52376937866211 
Training Epoch 7


100%|██████████| 468/468 [00:18<00:00, 25.52it/s]


Loss: 0.9186573624610901, Accuracy: 64.44978332519531 
Training Epoch 8


100%|██████████| 468/468 [00:18<00:00, 25.66it/s]


Loss: 0.7618063688278198, Accuracy: 72.32906341552734 
Training Epoch 9


100%|██████████| 468/468 [00:18<00:00, 25.83it/s]


Loss: 0.6335304975509644, Accuracy: 77.57746124267578 
Training Epoch 10


100%|██████████| 468/468 [00:18<00:00, 25.87it/s]


Loss: 0.5153590440750122, Accuracy: 82.1380844116211 
Training Epoch 11


100%|██████████| 468/468 [00:18<00:00, 25.93it/s]


Loss: 0.43831667304039, Accuracy: 85.16960906982422 
Training Epoch 12


100%|██████████| 468/468 [00:18<00:00, 25.95it/s]


Loss: 0.3818196952342987, Accuracy: 87.44657897949219 
Training Epoch 13


100%|██████████| 468/468 [00:18<00:00, 25.92it/s]


Loss: 0.326097697019577, Accuracy: 89.20272827148438 
Training Epoch 14


100%|██████████| 468/468 [00:18<00:00, 25.87it/s]


Loss: 0.28427767753601074, Accuracy: 90.83867645263672 
Training Epoch 15


100%|██████████| 468/468 [00:18<00:00, 25.94it/s]


Loss: 0.25318560004234314, Accuracy: 92.14075469970703 
Training Epoch 16


100%|██████████| 468/468 [00:18<00:00, 25.93it/s]


Loss: 0.23114848136901855, Accuracy: 92.8619155883789 
Training Epoch 17


100%|██████████| 468/468 [00:18<00:00, 25.89it/s]


Loss: 0.18989714980125427, Accuracy: 93.85015869140625 
Training Epoch 18


100%|██████████| 468/468 [00:18<00:00, 25.88it/s]


Loss: 0.18283507227897644, Accuracy: 94.14396667480469 
Training Epoch 19


100%|██████████| 468/468 [00:18<00:00, 25.89it/s]


Loss: 0.16037869453430176, Accuracy: 94.90518188476562 
Training Epoch 20


100%|██████████| 468/468 [00:18<00:00, 25.88it/s]


Loss: 0.14772939682006836, Accuracy: 95.30582427978516 
Training Epoch 21


100%|██████████| 468/468 [00:18<00:00, 25.83it/s]


Loss: 0.13377350568771362, Accuracy: 95.69978332519531 
Training Epoch 22


100%|██████████| 468/468 [00:18<00:00, 25.95it/s]


Loss: 0.135394886136055, Accuracy: 95.73985290527344 
Training Epoch 23


100%|██████████| 468/468 [00:18<00:00, 25.91it/s]


Loss: 0.11268161237239838, Accuracy: 96.46768188476562 
Training Epoch 24


100%|██████████| 468/468 [00:18<00:00, 25.89it/s]


Loss: 0.10865306854248047, Accuracy: 96.45433044433594 
Training Epoch 25


100%|██████████| 468/468 [00:18<00:00, 25.78it/s]


Loss: 0.09885963797569275, Accuracy: 96.75480651855469 
Training Epoch 26


100%|██████████| 468/468 [00:18<00:00, 25.94it/s]


Loss: 0.08907879143953323, Accuracy: 97.14209747314453 
Training Epoch 27


100%|██████████| 468/468 [00:18<00:00, 25.82it/s]


Loss: 0.08400726318359375, Accuracy: 97.23558044433594 
Training Epoch 28


100%|██████████| 468/468 [00:18<00:00, 25.80it/s]


Loss: 0.08263134211301804, Accuracy: 97.31570434570312 
Training Epoch 29


100%|██████████| 468/468 [00:18<00:00, 25.88it/s]


Loss: 0.0834260955452919, Accuracy: 97.5494155883789 
Training Epoch 30


100%|██████████| 468/468 [00:18<00:00, 25.95it/s]

Loss: 0.07064913213253021, Accuracy: 97.80315399169922 





## Save your best model

**Link to large model on Google Drive: [insert link here]** (if model exceeds 100 MB, else upload to GitHub classroom)

In [9]:
# YOUR CODE HERE
Trainer.model.save(filepath = "./model/task5_model")

INFO:tensorflow:Assets written to: ./model/task5_model/assets


## Generate .csv file for Kaggle

The following code snippet can be an example used to generate your prediction .csv file.

NOTE: If your Kaggle results are indicating random performance, then it's likely that the indices of your csv predictions are misaligned.

In [10]:
import csv
with tf.device('/CPU:0'):
    predictions = Trainer.predict()
with open('./model/task5_model/predictions_task_5.csv','w') as csvfile:
    fieldnames = ['Id','label']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for index,l in enumerate(predictions):
        filename = str(index) + '.png'
        label = str(l)
        writer.writerow({'Id': filename, 'label': label})