### Capstone 2- Bengali Grapheme Classification Project

This workbook is the first attempt at experimenting with a CNN for this project. I've followed this [starter code] (https://www.kaggle.com/kaushal2896/bengali-graphemes-starter-eda-multi-output-cnn)for ideas. 

This is also the first attempt at using MLFlow for tracking deep learning experiments

In [1]:
#importing necessary packages
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn
from collections import defaultdict
import mlflow

Although my virtual environment uses tensorflow-gpu, I want to doubly make sure that keras uses my GPU

In [2]:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()

Using TensorFlow backend.


['/job:localhost/replica:0/task:0/device:GPU:0']

In [3]:
# using custom scripts for creating model and loading data
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, './model/')

In [4]:
from model_creator import model_create
from data_loader import data_loader

In [5]:
# read the csv files
filenames = ['train','test','class_map','class_map_corrected','train_multi_diacritics','sample_submission']
df_dict = defaultdict()

for file in filenames:
    df_dict[file]=pd.read_csv('./data/{}.csv'.format(file))

In [6]:
# using mlflow autologger to track models, artifacts and parameters
import mlflow.keras
mlflow.keras.autolog()

In [8]:
#creating mlflow experiment
mlflow.create_experiment(name="conv_kernel_size_6_pool_size_3_default_arch")

'1'

In [9]:
#this version of the model create function creates a 3 layer alternating conv-maxpool layers followed by dropout
#two dense layers with a dropout in the middle and three output layers

#to-do batch normalization, add optimizer and initializer as parameters
model = model_create(conv_kernel_size=6,pool_size=3)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.



  model = MaxPool2D(pool_size=(pool_size, pool_size), dim_ordering="tf")(model)


In [10]:
#view summary of model created
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_layer (InputLayer)        (None, 137, 236, 1)  0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 137, 236, 64) 2368        input_layer[0][0]                
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 45, 78, 64)   0           conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 45, 78, 64)   147520      max_pooling2d_1[0][0]            
____________________________________________________________________________________________

In [11]:
#specify hyper-parameters
batch_size = 256
epochs = 50

In [12]:
#note - this implementation did not save experiment under the experiment name and instead saved it in the default name

#to do - investigate why
with mlflow.start_run():
    for i in range(4):
        #iterate through all 4 training parquet files 
        print('Reading parquet file #{}'.format(i+1))
        print('---------------------------------------')
        print('Transforming data for parquet file #{}'.format(i+1))
        print('---------------------------------------')
        x_train, x_test, y_train_root, y_test_root, y_train_consonant, y_test_consonant, y_train_vowel, y_test_vowel=data_loader('./data/train_image_data_{}.parquet'.format(i),df_dict['train'])
        print('Training model on parquet file #{}'.format(i+1))
        print('---------------------------------------')
        history=model.fit(x=x_train, y={"output_root":y_train_root,
                                    "output_vowel":y_train_vowel,"output_consonant":y_train_consonant},
                      epochs=epochs,batch_size=batch_size,
                      validation_data=(x_test,{"output_root":y_test_root,"output_vowel":y_test_vowel,
                                               "output_consonant":y_test_consonant}))
        print('Deleting variables after training')
        del x_train, x_test, y_train_root, y_test_root, y_train_consonant, y_test_consonant, y_train_vowel, y_test_vowel


Reading parquet file #1
---------------------------------------
Transforming data for parquet file #1
---------------------------------------
Training model on parquet file #1
---------------------------------------


  all_param_names, _, _, all_default_values = inspect.getargspec(fn)  # pylint: disable=W1505



Train on 45189 samples, validate on 5021 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50


Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50


Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50


Epoch 50/50
Deleting variables after training
Reading parquet file #2
---------------------------------------
Transforming data for parquet file #2
---------------------------------------
Training model on parquet file #2
---------------------------------------
Train on 45189 samples, validate on 5021 samples
Epoch 1/50


  all_param_names, _, _, all_default_values = inspect.getargspec(fn)  # pylint: disable=W1505


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50


Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50


Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50


Epoch 50/50
Deleting variables after training
Reading parquet file #3
---------------------------------------
Transforming data for parquet file #3
---------------------------------------
Training model on parquet file #3
---------------------------------------
Train on 45189 samples, validate on 5021 samples
Epoch 1/50


  all_param_names, _, _, all_default_values = inspect.getargspec(fn)  # pylint: disable=W1505


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50


Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50


Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50


Epoch 50/50
Deleting variables after training
Reading parquet file #4
---------------------------------------
Transforming data for parquet file #4
---------------------------------------
Training model on parquet file #4
---------------------------------------
Train on 45189 samples, validate on 5021 samples
Epoch 1/50


  all_param_names, _, _, all_default_values = inspect.getargspec(fn)  # pylint: disable=W1505


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50


Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50


Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50


Epoch 50/50
Deleting variables after training


In [45]:
# getting predictions on test set
# this code is borrowed from the started code mentioned at the top but modified for my implementation

preds_dict = {
    'grapheme_root': [],
    'vowel_diacritic': [],
    'consonant_diacritic': []
}

components = ['consonant_diacritic', 'grapheme_root', 'vowel_diacritic']
target=[] # model predictions placeholder
row_id=[] # row_id place holder
for i in range(4):
    df_test_img = pd.read_parquet('./data/test_image_data_{}.parquet'.format(i)) 
    df_test_img.set_index('image_id', inplace=True)


    X_test = df_test_img.values.reshape(-1, 137, 236, 1)
    
    preds = model.predict(X_test)

    for i, p in enumerate(preds_dict):
        preds_dict[p] = np.argmax(preds[i], axis=1)

    for k,id in enumerate(df_test_img.index.values):  
        for i,comp in enumerate(components):
            id_sample=id+'_'+comp
            row_id.append(id_sample)
            target.append(preds_dict[comp][k])
    del df_test_img
    del X_test


df_sample = pd.DataFrame(
    {
        'row_id': row_id,
        'target':target
    },
    columns = ['row_id','target'] 
)
df_sample.to_csv('submission.csv',index=False)
df_sample


Unnamed: 0,row_id,target
0,Test_0_consonant_diacritic,0
1,Test_0_grapheme_root,3
2,Test_0_vowel_diacritic,0
3,Test_1_consonant_diacritic,0
4,Test_1_grapheme_root,93
5,Test_1_vowel_diacritic,2
6,Test_2_consonant_diacritic,0
7,Test_2_grapheme_root,19
8,Test_2_vowel_diacritic,0
9,Test_3_consonant_diacritic,0
