# Image Classification Dataset COCO untuk Kategori Anjing dan Kucing

Aditya Rachman Putra  
Harits Abdurrohman

Data Checkpoint, saved model, dan raw features bisa diakses di [VM](https://3a4285ffc4f4b3f1-dot-asia-southeast1.notebooks.googleusercontent.com/lab/tree/modified-FCOS/DTL) atau bisa mengontak adityarputra@gmail.com atau harits.abrd@gmail.com

## Dataset COCO

Merupakan dataset untuk Object Detection dengan banyak category, namun untuk tugas ini disederhanakan menjadi suatu masalah klasifikasi gambar dengan dua kategori (binary classification). Eksperimen yang dilakukan secara garis besar dibagi menjadi dua bagian, yaitu eksperimen representasi data gambar (dan secara tidak langsung juga feature engineering). Dan eksperimen parameter model yang digunakan.

karena hanya ada dua kelas, gambar yang digunakan merupakan gambar yang berisi salah satu kelas target saja (anjing atau kucing) namun tidak keduanya.

## Experiment Parameter Model

### Decision Tree Learning

- Maximum tree Depth : 10, 100, 1000
- Criterion : Gini, Entropy

### Random Forest dan XGBoost

- Maximum Depth : 4, 6, 10
- Learning Rate : 0.1 dan 0.3
- Jumlah Estimators : 1, 5, 10, 50


## Eksperimen Representasi

- PCA
- SVD
- Texture
- Color Moment



# Setup COCO API


In [1]:
!pip install Cython
!pip install -q "tqdm>=4.36.1"
!pip3 install xgboost
!pip3 install scikit-image scikit-learn graphviz



In [2]:
!git clone https://github.com/waleedka/coco
!pip install -U setuptools
!pip install -U wheel
!make install -C coco/PythonAPI

fatal: destination path 'coco' already exists and is not an empty directory.
Requirement already up-to-date: setuptools in /opt/conda/lib/python3.7/site-packages (50.3.2)
Requirement already up-to-date: wheel in /opt/conda/lib/python3.7/site-packages (0.35.1)
make: Entering directory '/home/jupyter/modified-FCOS/DTL/coco/PythonAPI'
# install pycocotools to the Python site-packages
python setup.py build_ext install
running build_ext
building 'pycocotools._mask' extension
creating build
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/pycocotools
creating build/common
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -I../common -I/opt/conda/include/python3.7m -c pycocotools/_mask.c -o build/temp.linux-x86_64-3.7/pycocotools/_mask.o -Wno-cpp -Wno-unused-function -std=c99
^C
interrupted
Makefile:7: recipe for target 'install' 

In [3]:
!wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
!unzip ./annotations_trainval2017.zip

--2020-11-06 01:44:33--  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.88.163
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.88.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252907541 (241M) [application/zip]
Saving to: ‘annotations_trainval2017.zip’

      annotations_t   0%[                    ] 223.26K   263KB/s               ^C
Archive:  ./annotations_trainval2017.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of ./annotations_trainval2017.zip or
        ./annotations_trainval2017.zip.zip, and cannot find ./annotations_trainval2017.zip.ZIP, period.


# Import needed Packages and Load COCO categories

In [76]:
from pycocotools.coco import COCO
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, PredefinedSplit, train_test_split
from skimage.transform import rescale, resize, downscale_local_mean
from skimage.util import crop
from skimage import io, color
from math import floor, ceil
from PIL import Image
from joblib import dump, load
from tqdm.notebook import tqdm
from os import path
import xgboost as xgb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz 
import json
import pandas as pd

In [6]:
def split_integer(num):
    return (floor(num/2), ceil(num/2))

def load_resize_crop_img(img_url, target_width=64, rgb = True):
    I = io.imread(img_url)
    factor = target_width/min(I.shape[0:2])
    resize_target = (round(I.shape[0] * factor), round(I.shape[1] * factor))
    resized_image = (resize(I, resize_target, anti_aliasing=False) * 255).astype(np.uint8)
    crop_width = (split_integer(resize_target[0] - target_width), split_integer(resize_target[1] - target_width), (0,0))
    if rgb:
        cropped_image = crop(color.gray2rgb(resized_image), crop_width)
    else:
        cropped_image = crop(color.rgb2gray(resized_image), crop_width)
    return cropped_image

def get_coco_id(annotation_file, cat1 = "dog", cat2 = "cat", n = None):
    coco = COCO(annotation_file)
    catIds = coco.getCatIds(catNms=[cat1,cat2])
    img_combine = coco.getImgIds(catIds=catIds)
    img_cat1 = list(set(coco.getImgIds(catIds=coco.getCatIds(catNms=[cat1]))) - set(img_combine))
    img_cat2 = list(set(coco.getImgIds(catIds=coco.getCatIds(catNms=[cat2]))) - set(img_combine))
    if n is None:
        return img_cat1 + img_cat2, np.array(len(img_cat1) * [1] + len(img_cat2) * [0])
    return img_cat1[:n] + img_cat2[:n], np.array(len(img_cat1[:n]) * [1] + len(img_cat2[:n]) * [0])

In [57]:
train_path="annotations/instances_train2017.json"
val_path="annotations/instances_val2017.json"
annotation = {"train":train_path, "val":val_path}

N_TRAIN = 4000
IMG_SIZE = 64

# Get COCO Image Data

Untuk mempermudah melakukan eksperimen berulang, data gambar COCO di simpan pada storage mesin. Dan gambar akan di load kedalam variable sesuai kebutuhan. Dan karena COCO sudah memiliki train-val split sendiri.

Untuk eksperimen awal, digunakan representasi yang "naive" dari data gambar, yaitu diubah menjadi suatu flat array dengan ukuran 1 x (size x size x 3) dengan size yang digunakan sebesar 64 pixel (karena keterbatasan RAM dan grid search tidak support partial_fit maka digunakan ukuran gambar yang masih _muat_ kedalam Memory). Untuk gambar yang dalam format Grayscale, akan diubah menjadi RGB dengan SKImage Gray2RGB.

> **What's Next**
> - Bandingkan hasil antara image yang RGB dengan grayscale seperti apa
> - Mencari cara untuk bisa melakukan training tanpa terlalu terbatas pada ukuran RAM seperti dengan menggunakan Numpy 
> - Coba untuk Train-Val split menggunakan SKlearn train-test-split dengan test 1/3 dari keseluruhan data

In [54]:
X_train, y_train = get_coco_id(annotation["train"], n=N_TRAIN)
X_val, y_val = get_coco_id(annotation["val"])

loading annotations into memory...
Done (t=15.57s)
creating index...
index created!
loading annotations into memory...
Done (t=0.40s)
creating index...
index created!


In [16]:
coco = COCO(annotation["train"])
for x in tqdm(X_train):
    fn = "train/{}.png".format(x)
    if path.exists(fn):
#         print(f"skipping {x}")
        continue
    I = io.imread(coco.loadImgs(x)[0]['coco_url'])
    io.imsave(fn,I)

loading annotations into memory...
Done (t=18.34s)
creating index...
index created!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=8083.0), HTML(value='')))




In [17]:
coco = COCO(annotation["val"])
for x in tqdm(X_val):
    fn = "val/{}.png".format(x)
    if path.exists(fn):
#         print(f"skipping {x}")
        continue
    I = io.imread(coco.loadImgs(x)[0]['coco_url'])
    io.imsave(fn,I)

loading annotations into memory...
Done (t=1.89s)
creating index...
index created!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=337.0), HTML(value='')))




## Preprocess the image and flatten it

In [55]:
# coco = COCO(annotation["train"])
X_loaded = []
for x in tqdm(X_train):
    try : 
        fn = "train/{}.png".format(x)
        X_loaded += [load_resize_crop_img(fn, target_width=IMG_SIZE).flatten()]
    except :
        print(x)
print("Finished with length: ",len(X_loaded))
X_loaded = np.array(X_loaded)
# df_train = pd.Dataframe({"X":X_loaded,"y":y_train})
print("Converted to ndarray with shape: ",X_loaded.shape, y_train.shape)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=8083.0), HTML(value='')))




Finished with length:  8083
Converted to ndarray with shape:  (8083, 12288) (8083,)


In [56]:
# coco = COCO(annotation["val"])
X_loaded_val = []
for x in tqdm(X_val):
    try : 
        fn = "val/{}.png".format(x)
        X_loaded_val += [load_resize_crop_img(fn, target_width=IMG_SIZE).flatten()]
    except :
        print(x)
print("Finished with length: ",len(X_loaded_val))
X_loaded_val = np.array(X_loaded_val)
print("Converted to ndarray with shape: ",X_loaded_val.shape, y_val.shape)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=337.0), HTML(value='')))




Finished with length:  337
Converted to ndarray with shape:  (337, 12288) (337,)


## Prepare the fold for train-validate GridSearch

Seperti yang sudah dibahas sebelumnya, untuk saat ini digunakan pembagian data training dan validasi mengikuti pembagian pada dataset COCO, sehingga didapatkan jumlah data training di 8083 gambar dan validasi di 337 gambar. Karena jumlah gambar validasi ini relatif sedikit, seharusnya bisa diubah dengan menggunakan train-test-split dari sklearn sehingga pembagian dari train dan validasi lebih _imbang_.

In [58]:
test_fold = np.concatenate([
    # The training data.
    np.full(len(X_train), -1, dtype=np.int8),
    # The development data.
    np.zeros(len(X_val), dtype=np.int8)
])
cv = PredefinedSplit(test_fold)

X = np.concatenate([X_loaded, X_loaded_val])
y = np.concatenate([y_train, y_val])

# dtrain = xgb.DMatrix(X_loaded, label=y_train)
# dval = xgb.DMatrix(X_loaded_val, label=y_val)

### Save features and label for use later

Data numpy array dari flattened array disimpan kedalam npy file agar bisa digunakan kembali tanpa perlu melakukan preprocess dari gambar, dan variable yang sudah tidak digunakan dan mengambil ruang memory dihapus terlebih dahulu sebelum dimulai fitting

In [60]:
np.save("coco_dog_cat_feature_{}_{}.npy".format(N_TRAIN, IMG_SIZE), X)
np.save("coco_dog_cat_label_{}_{}.npy".format(N_TRAIN, IMG_SIZE), y)

# X = np.load("coco_dog_cat_feature_{}_{}.npy".format(N_TRAIN, IMG_SIZE))
# y = np.load("coco_dog_cat_label_{}_{}.npy".format(N_TRAIN, IMG_SIZE))

In [61]:
# Magic to delete unneeded variable (in this case the features array and label array) as it has been aggregated to X and y respectively
# %reset_selective "^X_loaded(_train)?$"
# %reset_selective "^y_(train|val)$"

Once deleted, variables cannot be recovered. Proceed (y/[n])?   y
Once deleted, variables cannot be recovered. Proceed (y/[n])?   y


# Grid Search Model Parameter

Pada bagian ini dilakukan grid search pada setiap model dengan parameter yang telah ditentukan sebelumnya untuk menemukan kombinasi parameter yang memberi akurasi rata-rata terbaik. Dan digunakan scorer default dari masing-masing model, atau dalam kasus ini merupakan [mean_accuracy dari model tersebut ](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html)

> **Whats Next**
> - Explore scoring yang sesuai untuk task binary classification yang lebih informatif (e.g. F1-score, AUC-ROC, etc.)
> - Gunakan SKlearn classification_report untuk mendapatkan report yang lebih komprehensif

## Decision Tree Learning 

In [62]:
base = DecisionTreeClassifier()
grid = {
"max_depth" : [10, 100, 1000],
"criterion" : ["gini", "entropy"]
}
model = GridSearchCV(base, grid, cv=cv,n_jobs=4, verbose=100)

In [63]:
model = model.fit(X,y)

Fitting 1 folds for each of 6 candidates, totalling 6 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:  1.8min
[Parallel(n_jobs=4)]: Done   2 out of   6 | elapsed:  2.6min remaining:  5.3min
[Parallel(n_jobs=4)]: Done   3 out of   6 | elapsed:  2.6min remaining:  2.6min
[Parallel(n_jobs=4)]: Done   4 out of   6 | elapsed:  2.7min remaining:  1.3min
[Parallel(n_jobs=4)]: Done   6 out of   6 | elapsed:  5.1min remaining:    0.0s
[Parallel(n_jobs=4)]: Done   6 out of   6 | elapsed:  5.1min finished


In [81]:
pd.concat([pd.DataFrame(model.cv_results_["rank_test_score"], columns=["Rank"]), pd.DataFrame(model.cv_results_["params"]),pd.DataFrame(model.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)

Unnamed: 0,Rank,criterion,max_depth,Accuracy
0,1,gini,10,0.557864
1,4,gini,100,0.519288
2,2,gini,1000,0.545994
3,6,entropy,10,0.504451
4,3,entropy,100,0.531157
5,4,entropy,1000,0.519288


In [65]:
# dump(model, 'gs_dtl_{}_{}.pkl'.format(N_TRAIN, IMG_SIZE))

['gs_dtl_4000_64.pkl']

In [None]:
# %reset_selective "^model$"

Bisa diamati bahwa untuk decision tree classifier, hasil terbaik untuk dataset ini bisa didapatkan dengan `max_depth = 10` dan `criterion = gini`

## Random Forest Classifier

In [66]:
rf_base = xgb.XGBRFClassifier()
rf_grid = {
"max_depth" : [4, 6, 10],
"learning_rate" : [0.1, 0.3],
"n_estimators" : [1, 5, 10, 50]
}
rf_model = GridSearchCV(rf_base, rf_grid, cv=cv,n_jobs=4, verbose=100, return_train_score = True)

In [67]:
rf_model = rf_model.fit(X, y)

Fitting 1 folds for each of 24 candidates, totalling 24 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:   24.2s
[Parallel(n_jobs=4)]: Done   2 tasks      | elapsed:   44.5s
[Parallel(n_jobs=4)]: Done   3 tasks      | elapsed:   47.4s
[Parallel(n_jobs=4)]: Done   4 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done   6 tasks      | elapsed:  2.3min
[Parallel(n_jobs=4)]: Done   7 tasks      | elapsed:  2.5min
[Parallel(n_jobs=4)]: Done   8 tasks      | elapsed:  4.1min
[Parallel(n_jobs=4)]: Done   9 tasks      | elapsed:  4.5min
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  4.8min
[Parallel(n_jobs=4)]: Done  11 tasks      | elapsed:  5.4min
[Parallel(n_jobs=4)]: Done  12 tasks      | elapsed:  6.0min
[Parallel(n_jobs=4)]: Done  13 tasks      | elapsed:  6.5min
[Parallel(n_jobs=4)]: Done  14 tasks      | elapsed:  6.9min
[Parallel(

In [83]:
pd.concat([pd.DataFrame(rf_model.cv_results_["rank_test_score"], columns=["Rank"]), pd.DataFrame(rf_model.cv_results_["params"]),pd.DataFrame(rf_model.cv_results_["mean_test_score"], columns=["Accuracy"]),pd.DataFrame(rf_model.cv_results_["mean_fit_time"], columns=["Fit Time"])],axis=1)

Unnamed: 0,Rank,learning_rate,max_depth,n_estimators,Accuracy,Fit Time
0,15,0.1,4,1,0.608309,21.783935
1,17,0.1,4,5,0.590504,42.091922
2,1,0.1,4,10,0.623145,66.157668
3,13,0.1,4,50,0.611276,265.103034
4,21,0.1,6,1,0.569733,23.052423
5,9,0.1,6,5,0.614243,59.69389
6,9,0.1,6,10,0.614243,104.709658
7,7,0.1,6,50,0.617211,467.231941
8,19,0.1,10,1,0.575668,32.680511
9,23,0.1,10,5,0.560831,109.912144


In [None]:
# dump(rf_model, 'gs_rf_{}_{}.pkl'.format(N_TRAIN, IMG_SIZE))

In [None]:
# %reset_selective "^rf_model$"

Untuk Random Forest ini bisa diamati bahwa `learning_rate` tidak mempengaruhi akurasi dari model, namun berpengaruh pada fit time, terutama untuk model dengna jumlah estimator dan tree depth yang tinggi.

## XGBoost Classifier

In [70]:
xg_base = xgb.XGBClassifier()
xg_grid = {
"max_depth" : [4, 6, 10],
"learning_rate" : [0.1, 0.3],
"n_estimators" : [1, 5, 10, 50]
}
xg_model = GridSearchCV(xg_base, xg_grid, cv=cv, n_jobs=4, verbose=100)

In [71]:
xg_model = xg_model.fit(X, y)

Fitting 1 folds for each of 24 candidates, totalling 24 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:   24.8s
[Parallel(n_jobs=4)]: Done   2 tasks      | elapsed:   50.3s
[Parallel(n_jobs=4)]: Done   3 tasks      | elapsed:   50.3s
[Parallel(n_jobs=4)]: Done   4 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:  2.1min
[Parallel(n_jobs=4)]: Done   6 tasks      | elapsed:  2.7min
[Parallel(n_jobs=4)]: Done   7 tasks      | elapsed:  3.0min
[Parallel(n_jobs=4)]: Done   8 tasks      | elapsed:  5.1min
[Parallel(n_jobs=4)]: Done   9 tasks      | elapsed:  5.4min
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  5.7min
[Parallel(n_jobs=4)]: Done  11 tasks      | elapsed:  6.5min
[Parallel(n_jobs=4)]: Done  12 tasks      | elapsed:  7.5min
[Parallel(n_jobs=4)]: Done  13 tasks      | elapsed:  7.8min
[Parallel(n_jobs=4)]: Done  14 tasks      | elapsed:  8.2min
[Parallel(

In [84]:
pd.concat([pd.DataFrame(xg_model.cv_results_["rank_test_score"], columns=["Rank"]), pd.DataFrame(xg_model.cv_results_["params"]),pd.DataFrame(xg_model.cv_results_["mean_test_score"], columns=["Accuracy"]),pd.DataFrame(xg_model.cv_results_["mean_fit_time"], columns=["Fit Time"])],axis=1)

Unnamed: 0,Rank,learning_rate,max_depth,n_estimators,Accuracy,Fit Time
0,23,0.1,4,1,0.551929,22.435227
1,5,0.1,4,5,0.632047,47.953519
2,3,0.1,4,10,0.637982,79.806288
3,2,0.1,4,50,0.643917,319.830331
4,21,0.1,6,1,0.554896,25.296498
5,15,0.1,6,5,0.578635,72.879623
6,11,0.1,6,10,0.602374,130.896523
7,4,0.1,6,50,0.635015,556.089526
8,19,0.1,10,1,0.560831,40.804119
9,15,0.1,10,5,0.578635,140.886188


In [None]:
# dump(xg_model, 'gs_xg_{}_{}.pkl'.format(N_TRAIN, IMG_SIZE))

Bisa diamati bahwa XGBoost memiliki mean accuracy yang lebih tinggi dibandingkan dua model lainnya. Yang dihasilkan oleh parameter learning rate, tre depth, dan jumlah estimator respectively 0.1, 10, 50 dengan mean accuracy 0.649852

# Kesimpulan

Telah dilakukan eksperimen dengan berbagai kombinasi parameter. Jika kita melihat performa terbaik yang dihasilkan per model, bisa diamati bahwa model DTL memberikan performa terburuk dan mendekati performa bila komputer menebak kelas dengan uniformly random (mean accuration 0.558). Lalu untuk Random Forest memiliki performa yang relatif lebih baik dengan mean accuration 0.623. Dan XGBoost memberikan performa terbaik dengna mean accuration 0.649.