# Pre-processing & Classification Try-Outs

## Datasets

- ~~Full~~
- ~~Cleaned~~
- ~~Cleaned+MinMaxScaled~~
- ~~Cleaned+RobustScaled~~
- ~~Cleaned+QuantileTransformed~~
- Cleaned+Extended+MinMaxScaled
- Cleaned+CleanExtended+MinMaxScaled

## Classifiers

- ~~RandomForest~~
- SVM/C
- kNN
- ~~SGD~~

## Winner

SVM/C `kernel='rbf', C=3.0, gamma=0.2` with Cleaned+CleanExtended+MinMaxScaled

In [1]:
# IMPORTS AND NOTEBOOK SETUP
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

\# | Rank | Classifier | Options | Dataset | Score
--- | --- | --- | --- | ---| ---
1 |1 |SVM/C | `kernel='rbf', C=1.5, gamma=0.2` | Cleaned+Extended+MinMaxScaled | `0.98541666666666672`
2 |1 |SVM/C | `kernel='rbf', C=8.7, gamma=0.04` | Cleaned+Extended+MinMaxScaled | `0.98333333333333328`
3 |1 |kNN | `n_neigbors=3` | Cleaned+Extended+MinMaxScaled | `0.98333333333333328`
4 |1 |SVM/C | `kernel='rbf', C=3.0, gamma=0.2` | Cleaned+CleanExtended+MinMaxScaled | `0.97916666666666663`

Default options:

- Scikit Learn train/test split ratio: `.25`.
- Normalizing all columns but `num_holes`.

## outdated

\# | Rank | Classifier | Options | Dataset | Score
--- | --- | --- | --- | ---| ---
1 |1 |SVM/C | `kernel='rbf', C=6.6, gamma=0.35` | Cleaned+MinMaxScaled | `0.978873239436`
1 |1 |SVM/C | `kernel='rbf', C=3.9, gamma=0.59` | Cleaned+MinMaxScaled | `0.978873239436`
1 |1 |SVM/C | `C=2.0` | Cleaned+MinMaxScaled | `0.973958333333`
2 |1 |SVM/C | `C=4.9` | Cleaned+RobustScaled | `0.973958333333`
3 |3 |SVM/C | `kernel='rbf', C=3.9` | Cleaned+MinMaxScaled | `0.967391304347`
4 |3 |SVM/C | `kernel='sigmoid', C=9.6` | Cleaned+MinMaxScaled | `0.967391304347`
5 |5 |RandomForest | `n_estimators=70` | Cleaned | `0.953125000000`
6 |5 |RandomForest | `n_estimators=70` | Cleaned+MinMaxScaled | `0.953125000000`
7 |5 |SVM/C | `C=2.6` | Cleaned+QuantileTransformed | `0.953125000000`
8 |8 |SVM/C | `C=4.3` | Cleaned | `0.947916666667`
9 |8 |RandomForest | `n_estimators=16` | Cleaned+RobustScaled | `0.947916666667`
10|8 |RandomForest | `n_estimators=32` | Cleaned+QuantileTransformed | `0.947916666667`
11|11|RandomForest | `n_estimators=90` | Full | `0.942708333333`
12|12|SVM/C | `default` | Full | `0.932291666667`

Default options:

- RandomForest with `n_estimators=50`, `oob_score=True` and `random_state=123456`.
- SVM/C with `kernel=linear`, `C=1.0`.

Normalizing all columns but `num_holes`.

In [2]:
# IMPORTING OUR DATASET
data_full = pd.read_csv('../dataset-numpy/dataset.csv')
data_clean_manual = pd.read_csv('../dataset-numpy/dataset-clean-manual.csv')
data_ext_clean_manual = pd.read_csv('../dataset-numpy/dataset-extended-clean-manual.csv')
data_v4 = pd.read_csv('../dataset-numpy/dataset-v4.csv')
mnist = pd.read_csv('../dataset-numpy/mnist.csv')

mnist.describe()

Unnamed: 0,area,contours,radius,hull_radius,centroid_x,centroid_y,weight_0_0,weight_0_1,weight_0_2,weight_0_3,...,weight_7_0,weight_7_1,weight_7_2,weight_7_3,weight_7_4,weight_7_5,weight_7_6,weight_7_7,num_holes,label
count,41998.0,41998.0,41998.0,41998.0,41998.0,41998.0,41998,41998,41998.0,41998.0,...,41998,41998.0,41998.0,41998.0,41998.0,41998.0,41998.0,41998,41998.0,41998.0
mean,130.065515,38.106005,7.884814,10.212555,16.114101,16.114185,0,0,0.003429,0.017572,...,0,0.005,0.15165,0.418568,0.276561,0.075813,0.003095,0,0.297705,4.456807
std,67.09485,11.016017,1.076821,1.08275,0.644487,0.966974,0,0,0.155041,0.348104,...,0,0.182251,1.076752,1.777347,1.466151,0.765854,0.14733,0,0.509892,2.8877
min,10.0,4.0,1.999121,2.128939,4.25,3.978495,0,0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
25%,83.5,31.0,7.203928,9.558402,15.776173,15.731418,0,0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,2.0
50%,116.0,38.0,7.848646,10.277427,16.106667,16.141785,0,0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,4.0
75%,158.5,46.0,8.524543,10.924278,16.424546,16.554601,0,0,0.0,0.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0,1.0,7.0
max,412.5,83.0,11.709348,14.369267,25.428571,27.266667,0,0,9.0,11.0,...,0,8.0,15.0,15.0,16.0,15.0,9.0,0,3.0,9.0


In [3]:
data_v5 = data_v4.append(mnist).fillna(0)

# Reorder columns
column_order = [
    "area",
    "contours",
    "radius",
    "hull_radius",
    "centroid_x",
    "centroid_y",
]

COUNT = 8
for x in range(COUNT):
    for y in range(COUNT):
        column_order.append('_'.join(['weight', str(x), str(y)]))
column_order.append("num_holes")
column_order.append("label")

data_v5 = data_v5[column_order]
means = data_v5.mean()
data_v5 = data_v5.drop(means[means < .015].index, axis=1)
data_v5.describe()

Unnamed: 0,area,contours,radius,hull_radius,centroid_x,centroid_y,weight_0_2,weight_0_3,weight_0_4,weight_0_5,...,weight_6_4,weight_6_5,weight_6_6,weight_7_2,weight_7_3,weight_7_4,weight_7_5,weight_7_6,num_holes,label
count,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,...,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0
mean,137.129787,38.405301,7.998856,10.363452,16.102961,16.082107,0.244888,0.579876,0.570654,0.252197,...,5.477754,1.91195,0.26677,0.404094,0.961383,0.801721,0.3803,0.090988,0.301289,4.46013
std,74.80108,11.015813,1.194949,1.283819,0.679833,1.034079,1.679701,2.84788,2.786978,1.789785,...,5.493916,4.085573,1.636103,2.021423,3.246829,3.084245,2.159046,1.0567,0.51366,2.886857
min,10.0,4.0,1.999121,2.128939,4.25,3.978495,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,85.0,31.0,7.240011,9.595095,15.76073,15.702476,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
50%,119.0,38.0,7.903739,10.333949,16.100467,16.126446,0.0,0.0,0.0,0.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
75%,167.5,46.0,8.649492,11.024899,16.427673,16.552574,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0
max,512.0,88.0,13.11759,16.446964,25.428571,27.266667,16.0,16.0,16.0,16.0,...,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,3.0,9.0


## Normalization

In [4]:
less_columns = data_clean_manual.columns.values
columns_v4 = data_v4.columns.values
columns_v5 = data_v5.columns.values
columns_to_not_normalize = ['num_holes', 'label']

columns_v4_to_normalize = [c for c in columns_v4 if not c in columns_to_not_normalize]
columns_v5_to_normalize = [c for c in columns_v5 if not c in columns_to_not_normalize]
less_columns_to_normalize = [c for c in less_columns if not c in columns_to_not_normalize]
        
def scale(data, scaler, columns):
    return pd.DataFrame(scaler.fit_transform(data[columns]), columns=columns)

### MinMaxScaler

In [7]:
from sklearn.preprocessing import MinMaxScaler

minmaxscaled = data_clean_manual.copy()
minmaxscaled[less_columns_to_normalize] = scale(data_clean_manual, MinMaxScaler(), less_columns_to_normalize)

minmaxscaled_ext = data_ext_clean_manual.copy()
minmaxscaled_ext[columns_v4_to_normalize] = scale(data_ext_clean_manual, MinMaxScaler(), columns_v4_to_normalize)

scaled_v4 = data_v4.copy()
scaled_v4[columns_v4_to_normalize] = scale(data_v4, MinMaxScaler(), columns_v4_to_normalize)

scaled_v5 = data_v5.copy()
scaled_v5[columns_v5_to_normalize] = scale(data_v5, MinMaxScaler(), columns_v5_to_normalize)

print minmaxscaled.shape, minmaxscaled_ext.shape, scaled_v4.shape, scaled_v5.shape
scaled_v5.describe()

(1920, 24) (1920, 72) (1920, 60) (43918, 55)


Unnamed: 0,area,contours,radius,hull_radius,centroid_x,centroid_y,weight_0_2,weight_0_3,weight_0_4,weight_0_5,...,weight_6_4,weight_6_5,weight_6_6,weight_7_2,weight_7_3,weight_7_4,weight_7_5,weight_7_6,num_holes,label
count,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,...,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0,43918.0
mean,0.267452,0.413163,0.549997,0.585844,0.559136,0.518326,0.030406,0.07134,0.069311,0.03054,...,0.35454,0.140158,0.025124,0.041108,0.094153,0.082957,0.04282,0.011171,0.301289,4.46013
std,0.161695,0.131089,0.11622,0.100673,0.0338,0.04716,0.146616,0.245778,0.239957,0.155518,...,0.349772,0.280165,0.128061,0.164099,0.260332,0.252497,0.182754,0.092606,0.51366,2.886857
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.153386,0.321429,0.474827,0.524436,0.542671,0.502061,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
50%,0.224104,0.416667,0.536625,0.57758,0.559245,0.520902,0.0,0.0,0.0,0.0,...,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
75%,0.338396,0.5,0.611857,0.630165,0.575136,0.539844,0.0,0.0,0.0,0.0,...,0.6875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,9.0


### ~~RobustScaler~~

In [None]:
from sklearn.preprocessing import RobustScaler

robustscaled = data_clean_manual.copy()
robustscaled[columns] = scale(data_clean_manual, RobustScaler(), columns)
robustscaled.describe()

### ~~QuantileTransformer~~

In [None]:
from sklearn.preprocessing import QuantileTransformer

quantiletransformed = data_clean_manual.copy()
quantiletransformed[columns] = scale(data_clean_manual, QuantileTransformer(), columns)
quantiletransformed.describe()

## Splitting the dataset

In [8]:
from sklearn.model_selection import train_test_split

def split(data, ratio):
    return train_test_split(data.iloc[:,:-1], data.iloc[:,-1], test_size=ratio)

X_train, X_test, Y_train, Y_test = split(scaled_v5, .25)

## Try Random Forest Classifier

In [9]:
# RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier

top_rf = (2, 0.0)
for n_e in range(2, 101):
    rf = RandomForestClassifier(n_estimators=n_e, oob_score=True, random_state=123456)
    rf.fit(X_train, Y_train)
    score = rf.score(X_test, Y_test)
    if score > top_rf[1]:
        top_rf = (n_e, score)
        print('N_E:', n_e, 'Score:', score)
print('Top:', top_rf)

  warn("Some inputs do not have OOB scores. "


('N_E:', 2, 'Score:', 0.18460837887067394)
('N_E:', 3, 'Score:', 0.19672131147540983)
('N_E:', 5, 'Score:', 0.19863387978142077)
('N_E:', 6, 'Score:', 0.19908925318761383)
('N_E:', 7, 'Score:', 0.2034608378870674)
('N_E:', 8, 'Score:', 0.20564663023679416)
('N_E:', 9, 'Score:', 0.20683060109289617)


KeyboardInterrupt: 

## Try Support Vector Machine

100 x random train test split:

\# | Dataset | Options | Min | Max | Mean | Variance
---| --- | --- | --- | --- | --- | ---
1 | Cleaned+Extended+MinMaxScaled | `kernel='rbf', C=1.5, gamma=0.2` | `0.9708333333` | `0.9937500000` | `0.9837291666` | `0.022916667`
2 | ~~Cleaned+CleanExtended+MinMaxScaled~~ | `kernel='rbf', C=3.0, gamma=0.2` | `0.9687500000` | `0.9958333333` | `0.9814791666` | `0.027083333`
3 | Cleaned+Extended+MinMaxScaled | `kernel='linear', C=1.1` | `0.9604166666` | `0.9916666666` | `0.9743750000`
4 | Cleaned+MinMaxScaled | `kernel='rbf', C=6.6, gamma=0.35` | `0.9500000000` | `0.9833333333` | `0.9712708333`
5 | ~~Cleaned+CleanExtended+MinMaxScaled~~ | `kernel='linear', C=0.4` | `0.9437500000` | `0.9875000000` | `0.9702708333`
6 | Cleaned+MinMaxScaled | `kernel='linear', C=1.5` | `0.9395833333` | `0.9812500000` | `0.9626250000`

In [10]:
# SVM
from sklearn import svm

ITERS = 1
scores = np.zeros((ITERS))
for i in range(ITERS):
    print i, '/', ITERS - 1
    X_train, X_test, Y_train, Y_test = split(scaled_v5, .25)
    svc = svm.SVC(kernel='rbf', C=3.0, gamma=0.2)
    svc.fit(X_train, Y_train)
    scores[i] = svc.score(X_test, Y_test)
    
print 'Min Score:', scores.min()
print 'Max Score:', scores.max()
print 'Mean Score:', scores.mean()

0 / 0
Min Score: 0.23825136612
Max Score: 0.23825136612
Mean Score: 0.23825136612


In [None]:
print 'Min Score:', scores.min()
print 'Max Score:', scores.max()
print 'Mean Score:', scores.mean()

## Compared Datasets with Same Options

300 Iterations of 25% splits.

Options: `kernel='rbf', C=2.8, gamma=0.1`

Options: `kernel='rbf', C=3.0, gamma=0.2`

\# | Dataset | Min Score | Mean Score | Max Score
---| --- | --- | --- | ---
1| v4 | `0.96458333` | `0.98163194`| `0.99583333`
2| Cleaned+MinMaxScaled+Extended | `0.96041666` | `0.98059722` | `0.99583333`
3| Cleaned+MinMaxScaled | `0.94791666` | `0.96754861` | `0.98541666`

Options: `kernel='rbf', C=1.5, gamma=0.2`

\# | Dataset | Min Score | Mean Score | Max Score
---| --- | --- | --- | ---
1| v4 | `0.96666666` | `0.98113888`| `0.99583333`
2| Cleaned+MinMaxScaled+Extended | `0.96041666` | `0.97952083` | `0.99375000`
3| Cleaned+MinMaxScaled | `0.93750000` | `0.95971527` | `0.98125000`

## Validating on unseen data

In [None]:
DATASET = scaled_v4
X_train_test, X_validation, Y_train_test, Y_validation = split(DATASET, .1)

print X_train_test.shape, X_validation.shape, Y_train_test.shape, Y_validation.shape

In [None]:
# Testing on train/test data
svc = svm.SVC(kernel='rbf', C=2.8, gamma=0.1)

FOLDS = 10
for i in range(FOLDS):
    X_train, X_test, Y_train, Y_test = train_test_split(X_train_test, Y_train_test, test_size=.25)
    svc.fit(X_train, Y_train)
    print 'Score ', i, ':', svc.score(X_test, Y_test)

print
# Validating on unseen validation data
print 'Score: ', svc.score(X_validation, Y_validation)

### Finding the optimal C and Gamma for RBF kernel

In [None]:
X_train, X_test, Y_train, Y_test = split(scaled_v4, .25)

Gs = np.arange(.1, 4, .1)
Cs = np.arange(.1, 10, .1)

steps = len(Gs) * len(Cs)
scores = np.zeros((steps))
index = 0
top = (.1, .01, 0)

for g in Gs:
    for c in Cs:
        print('%d / %d' % (index, steps))
        svc = svm.SVC(kernel='rbf', C=c, gamma=g)
        svc.fit(X_train, Y_train)
        score = svc.score(X_test, Y_test)
        scores[index] = score
        if score > top[2]:
            top = (c, g, score)
        index += 1

print('Top:', top)

## Try SGD

In [None]:
from sklearn import linear_model

sgd = linear_model.SGDClassifier(max_iter=1000)
sgd.fit(X_train, Y_train)
score = sgd.score(X_test, Y_test)
print(score)

## Try kNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, Y_train)
score = knn.score(X_test, Y_test)
print(score)

In [None]:
Ns = range(3, 30)
scores = np.zeros((len(Ns)))
index = 0
top = (3, 0)
for n in Ns:
    print('%d / %d (%d)' % (index, len(Ns)-1, n))
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train, Y_train)
    score = knn.score(X_test, Y_test)
    scores[index] = score
    if score > top[1]:
        top = (n, score)
    index += 1

print('Top:', top)

## Combining Automation

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import svm

datasets = [
    ('Cleaned+MinMaxScaled', minmaxscaled),
    ('Cleaned+Extended+MinMaxScaled', minmaxscaled_ext)
]

options = {
    'AdaBoost': {
        'base_estimator': [
            svm.SVC(kernel='rbf', C=8.7, gamma=0.04),
            RandomForestClassifier(n_estimators=69)
        ],
        'algorithm': ['SAMME']
    },
    'SVM/C': {
        'kernel': ('linear', 'rbf'),
        'C': np.arange(.1, 12.0, .2),
        'gamma': [.001, .005, .01, .025, .05, .075, .1, .25, .5, 1, 3, 5, 8]
    },
    'RandomForest': {
        'n_estimators': range(10, 70)
    }
}

classifiers = [
#     ('AdaBoost', AdaBoostClassifier),
    ('SVM/C', svm.SVC),
#     ('RandomForest', RandomForestClassifier)
]

def search(classifiers, options, datasets, test_size, random_state):
    results = {
        'rank': [],
        'classifier': [],
        'options': [],
        'dataset': [],
        'score': []
    }

    for dataset in datasets:
        X_train, X_test, Y_train, Y_test = train_test_split(dataset[1].iloc[:,:-1], dataset[1].iloc[:,-1],\
                                                            test_size=test_size, random_state=random_state)
        
        for classifier in classifiers:
            name = classifier[0]
            print 'Testing', dataset[0], 'on', name, '...'

            model = GridSearchCV(classifier[1](), options[name], verbose=1, cv=3)
            model.fit(X_train, Y_train)
            
            print 'Params:', model.best_params_
            print 'MSE:', model.best_score_
            print
            
            results['rank'].append(0)
            results['classifier'].append(name)
            results['options'].append(str(model.best_params_))
            results['dataset'].append(dataset[0])
            results['score'].append(model.best_score_)
            
    return results

results = search(classifiers, options, datasets, .35, 123456)

In [None]:
results_df = pd.DataFrame(results).sort_values(['score'], ascending=[False])
results_df['rank'] = pd.Series(range(1, len(results_df) + 1), index=results_df.index)
results_df[['rank', 'classifier', 'options', 'dataset', 'score']].to_csv('../classifiers/results_testsize35.csv', sep=',', encoding='utf-8', index=False)

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
import itertools

best = svm.SVC(kernel='rbf', C=3.0, gamma=0.2)
X_train, X_test, Y_train, Y_test = split(scaled_v4, .25)
Y_pred = best.fit(X_train, Y_train).predict(X_test)

def plot_confusion_matrix(cm, classes):
#     print(cm)
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title('Confusion matrix')
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, Y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()

print('Accuracy', best.score(X_test, Y_test))
plot_confusion_matrix(cnf_matrix, classes=range(0,10))

plt.show()