***Issue#4 - Traversal of the space of cross-validation folds.***

Fix for #4: https://github.com/mozilla/PRESC/issues/4

In this notebook, we will do train-test splits experiment and see how the accuracy of the model varies. we will use SVM classifier along with MinMax data Transformation and Hyperparameter tuning as per my notebook for issue #2. I chose that settings because using transformation and hyperparameter tuning improved my results in my previous experiments for issue #2

***Steps***

1. Data Exploration and Outlier Fixing
2. Data Transformation(MinmaxScaler)
3. Hyperparameter Tuning(GridCV)
- I am using GridSearchCV function to find the optimal parameters i.e parameter tuning. I will be using rbf(Radial basis kernel). The optimal params value for c and gamma was based on this kaggle kernel (https://www.kaggle.com/rajansharma780/vehicle/kernels).
4. Training using multiple split ratios
5. Display result in tabular format

***References:***

1. https://scikit-learn.org
2. https://www.kaggle.com/rajansharma780/vehicle/kernels

In [1]:
# Ignore all the future warning and deprecation warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt 
import seaborn as sns
pd.set_option("display.precision", 2) 

    
from sklearn.preprocessing import MinMaxScaler
from issue4_helper import cross_validation
from sklearn.svm import SVC

In [2]:
# Read the data
df = pd.read_csv("../../datasets/vehicles.csv")

In [3]:
# Set Feature and Label Column
feature_cols = ['COMPACTNESS', 'CIRCULARITY', 'DISTANCE_CIRCULARITY', 'RADIUS_RATIO',
       'PR.AXIS_ASPECT_RATIO', 'MAX.LENGTH_ASPECT_RATIO', 'SCATTER_RATIO',
       'ELONGATEDNESS', 'PR.AXIS_RECTANGULARITY', 'MAX.LENGTH_RECTANGULARITY',
       'SCALED_VARIANCE_MAJOR', 'SCALED_VARIANCE_MINOR',
       'SCALED_RADIUS_OF_GYRATION', 'SKEWNESS_ABOUT_MAJOR',
       'SKEWNESS_ABOUT_MINOR', 'KURTOSIS_ABOUT_MAJOR']
label_col = ['Class']

**Removing Outlier Rows**

In [4]:
# Max value based on boxplot to filter outliers of 8 columns where outliers are identified. 
df_columns_with_outliers = pd.DataFrame([[255, 77, 13, 288, 980, 88, 19, 40]], columns=['RADIUS_RATIO', 
                                                                                'PR.AXIS_ASPECT_RATIO', 
                                                                                'MAX.LENGTH_ASPECT_RATIO', 
                                                                                'SCALED_VARIANCE_MAJOR', 
                                                                                'SCALED_VARIANCE_MINOR', 
                                                                                'SKEWNESS_ABOUT_MAJOR', 
                                                                                'SKEWNESS_ABOUT_MINOR', 
                                                                                'KURTOSIS_ABOUT_MAJOR'])

total_outliers = 0
for i, column in enumerate(df_columns_with_outliers.columns):
    total_outliers += df[column][df[column] > df_columns_with_outliers[column][0]].size
print('Out of {} rows {} Outliers '.format(len(df.index), total_outliers))

Out of 846 rows 51 Outliers 


In [5]:
df_new = df.copy();

# Fill null
df_new.fillna(df_new.mean(), inplace=True)

# Remove outliers based on max value identified earlier from boxplot
for i, column in enumerate(df_columns_with_outliers.columns):
    df_new = df_new[df_new[column] < df_columns_with_outliers[column][0]]
 
df_new['Class'] = pd.Categorical(df_new['Class']).codes

# reset the index post cleaning the outliers
df_new = df_new.reset_index(drop=True)

**Tran/Test Split Testing**

We will run the training on our model by using different test-train split ratio and check the variation in accuracy. We want to check if the ratio has any relationship with the performance(accuracy) of the model.




In [6]:
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

param_grid = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['rbf']}  

X_new = df_new[feature_cols]
y_new = df_new[label_col]

scaler_mm = MinMaxScaler()

# Run the test and print the results
cross_validation(X_new, y_new.values.ravel(), param_grid, scaler_mm)


<class 'sklearn.preprocessing.data.MinMaxScaler'>




0.8006685521483041




True

**Cross Validation: KFold**

**Cross Validation: Stratified KFold**