# This is to show and test my abilities to actually create a notebook and test my skills I learned from my Data Science & ML Courses 
This notebook is for everyone to see and comment on and is a test of my skills along with some research, feel free to tell me my mistakes and what I should've done, and could do better ❤

## 1. Problem Definition

> Predict target values and submit the score on Kaggle


# 2. Data
As we could see from the data this is a classification problem, some of the columns have string values we must convert into categorical values.

> For this competition, you will be predicting a binary target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat18 are categorical, and the feature columns cont0 - cont10 are continuous.  
Files  
train.csv - the training data with the target column  
test.csv - the test set; you will be predicting the target for each row in this file (the probability of the binary target)  
sample_submission.csv - a sample submission file in the correct format



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV


from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, accuracy_score

import pickle

In [5]:
test_df = pd.read_csv('test.csv')
train_df = pd.read_csv('train.csv')
submission_df = pd.read_csv('sample_submission.csv')

In [3]:
train_df.head()

Unnamed: 0,id,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,...,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,cont10,target
0,0,A,I,A,B,B,BI,A,S,Q,...,0.759439,0.795549,0.681917,0.621672,0.592184,0.791921,0.815254,0.965006,0.665915,0
1,1,A,I,A,A,E,BI,K,W,AD,...,0.386385,0.541366,0.388982,0.357778,0.600044,0.408701,0.399353,0.927406,0.493729,0
2,2,A,K,A,A,E,BI,A,E,BM,...,0.343255,0.616352,0.793687,0.552877,0.352113,0.388835,0.412303,0.292696,0.549452,0
3,3,A,K,A,C,E,BI,A,Y,AD,...,0.831147,0.807807,0.800032,0.619147,0.221789,0.897617,0.633669,0.760318,0.934242,0
4,4,A,I,G,B,E,BI,C,G,Q,...,0.338818,0.277308,0.610578,0.128291,0.578764,0.279167,0.351103,0.357084,0.32896,1


## Now we're going to convert the categorical values and assume they are ordinal

When to use a Label Encoding vs. One Hot Encoding
This question generally depends on your dataset and the model which you wish to apply. But still, a few points to note before choosing the right encoding technique for your model:


> **We apply One-Hot Encoding when:**  
◽ The categorical feature is not ordinal (like the countries above)  
◽ The number of categorical features is less so one-hot encoding can be effectively applied  
**We apply Label Encoding when:**  
◽ The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)  
◽ The number of categories is quite large as one-hot encoding can lead to high memory consumption  
link: https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/  

We are also dropping the ID column above, this should help in getting optimal results rather than having our algorithims see them as values and have them result as outliers

In [8]:
# create a list and store the columns
cat_cols = []
for col in train_df.columns:
    cat_cols.append(col)

# use only the categorical columns    
cat_cols = cat_cols[1:20]

# setup LabelEncoder
label_encoder = LabelEncoder()

# loop through the dataset's columns and transform their values into numerical values
for col in cat_cols:
    test_df[col] = label_encoder.fit_transform(test_df[col])
    train_df[col] = label_encoder.fit_transform(train_df[col])
    
# set random seed to reproduce
np.random.seed(42)

# shuffle the dataset and drop the id column
train_df = train_df.sample(frac=1).reset_index(drop=True)
train_df = train_df.drop(columns='id', axis=1)
test_df = test_df.drop(columns='id', axis=1)

# split the data
X = train_df.drop('target', axis=1)
y = train_df.target

# split the data into train, test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Since we're dealing with a classification problem we're going to try three different models to try and compare before we get to the experimentation part



In [5]:
# we're going to create a dict of the different models we're going to use
models = {
    'Logistic Regression' : LogisticRegression(solver='liblinear'),
    'RandomForest' : RandomForestClassifier(),
    'KNN': KNeighborsClassifier(),
}

In [6]:
# create a function where we loop through the dict and use score them
def fit_and_score(models, X_train, X_test, y_train, y_test):
    model_scores = {}
    for name, model in models.items(): 
        model.fit(X_train, y_train)
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [7]:
model_scores = fit_and_score(models=models, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)
model_scores

{'Logistic Regression': 0.8376222222222223,
 'RandomForest': 0.8464,
 'KNN': 0.8109222222222222}

Based on the score above we're going to use RandomForestClassifier, unless we have a lot of time to actually test out and experiment on these algorithms having the highest one would be time efficient(unless we have a good cpu). 

# 3. Experimentation

## We're then going to manually test out hyperparameters
this is the part where we're going to experiment a lot on, set yourself a limit on the result you want or you'll never stop training, with this amount of dataset and amount of combinations and fine-tuning your hyperparameter may take you a lot of time.

for the meaning of these hyperparameters, go to sklearn's documentation page for more information

In [8]:
rf_grid = {
    'n_estimators' : np.arange(600, 1500, 50),
    'max_depth' : [None, 1,3,5],
    'max_features': ['auto'],
    'min_samples_split' : np.arange(1,15,1),
    'min_samples_leaf' : np.arange(1,15,1),
    'bootstrap' : [True ,False]
}

### it is way better to train on your own PC/locally since you can maximize the amount of threads you have especially if you have a mid-end PC 

In [9]:
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                          param_distributions=rf_grid,
                          cv=5,
                          n_iter=5,
                          verbose=2,
                          n_jobs=-1
                          )
rs_rf.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  25 | elapsed: 21.9min remaining: 14.6min
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed: 24.9min finished


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=5,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [None, 1, 3, 5],
                                        'max_features': ['auto'],
                                        'min_samples_leaf': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
                                        'min_samples_split': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
                                        'n_estimators': array([ 600,  650,  700,  750,  800,  850,  900,  950, 1000, 1050, 1100,
       1150, 1200, 1250, 1300, 1350, 1400, 1450])},
                   verbose=2)

#### now to check on the best parameters our RandomizedSearchCV has found

In [10]:
rs_rf.best_params_

{'n_estimators': 1200,
 'min_samples_split': 6,
 'min_samples_leaf': 10,
 'max_features': 'auto',
 'max_depth': None,
 'bootstrap': False}

In [11]:
rs_y_preds_proba = rs_rf.predict_proba(X_test)[:,1]
score = roc_auc_score(y_test, rs_y_preds_proba)
print(f'{score:0.5f}')

0.88822


#### We then store our result to a file using pickle just incase something happens(I do this since my electricity goes out a lot)

In [12]:
# load pickle
load_pickle_model = pickle.load(open('rs_random_forest_model.pkl', 'rb'))

### We then use our saved model to predict on the test dataset we were given, and then save it, and then we submit our results to the kaggle page

In [13]:
pickle_y_preds = load_pickle_model.predict_proba(test_df)[:,1]
submission_df['target'] = pickle_y_preds
submission_df.to_csv('random_forest_rs.csv', index=False)

## Now if we wanna go a bit further and fine-tune our model, we can use GridSearchCV
GridSearchCV actually takes time since it doesn't randomly choose a model and skip it  
but instead it actually tries every combination you have on your parameters dictionary

In [14]:
gs_grid = {
    'n_estimators' : np.arange(1100, 1350, 50),
    'max_depth' : [None],
    'max_features': ['auto'],
    'min_samples_split' : np.arange(8,10,1),
    'min_samples_leaf' : np.arange(12,14,1),
    'bootstrap' : [False]
}

##### Do Note: I got these values are from the RandomizedSearchCV best params and created a range of values to find a somewhat sweetspot.

In [15]:
gs_rf = GridSearchCV(RandomForestClassifier(),
                          param_grid=gs_grid,
                          cv=5,
                          verbose=2,
                          n_jobs=-1
                          )
gs_rf.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 38.5min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 169.5min finished


GridSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'bootstrap': [False], 'max_depth': [None],
                         'max_features': ['auto'],
                         'min_samples_leaf': array([12, 13]),
                         'min_samples_split': array([8, 9]),
                         'n_estimators': array([1100, 1150, 1200, 1250, 1300])},
             verbose=2)

In [6]:
# save pickle
pickle.dump(gs_rf, open('gs_random_forest_model.pkl', 'wb'))
# load pickle
load_pickle_model_gs = pickle.load(open('gs_random_forest_model.pkl', 'rb'))

In [15]:
load_pickle_model_gs.best_params_

{'bootstrap': False,
 'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 12,
 'min_samples_split': 8,
 'n_estimators': 1200}

In [16]:
gs_y_preds_proba = load_pickle_model_gs.predict_proba(X_test)[:,1]
score = roc_auc_score(y_test, gs_y_preds_proba)
print(f'{score:0.5f}')

0.88822


In [9]:
pickle_y_preds_gs = load_pickle_model_gs.predict_proba(test_df)[:,1]
submission_df['target'] = pickle_y_preds_gs
submission_df.to_csv('random_forest_gs.csv', index=False)

Special Thanks to 

https://www.kaggle.com/inversion/get-started-mar-tabular-playground-competition