<a href="https://colab.research.google.com/github/hpazerf/kaggle/blob/main/Kaggle_Competition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Data Science: Kaggle Competition


Name: Harris Azerf

Kaggle Username: Harris Azerf

## Code Breakdown and Description

### Quick Description
***
While looking at the code and breaking it down to sections, the first section looked at the Raw Data with no manipulation. The second section removed the Id column. The third focused on standardizing the data. The last section looked at feature importance and using the models. I used Google Colab to run all the code to make use of the GPU for certain models. After each model I placed the Kaggle Public AUC score to keep track of the different models and different levels of success I had for each of the models.

### Setup
***
This is where I imported all the libraries and the data for the models. For the data, I simply imported the data and then manipulated it to remove the id column. Also, I concatenated the two data sets together and standardized the features to use in the different models. This gave 3 different types of data sets to use for the models. 


### Section 1: Raw Data using different models
***
The models I looked at where the baseline Logistic Regression, Random Forest, and XGBoost Models. These models used the data straight from the csv files to create the models and predict the probabilities. This section was broken down into one subsection called **No Tuning**. The subsection, **No Tuning**, was broken down into sections for each different model.

### Section 2: Data without Id Column
***
After a small confusion that I explain below, I removed the Id column and looked at the Random Forest, Xgboost, Catboost, and Light GB models. I tried all of these models without any tuning initially. Then, I tuned each model with different parameter ranges to help with overfitting and creating different results. The best result that I was able to get using these models was the Catboost. I also attempted to try stacking using different models but end up with models that overfit the test data. 

Initially, when I started using the dataset without the Id column there was some confusion. When I ran this the first few times, I made a mistake in thinking I standardized the data which I initially did, but due to some mishap I unknowingly  deleted the portion that standardized the data. Later when looking back at how I standardized, I realized I only removed the Id column and not actually standardizing the data. After I realized this I fixed the variables to represent this mistake.

This section was split into subsections for **No Tuning**, **Tuning**, **Tuned Models**, and **Stacking**. For each of these sections I looked at the models mentioned above. In the **No Tuning** section, I used the baseline model to see how each model performed with no tuning. In the **Tuning** section, I tuned and obtained different parameters using GridSearchCV and RandomSearchCV. For each model, I came back multiple times and set different ranges and parameters based on different resources that I found. The parameters that I found in **Tuning** were used in the **Tuned Models** sections breaking down the success of each model. In the **Stacking** section, I also tried stacking several of the models in several different ways to hopefully get a better result. Most of the models that I stacked resulted in a lower score. I believe this occurred  due to overfitting.

### Section 3: Standardized Data
***
In this section, I standardized  the data and then tried tuning similar to what I did in the models above. The models I looked at where XGBoost, Random Forest, and Catboost. During this time, I tried to improve the XGBoost model as much as I could. I also tried tuning the Random forest and Catboost without much success.

This section was split into subsections for **No Tuning**, **Tuning**, and **Tuned Models**. I did similar things to what I did above to where I started with the baseline models and the moved to tuning the parameters for each of classifiers using GridSearchCV and RandomSearchCV. Similar to above, I created different ranges of parameters to help create multiple slightly different models. During this time, I focused heavily on XGboost and improving that score.

### Section 4: Feature Selection
***
In this section, I dropped some of the features by using feature importance for each of the different models, mainly focusing on XGBoost, Catboost, and Light GB. I went through the same process of tuning with slightly different ranges to improve the score. I also tried stacking the models again to see if I can get a better result. During this part, I used Catboost and dropped a few features to get the best overall results.

This section is split based on the classifier and dataset that was being used. I used **XGBoost with the Standardized Values**, **CatBoost with the Standardized Values**, **CatBoost with Data with No Id column**, and **Light GB with Data with No Id column**. Each of these subsections are broken down even further. First, I find the importance of features using a baseline model and the feature_importance method. Based on this, I remove the features with low importance. Then I tuned each model to find the best parameters and used those parameters to create several tuned models. As mentioned above, I also tried stacking which I placed into **XGBoost stacked with XGBoost**.

## Model Testing Order
***
Overall, I started by looking at the baseline models for a few of the different classifiers which was mainly the Random Forest, XGBoost, and Catboost. For each of these models I tried various things such as tuning, stacking, parameter changes, and feature selection. The one I initially focused on was XGBoost since it gave me the highest AUC score initially. After dropping several unimportant features and improving the score as much as possible, I moved to looking at Catboost. Using Catboost, I was able to get my best score by once again dropping features and trying different parameters. Lastly, through my research I ran across Light GB model and tried the same methods that I used for both XGboost and Catboost.

## What I learned
***
Different datasets worked better for different classifiers. For XGBoost, the standardized data worked best for my models, but when I used Catboost, the regular data worked better. In fact, my best score was Catboost using the regular data with a few features dropped. 

Also, during my research I read that having a high number of iterations helps prevent overfitting. To help prevent overfitting in most of the models, I used iterations from around 1000-2000. 

Looking back, stacking did not help me at all. This could be because the model was overfitting the data. 

Initially, I thought the Light GB models could help improve my score, but I think I did not tune the model well enough to get a better score.

Feature Selection helped me improve my score every time after I hit a bump in tuning the models. It did not help when using Catboost with the standardized data, but it helped both for XGBoost using the standardized data and Catboost using the regular data. Both for XGBoost classifier and Catboost classifier, I was able to get the highest score in the public leaderboard by dropping several of the features. But looking back at the private leaderboard, the original models without any features drop performed better.


## Best Model
***

*Public Leaderboard:* For the Public Leaderboard, the best model I was able to create was a tuned Catboost with several features dropped. I took a baseline Catboost model and measured the feature importance of each feature and dropped those that had low values. After dropping the features, I tuned the model to find the best parameters based on the data with the dropped features. Using these parameters, I created the model that I was able to get the highest AUC score on the public leaderboard.

*Private Leaderboard:* For the Private Leaderboard, I found that the models with the dropped features did worse than the models with all of the features in the data. The best model I created was a tuned Catboost using the complete regular data without any normalization and no dropped features. 

***

# Setup


## Importing Libraries

In [None]:
#!pip install xgboost
!pip install catboost
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import skew
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV,RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

%matplotlib inline



## Importing Data

In [None]:
train = pd.read_csv('train_final.csv')
test = pd.read_csv('test_final.csv')

In [None]:
train.shape

(16383, 26)

In [None]:
y = train['Y']
X = train.drop('Y', axis=1)

### Standarizing the data and removing the Id Column

In [None]:
# Combining all the data together
all_data = pd.concat((train.loc[:,'f1':'f24'],
                      test.loc[:,'f1':'f24']))
feature_col_names=list(all_data.columns)

In [None]:
# Removing Id Column
all_data = pd.concat((train.loc[:,'f1':'f24'],
                      test.loc[:,'f1':'f24']))
X_no_id = all_data[:train.shape[0]]
test_no_id = all_data[train.shape[0]:]

In [None]:
# Standarizing the data
sc = StandardScaler()
std=sc.fit_transform(all_data)

#creating matrices for sklearn:
X_std = std[:train.shape[0]]
X_std = pd.DataFrame(data=X_std, columns=feature_col_names)
test_std = std[train.shape[0]:]
test_std = pd.DataFrame(data=test_std, columns=feature_col_names)

# S1: Raw Data using different models

## S1-A: No Tuning

### Logisitic Regression Model

In [None]:
lr = LogisticRegression(random_state = 42)
lr.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
y_pred = lr.predict(X)
y_probas = lr.predict_proba(X)
print('Misclassified samples: %d' %(y != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y, y_pred))
print('AUC: %.2f' % roc_auc_score(y, y_probas[:, 1]))

Misclassified samples: 948
Accuracy: 0.94
AUC: 0.53


In [None]:
test_probas = lr.predict_proba(test)
lr_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
lr_solution.to_csv("lr_no_std_no_tuning_sol.csv", index = False)

Kaggle Score: 0.54023

### Random Forest

In [None]:
rf = RandomForestClassifier(n_jobs = -1,random_state = 42)
rf.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
test_probas = rf.predict_proba(test)
rf_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
rf_solution.to_csv("rf_no_std_no_tuning_sol.csv", index = False)

Kaggle Score: 0.84233

### XGBoost

In [None]:
xgb_model = XGBClassifier(n_estimators=1000, random_state=42)
xgb_model.fit(X, y)

In [None]:
test_probas = xgb_model.predict_proba(test)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std2_tuning2_sol.csv", index = False)

Kaggle Score: 0.86479

***

# S2: Data without Id Column


## S2-A: No Tuning

### Random Forest

In [None]:
rf = RandomForestClassifier(n_jobs = -1,random_state = 42)
rf.fit(X_no_id, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
test_probas = rf.predict_proba(test_no_id)
rf_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
rf_solution.to_csv("rf_std_no_tuning_sol.csv", index = False)

Kaggle Score: 0.86222

### XGBoost

In [None]:
xgb_model = XGBClassifier(max_depth=10,min_child_weight=1,gamma=1,scale_pos_weight=.6,subsample=0.5,n_estimators=1000, random_state=42)
xgb_model.fit(X_no_id, y)

In [None]:
test_probas = xgb_model.predict_proba(test_no_id)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std_no_tuning_no_param_sol.csv", index = False)

Kaggle Score: 0.87718

### Catboost

In [None]:
cb_model = CatBoostClassifier(lograndom_state=42,task_type='GPU', verbose=0)
cb_model.fit(X_no_id, y)

In [None]:
test_probas = cb_model.predict_proba(test_no_id)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_std_no_tuning_sol.csv", index = False)

Kaggle Score: 0.86237

### Light GB Model

In [None]:
lgb_model = LGBMClassifier(metric='auc',random_state=42, n_estimators=1500, verbose=0)
lgb_model.fit(X_no_id, y)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               metric='auc', min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
               objective=None, random_state=42, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0, verbose=0)

In [None]:
test_probas = lgb_model.predict_proba(test_no_id)
lgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
lgb_solution.to_csv("lgb_model_noid_no_tuning_sol.csv", index = False)

Kaggle Score: 0.86656

## S2-B: Tuning Section

### Tuning the Random Forest Model

In [None]:
rf = RandomForestClassifier(n_jobs = -1,random_state = 42)
rf.fit(X_no_id, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
# Hypertuning the Random Forest Model

# Number of trees in random forest
#n_estimators = [500,1000,1500,2000]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {#'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
               }

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, scoring='roc_auc', cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_no_id, y)
# print results
print(rf_random.best_params_)
print(rf_random.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   51.9s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  6.8min finished


{'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 20, 'bootstrap': False}
0.8615652409245795


### Tuning the XGBoost
Due to the many variables to tune, I separated them into different sections by slowly building the model to get a quick range of where all the parameters were. Then I created a smaller grid search that looked at the smaller range for all the variables. I ran this multiple times with slightly different  ranges to get two or three different models to test.

#### Tuning the eta and max_depth

In [None]:
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_no_id, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# Hypertuning the XGBoost
max_depth=[1,2,3,4,5,6,7,8,9,10]

# Create the random grid
grid = {
        'max_depth': max_depth,
               }

# Random search of parameters
xgb_search = GridSearchCV(estimator = xgb_model, param_grid = grid, cv = 5, scoring='roc_auc', verbose=2, n_jobs = -1)
# Fit the model
xgb_search.fit(X_no_id, y)
# print results
print(xgb_search.best_params_)
print(xgb_search.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   54.7s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.5min finished


{'max_depth': 6}
0.8555425240205217


#### Tuning the min_child_weight

In [None]:
xgb_model = XGBClassifier(eta=0.4,max_depth=6, random_state=42)
xgb_model.fit(X_no_id, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.4, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# Hypertuning the XGBoost
min_child_weight=[1,2,3,4,5,6,7,8,9,10]


# Create the random grid
grid = {
        'min_child_weight': min_child_weight,

               }

# Random search of parameters
xgb_search = GridSearchCV(estimator = xgb_model, param_grid = grid, cv = 5, scoring='roc_auc', verbose=2, n_jobs = -1)
# Fit the model
xgb_search.fit(X_no_id, y)
# print results
print(xgb_search.best_params_)
print(xgb_search.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.4min finished


{'min_child_weight': 6}
0.8575492595361996


#### Tuning the gamma

In [None]:
xgb_model = XGBClassifier(eta=0.4,max_depth=6,min_child_weight=6, random_state=42)
xgb_model.fit(X_no_id, y)

In [None]:
# Hypertuning the XGBoost
gamma=[1,2,3,4,5,6,7,8,9,10]

# Create the random grid
grid = {
        'gamma': gamma,
               }

# Random search of parameters
xgb_search = GridSearchCV(estimator = xgb_model, param_grid = grid, cv = 5, scoring='roc_auc', verbose=2, n_jobs = -1)
# Fit the model
xgb_search.fit(X_no_id, y)
# print results
print(xgb_search.best_params_)
print(xgb_search.best_score_)

#### Tuning the subsample and the scale_pos_weight

In [None]:
xgb_model = XGBClassifier(eta=0.4,max_depth=6,min_child_weight=6,gamma=4, random_state=42)
xgb_model.fit(X_no_id, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.4, gamma=4,
              learning_rate=0.1, max_delta_step=0, max_depth=6,
              min_child_weight=6, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# Hypertuning the XGBoost

subsample=[0.5,0.6,0.7,0.8,0.9,1]
scale_pos_weight=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]

# Create the random grid
grid = {
        'subsample': subsample,
        'scale_pos_weight': scale_pos_weight,
               }

# Random search of parameters
xgb_search = GridSearchCV(estimator = xgb_model, param_grid = grid, cv = 5, scoring='roc_auc', verbose=2, n_jobs = -1)
# Fit the model
xgb_search.fit(X_no_id, y)
# print results
print(xgb_search.best_params_)
print(xgb_search.best_score_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  8.7min finished


{'scale_pos_weight': 0.8, 'subsample': 0.8}
0.8590735440985217


#### Tuning everything with a smaller scale

In [None]:
xgb_model = XGBClassifier(eta=0.4,max_depth=6,min_child_weight=6,gamma=4,scale_pos_weight=0.8,subsample=0.8, random_state=42)
xgb_model.fit(X_no_id, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.4, gamma=4,
              learning_rate=0.1, max_delta_step=0, max_depth=6,
              min_child_weight=6, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=0.8, seed=None,
              silent=None, subsample=0.8, verbosity=1)

In [None]:
# Hypertuning the XGBoost
eta=[0.4,0.3,0.5]
max_depth=[5,6,7,8,9,10]
min_child_weight=[5,6,7]
gamma=[3,4,5,6]
subsample=[0.6,0.7,0.8,0.9]
scale_pos_weight=[0.6,0.7,0.8,0.9]

# Create the random grid
grid = {'eta': eta,
        'max_depth': max_depth,
        'min_child_weight': min_child_weight,
        'gamma': gamma,
        'subsample': subsample,
        'scale_pos_weight': scale_pos_weight,
               }

# Random search of parameters
xgb_random = RandomizedSearchCV(estimator = xgb_model, param_distributions = grid, n_iter = 100, scoring='roc_auc', cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
xgb_random.fit(X_no_id, y)
# print results
print(xgb_random.best_params_)
print(xgb_random.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  8.9min finished


{'subsample': 0.9, 'scale_pos_weight': 0.7, 'min_child_weight': 7, 'max_depth': 10, 'gamma': 3, 'eta': 0.4}
0.8612441721715811


### Tuning Catboost

In [None]:
cb_model = CatBoostClassifier(random_state=42, eval_metric='AUC', verbose=0)
cb_model.fit(X_no_id, y)

<catboost.core.CatBoostClassifier at 0x7f69d891fc18>

In [None]:
# Grid Search
grid = {'depth':[3,1,2,6,4,5,7,8,9,10],
          'iterations':[250,100,500,1000],
          'learning_rate':[0.03,0.001,0.01,0.1,0.2,0.3], 
          'l2_leaf_reg':[3,1,5,10,100],
          'border_count':[32,5,10,20,50,100,200],
          'scale_pos_weight':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],
          'thread_count':[4]
          }
# Random search of parameters
cb_search = GridSearchCV(estimator = cb_model, param_grid = grid, cv = 5, scoring='roc_auc', verbose=2, n_jobs = -1)
# Fit the model
cb_search.fit(X_std, y)
# print results
print(cb_search.best_params_)
print(cb_search.best_score_)

### Tuning the Light GB Model

In [None]:
cat_feat = np.where(X_no_id.dtypes != np.float)[0]
lgb_model = LGBMClassifier(boosting_type='dart',cat_features=cat_feat,metric='auc',random_state=42, verbose=0)
lgb_model.fit(X_no_id, y)

LGBMClassifier(boosting_type='dart',
               cat_features=array([ 0,  1,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23]),
               class_weight=None, colsample_bytree=1.0, importance_type='split',
               learning_rate=0.1, max_depth=-1, metric='auc',
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0,
               verbose=0)

In [None]:
random_grid = {
          "max_depth": [25,50, 75],
          "learning_rate" : [0.01,0.05,0.1],
          "num_leaves": [300,900,1200],
          }

# Random search of parameters
lgb_random = RandomizedSearchCV(estimator = lgb_model, param_distributions = random_grid, n_iter = 10, cv = 5, verbose=2, n_jobs=-1, random_state=42)
# Fit the model
lgb_random.fit(X_no_id, y)
# print results
print(lgb_random.best_params_)
print(lgb_random.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  2.0min finished


{'num_leaves': 900, 'max_depth': 50, 'learning_rate': 0.05}
0.9604470310146234


## S2-C:Tuned Models

### Random Forest with tuned parameters

In [None]:
rf = RandomForestClassifier(min_samples_split=2, min_samples_leaf=2, max_features='sqrt', max_depth=20, bootstrap=False, n_estimators = 1500, n_jobs = -1,random_state = 42)
rf.fit(X_std, y)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=20, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1500,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
test_probas = rf.predict_proba(test_std)
rf_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
rf_solution.to_csv("rf_std2_tuning2_sol.csv", index = False)

Kaggle Score: 0.87501

### XGBoost with tuned parameters

In [None]:
xgb_model = XGBClassifier(eta=0.4,booster='dart',max_depth=9,min_child_weight=7,gamma=2,scale_pos_weight=0.8,subsample=0.8, n_estimators=1500, random_state=42)
xgb_model.fit(X_no_id, y)


In [None]:
test_probas = xgb_model.predict_proba(test_no_id)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std2_tuning2_sol.csv", index = False)

Kaggle Score: 0.87830

### XGBoost with tuned parameters (2)

In [None]:
xgb_model = XGBClassifier(booster='dart',max_depth=24,min_child_weight=1,gamma=9,scale_pos_weight=0.8,subsample=0.6, n_estimators=1500)
xgb_model.fit(X_no_id, y)

In [None]:
test_probas = xgb_model.predict_proba(test_no_id)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_no_id_tuning2_sol.csv", index = False)

Kaggle Score: 0.87863

### Catboost with tuned parameters

In [None]:
cb_model = CatBoostClassifier(learning_rate=0.005, n_estimators=1500, max_depth=7, l2_leaf_reg=1, scale_pos_weight=0.7, random_state=42,task_type='GPU', verbose=0)
cb_model.fit(X_no_id, y)

In [None]:
test_probas = cb_model.predict_proba(test_no_id)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_std_tuning_sol.csv", index = False)

Kaggle Score: 0.85171

### Catboost with tuned parameters (3)

In [None]:
cat_feat = np.where(X_no_id.dtypes != np.float)[0]
cb_model = CatBoostClassifier(cat_features=cat_feat,eval_metric='AUC', learning_rate=0.01, thread_count=4, n_estimators=1500, depth=3, l2_leaf_reg=100, scale_pos_weight=0.4, border_count=100, random_state=42, verbose=0)
cb_model.fit(X_no_id, y)

<catboost.core.CatBoostClassifier at 0x7f69d6d6a1d0>

In [None]:
test_probas = cb_model.predict_proba(test_no_id)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_noid_tuning3_sol.csv", index = False)

Kaggle Score: 0.92915

### Catboost with tuned parameters (4)
Testing the logloss metric due to the loss_function not having an AUC parameter

In [None]:
cb_model = CatBoostClassifier(loss_function='Logloss',eval_metric='AUC', learning_rate=0.01, thread_count=4, n_estimators=1500, depth=3, l2_leaf_reg=100, scale_pos_weight=0.4, border_count=100, random_state=42, verbose=0)
cb_model.fit(X_no_id, y)

In [None]:
test_probas = cb_model.predict_proba(test_no_id)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_noid_tuning4_sol.csv", index = False)

Kaggle Score: 0.83829

### Light GB Model with tuned parameters

In [None]:
cat_feat = np.where(X_no_id.dtypes != np.float)[0]
lgb_model = LGBMClassifier(boosting_type='dart',cat_features=cat_feat,max_depth=50, num_leaves=900, learning_rate= 0.05, metric='auc',random_state=42, n_estimators=1500, verbose=0)
lgb_model.fit(X_no_id, y)

LGBMClassifier(boosting_type='dart',
               cat_features=array([ 0,  1,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23]),
               class_weight=None, colsample_bytree=1.0, importance_type='split',
               learning_rate=0.05, max_depth=50, metric='auc',
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=1500, n_jobs=-1, num_leaves=900, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0,
               verbose=0)

In [None]:
test_probas = lgb_model.predict_proba(test_no_id)
lgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
lgb_solution.to_csv("lgb_model_noid_tuning_sol.csv", index = False)

Kaggle Score: 0.87897

## S2-D: Stacking
At this point, I tried to use stacking to improve my score. This decreased my score and I beleive this is due to the models overfitting. At this point, I realized I was not using the standardized data so I did not continue stacking.



### Random Forest stacked with same Random Forest classified by Random Forest

In [None]:
rf = RandomForestClassifier(min_samples_split=2, min_samples_leaf=1, max_features='sqrt', max_depth=30, bootstrap=False, n_estimators = 1500, n_jobs = -1,random_state = 42)
rf.fit(X_no_id, y)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=30, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1500,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
X_no_id_stacked=X_no_id.copy()
rf1_preds = rf.predict(X_no_id)
rf2_preds = rf.predict(X_no_id)
rf1_series= pd.Series(rf1_preds)
rf2_series= pd.Series(rf2_preds)
X_std_stacked['rf1_preds'] = rf1_series
X_std_stacked['rf2_preds'] = rf2_series
stacked_rf = RandomForestClassifier(min_samples_split=2, min_samples_leaf=1, max_features='sqrt', max_depth=30, bootstrap=False, n_estimators = 1500, n_jobs = -1,random_state = 42)
stacked_rf.fit(X_no_id_stacked, y)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=30, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1500,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
test_no_id_stacked=test_no_id.copy()
rf1_test_preds = rf.predict(test_no_id)
rf2_test_preds = rf.predict(test_no_id)
rf1_test_series= pd.Series(rf1_test_preds)
rf2_test_series= pd.Series(rf2_test_preds)
test_no_id_stacked['rf1_test_preds'] = rf1_test_series
test_no_id_stacked['rf2_test_preds'] = rf2_test_series

In [None]:
test_pred = stacked_rf.predict(test_no_id_stacked)
test_probas = stacked_rf.predict_proba(test_no_id_stacked)

In [None]:
stacked_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
stacked_solution.to_csv("stacked_rf_sol.csv", index = False)

Kaggle Score: 0.77292

***

# S3: Standardized Data
After I realized my mistake, I fixed it and reran both the Random Forest models and the XGBoost models. I used multiple different parameter ranges and changes to see if I could improve the AUC score. Some of these changes helped, but most of the changes only improved the score by really small margins.

## S3-A: No Tuning

### Random Forest

In [None]:
rf = RandomForestClassifier(n_jobs = -1,random_state = 42)
rf.fit(X_std, y)

In [None]:
test_probas = rf.predict_proba(test_std)
rf_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
rf_solution.to_csv("rf_std2_no_tuning_sol.csv", index = False)

Kaggle Score: 0.86628

## S3-B: Tuning Section

### Tuning the Random Forest Model

In [None]:
rf = RandomForestClassifier(n_jobs = -1,random_state = 42)
rf.fit(X_std, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
# Hypertuning the Random Forest Model

# Number of trees in random forest
#n_estimators = [500,1000,1500,2000]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {#'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
               }

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, scoring='roc_auc', cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_std, y)
# print results
print(rf_random.best_params_)
print(rf_random.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   51.9s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  6.8min finished


{'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 20, 'bootstrap': False}
0.8615652409245795


### Tuning the XGBoost
Due to the many variables to tune, I separated them into different sections by slowly building the model to get a quick range of where all the parameters were. Then I created a smaller grid search that looked at the smaller range for all the variables. I ran this multiple times with slightly different  ranges to get two or three different models to test.

In [None]:
xgb_model = XGBClassifier(booster='dart',random_state=42)
xgb_model.fit(X_std, y)

XGBClassifier(base_score=0.5, booster='dart', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# Hypertuning the XGBoost
eta=[0.1,0.2]
#n_estimators=[1500]
max_depth=[10,14,16,18,20,22,24]
min_child_weight=[1,3,5,6,7]
gamma=[3,4,5,6,7,9]
subsample=[0.5,0.6,0.7,0.8,0.9]
scale_pos_weight=[0.6,0.7,0.8,0.9,1]
max_delta_step=[1,2,3,4,5]

# Create the random grid
grid = {'eta': eta,
        #'n_estimators': n_estimators,
        'max_depth': max_depth,
        'min_child_weight': min_child_weight,
        'gamma': gamma,
        'subsample': subsample,
        'scale_pos_weight': scale_pos_weight,
        'max_delta_step': max_delta_step
               }

# Random search of parameters
xgb_random = RandomizedSearchCV(estimator = xgb_model, param_distributions = grid, n_iter = 100, scoring='roc_auc', cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
xgb_random.fit(X_std, y)
# print results
print(xgb_random.best_params_)
print(xgb_random.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed: 12.3min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 22.9min finished


{'subsample': 0.9, 'scale_pos_weight': 0.9, 'min_child_weight': 3, 'max_depth': 18, 'max_delta_step': 4, 'gamma': 5, 'eta': 0.2}
0.8664444198414749


In [None]:
y_pred = rf.predict(X_std)
y_probas = rf.predict_proba(X_std)
print('Misclassified samples: %d' %(y != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y, y_pred))
print('AUC score: %.2f' % roc_auc_score(y, y_probas[:, 1]))

Misclassified samples: 127
Accuracy: 0.99
AUC score: 1.00


### Tuning Catboost
Found a resource to retune some of the parameters for Catboost

In [None]:
cb_model = CatBoostClassifier(random_state=42,task_type='GPU', eval_metric='AUC', verbose=0)
cb_model.fit(X_std, y)

<catboost.core.CatBoostClassifier at 0x7f6e7f9eaef0>

In [None]:
# Hypertuning the Catboost model
learning_rate = [0.03,0.001,0.01,0.1,0.2,0.3]
iterations = [500,750,1000,1250,1500,1750,2000]
max_depth = [1,2,3,4,5,6,7,8,9,10,12,14,16]
l2_leaf_reg =[1,3,5,10,100]
border_count =[5,10,20,30,50,100,200]
scale_pos_weight=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]


random_grid = {'learning_rate': learning_rate,
              'iterations':iterations,
              'max_depth': max_depth,
              'l2_leaf_reg':l2_leaf_reg,
               'border_count':border_count,
              'scale_pos_weight': scale_pos_weight

}

# Random search of parameters
cb_random = RandomizedSearchCV(estimator = cb_model, param_distributions = random_grid, n_iter = 10, cv = 5, verbose=2, random_state=42)
# Fit the model
cb_random.fit(X_std, y)
# print results
print(cb_random.best_params_)
print(cb_random.best_score_)

## S3-C: Tuned Models

### Random Forest with tuned parameters

In [None]:
rf = RandomForestClassifier(min_samples_split=2, min_samples_leaf=2, max_features='sqrt', max_depth=20, bootstrap=False, n_estimators = 1500, n_jobs = -1,random_state = 42)
rf.fit(X_std, y)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=20, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1500,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
test_probas = rf.predict_proba(test_std)
rf_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
rf_solution.to_csv("rf_std2_tuning2_sol.csv", index = False)

Kaggle Score: 0.87533

### XGBoost with tuned parameters (2)

In [None]:
xgb_model = XGBClassifier(eta=0.4,booster='dart',max_depth=16,min_child_weight=7,gamma=3,scale_pos_weight=0.7,subsample=0.9, n_estimators=1500, random_state=42)
xgb_model.fit(X_std, y)

In [None]:
test_probas = xgb_model.predict_proba(test_std)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std2_tuning2_sol.csv", index = False)

Kaggle Score: 0.87797

### XGBoost with tuned parameters (3)

In [None]:
xgb_model = XGBClassifier(booster='dart',max_depth=16, min_child_weight=7,gamma=3,scale_pos_weight=0.7,subsample=0.9, n_estimators=1500, random_state=42)
xgb_model.fit(X_std, y)

In [None]:
test_probas = xgb_model.predict_proba(test_std)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std2_tuning3_sol.csv", index = False)

Kaggle Score: 0.87831

### XGBoost with tuned parameters (4)

In [None]:
xgb_model = XGBClassifier(booster='dart',max_depth=10, min_child_weight=1,gamma=3,scale_pos_weight=0.7,subsample=0.8, n_estimators=1500, random_state=42)
xgb_model.fit(X_std, y)

In [None]:
test_probas = xgb_model.predict_proba(test_std)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std_tuning4_sol.csv", index = False)

Kaggle Score: 0.87946

### XGBoost with tuned parameters (5)

In [None]:
xgb_model = XGBClassifier(booster='dart',max_depth=10, min_child_weight=2,gamma=2,scale_pos_weight=0.9,subsample=0.6, n_estimators=1500, random_state=42)
xgb_model.fit(X_std, y)

In [None]:
test_probas = xgb_model.predict_proba(test_std)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std_tuning5_sol.csv", index = False)

Kaggle Score: 0.87684

### XGBoost with tuned parameters (6)

In [None]:
xgb_model = XGBClassifier(booster='dart',max_depth=24, min_child_weight=1,gamma=3,scale_pos_weight=0.9,subsample=0.6, n_estimators=1500, random_state=42)
xgb_model.fit(X_std, y)

In [None]:
test_probas = xgb_model.predict_proba(test_std)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std_tuning7_sol.csv", index = False)

Kaggle Score: 0.87882

### XGBoost with tuned parameters (7)

In [None]:
xgb_model = XGBClassifier(eta=0.4, booster='dart',max_depth=24, min_child_weight=1,gamma=3,scale_pos_weight=0.9,subsample=0.9, n_estimators=1500, random_state=42)
xgb_model.fit(X_std, y)

In [None]:
test_probas = xgb_model.predict_proba(test_std)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_std_tuning8_sol.csv", index = False)

Kaggle Score: 0.87756

### Catboost with tuned parameters

In [None]:
cb_model = CatBoostClassifier(learning_rate=0.01, iterations=1500, max_depth=9, l2_leaf_reg=3, scale_pos_weight=0.7, random_state=42,task_type='GPU', eval_metric='AUC', verbose=0)
cb_model.fit(X_no_id, y)

In [None]:
test_probas = cb_model.predict_proba(test_no_id)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_std2_tuning_sol.csv", index = False)

Kaggle Score: 0.87168

### Catboost with tuned parameters (2)

In [None]:
cb_model = CatBoostClassifier(learning_rate=0.01, iterations=500, max_depth=9, l2_leaf_reg=3, scale_pos_weight=0.7, random_state=42,task_type='GPU', eval_metric='AUC', verbose=0)
cb_model.fit(X_no_id, y)

In [None]:
test_probas = cb_model.predict_proba(test_no_id)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_std2_tuning2_sol.csv", index = False)

 Kaggle Score: 0.85613

***

# S4: Feature Selection

## S4-A: XGBoost using Standardized Values-Feature Importance

### Using XGBoost to drop features

In [None]:
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_std, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# Finding the feature importance
importance = xgb_model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('f%0d, Score: %.5f' % (i+1,v))

f1, Score: 0.03690
f2, Score: 0.00000
f3, Score: 0.04208
f4, Score: 0.05820
f5, Score: 0.01866
f6, Score: 0.01802
f7, Score: 0.04918
f8, Score: 0.06070
f9, Score: 0.00000
f10, Score: 0.03252
f11, Score: 0.00000
f12, Score: 0.02465
f13, Score: 0.05682
f14, Score: 0.28747
f15, Score: 0.04232
f16, Score: 0.03977
f17, Score: 0.03821
f18, Score: 0.00000
f19, Score: 0.05056
f20, Score: 0.02410
f21, Score: 0.00000
f22, Score: 0.06337
f23, Score: 0.02433
f24, Score: 0.03214


In [None]:
#Selected f2,f5,f6,f9,f11,f18,f20 to drop because of its low importance
all_data_drop = all_data.drop(['f2', 'f5', 'f6', 'f9', 'f11', 'f18', 'f20'], axis=1)
feature_col_names=list(all_data_drop.columns)

# Standarizing the data
sc = StandardScaler()
std_drop=sc.fit_transform(all_data_drop)

#creating matrices for sklearn:
X_std_drop = std_drop[:train.shape[0]]
X_std_drop = pd.DataFrame(data=X_std_drop, columns=feature_col_names)
test_std_drop = std_drop[train.shape[0]:]
test_std_drop = pd.DataFrame(data=test_std_drop, columns=feature_col_names)

### Tuning XGBoost after features dropped

In [None]:
xgb_model = XGBClassifier(booster='dart',eval_metric='auc',random_state=42)
xgb_model.fit(X_std_drop, y)

XGBClassifier(base_score=0.5, booster='dart', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='auc',
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# Hypertuning the XGBoost
eta=[0.03,0.001,0.01,0.1,0.2,0.3]
n_estimators=[1000,1500,2000]
max_depth=[1,7,5,6,10,14,16,18,20,22,24]
min_child_weight=[1,3,5,6,7]
gamma=[1,3,4,5,6,7,9]
subsample=[0.3,0.4,0.5,0.6,0.7,0.8,0.9]
scale_pos_weight=[0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
max_delta_step=[1,2,3,4,5]

# Create the random grid
grid = {'eta': eta,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'min_child_weight': min_child_weight,
        'gamma': gamma,
        'subsample': subsample,
        'scale_pos_weight': scale_pos_weight,
        'max_delta_step': max_delta_step
               }

# Random search of parameters
xgb_random = RandomizedSearchCV(estimator = xgb_model, param_distributions = grid, n_iter = 10, scoring='roc_auc', cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
xgb_random.fit(X_std_drop, y)
# print results
print(xgb_random.best_params_)
print(xgb_random.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 165.9min finished


{'subsample': 0.7, 'scale_pos_weight': 0.9, 'n_estimators': 2000, 'min_child_weight': 1, 'max_depth': 5, 'max_delta_step': 4, 'gamma': 4, 'eta': 0.03}
0.8690272190853436


### XGBoost with dropped features

In [None]:
xgb_model = XGBClassifier(eta=0.2,booster='dart',max_depth=18, min_child_weight=3,gamma=5,scale_pos_weight=0.9,subsample=0.9, max_delta_step=4, n_estimators=1500, random_state=42)
xgb_model.fit(X_std_drop, y)
y_pred = xgb_model.predict(X_std_drop)
y_probas = xgb_model.predict_proba(X_std_drop)
print('Misclassified samples: %d' %(y != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y, y_pred))
print('AUC score: %.2f' % roc_auc_score(y, y_probas[:, 1]))

Misclassified samples: 319
Accuracy: 0.98
AUC score: 1.00


In [None]:
test_pred = xgb_model.predict(test_std_drop)
test_probas = xgb_model.predict_proba(test_std_drop)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_fd_sol.csv", index = False)

Kaggle Score: 0.88428


### XGBoost with dropped features (2)

In [None]:
xgb_model = XGBClassifier(eta=0.03,booster='dart',max_depth=5, min_child_weight=1,gamma=4,scale_pos_weight=0.9,subsample=0.7, max_delta_step=4, eval_metric='auc', n_estimators=2000, random_state=42)
xgb_model.fit(X_std_drop, y)
y_pred = xgb_model.predict(X_std_drop)
y_probas = xgb_model.predict_proba(X_std_drop)
print('Misclassified samples: %d' %(y != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y, y_pred))
print('AUC score: %.2f' % roc_auc_score(y, y_probas[:, 1]))

Misclassified samples: 348
Accuracy: 0.98
AUC score: 1.00


In [None]:
test_pred = xgb_model.predict(test_std_drop)
test_probas = xgb_model.predict_proba(test_std_drop)
xgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
xgb_solution.to_csv("xgb_model_fd2_sol.csv", index = False)

Kaggle Score: 0.88262


## S4-B: XGBoost stacked with XGBoost

In [None]:
#Selected f2,f5,f6,f9,f11,f18,f20 to drop because of its low importance
all_data_drop = all_data.drop(['f2', 'f5', 'f6', 'f9', 'f11', 'f18', 'f20'], axis=1)
feature_col_names=list(all_data_drop.columns)

# Standarizing the data
sc = StandardScaler()
std_drop=sc.fit_transform(all_data_drop)

#creating matrices for sklearn:
X_std_drop = std_drop[:train.shape[0]]
X_std_drop = pd.DataFrame(data=X_std_drop, columns=feature_col_names)
test_std_drop = std_drop[train.shape[0]:]
test_std_drop = pd.DataFrame(data=test_std_drop, columns=feature_col_names)

In [None]:
xgb = XGBClassifier(eta=0.2,booster='dart',max_depth=18, min_child_weight=3,gamma=5,scale_pos_weight=0.9,subsample=0.9, max_delta_step=4, n_estimators=1500, random_state=42)
xgb.fit(X_std_drop, y)

XGBClassifier(base_score=0.5, booster='dart', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.2, gamma=5,
              learning_rate=0.1, max_delta_step=4, max_depth=18,
              min_child_weight=3, missing=None, n_estimators=1500, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=0.9, seed=None,
              silent=None, subsample=0.9, verbosity=1)

In [None]:
X_std_stacked=X_std_drop.copy()
xgb_preds = xgb.predict(X_std_drop)
xgb_series= pd.Series(xgb_preds)
X_std_stacked['xgb_preds'] = xgb_series
stacked_xgb = XGBClassifier(eta=0.2,booster='dart',max_depth=18, min_child_weight=3,gamma=5,scale_pos_weight=0.9,subsample=0.9, max_delta_step=4, n_estimators=1500, random_state=42)
stacked_xgb.fit(X_std_stacked, y)

XGBClassifier(base_score=0.5, booster='dart', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.2, gamma=5,
              learning_rate=0.1, max_delta_step=4, max_depth=18,
              min_child_weight=3, missing=None, n_estimators=1500, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=0.9, seed=None,
              silent=None, subsample=0.9, verbosity=1)

In [None]:
test_std_stacked=test_std_drop.copy()
xgb_test_preds = xgb.predict(test_std_drop)
xgb_test_series= pd.Series(xgb_test_preds)
test_std_stacked['xgb_preds'] = xgb_test_series

In [None]:
test_pred = stacked_xgb.predict(test_std_stacked)
test_probas = stacked_xgb.predict_proba(test_std_stacked)

In [None]:
stacked_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
stacked_solution.to_csv("stacked_xgb_fd_sol.csv", index = False)

Kaggle Score: 0.83729

## S4-C: Catboost using Standardized Values-Feature Importance

### Using Catboost to drop features

In [None]:
cb_model = CatBoostClassifier(random_state=42,task_type='GPU', verbose=0)
cb_model.fit(X_std, y)

<catboost.core.CatBoostClassifier at 0x7fe15a0f06d8>

In [None]:
# Finding the feature importance
importance = cb_model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('f%0d, Score: %.5f' % (i+1,v))

f1, Score: 2.98164
f2, Score: 0.13857
f3, Score: 0.56606
f4, Score: 5.06800
f5, Score: 0.27503
f6, Score: 0.12787
f7, Score: 1.94005
f8, Score: 4.33906
f9, Score: 0.06631
f10, Score: 0.46247
f11, Score: 0.02798
f12, Score: 0.58186
f13, Score: 2.63522
f14, Score: 64.65835
f15, Score: 5.03643
f16, Score: 3.97784
f17, Score: 3.49381
f18, Score: 0.10235
f19, Score: 1.89136
f20, Score: 0.24434
f21, Score: 0.08251
f22, Score: 0.12418
f23, Score: 0.84710
f24, Score: 0.33161


In [None]:
#Selected f2,f6,f9,f11,f18,f21,f22 to drop because of its low importance
all_data_drop = all_data.drop(['f2', 'f6', 'f9', 'f11', 'f18', 'f21','f22'], axis=1)
feature_col_names=list(all_data_drop.columns)

# Standarizing the data
sc = StandardScaler()
std_drop=sc.fit_transform(all_data_drop)

#creating matrices for sklearn:
X_std_drop = std_drop[:train.shape[0]]
X_std_drop = pd.DataFrame(data=X_std_drop, columns=feature_col_names)
test_std_drop = std_drop[train.shape[0]:]
test_std_drop = pd.DataFrame(data=test_std_drop, columns=feature_col_names)

### Tuning Catboost after dropping features

In [None]:
cb_model = CatBoostClassifier(random_state=42,task_type='GPU', verbose=0)
cb_model.fit(X_std_drop, y)

<catboost.core.CatBoostClassifier at 0x7fe1508e7080>

In [None]:
# Hypertuning the Catboost model
learning_rate = [0.4,0.3,0.2,0.1,0.005]
#n_estimators = [500,1000,1500]
max_depth = [1,2,3,4,5,6,7,8,9,10]
l2_leaf_reg =[1,3,5,10,100]
scale_pos_weight=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]


random_grid = {'learning_rate': learning_rate,
              #'n_estimators':n_estimators,
              'max_depth': max_depth,
              'l2_leaf_reg':l2_leaf_reg,
              'scale_pos_weight': scale_pos_weight

}

# Random search of parameters
cb_random = RandomizedSearchCV(estimator = cb_model, param_distributions = random_grid, n_iter = 10, cv = 5, verbose=2, random_state=42)
# Fit the model
cb_random.fit(X_std_drop, y)
# print results
print(cb_random.best_params_)
print(cb_random.best_score_)

### Catboost with dropped features

In [None]:
cb_model = CatBoostClassifier(learning_rate=0.005, n_estimators=1500, max_depth=7, l2_leaf_reg=1, scale_pos_weight=0.7, random_state=42,task_type='GPU', verbose=0)
cb_model.fit(X_std_drop, y)
y_pred = cb_model.predict(X_std_drop)
y_probas = cb_model.predict_proba(X_std_drop)
print('Misclassified samples: %d' %(y != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y, y_pred))
print('AUC score: %.2f' % roc_auc_score(y, y_probas[:, 1]))

Misclassified samples: 653
Accuracy: 0.96
AUC score: 0.92


In [None]:
test_pred = cb_model.predict(test_std_drop)
test_probas = cb_model.predict_proba(test_std_drop)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_fd_tuning_sol.csv", index = False)

Kaggle Score: 0.85791

## S4-D: Catboost using Data Values with No Id Column-Feature Importance

### Using Catboost to drop features

In [None]:
cb_model = CatBoostClassifier(random_state=42, verbose=0)
cb_model.fit(X_no_id, y)

<catboost.core.CatBoostClassifier at 0x7f7db6959780>

In [None]:
# Finding the feature importance
importance = cb_model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('f%0d, Score: %.5f' % (i+1,v))

f1, Score: 5.46178
f2, Score: 0.27888
f3, Score: 1.35335
f4, Score: 6.03119
f5, Score: 0.18184
f6, Score: 0.21357
f7, Score: 2.79626
f8, Score: 5.71512
f9, Score: 0.06048
f10, Score: 1.08223
f11, Score: 0.21621
f12, Score: 1.44704
f13, Score: 3.53180
f14, Score: 47.39967
f15, Score: 6.82154
f16, Score: 6.27292
f17, Score: 4.54204
f18, Score: 0.25579
f19, Score: 2.97781
f20, Score: 0.23034
f21, Score: 0.21506
f22, Score: 0.46469
f23, Score: 2.06534
f24, Score: 0.38506


In [None]:
#Selected f2,f5,f6,f9,f11,f18,f20,f21 to drop because of its low importance
all_data_drop = all_data.drop(['f2', 'f5','f6', 'f9', 'f11', 'f18', 'f20','f21'], axis=1)
feature_col_names=list(all_data_drop.columns)
X_no_id_drop= all_data_drop[:train.shape[0]]
test_no_id_drop = all_data_drop[train.shape[0]:]

### Tuning Catboost after dropping features

In [None]:
cb_model = CatBoostClassifier(random_state=42,eval_metric='AUC', verbose=0)
cb_model.fit(X_no_id_drop, y)

<catboost.core.CatBoostClassifier at 0x7fd9e20d1b00>

In [None]:
random_grid = {'depth':[3,1,2,6,4,5,7,8,9,10],
          'iterations':[250,100,500,1000],
          'learning_rate':[0.03,0.001,0.01,0.1,0.2,0.3], 
          'l2_leaf_reg':[3,1,5,10,100],
          'border_count':[32,5,10,20,50,100,200],
          'scale_pos_weight':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],
          'thread_count':[4]
          }

# Random search of parameters
cb_random = RandomizedSearchCV(estimator = cb_model, param_distributions = random_grid, n_iter = 10, cv = 5, verbose=2, random_state=42)
# Fit the model
cb_random.fit(X_no_id_drop, y)
# print results
print(cb_random.best_params_)
print(cb_random.best_score_)

### Catboost tuned with dropped features

In [None]:
cb_model = CatBoostClassifier(eval_metric='AUC', learning_rate=0.01, thread_count=4, n_estimators=1500, depth=3, l2_leaf_reg=100, scale_pos_weight=0.4, border_count=100, random_state=42, verbose=0)
cb_model.fit(X_no_id_drop, y)

<catboost.core.CatBoostClassifier at 0x7fd9df973278>

In [None]:
test_probas = cb_model.predict_proba(test_no_id_drop)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_noid_fd_tuned_sol.csv", index = False)

Kaggle Score: 0.83829

### Catboost tuned with dropped features (2)
Best Score: Used an additional parameter with the tuned parameters from before.

In [None]:
cat_feat = np.where(X_no_id_drop.dtypes != np.float)[0]
cb_model = CatBoostClassifier(cat_features=cat_feat,eval_metric='AUC', learning_rate=0.01, thread_count=4, n_estimators=1500, depth=3, l2_leaf_reg=100, scale_pos_weight=0.4, border_count=100, random_state=42, verbose=0)
cb_model.fit(X_no_id_drop, y)

<catboost.core.CatBoostClassifier at 0x7fd9de81f550>

In [None]:
test_probas = cb_model.predict_proba(test_no_id_drop)
cb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
cb_solution.to_csv("cb_model_noid_fd_tuning2_sol.csv", index = False)

Kaggle Score: 0.92947

## S4-E:  Light GB Model using Data Values with No Id Column-Feature Importance

### Using Light GB to drop features

In [None]:
cat_feat = np.where(X_no_id.dtypes != np.float)[0]
lgb_model = LGBMClassifier(boosting_type='dart',cat_features=cat_feat,metric='auc',random_state=42, n_estimators=1500, verbose=0)
lgb_model.fit(X_no_id, y)

LGBMClassifier(boosting_type='dart',
               cat_features=array([ 0,  1,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23]),
               class_weight=None, colsample_bytree=1.0, importance_type='split',
               learning_rate=0.1, max_depth=-1, metric='auc',
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=1500, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0,
               verbose=0)

In [None]:
# Finding the feature importance
importance = lgb_model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('f%0d, Score: %.5f' % (i+1,v))

f1, Score: 5458.00000
f2, Score: 118.00000
f3, Score: 1443.00000
f4, Score: 3804.00000
f5, Score: 91.00000
f6, Score: 82.00000
f7, Score: 1460.00000
f8, Score: 3819.00000
f9, Score: 64.00000
f10, Score: 1172.00000
f11, Score: 90.00000
f12, Score: 1367.00000
f13, Score: 2830.00000
f14, Score: 6198.00000
f15, Score: 5751.00000
f16, Score: 4688.00000
f17, Score: 2412.00000
f18, Score: 46.00000
f19, Score: 1197.00000
f20, Score: 64.00000
f21, Score: 127.00000
f22, Score: 114.00000
f23, Score: 2436.00000
f24, Score: 169.00000


In [None]:
#Selected f2,f5,f6,f9,f11,f18,f20,f21, f22, f24 to drop because of its low importance
all_data_drop = all_data.drop(['f2', 'f5','f6', 'f9', 'f11', 'f18', 'f20','f21','f22','f24'], axis=1)
feature_col_names=list(all_data_drop.columns)
X_no_id_drop= all_data_drop[:train.shape[0]]
test_no_id_drop = all_data_drop[train.shape[0]:]

### Tuning the Light GB Model after dropping features

In [None]:
cat_feat = np.where(X_no_id_drop.dtypes != np.float)[0]
lgb_model = LGBMClassifier(boosting_type='dart',cat_features=cat_feat,metric='auc',random_state=42, verbose=0)
lgb_model.fit(X_no_id_drop, y)

LGBMClassifier(boosting_type='dart',
               cat_features=array([ 0,  2,  3,  4,  5,  6,  7,  9, 10, 11, 12, 13]),
               class_weight=None, colsample_bytree=1.0, importance_type='split',
               learning_rate=0.1, max_depth=-1, metric='auc',
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0,
               verbose=0)

In [None]:
random_grid = {
          "max_depth": [25,50, 75],
          "learning_rate" : [0.01,0.05,0.1],
          "num_leaves": [300,900,1200],
          }

# Random search of parameters
lgb_random = RandomizedSearchCV(estimator = lgb_model, param_distributions = random_grid, n_iter = 10, cv = 5, verbose=2, n_jobs=-1, random_state=42)
# Fit the model
lgb_random.fit(X_no_id_drop, y)
# print results
print(lgb_random.best_params_)
print(lgb_random.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.8min finished


{'num_leaves': 300, 'max_depth': 25, 'learning_rate': 0.05}
0.9604470496444864


### Light GB Model with dropped features

In [None]:
cat_feat = np.where(X_no_id_drop.dtypes != np.float)[0]
lgb_model = LGBMClassifier(boosting_type='dart',max_depth=25, num_leaves=300, learning_rate= 0.05, cat_features=cat_feat,metric='auc',random_state=42, n_estimators=1500, verbose=0)
lgb_model.fit(X_no_id_drop, y)

LGBMClassifier(boosting_type='dart',
               cat_features=array([ 0,  2,  3,  4,  5,  6,  7,  9, 10, 11, 12, 13]),
               class_weight=None, colsample_bytree=1.0, importance_type='split',
               learning_rate=0.05, max_depth=25, metric='auc',
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=1500, n_jobs=-1, num_leaves=300, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0,
               verbose=0)

In [None]:
test_probas = lgb_model.predict_proba(test_no_id_drop)
lgb_solution = pd.DataFrame({'Id': test.Id, 'Y' : test_probas[:, 1]})
lgb_solution.to_csv("lgb_model_fd_tuning_sol.csv", index = False)

Kaggle Score: 0.87865