<a href="https://www.kaggle.com/code/datascientistsohail/bayesian-optimisation-ps-se01e02?scriptVersionId=116230892" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Explaination abouth the Method

**This process is a type of hyperparameter tuning.**

In machine learning, Bayesian Optimization with Gaussian Process (GP-BO) is a powerful method for global optimization of expensive black-box functions, particularly when the number of evaluations of the function is limited. It is often used to optimize the hyperparameters of a machine learning model. GP-BO models the function to be optimized as a Gaussian Process, which is a probability distribution over functions. This allows it to make probabilistic predictions about the function's behavior and use this information to decide where to sample next. The algorithm iteratively improves its model of the function and selects the next point to evaluate based on the acquisition function which balances exploration and exploitation. The goal is to find the global minimum or maximum of the function by using the least number of evaluations.


In [1]:
from numbers import Real
import pandas as pd
import numpy as np

from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import pipeline
from functools import partial
from skopt import space
from sklearn.preprocessing import OneHotEncoder
from skopt import gp_minimize 
from xgboost import XGBClassifier
import os

In [2]:
df_train = pd.read_csv('/kaggle/input/playground-series-s3e2/train.csv')
df_test = pd.read_csv('/kaggle/input/playground-series-s3e2/test.csv')
submission = pd.read_csv('/kaggle/input/playground-series-s3e2/sample_submission.csv')

In [3]:
df_train.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,0,Male,28.0,0,0,Yes,Private,Urban,79.53,31.1,never smoked,0
1,1,Male,33.0,0,0,Yes,Private,Rural,78.44,23.9,formerly smoked,0
2,2,Female,42.0,0,0,Yes,Private,Rural,103.0,40.3,Unknown,0
3,3,Male,56.0,0,0,Yes,Private,Urban,64.87,28.8,never smoked,0
4,4,Female,24.0,0,0,No,Private,Rural,73.36,28.8,never smoked,0


In [4]:
df_test.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,15304,Female,57.0,0,0,Yes,Private,Rural,82.54,33.4,Unknown
1,15305,Male,70.0,1,0,Yes,Private,Urban,72.06,28.5,Unknown
2,15306,Female,5.0,0,0,No,children,Urban,103.72,19.5,Unknown
3,15307,Female,56.0,0,0,Yes,Govt_job,Urban,69.24,41.4,smokes
4,15308,Male,32.0,0,0,Yes,Private,Rural,111.15,30.1,smokes


In [5]:
print(df_train.shape)
print(df_test.shape)

(15304, 12)
(10204, 11)


**Dividing the data into X as features and y as target**

In [6]:
y = df_train.stroke
X = df_train.drop(['id', 'stroke'], axis = 1)
X_test = df_test.drop(['id'], axis = 1)
del df_train
del df_test
X.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,Male,28.0,0,0,Yes,Private,Urban,79.53,31.1,never smoked
1,Male,33.0,0,0,Yes,Private,Rural,78.44,23.9,formerly smoked
2,Female,42.0,0,0,Yes,Private,Rural,103.0,40.3,Unknown
3,Male,56.0,0,0,Yes,Private,Urban,64.87,28.8,never smoked
4,Female,24.0,0,0,No,Private,Rural,73.36,28.8,never smoked


In [7]:
X_test.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,Female,57.0,0,0,Yes,Private,Rural,82.54,33.4,Unknown
1,Male,70.0,1,0,Yes,Private,Urban,72.06,28.5,Unknown
2,Female,5.0,0,0,No,children,Urban,103.72,19.5,Unknown
3,Female,56.0,0,0,Yes,Govt_job,Urban,69.24,41.4,smokes
4,Male,32.0,0,0,Yes,Private,Rural,111.15,30.1,smokes


In [8]:
X.isnull().sum()

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
dtype: int64

In [9]:
X_test.isnull().sum()

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
dtype: int64

There are no null values in both training and testing data but there seems to be categorical columns.

**Identifying Categorical columns**

In [10]:
low_cardinality_cols = [cname for cname in X.columns if X[cname].nunique() < 10 and 
                        X[cname].dtype == "object"]
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]

print('low cardinality columns :', low_cardinality_cols)
print('Numerical columns :', numerical_cols)

low cardinality columns : ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
Numerical columns : ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi']


Simply the identification can be done as follows:

In [11]:
s = (X.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']


In [12]:
"""def optimize(params, param_names, x, y):

    # convert params to dictionary
    params = dict(zip(param_names, params))
    
    model = XGBClassifier(**params, random_state  = 42, 
                          use_label_encoder=False, 
                         tree_method = 'gpu_hist',
                         gpu_id = 0,
                         predictor = 'gpu_predictor')
    
    kf = model_selection.StratifiedKFold(n_splits=5)
    accuracies = []
    count=0
    for idx in kf.split(X=x, y = y):
        train_idx, test_idx = idx[0], idx[1]
        
        #xtrain, xtest = x.iloc[train_idx], x.iloc[test_idx]
        #ytrain, ytest = y.iloc[train_idx], y.iloc[test_idx]
        x_train, x_valid = x.iloc[train_idx], x.iloc[test_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]
        
        # For categorical columns:
        OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
        OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(x_train[object_cols]))
        OH_cols_valid = pd.DataFrame(OH_encoder.transform(x_valid[object_cols]))
        
        # One-hot encoding removed index; put it back
        OH_cols_train.index = x_train.index
        OH_cols_valid.index = x_valid.index
        

        # Remove categorical columns (will replace with one-hot encoding)
        num_x_train = x_train.drop(object_cols, axis=1)
        num_x_valid = x_valid.drop(object_cols, axis=1)
        
        # Add one-hot encoded columns to numerical features
        OH_x_train = pd.concat([num_x_train, OH_cols_train], axis=1)
        OH_x_valid = pd.concat([num_x_valid, OH_cols_valid], axis=1)

        #Applying standard scalar
        st_scaler = StandardScaler()
        scaled_x_train = st_scaler.fit_transform(OH_x_train)
        scaled_x_valid = st_scaler.fit_transform(OH_x_valid)
        
        model.fit(scaled_x_train, y_train, 
                  eval_set = [(scaled_x_train, y_train), (scaled_x_valid, y_valid)],
              early_stopping_rounds = 100,
              eval_metric = 'auc',
             verbose = False)
        preds = model.predict_proba(scaled_x_valid)[:,1]
        fold_acc = metrics.roc_auc_score(y_valid, preds)
        accuracies.append(fold_acc)

    return -1.0*np.mean(accuracies)"""

"def optimize(params, param_names, x, y):\n\n    # convert params to dictionary\n    params = dict(zip(param_names, params))\n    \n    model = XGBClassifier(**params, random_state  = 42, \n                          use_label_encoder=False, \n                         tree_method = 'gpu_hist',\n                         gpu_id = 0,\n                         predictor = 'gpu_predictor')\n    \n    kf = model_selection.StratifiedKFold(n_splits=5)\n    accuracies = []\n    count=0\n    for idx in kf.split(X=x, y = y):\n        train_idx, test_idx = idx[0], idx[1]\n        \n        #xtrain, xtest = x.iloc[train_idx], x.iloc[test_idx]\n        #ytrain, ytest = y.iloc[train_idx], y.iloc[test_idx]\n        x_train, x_valid = x.iloc[train_idx], x.iloc[test_idx]\n        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]\n        \n        # For categorical columns:\n        OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)\n        OH_cols_train = pd.DataFrame(OH_encoder.fit

In [13]:
"""param_space = [space.Real(0.01, 0.1, name = "eta"),
               space.Integer(3,25, name = "max_depth"),
               space.Integer(1, 7, name = "min_child_weight"),
               space.Real(0.6, 1.0, name = "subsample"),
               space.Real(0.6, 1.0, name= "colsample_bytree"),
               space.Real(0.01, 1.0, name = "alpha")]

param_names = ["eta", "max_depth", "min_child_weight", "subsample", 
                "colsample_bytree", "alpha"]

optimization_function = partial(optimize, param_names = param_names, x = X, y = y)
result = gp_minimize(optimization_function, dimensions = param_space, n_calls = 15, 
                    n_random_starts = 10, verbose = 10)

best_params = dict(zip(param_names, result.x))
print(best_params)"""

'param_space = [space.Real(0.01, 0.1, name = "eta"),\n               space.Integer(3,25, name = "max_depth"),\n               space.Integer(1, 7, name = "min_child_weight"),\n               space.Real(0.6, 1.0, name = "subsample"),\n               space.Real(0.6, 1.0, name= "colsample_bytree"),\n               space.Real(0.01, 1.0, name = "alpha")]\n\nparam_names = ["eta", "max_depth", "min_child_weight", "subsample", \n                "colsample_bytree", "alpha"]\n\noptimization_function = partial(optimize, param_names = param_names, x = X, y = y)\nresult = gp_minimize(optimization_function, dimensions = param_space, n_calls = 15, \n                    n_random_starts = 10, verbose = 10)\n\nbest_params = dict(zip(param_names, result.x))\nprint(best_params)'

### Best Parameters known: 
{'eta': 0.06424024027690198, 'max_depth': 11, 'min_child_weight': 1, 'subsample': 0.7801999418890287, 'colsample_bytree': 0.7810719584070286, 'alpha': 0.9036042999413357}

In [14]:
#optimized_params = best_params
optimized_params = {'eta': 0.06424024027690198, 'max_depth': 11, 
                    'min_child_weight': 1, 'subsample': 0.7801999418890287, 
                    'colsample_bytree': 0.7810719584070286, 
                    'alpha': 0.9036042999413357}

**Preprocissing Test Data and running the model for based on the tunned parameters**

In [15]:
y_preds = np.zeros(len(X_test))
socres = []
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)
model = XGBClassifier(**optimized_params, random_state = 42, use_label_encoder=False, 
                     tree_method = 'gpu_hist',
                     gpu_id = 0,
                     eval_metric = 'auc',
                     predictor = 'gpu_predictor')

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
OH_cols_test.index = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
num_X_test = X_test.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis =1)


#Applying standard scalar
st_scaler = StandardScaler()
scaled_X_train = st_scaler.fit_transform(OH_X_train)
scaled_X_valid = st_scaler.fit_transform(OH_X_valid)
scaled_X_test = st_scaler.fit_transform(OH_X_test)

model.fit(scaled_X_train, y_train, eval_set = [(scaled_X_train, y_train), (scaled_X_valid, y_valid)], 
              verbose = False)
final_preds = model.predict_proba(scaled_X_valid)[:,1]
accuracy = metrics.roc_auc_score(y_valid, final_preds)
print('Accuracy: ', accuracy)
y_preds = model.predict_proba(scaled_X_test)[:,1] 



Accuracy:  0.8820865918621784


In [16]:
print(y_preds.shape)
print(submission.shape)

(10204,)
(10204, 2)


In [17]:
submission['stroke'] = y_preds
submission.to_csv('submission.csv',index = False)