**id** : A unique identifier for each individual in the dataset.

**Gender** : The individual's gender, indicating whether they are male or female.

**Age** : The age of the individual, representing their age in years.

**Height** : The height of the individual, typically measured in meters.

**Weight** : The weight of the individual, typically measured in kilograms.

**family_history_with_overweight** : Indicates whether there is a family history of overweight for the individual (yes/no).

**FAVC** : Stands for "Frequency of consuming high caloric food," representing how often the individual consumes high-calorie foods (yes/no).

**FCVC** : Stands for "Frequency of consuming vegetables," representing how often the individual consumes vegetables.

**NCP** : Stands for "Number of main meals," indicating the number of main meals the individual consumes daily.

**CAEC** : Stands for "Consumption of food between meals," representing the frequency of consuming food between meals.

**SMOKE** : Indicates whether the individual smokes or not (yes/no).

**CH2O** : Represents the amount of water consumption for the individual.

**SCC** : Stands for "Calories consumption monitoring," indicating whether the individual monitors their calorie consumption (yes/no).

**FAF** : Stands for "Physical activity frequency," representing the frequency of the individual's physical activities.

**TUE** : Stands for "Time using technology devices," indicating the amount of time the individual spends using technology devices.

**CALC** : Stands for "Consumption of alcohol," representing the frequency of alcohol consumption.

**MTRANS** : Stands for "Mode of transportation," indicating the mode of transportation the individual uses.

**NObeyesdad** : The target variable, representing the obesity risk category of the individual. It has multiple classes such as 'Overweight_Level_II', 'Normal_Weight', 'Insufficient_Weight', 'Obesity_Type_III', 'Obesity_Type_II', 'Overweight_Level_I', and 'Obesity_Type_I'.

In [2]:

import pandas as pd
import numpy as np
import json

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import StandardScaler, FunctionTransformer, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import make_pipeline

import optuna
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')


In [3]:
train = pd.read_csv('data/train.csv', index_col = 'id')

In [4]:
train.head()

Unnamed: 0_level_0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,Male,24.443011,1.699998,81.66995,yes,yes,2.0,2.983297,Sometimes,no,2.763573,no,0.0,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,Female,18.0,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,1.0,1.0,no,Automobile,Normal_Weight
2,Female,18.0,1.71146,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,Female,20.952737,1.71073,131.274851,yes,yes,3.0,3.0,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20758 entries, 0 to 20757
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          20758 non-null  object 
 1   Age                             20758 non-null  float64
 2   Height                          20758 non-null  float64
 3   Weight                          20758 non-null  float64
 4   family_history_with_overweight  20758 non-null  object 
 5   FAVC                            20758 non-null  object 
 6   FCVC                            20758 non-null  float64
 7   NCP                             20758 non-null  float64
 8   CAEC                            20758 non-null  object 
 9   SMOKE                           20758 non-null  object 
 10  CH2O                            20758 non-null  float64
 11  SCC                             20758 non-null  object 
 12  FAF                             20758

In [6]:
def report(data) : 
    report = pd.DataFrame(index = data.columns)
    report['type'] = data.dtypes
    report['count'] = data.count()
    report['nunique'] = data.nunique()
    report['%unique'] = report['nunique'] / len(data) * 100
    report['null'] = data.isnull().sum()
    report['%null'] = report['null'] / len(data) * 100
    report['min'] = data.min()
    report['max'] = data.max()
    return report

In [7]:
report(train)

Unnamed: 0,type,count,nunique,%unique,null,%null,min,max
Gender,object,20758,2,0.009635,0,0.0,Female,Male
Age,float64,20758,1703,8.204066,0,0.0,14.0,61.0
Height,float64,20758,1833,8.83033,0,0.0,1.45,1.975663
Weight,float64,20758,1979,9.533674,0,0.0,39.0,165.057269
family_history_with_overweight,object,20758,2,0.009635,0,0.0,no,yes
FAVC,object,20758,2,0.009635,0,0.0,no,yes
FCVC,float64,20758,934,4.49947,0,0.0,1.0,3.0
NCP,float64,20758,689,3.319202,0,0.0,1.0,4.0
CAEC,object,20758,4,0.01927,0,0.0,Always,no
SMOKE,object,20758,2,0.009635,0,0.0,no,yes


In [8]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,20758.0,23.841804,5.688072,14.0,20.0,22.815416,26.0,61.0
Height,20758.0,1.700245,0.087312,1.45,1.631856,1.7,1.762887,1.975663
Weight,20758.0,87.887768,26.379443,39.0,66.0,84.064875,111.600553,165.057269
FCVC,20758.0,2.445908,0.533218,1.0,2.0,2.393837,3.0,3.0
NCP,20758.0,2.761332,0.705375,1.0,3.0,3.0,3.0,4.0
CH2O,20758.0,2.029418,0.608467,1.0,1.792022,2.0,2.549617,3.0
FAF,20758.0,0.981747,0.838302,0.0,0.008013,1.0,1.587406,3.0
TUE,20758.0,0.616756,0.602113,0.0,0.0,0.573887,1.0,2.0


In [9]:
X = train

lb = LabelEncoder()
y = lb.fit_transform(X.pop('NObeyesdad'))

SEED = 42
SPLITS = 5

SKF = StratifiedKFold(n_splits = SPLITS, random_state = SEED, shuffle = True)

In [10]:
def features_encoding(data:pd.DataFrame) -> pd.DataFrame: 
    data['Gender'] = data['Gender'].replace({'Male':0,'Female':1})
    data[['family_history_with_overweight','FAVC','SMOKE','SCC']] = data[['family_history_with_overweight','FAVC','SMOKE','SCC']].replace({'no':0,'yes':1})
    data[['CAEC','CALC']] = data[['CAEC','CALC']].replace({'no':0,'Sometimes':1,'Frequently':2,'Always':3})
    data = pd.get_dummies(data, columns=['MTRANS'], dtype='int8')
    return data
FeaturesEncoding = FunctionTransformer(features_encoding)

In [11]:
def xgb_objective(trial):
    params = {
        'eta' : trial.suggest_float('eta', .001, .3, log = True),
        'max_depth' : trial.suggest_int('max_depth', 2, 30),
        'subsample' : trial.suggest_float('subsample', .5, 1),
        'colsample_bytree' : trial.suggest_float('colsample_bytree', .1, 1),
        'min_child_weight' : trial.suggest_float('min_child_weight', .1, 20, log = True),
        'reg_lambda' : trial.suggest_float('reg_lambda', .01, 20, log = True),
        'reg_alpha' : trial.suggest_float('reg_alpha', .01, 10, log = True),
        'n_estimators' : trial.suggest_int('max_depth', 10, 500),
        
        'random_state' : SEED,
        'tree_method' : 'hist',
    }
    
    optuna_model = make_pipeline(
        FeaturesEncoding,
        StandardScaler(),
        XGBClassifier(**params)
    )
    
    optuna_score = cross_val_score(optuna_model, X, y, scoring='accuracy', cv=SKF)
    
    return np.mean(optuna_score)

In [12]:
def lgbm_objective(trial):
    params = {
        'boosting_type': 'gbdt',
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'num_leaves': trial.suggest_int('num_leaves', 20, 100),
        'max_depth': trial.suggest_int('max_depth', 5, 20),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 20),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'random_state': SEED
    }
    
    optuna_model = make_pipeline(
        FeaturesEncoding,
        StandardScaler(),
        LGBMClassifier(**params)
    )
    
    optuna_score = cross_val_score(optuna_model, X, y, scoring='accuracy', cv=SKF)
    
    return np.mean(optuna_score)

In [13]:
cat_features = [
    'Gender',
    'family_history_with_overweight',
    'FAVC',
    'CAEC',
    'SMOKE',
    'SCC',
    'CALC'
]

In [14]:
def cat_objective(trial):
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'depth': trial.suggest_int('depth', 4, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
        'iterations': trial.suggest_int('iterations', 50, 300),
        'border_count': trial.suggest_int('border_count', 32, 255),
        'thread_count': 4,
        'eval_metric': 'AUC',
        'loss_function': 'MultiClass',
        'random_seed': SEED,
        'verbose': False,
        'cat_features' : [0, 4, 5, 8, 9, 11, 14, 15]
    }
    
    optuna_model = make_pipeline(
        CatBoostClassifier(**params)
    )
    
    optuna_score = cross_val_score(optuna_model, X, y, scoring='accuracy', cv=SKF)
    
    return np.mean(optuna_score)

In [15]:
xgb_study = optuna.create_study(direction = 'maximize')
xgb_study.optimize(xgb_objective,100, n_jobs=-1)

[I 2024-02-03 18:20:54,657] A new study created in memory with name: no-name-327331aa-acf1-466c-b0dc-af582021a689
[I 2024-02-03 18:21:07,879] Trial 3 finished with value: 0.8662684062015374 and parameters: {'eta': 0.002695904257318622, 'max_depth': 6, 'subsample': 0.999346020947002, 'colsample_bytree': 0.8077531973739216, 'min_child_weight': 17.4701070165456, 'reg_lambda': 0.183027504216396, 'reg_alpha': 4.592783867138206}. Best is trial 3 with value: 0.8662684062015374.
[I 2024-02-03 18:21:11,966] Trial 5 finished with value: 0.8492148745177822 and parameters: {'eta': 0.06346148141832621, 'max_depth': 7, 'subsample': 0.5726762875709589, 'colsample_bytree': 0.47992365584948626, 'min_child_weight': 12.607633413610953, 'reg_lambda': 10.149371491061261, 'reg_alpha': 5.387930731044843}. Best is trial 3 with value: 0.8662684062015374.
[I 2024-02-03 18:21:12,235] Trial 7 finished with value: 0.8833698869599406 and parameters: {'eta': 0.0010350209616449438, 'max_depth': 7, 'subsample': 0.8766

In [16]:
optuna.visualization.plot_param_importances(xgb_study)

In [17]:
xgb_study.best_params, xgb_study.best_value

({'eta': 0.2377243998901392,
  'max_depth': 30,
  'subsample': 0.6934145922176526,
  'colsample_bytree': 0.7871183095074598,
  'min_child_weight': 0.2631986020154868,
  'reg_lambda': 5.443297437315757,
  'reg_alpha': 0.047334295966659064},
 0.9069271443285712)

In [18]:
with open('xgb.json', 'w') as json_file:
    json.dump(xgb_study.best_params, json_file, indent=4)

In [19]:
lgbm_study = optuna.create_study(direction = 'maximize')
lgbm_study.optimize(lgbm_objective,50, n_jobs=-1)

[I 2024-02-03 18:33:53,736] A new study created in memory with name: no-name-7bc36c0e-e768-4a03-af7e-e3809dfddf38
[I 2024-02-03 18:48:02,261] Trial 1 finished with value: 0.9004237899821247 and parameters: {'learning_rate': 0.013135853194914206, 'num_leaves': 30, 'max_depth': 12, 'min_child_samples': 6, 'subsample': 0.5802292242474745, 'colsample_bytree': 0.7338709150363816, 'reg_alpha': 0.5819397648632936, 'reg_lambda': 0.6247780486949354, 'n_estimators': 97}. Best is trial 1 with value: 0.9004237899821247.
[I 2024-02-03 18:50:33,254] Trial 5 finished with value: 0.9042293126200758 and parameters: {'learning_rate': 0.24075926582530852, 'num_leaves': 72, 'max_depth': 7, 'min_child_samples': 20, 'subsample': 0.7101687813199193, 'colsample_bytree': 0.6050859591892961, 'reg_alpha': 0.5577893206013281, 'reg_lambda': 0.8320855415048715, 'n_estimators': 87}. Best is trial 5 with value: 0.9042293126200758.
[I 2024-02-03 18:52:04,421] Trial 7 finished with value: 0.9052894142089863 and paramet

In [20]:
optuna.visualization.plot_param_importances(lgbm_study)

In [21]:
lgbm_study.best_params, lgbm_study.best_value

({'learning_rate': 0.05340658953221508,
  'num_leaves': 87,
  'max_depth': 10,
  'min_child_samples': 16,
  'subsample': 0.5716554335282593,
  'colsample_bytree': 0.6044802806390872,
  'reg_alpha': 0.45290329883942215,
  'reg_lambda': 0.8774336985419824,
  'n_estimators': 76},
 0.9088541006670632)

In [22]:
with open('lgbm.json', 'w') as json_file:
    json.dump(lgbm_study.best_params, json_file, indent=4)

In [23]:
cat_study = optuna.create_study(direction = 'maximize')
cat_study.optimize(cat_objective,50, n_jobs=-1)

[I 2024-02-03 20:32:51,405] A new study created in memory with name: no-name-f55f8851-4860-497e-9ec2-ac975490c1db
[I 2024-02-03 20:33:50,379] Trial 9 finished with value: 0.8782153034136677 and parameters: {'learning_rate': 0.047338069013288395, 'depth': 8, 'l2_leaf_reg': 2.588419949576218, 'iterations': 60, 'border_count': 239}. Best is trial 9 with value: 0.8782153034136677.
[I 2024-02-03 20:33:56,317] Trial 1 finished with value: 0.8777817193804776 and parameters: {'learning_rate': 0.05495707066027985, 'depth': 9, 'l2_leaf_reg': 8.771303356718581, 'iterations': 55, 'border_count': 208}. Best is trial 9 with value: 0.8782153034136677.
[I 2024-02-03 20:33:58,285] Trial 0 finished with value: 0.8986895582882969 and parameters: {'learning_rate': 0.2193068689013425, 'depth': 4, 'l2_leaf_reg': 3.748285992900594, 'iterations': 124, 'border_count': 198}. Best is trial 0 with value: 0.8986895582882969.
[I 2024-02-03 20:34:43,958] Trial 11 finished with value: 0.8990264666823558 and parameter

In [24]:
optuna.visualization.plot_param_importances(cat_study)

In [25]:
cat_study.best_params, cat_study.best_value

({'learning_rate': 0.27024349994017527,
  'depth': 6,
  'l2_leaf_reg': 7.162070222037684,
  'iterations': 243,
  'border_count': 155},
 0.9043257329640373)

In [26]:
with open('cat.json', 'w') as json_file:
    json.dump(cat_study.best_params, json_file, indent=4)

In [144]:
params = {
    'eta': 0.2377243998901392,
    'max_depth': 30,
    'subsample': 0.6934145922176526,
    'colsample_bytree': 0.7871183095074598,
    'min_child_weight': 0.2631986020154868,
    'reg_lambda': 5.443297437315757,
    'reg_alpha': 0.047334295966659064,
    'random_state' : SEED,
    'tree_method' : 'hist'
}

xgb = make_pipeline(
    FeaturesEncoding,
    StandardScaler(),
    XGBClassifier(**params)
)
xgb_scores = cross_val_score(xgb, X,y,scoring='accuracy',cv=SKF, n_jobs=-1)
print(f'Mean score XGBoost : {np.mean(xgb_scores):.5f} ± {np.std(xgb_scores):.5f}')

Mean score XGBoost : 0.90823 ± 0.00462


In [None]:
params = {
    'boosting_type': 'gbdt',
    'learning_rate': ,
    'num_leaves': ,
    'max_depth': ,
    'min_child_samples': ,
    'subsample': ,
    'colsample_bytree': ,
    'reg_alpha': ,
    'reg_lambda': ,
    'n_estimators': ,
    'random_state': SEED
}
lgbm = make_pipeline(
        FeaturesEncoding,
        StandardScaler(),
        LGBMClassifier(**params)
)
cat_scores = cross_val_score(lgbm, X,y,scoring='accuracy',cv=SKF, n_jobs=-1)
print(f'Mean score CatBoost : {np.mean(cat_scores):.5f} ± {np.mean(cat_scores):.5f}')

In [None]:
params = {
    'learning_rate': ,
    'depth': ,
    'l2_leaf_reg': ,
    'iterations': ,
    'border_count': ,
    'thread_count': 4,
    'eval_metric': 'AUC',
    'loss_function': 'MultiClass',
    'random_seed': SEED,
    'verbose': False,
    'cat_features' : [0, 4, 5, 8, 9, 11, 14, 15]
}
cat = CatBoostClassifier(**params, cat_features=[0, 4, 5, 8, 9, 11, 14, 15], show_importance = False)
cat_scores = cross_val_score(cat, X,y,scoring='accuracy',cv=SKF, n_jobs=-1)
print(f'Mean score CatBoost : {np.mean(cat_scores):.5f} ± {np.mean(cat_scores):.5f}')

In [None]:
weights = RidgeClassifier(random_state = seed).fit(oof_list, train.Exited).coef_[0]
weights /= weights.sum()
pd.DataFrame(weights, index = list(oof_list), columns = ['weight per model'])