# Infos

**id** : A unique identifier for each individual in the dataset.

**Gender** : The individual's gender, indicating whether they are male or female.

**Age** : The age of the individual, representing their age in years.

**Height** : The height of the individual, typically measured in meters.

**Weight** : The weight of the individual, typically measured in kilograms.

**family_history_with_overweight** : Indicates whether there is a family history of overweight for the individual (yes/no).

**FAVC** : Stands for "Frequency of consuming high caloric food," representing how often the individual consumes high-calorie foods (yes/no).

**FCVC** : Stands for "Frequency of consuming vegetables," representing how often the individual consumes vegetables.

**NCP** : Stands for "Number of main meals," indicating the number of main meals the individual consumes daily.

**CAEC** : Stands for "Consumption of food between meals," representing the frequency of consuming food between meals.

**SMOKE** : Indicates whether the individual smokes or not (yes/no).

**CH2O** : Represents the amount of water consumption for the individual.

**SCC** : Stands for "Calories consumption monitoring," indicating whether the individual monitors their calorie consumption (yes/no).

**FAF** : Stands for "Physical activity frequency," representing the frequency of the individual's physical activities.

**TUE** : Stands for "Time using technology devices," indicating the amount of time the individual spends using technology devices.

**CALC** : Stands for "Consumption of alcohol," representing the frequency of alcohol consumption.

**MTRANS** : Stands for "Mode of transportation," indicating the mode of transportation the individual uses.

**NObeyesdad** : The target variable, representing the obesity risk category of the individual. It has multiple classes such as 'Overweight_Level_II', 'Normal_Weight', 'Insufficient_Weight', 'Obesity_Type_III', 'Obesity_Type_II', 'Overweight_Level_I', and 'Obesity_Type_I'.

# Import

In [1]:
from datetime import datetime as dt
import pandas as pd
import numpy as np
import json

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import StandardScaler, FunctionTransformer, LabelEncoder
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

import optuna
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')


In [2]:
train = pd.read_csv('data/train.csv', index_col = 'id')

# Exploration

In [3]:
train.head()

Unnamed: 0_level_0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,Male,24.443011,1.699998,81.66995,yes,yes,2.0,2.983297,Sometimes,no,2.763573,no,0.0,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,Female,18.0,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,1.0,1.0,no,Automobile,Normal_Weight
2,Female,18.0,1.71146,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,Female,20.952737,1.71073,131.274851,yes,yes,3.0,3.0,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20758 entries, 0 to 20757
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          20758 non-null  object 
 1   Age                             20758 non-null  float64
 2   Height                          20758 non-null  float64
 3   Weight                          20758 non-null  float64
 4   family_history_with_overweight  20758 non-null  object 
 5   FAVC                            20758 non-null  object 
 6   FCVC                            20758 non-null  float64
 7   NCP                             20758 non-null  float64
 8   CAEC                            20758 non-null  object 
 9   SMOKE                           20758 non-null  object 
 10  CH2O                            20758 non-null  float64
 11  SCC                             20758 non-null  object 
 12  FAF                             20758

In [5]:
def report(data) : 
    report = pd.DataFrame(index = data.columns)
    report['type'] = data.dtypes
    report['count'] = data.count()
    report['nunique'] = data.nunique()
    report['%unique'] = report['nunique'] / len(data) * 100
    report['null'] = data.isnull().sum()
    report['%null'] = report['null'] / len(data) * 100
    report['min'] = data.min()
    report['max'] = data.max()
    return report
report(train)

Unnamed: 0,type,count,nunique,%unique,null,%null,min,max
Gender,object,20758,2,0.009635,0,0.0,Female,Male
Age,float64,20758,1703,8.204066,0,0.0,14.0,61.0
Height,float64,20758,1833,8.83033,0,0.0,1.45,1.975663
Weight,float64,20758,1979,9.533674,0,0.0,39.0,165.057269
family_history_with_overweight,object,20758,2,0.009635,0,0.0,no,yes
FAVC,object,20758,2,0.009635,0,0.0,no,yes
FCVC,float64,20758,934,4.49947,0,0.0,1.0,3.0
NCP,float64,20758,689,3.319202,0,0.0,1.0,4.0
CAEC,object,20758,4,0.01927,0,0.0,Always,no
SMOKE,object,20758,2,0.009635,0,0.0,no,yes


In [6]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,20758.0,23.841804,5.688072,14.0,20.0,22.815416,26.0,61.0
Height,20758.0,1.700245,0.087312,1.45,1.631856,1.7,1.762887,1.975663
Weight,20758.0,87.887768,26.379443,39.0,66.0,84.064875,111.600553,165.057269
FCVC,20758.0,2.445908,0.533218,1.0,2.0,2.393837,3.0,3.0
NCP,20758.0,2.761332,0.705375,1.0,3.0,3.0,3.0,4.0
CH2O,20758.0,2.029418,0.608467,1.0,1.792022,2.0,2.549617,3.0
FAF,20758.0,0.981747,0.838302,0.0,0.008013,1.0,1.587406,3.0
TUE,20758.0,0.616756,0.602113,0.0,0.0,0.573887,1.0,2.0


# Transformers

In [7]:
def features_encoding(data:pd.DataFrame) -> pd.DataFrame: 
    data['Gender'] = data['Gender'].replace({'Male':0,'Female':1})
    data[['family_history_with_overweight','FAVC','SMOKE','SCC']] = data[['family_history_with_overweight','FAVC','SMOKE','SCC']].replace({'no':0,'yes':1})
    data[['CAEC','CALC']] = data[['CAEC','CALC']].replace({'no':0,'Sometimes':1,'Frequently':2,'Always':3})
    data = pd.get_dummies(data, columns=['MTRANS'], dtype='int8')
    return data
FeaturesEncoding = FunctionTransformer(features_encoding)

In [8]:
numeric_features = ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']
FeatureScaler = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), numeric_features)
    ],
    remainder='passthrough'  # Conserver les autres colonnes inchangées
)

# Optuna

In [9]:
X = train.copy()

lb = LabelEncoder()
y = lb.fit_transform(X.pop('NObeyesdad'))

SEED = 42
SPLITS = 5
TRIALS = 200
SKF = StratifiedKFold(n_splits = SPLITS, random_state = SEED, shuffle = True)

## XGB

In [10]:
params_xgb = {
        'random_state' : SEED,
        'tree_method' : 'hist',
}
def xgb_objective(trial):

    params = {
        'eta' : trial.suggest_float('eta', .001, .3, log = True),
        'max_depth' : trial.suggest_int('max_depth', 2, 30),
        'subsample' : trial.suggest_float('subsample', .5, 1),
        'colsample_bytree' : trial.suggest_float('colsample_bytree', .1, 1),
        'min_child_weight' : trial.suggest_float('min_child_weight', .1, 20, log = True),
        'reg_lambda' : trial.suggest_float('reg_lambda', .01, 20, log = True),
        'reg_alpha' : trial.suggest_float('reg_alpha', .01, 10, log = True),
        'n_estimators' : trial.suggest_int('max_depth', 10, 500),
        **params_xgb

    }
    
    optuna_model = make_pipeline(
        FeaturesEncoding,
        FeatureScaler,
        XGBClassifier(**params)
    )
    
    optuna_score = cross_val_score(optuna_model, X, y, scoring='accuracy', cv=SKF)
    
    return np.mean(optuna_score)

In [11]:
xgb_study = optuna.create_study(direction = 'maximize')
xgb_study.optimize(xgb_objective,n_trials=TRIALS, n_jobs=-1, show_progress_bar=True)
print("")
print(f'scores : {xgb_study.best_value}, params : {xgb_study.best_params} ')
with open('xgb.json', 'w') as json_file:
    json.dump(xgb_study.best_params, json_file, indent=4)

optuna.visualization.plot_param_importances(xgb_study)

[I 2024-02-05 20:42:16,259] A new study created in memory with name: no-name-2250b09b-fde2-49a1-9cf0-d118456b5c0c


  0%|          | 0/200 [00:00<?, ?it/s]

[I 2024-02-05 20:42:24,508] Trial 11 finished with value: 0.7184700369342485 and parameters: {'eta': 0.016534797795007987, 'max_depth': 5, 'subsample': 0.9943816391567825, 'colsample_bytree': 0.24446505026163534, 'min_child_weight': 6.372863493429788, 'reg_lambda': 0.012268567092660644, 'reg_alpha': 6.878744190447381}. Best is trial 11 with value: 0.7184700369342485.
[I 2024-02-05 20:42:27,384] Trial 5 finished with value: 0.8088449100409447 and parameters: {'eta': 0.025737228035154065, 'max_depth': 6, 'subsample': 0.5011688814235205, 'colsample_bytree': 0.32584654823661396, 'min_child_weight': 0.6958241338206285, 'reg_lambda': 0.19149332799194194, 'reg_alpha': 0.1803980513729415}. Best is trial 5 with value: 0.8088449100409447.
[I 2024-02-05 20:42:36,793] Trial 3 finished with value: 0.8573081259524249 and parameters: {'eta': 0.03579665906505937, 'max_depth': 9, 'subsample': 0.5493923537663203, 'colsample_bytree': 0.3581421829018418, 'min_child_weight': 0.44037347323563397, 'reg_lambd

## LGBM

In [12]:
params_lgbm = {
    'boosting_type': 'gbdt',
    'random_state': SEED
}
def lgbm_objective(trial):
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'num_leaves': trial.suggest_int('num_leaves', 20, 100),
        'max_depth': trial.suggest_int('max_depth', 5, 20),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 20),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        **params_lgbm
        
    }
    
    optuna_model = make_pipeline(
        FeaturesEncoding,
        FeatureScaler,
        LGBMClassifier(**params)
    )
    
    optuna_score = cross_val_score(optuna_model, X, y, scoring='accuracy', cv=SKF)
    
    return np.mean(optuna_score)

In [13]:
lgbm_study = optuna.create_study(direction = 'maximize')
lgbm_study.optimize(lgbm_objective,n_trials=TRIALS, n_jobs=-1, show_progress_bar=True)
print("")
print(f'scores : {lgbm_study.best_value}, params : {lgbm_study.best_params} ')
with open('lgbm.json', 'w') as json_file:
    json.dump(lgbm_study.best_params, json_file, indent=4)
optuna.visualization.plot_param_importances(lgbm_study)

[I 2024-02-05 20:59:03,967] A new study created in memory with name: no-name-47709e44-0c5d-41e0-baff-c4fed6df2b49


  0%|          | 0/200 [00:00<?, ?it/s]

[I 2024-02-05 21:12:31,806] Trial 3 finished with value: 0.9038440141869847 and parameters: {'learning_rate': 0.22711624056257798, 'num_leaves': 20, 'max_depth': 7, 'min_child_samples': 20, 'subsample': 0.674238608520485, 'colsample_bytree': 0.5750167109910995, 'reg_alpha': 0.43218518900550995, 'reg_lambda': 0.8645644133452034, 'n_estimators': 197}. Best is trial 3 with value: 0.9038440141869847.
[I 2024-02-05 21:14:22,843] Trial 0 finished with value: 0.9034587041495676 and parameters: {'learning_rate': 0.2164530696401225, 'num_leaves': 78, 'max_depth': 17, 'min_child_samples': 20, 'subsample': 0.7992785147296793, 'colsample_bytree': 0.7358468661736302, 'reg_alpha': 0.9158243941117413, 'reg_lambda': 0.2752619732005117, 'n_estimators': 77}. Best is trial 3 with value: 0.9038440141869847.
[I 2024-02-05 21:16:21,090] Trial 8 finished with value: 0.9061564198148042 and parameters: {'learning_rate': 0.059234770356500877, 'num_leaves': 50, 'max_depth': 16, 'min_child_samples': 1, 'subsample

## CatBoost

In [14]:
params_cat ={    
    'thread_count': 4,
    'eval_metric': 'AUC',
    'loss_function': 'MultiClass',
    'random_seed': SEED,
    'verbose': False,
    'cat_features' : [8,9,10,11,12,13,14,15]
    
}
def cat_objective(trial):
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'depth': trial.suggest_int('depth', 4, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
        'iterations': trial.suggest_int('iterations', 50, 300),
        'border_count': trial.suggest_int('border_count', 32, 255),
        **params_cat
    }
    
    optuna_model = make_pipeline(
        FeatureScaler,
        CatBoostClassifier(**params)
    )
    
    optuna_score = cross_val_score(optuna_model, X, y, scoring='accuracy', cv=SKF)
    
    return np.mean(optuna_score)

In [15]:
cat_study = optuna.create_study(direction = 'maximize')
cat_study.optimize(cat_objective,n_trials=TRIALS, n_jobs=-1, show_progress_bar=True)
print("")
print(f'scores : {cat_study.best_value}, params : {cat_study.best_params}')
with open('cat.json', 'w') as json_file:
    json.dump(cat_study.best_params, json_file, indent=4)
optuna.visualization.plot_param_importances(cat_study)

[I 2024-02-06 00:05:18,445] A new study created in memory with name: no-name-b2837778-bc1b-4bd6-ab52-2c135b8e990a


  0%|          | 0/200 [00:00<?, ?it/s]

[I 2024-02-06 00:06:44,681] Trial 2 finished with value: 0.8746503500560954 and parameters: {'learning_rate': 0.03739944155725026, 'depth': 8, 'l2_leaf_reg': 5.173271362547643, 'iterations': 80, 'border_count': 151}. Best is trial 2 with value: 0.8746503500560954.
[I 2024-02-06 00:07:02,851] Trial 6 finished with value: 0.8987855028548962 and parameters: {'learning_rate': 0.26114494554501494, 'depth': 7, 'l2_leaf_reg': 8.735731373685564, 'iterations': 101, 'border_count': 152}. Best is trial 6 with value: 0.8987855028548962.
[I 2024-02-06 00:08:34,735] Trial 7 finished with value: 0.9027358126671892 and parameters: {'learning_rate': 0.2446158661721171, 'depth': 6, 'l2_leaf_reg': 7.890124707330498, 'iterations': 190, 'border_count': 111}. Best is trial 7 with value: 0.9027358126671892.
[I 2024-02-06 00:09:16,759] Trial 10 finished with value: 0.8996045593860661 and parameters: {'learning_rate': 0.13066013731601567, 'depth': 8, 'l2_leaf_reg': 1.1054488900509147, 'iterations': 188, 'borde

## Summary

In [16]:
params = {
    **xgb_study.best_params,
    **params_xgb
}

xgb = make_pipeline(
    FeaturesEncoding,
    FeatureScaler,
    XGBClassifier(**params)
)
xgb_scores = cross_val_score(xgb, X,y,scoring='accuracy',cv=SKF, n_jobs=-1)
print(f'Mean score XGBoost : {np.mean(xgb_scores):.5f} ± {np.std(xgb_scores):.5f}')

Mean score XGBoost : 0.90828 ± 0.00462


In [17]:
params = {
    **lgbm_study.best_params,
    **params_lgbm
}
lgbm = make_pipeline(
        FeaturesEncoding,
        FeatureScaler,
        LGBMClassifier(**params)
)
cat_scores = cross_val_score(lgbm, X,y,scoring='accuracy',cv=SKF, n_jobs=-1)
print(f'Mean score CatBoost : {np.mean(cat_scores):.5f} ± {np.mean(cat_scores):.5f}')

Mean score CatBoost : 0.91015 ± 0.91015


In [18]:
params = {
    **cat_study.best_params,
    **params_cat
}
cat = make_pipeline(
    FeatureScaler,
    CatBoostClassifier(**params)
)
cat_scores = cross_val_score(cat, X,y,scoring='accuracy',cv=SKF, n_jobs=-1)

print(f'Mean score CatBoost : {np.mean(cat_scores):.5f} ± {np.mean(cat_scores):.5f}')

Mean score CatBoost : 0.90534 ± 0.90534


# Voting

In [19]:
xgb_predict_proba = cross_val_predict(xgb, X,y,cv=SKF, n_jobs=-1, method='predict_proba')
lgbm_predict_proba = cross_val_predict(lgbm, X,y,cv=SKF, n_jobs=-1, method='predict_proba')
cat_predict_proba = cross_val_predict(cat, X,y,cv=SKF, n_jobs=-1, method='predict_proba')

In [20]:
def proba_true(proba) : 
    df = pd.concat([pd.DataFrame(y, columns=['true']),pd.DataFrame(proba)], axis=1)
    return df.apply(lambda row : row[row['true']], axis=1)

In [21]:
prediction = pd.DataFrame()
prediction.insert(0,'xgb',proba_true(xgb_predict_proba))
prediction.insert(1,'lgbm',proba_true(lgbm_predict_proba))
prediction.insert(2,'cat',proba_true(cat_predict_proba))

In [55]:
ridge = RidgeClassifier()
ridge.fit(prediction,y)
weights = ridge.coef_[0]/sum(ridge.coef_[1])
print(weights)

[ 0.12128787 -0.58146858 -0.21770786]


In [56]:
estimators = [
    ('xgb',xgb),
    ('lgbm',lgbm),
    ('cat',cat)
]
voting = VotingClassifier(estimators, voting='soft',weights=weights)

In [50]:
scores = cross_val_score(voting, X,y,scoring='accuracy',cv=SKF, n_jobs=-1)
print(f'Mean score : {np.mean(scores):.5f} ± {np.mean(scores):.5f}')

Mean score : 0.91112 ± 0.91112


# Submission

In [57]:
voting.fit(X,y)

In [58]:
test = pd.read_csv('data/test.csv', index_col='id')
submission = pd.read_csv("data/sample_submission.csv", index_col='id')

In [59]:
voting.predict(test)

array([3, 5, 4, ..., 0, 1, 3])

In [60]:
submission.loc[:,'NObeyesdad'] = lb.inverse_transform(voting.predict(test))

In [61]:
name = dt.now().strftime("%Y%m%d_%H%M")

In [62]:
submission.to_csv(f"submission/{name}.csv")

In [63]:
submission

Unnamed: 0_level_0,NObeyesdad
id,Unnamed: 1_level_1
20758,Obesity_Type_II
20759,Overweight_Level_I
20760,Obesity_Type_III
20761,Obesity_Type_I
20762,Obesity_Type_III
...,...
34593,Overweight_Level_II
34594,Normal_Weight
34595,Insufficient_Weight
34596,Normal_Weight
