> open /work/merged_dataset.csv create features and preprocess for feature selection using scikitlearn advanced modules

In [1]:
# First, we need to load the csv file into a pandas dataframe
import pandas as pd

df = pd.read_csv('/work/merged_dataset.csv')
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,id
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight,
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight,
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight,
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I,
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II,


In [2]:
# Next, we need to preprocess the data. In particular, we need to handle categorical variables and missing values.
# We'll start by identifying the categorical variables and converting them into dummy variables.

# Select the categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# For each categorical column
# We use pd.get_dummies to convert the categorical variable into dummy/indicator variables
for col in categorical_cols:
    df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col], prefix=col, drop_first=True)], axis=1)
    
df.head()

Unnamed: 0,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE,id,Gender_Male,...,MTRANS_Bike,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking,NObeyesdad_Normal_Weight,NObeyesdad_Obesity_Type_I,NObeyesdad_Obesity_Type_II,NObeyesdad_Obesity_Type_III,NObeyesdad_Overweight_Level_I,NObeyesdad_Overweight_Level_II
0,21.0,1.62,64.0,2.0,3.0,2.0,0.0,1.0,,False,...,False,False,True,False,True,False,False,False,False,False
1,21.0,1.52,56.0,3.0,3.0,3.0,3.0,0.0,,False,...,False,False,True,False,True,False,False,False,False,False
2,23.0,1.8,77.0,2.0,3.0,2.0,2.0,1.0,,True,...,False,False,True,False,True,False,False,False,False,False
3,27.0,1.8,87.0,3.0,3.0,2.0,2.0,0.0,,True,...,False,False,False,True,False,False,False,False,True,False
4,22.0,1.78,89.8,2.0,1.0,2.0,0.0,0.0,,True,...,False,False,True,False,False,False,False,False,False,True


In [3]:
# Next, we'll check for missing values. If there are any, we'll have to decide how to handle them.
df.isnull().sum()

Age                                      0
Height                                   0
Weight                                   0
FCVC                                     0
NCP                                      0
CH2O                                     0
FAF                                      0
TUE                                      0
id                                    2111
Gender_Male                              0
family_history_with_overweight_yes       0
FAVC_yes                                 0
CAEC_Frequently                          0
CAEC_Sometimes                           0
CAEC_no                                  0
SMOKE_yes                                0
SCC_yes                                  0
CALC_Frequently                          0
CALC_Sometimes                           0
CALC_no                                  0
MTRANS_Bike                              0
MTRANS_Motorbike                         0
MTRANS_Public_Transportation             0
MTRANS_Walk

In [4]:
# The 'id' column has missing values. 
# As 'id' doesn't help for prediction and it is only a unique identification of a record, we can safely drop it.

df = df.drop(['id'], axis=1)
df.head()

Unnamed: 0,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE,Gender_Male,family_history_with_overweight_yes,...,MTRANS_Bike,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking,NObeyesdad_Normal_Weight,NObeyesdad_Obesity_Type_I,NObeyesdad_Obesity_Type_II,NObeyesdad_Obesity_Type_III,NObeyesdad_Overweight_Level_I,NObeyesdad_Overweight_Level_II
0,21.0,1.62,64.0,2.0,3.0,2.0,0.0,1.0,False,True,...,False,False,True,False,True,False,False,False,False,False
1,21.0,1.52,56.0,3.0,3.0,3.0,3.0,0.0,False,True,...,False,False,True,False,True,False,False,False,False,False
2,23.0,1.8,77.0,2.0,3.0,2.0,2.0,1.0,True,True,...,False,False,True,False,True,False,False,False,False,False
3,27.0,1.8,87.0,3.0,3.0,2.0,2.0,0.0,True,False,...,False,False,False,True,False,False,False,False,True,False
4,22.0,1.78,89.8,2.0,1.0,2.0,0.0,0.0,True,False,...,False,False,True,False,False,False,False,False,False,True


We have successfully opened the 'merged_dataset.csv' and created features by handling the categorical variables using pandas get_dummies functionality. We've also dealt with missing values in the 'id' column. Now, the data is now properly preprocessed and is ready for a feature selection process with scikit-learn's modules, or for any other data science methods.

> apply scikitlearns advanced moduls to find the relevant features

Alright. To get the importance of features, we can use methods like `feature_importances_` from tree-based classifiers, `Coefficient Magnitude` from linear models, and `Recursive Feature Elimination (RFE)`. Here, we will apply a RandomForestClassifier model to infer the importance of features. Please note that for a more comprehensive feature selection, multiple methods should be applied and compared.

But first, let's split the data into features X and target y. We'll consider `NObeyesdad_Normal_Weight`, `NObeyesdad_Obesity_Type_I`, `NObeyesdad_Obesity_Type_II`, `NObeyesdad_Obesity_Type_III`, `NObeyesdad_Overweight_Level_I`, `NObeyesdad_Overweight_Level_II` as target variables (representing different categories of obesity). The remaining columns will be our feature set. 

Then, we'll apply a Random Forest classifier and fit the classifier to our data. After that, we'll get feature importances from

In [5]:
# Separate the data into features and target variables
target_cols = ['NObeyesdad_Normal_Weight', 'NObeyesdad_Obesity_Type_I', 'NObeyesdad_Obesity_Type_II', 
               'NObeyesdad_Obesity_Type_III', 'NObeyesdad_Overweight_Level_I', 'NObeyesdad_Overweight_Level_II']
X = df.drop(target_cols, axis=1)  # Features
y = df[target_cols]  # Target

# Import RandomForestClassifier from sklearn
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
model = RandomForestClassifier(random_state=1)

# Fit the model
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_

# Create a DataFrame for visualization
features = pd.DataFrame()
features['feature'] = X.columns
features['importance'] = importances
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

features

Unnamed: 0_level_0,importance
feature,Unnamed: 1_level_1
MTRANS_Bike,0.000338
MTRANS_Motorbike,0.000394
SMOKE_yes,0.001462
MTRANS_Walking,0.002872
CALC_Frequently,0.003177
CAEC_no,0.004169
SCC_yes,0.005189
CAEC_Frequently,0.008041
CAEC_Sometimes,0.011104
FAVC_yes,0.012212


In [14]:
# First, revert the dummy variables back into single column
import numpy as np

# Create a dict to map column names to numeric values
class_map = {c: i for i, c in enumerate(target_cols)}

# Use the idxmax function to get the column name where max value appears, replace that with the number from the class_map, and convert it to integer
y_single = y.idxmax(axis=1).replace(class_map).astype(int)

# Fit the method to this single column representation
rfe = rfe.fit(X, y_single)

# Get selected features
features_selected = X.columns[rfe.support_]
features_selected