# Decision Trees
55377f5b65d7c82726ee8a0460)



Student: Noa Pereira Prada Schnor

Student ID: A00326381

### Dataset Overview:
#### Name: Estimation of Obesity Levels Based On Eating Habits and Physical Condition
#### Purpose:
This dataset is primarily used to estimate obesity levels among individuals from Mexico, Peru, and Colombia by analysing their eating habits and physical activity.
#### Data Generation:
- **77%** of the data was generated **synthetically** using the **Weka** tool and the **SMOTE** filter, while **23%** of the data was **collected directly** from users via a **web platform**.
    
#### Target Variable:
- **Name:** NObeyesdad
- **Description:** Ordinal variable created based on the **Body Mass Index (BMI)**. The dataset classifies individuals into 7 NObeyesdad categories, from Insufficient Weight to Obesity Type III.

#### Feature Variables:
The feature variables are related to eating habits attributes, physical activity attributes and additional attributes, such as gender and age. All the attributes are listed below.

##### Eating Habits Attributes:
The following attributes are related to the **eating habits** of individuals:
- **Frequent consumption of high-caloric food (FAVC)** - binary: Yes/No
- **Frequency of consumption of vegetables (FCVC)** - categorical: Never, Sometimes, Always
- **Number of main meals (NCP)** - categorical: Between 1 and 2, Three, More than three
- **Consumption of food between meals (CAEC)** -categorical: No, Sometimes, Frequently, Always
- **Consumption of water daily (CH20)** -categorical: Less than a litre, Between 1 and 2 L, More than 2 L
- **Consumption of alcohol (CALC)**

##### Physical Activity Attributes:
The attributes related to the **physical activity** of individuals include:
- **Calories consumption monitoring (SCC)** - binary: Yes/No
- **Physical activity frequency (FAF)** - categorical: I do not have, 1 or 2 days, 2 or 4 days, 4 or 5 days
- **Time using technology devices (TUE)** - categorical: 0–2 hours, 3–5 hours, More than 5 hours
- **Transportation used (MTRANS)** - categorical: Automobile, Motorbike, Bike, Public Transportation, Walking


##### Additional Attributes:
Other variables obtained in the dataset are:
- **Gender** - binary: Male/Female
- **Age** - numerical, in years
- **Height** - numerical, in metres
- **Weight** - numerical, in kilograms
- **Smoke(SMOKE)** - binary: yes/no
- **Family History of Obesity(family_history_with_overweight)** - binary: Yes/No


#### Additional Resources:
For detailed information and studies related to this dataset, please refer to the following sources:
- [UCI Machine Learning Repository: Estimation of Obesity Levels Based On Eating Habits and Physical Condition Dataset](https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition)
- [Dataset for Estimation of Obesity Levels Based on Eating Habits and Physical Condition](https://www.semanticscholar.org/paper/Dataset-for-estimation-of-obesity-levels-based-on-Palechor-Manotas/35b40bacd2ffa9370885b7a3004d88995fd1d011)
- The full paper can be accessed from [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6710633/).

### Import libraries

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split #to split into the training and test data
from sklearn import metrics #to calculate accuracy
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix #to check the prediction expected vs predicted
from sklearn.metrics import accuracy_score
from sklearn.tree import export_graphviz
from io import StringIO
from IPython.display import Image
import pydotplus
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import pandas as pd
import numpy as np

### Load Data and Exploration

In [None]:
#Read the csv file
df = pd.read_csv("data/ObesityDataSet_raw_and_data_sinthetic.csv")

In [None]:
#Check first  rows of the dataset
df.head() 

- The target variable (Nobeyesdad) is a categorical/ordinal variable
- Features: mixed data type - categorical, binary, and continuous (numerical)

In [None]:
#Check number of instances and columns
df.shape

- Instances: 2111
- No. columns: 17 (including the Target variable)

In [None]:
#Check the data type
df.info()

- No missing data
- All columns with 2111 data entries/instances
- 9 Non-numerical columns with Dtype 'object' - categorical
- 8 Numerical columns float-type

In [None]:
#Check the statistical summary of numerical columns
df.describe()

-  Data spread: wide spread of Weight (from 39 to 173 kg)
-  It seems that not only adults are included in this dataset, the Age ranges from 14 to 61

In [None]:
#Check the distributions of the categories of the target variable
print(df.NObeyesdad.value_counts())

- Fairly balanced distribution across Obesity Level categories

In [None]:
#Explore the relationships among numerical variables
scatter_matrix(df, figsize=(10, 10), diagonal='kde')  # 'kde' to show kernel density on the diagonal
# Save the figure
plt.savefig('plots/DecisionTrees/DT_scatter_matrix.png')
plt.show()

- Age distribution - right-skewed: most individuals are younger with fewer older individuals
- Height/Weight more normal-like distribution
- Variables like FCVC, NCP, and CH2O have distinct peaks, suggesting certain preferences or habits may dominate the dataset (like eating habits or water consumption). 
- There don’t appear to be strong linear relationships between these lifestyle factors and physical attributes like height, weight, or age.
- Weight vs. Age: older individuals possibly weigh more, but there doesn’t appear to be a strong correlation.
- Weight vs. Height: positive correlation, as expected. Taller individuals generally weigh more.

In [None]:
# Define mappings
frequency_mapping = {'no': 0, 'Sometimes': 1, 'Frequently': 2, 'Always': 3}
binary_mapping = {'no': 0, 'yes': 1}

# Columns to map
frequency_columns = ['CALC', 'CAEC']
binary_columns = ['family_history_with_overweight', 'SMOKE', 'FAVC', 'SCC']

# Apply mappings to frequency columns
for col in frequency_columns:
    # Convert column to lowercase string and map
    df[col] = df[col].astype(str).map(frequency_mapping)
    
    # Check for any missing values introduced by the mapping
    print(f"Column {col} missing values after mapping:", df[col].isna().sum())

# Apply mappings to binary columns
for col in binary_columns:
    # Convert column to lowercase string and map
    df[col] = df[col].astype(str).map(binary_mapping)
    
    # Check for any missing values introduced by the mapping
    print(f"Column {col} missing values after mapping:", df[col].isna().sum())

# Combine transformed columns
transformed_columns = frequency_columns + binary_columns


# One-hot encode 'MTRANS' and 'Gender' in one line
df = pd.get_dummies(df, columns=['MTRANS', 'Gender'], prefix=['MTRANS', 'Gender'])

# Check the first 5 rows of the transformed DataFrame
df.head()

- Categorical features were encoded
- Target variable is an ordinal variable and not encoded. The model will be run with not encoded and encoded target variable to check what is the best approach with the target variable (it should encoded or not to get a better model performance?)

### Modelling

#### Non encoded target variable (NObeyesdad)

In [None]:
# Defining the features
X = df.drop(['NObeyesdad'], axis=1)  # Dropping the target variable

In [None]:
#Defining the target variable
y = df['NObeyesdad']

In [None]:
#Splitting the data (training, test data)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1,stratify=y)

In [None]:
#Create the model and fit the training data
tree = DecisionTreeClassifier()
tree.fit(X_train,y_train)

In [None]:
#Predict the response for test dataset
y_hat = tree.predict(X_test)

In [None]:
#Model Accuracy, how often is the classifier correct?
accuracy = metrics.accuracy_score(y_test, y_hat)
print("Accuracy:", accuracy)

In [None]:
# Check the depth of the tree after fitting
print(f"Depth of the tree: {tree.get_depth()}")

In [None]:
#Create cm for a non-encoded target variable 

feature_names = X.columns
print(feature_names)

target_names = ["Obesity_Type_I", "Obesity_Type_II","Obesity_Type_III", "Overweight_Level_I", "Overweight_Level_II", "Normal_Weight", "Insufficient_Weight"] #name of all possible values for obesity field

#Confusion matrix
cm= confusion_matrix(y_test,y_hat)

# Convert the confusion matrix to a DataFrame with target names as row and column labels
cm_df = pd.DataFrame(cm, index=target_names, columns=target_names)

# Print the confusion matrix with labels
print("Confusion Matrix:")
print(cm_df)

- Good classification ability, especially in categories like Obesity_Type_III and Overweight_Level_II, where correct classification rates are very high.
- The misclassifications happen between normal weight and insufficient weight, and obesity I and II.

In [None]:
#Check the splits, gini values, and how the features were used of all the levels
dot_data = StringIO()

# Corrected export_graphviz call
export_graphviz(
    tree, 
    out_file=dot_data, 
    filled=True, 
    rounded=True,  # Capital "T"
    special_characters=True,  # Correct spelling
    feature_names=feature_names, 
    class_names=target_names
)

# Creating the graph from dot data
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

#save figure
graph.write_png('plots/DecisionTrees/obesityDecisionTree_not_encoded.png')


# Displaying the image
Image(graph.create_png())

In [None]:
# Perform 5-fold cross-validation on the decision tree model and print the accuracy for each fold
scores = cross_val_score(tree, X_train, y_train, cv=5)

In [None]:
# Print the accuracy scores for each fold to see the model's performance on each subset of the data
print(scores)

# Calculate and print the mean accuracy across all folds to get an overall cross-validation accuracy estimate
print(scores.mean())

In [None]:
#Cross-validation helps ensure that the model generalizes well to unseen data. 

accuracy_depths = []

# Loop through the depths and calculate both cross-validation and training accuracy
for d in range(1, 20):
    tree = DecisionTreeClassifier(max_depth=d)
    
    # Cross-validation accuracy (using 5-fold CV)
    scores = cross_val_score(tree, X_train, y_train, cv=5)
    mean_cv_score = scores.mean()
    
    # Fit the model on the full training set to compute training accuracy
    tree.fit(X_train, y_train)
    train_predictions = tree.predict(X_train)
    train_accuracy = accuracy_score(y_train, train_predictions)
    
    # Store both accuracies and the depth
    accuracy_depths.append((mean_cv_score, train_accuracy, d))
    
    # Printing the CV and Training accuracies
    print(f"Depth {d}: Cross Validation Accuracy = {mean_cv_score:.4f}, Training Accuracy = {train_accuracy:.4f}")


# Finding the maximum cross-validation accuracy
max_accuracy = max(accuracy_depths, key=lambda x: x[0])[0]

# Finding all depths that have the maximum accuracy
best_depths = [d for cv_score, train_accuracy, d in accuracy_depths if cv_score == max_accuracy]

# Printing the depth with the highest Cross Validation accuracy 
print(f"\nMaximum Cross-Validation Accuracy: {max_accuracy:.4f} at Depth(s): {best_depths}")

- The Best Cross-Validation accuracy is at Depth 11, however, the training accuracy reaches 100% from depth 11, and the cross-validation does not change much from depth 10. To avoid overfitting the depth 10 could be chosen.


#### Encoded target variable (NObeyesdad)

In [None]:
#Encoding the target variable to check if there is any change/improvement in the model performance

#Check unique values of the target variable
print(df['NObeyesdad'].unique())

#Category order mapping
category_order = {
    'Insufficient_Weight': 0,
    'Normal_Weight': 1,
    'Overweight_Level_I': 2,
    'Overweight_Level_II': 3,
    'Obesity_Type_I': 4,
    'Obesity_Type_II': 5,
    'Obesity_Type_III': 6
}

#Assign digits to the target variable
y_encoded = df['NObeyesdad'].map(category_order)

#Check the encoded target variable
print(y_encoded)

In [None]:
#Splitting the data (training, test data)
X_train,X_test,y_train_encoded,y_test_encoded = train_test_split(X,y_encoded,test_size=0.25,random_state=1,stratify=y)

In [None]:
#Create the model and fit the training data
tree_encoded = DecisionTreeClassifier()
tree_encoded.fit(X_train,y_train_encoded)

In [None]:
#Predict the response for the test dataset
y_hat_encoded = tree_encoded.predict(X_test)

In [None]:
#Model Accuracy, how often is the classifier correct?
accuracy_encoded = metrics.accuracy_score(y_test_encoded, y_hat_encoded)
print("Accuracy:", accuracy_encoded)

In [None]:
# Check the depth of the tree after fitting
print(f"Depth of the tree: {tree_encoded.get_depth()}")

- The model seems to behave quite similarly to the encoded target variable (same depth).
- The model score had a slight improvement (from 0.9318 to 0.9337)
- The model handles well categorical target variable
- The model was able to capture most patterns even without explicit ordinal encoding.

In [None]:
scores_encoded = cross_val_score(tree_encoded, X_train, y_train_encoded, cv=5)

In [None]:
print(scores_encoded)
print(scores_encoded.mean())

In [None]:
# Cross-validation helps ensure that the model generalizes well to unseen data.
accuracy_depths_encoded = []

# Loop through the depths and calculate both cross-validation and training accuracy
for d in range(1, 20):
    tree_encoded = DecisionTreeClassifier(max_depth=d)
    
    # Cross-validation accuracy (using 5-fold CV)
    scores_encoded = cross_val_score(tree_encoded, X_train, y_train_encoded, cv=5)
    mean_cv_score_encoded = scores_encoded.mean()
    
    # Fit the model on the full training set to compute training accuracy
    tree_encoded.fit(X_train, y_train_encoded)
    train_predictions_encoded = tree_encoded.predict(X_train)
    train_accuracy_encoded = accuracy_score(y_train_encoded, train_predictions_encoded)
    
    # Store both accuracies and the depth
    accuracy_depths_encoded.append((mean_cv_score_encoded, train_accuracy_encoded, d))
    
    # Printing the CV and Training accuracies
    print(f"Depth {d}: Cross Validation Accuracy = {mean_cv_score_encoded:.4f}, Training Accuracy = {train_accuracy_encoded:.4f}")

# Finding the maximum cross-validation accuracy
max_accuracy_encoded = max(accuracy_depths_encoded, key=lambda x: x[0])[0]

# Finding all depths that have the maximum accuracy
best_depths_encoded = [d for cv_score_encoded, train_accuracy_encoded, d in accuracy_depths_encoded if cv_score_encoded == max_accuracy_encoded]

# Printing the depth with the highest Cross Validation accuracy 
print(f"\nMaximum Cross-Validation Accuracy: {max_accuracy_encoded:.4f} at Depth(s): {best_depths_encoded}")


#### Checking the cross-validation accuracy of encoded vs non-encoded target variable:
- Encoding the target as integers for ordinal relationships provides a slight advantage in cross-validation accuracy.
- Both cases (encoded and non-encoded target variable) overfit at greater depths is valuable for tuning your model.
- The best depth for encoded seems to be 13, however, the Training accuracy reaches 100% at depth 11 while the Cross-Validation accuracy does not change significantly.
- As the tree gets deeper, it can handle more complex decision boundaries, but this comes with the risk of overfitting, which was evident in the training and cross-validation accuracy results.
- The depth to be used should be 10.

In [None]:
#Create the model and fit the training data
tree_encoded = DecisionTreeClassifier(max_depth=10)
tree_encoded.fit(X_train,y_train_encoded)
#Predict the response for the test dataset
y_hat_encoded = tree_encoded.predict(X_test)

#Model Accuracy, how often is the classifier correct?
accuracy_encoded = metrics.accuracy_score(y_test_encoded, y_hat_encoded)
print("Accuracy:", accuracy_encoded)

In [None]:
#Check the splits, gini values, and how the features were used of all the levels
dot_data = StringIO()

# Corrected export_graphviz call
export_graphviz(
    tree, 
    out_file=dot_data, 
    filled=True, 
    rounded=True,  # Capital "T"
    special_characters=True,  # Correct spelling
    feature_names=feature_names, 
    class_names=target_names
)

# Creating the graph from dot data
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

#save figure
graph.write_png('plots/DecisionTrees/obesityDecisionTree_depth10_encoded.png')


# Displaying the image
Image(graph.create_png())

In [None]:
#Create a cm for a encoded target variable

# Create a list of labels in the correct order
labels = list(category_order.keys())

cm_encoded = confusion_matrix(y_test_encoded, y_hat_encoded)

# Create the DataFrame with the category names as labels
cm_df = pd.DataFrame(cm_encoded, index=labels, columns=labels)

# Print the confusion matrix with readable labels
print("Confusion Matrix with Labels:")
print(cm_df)

- If the model needs to consider the ordinal nature explicitly, the encoded target should be chosen. 

## The end