# Exploring Class Imbalance

Through our EDA analysis, we found some of the class features were imbalanced (FAVC, SMOKE, SCC), so based on our [class imbalance research spike](https://docs.google.com/document/d/116kfJpA7369dsGvCV0twKVvFQefTEyGULMDg783SfOY/edit?tab=t.0), we decided to explore random oversampling to see if this could create a data set for a more accurate model prediction result.

In [15]:
# Load imports and cleaned data
import pandas as pd

df = pd.read_csv('../data/processed/cleaned_data.csv')

print(df.columns)
df.info()

Index(['Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight',
       'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE',
       'CALC', 'MTRANS', 'NObeyesdad'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2087 non-null   object 
 1   Age                             2087 non-null   float64
 2   Height                          2087 non-null   float64
 3   Weight                          2087 non-null   float64
 4   family_history_with_overweight  2087 non-null   object 
 5   FAVC                            2087 non-null   object 
 6   FCVC                            2087 non-null   float64
 7   NCP                             2087 non-null   float64
 8   CAEC                            2087 non-null   object 
 9   S

In [16]:
# Define function to balance data on a targeted column

def balance_classes(df, target_column):
    balanced_df = df.copy()

    print("=====")
    print("Balancing Column: ", target_column)

    # Class counts
    class_counts = balanced_df[target_column].value_counts()
    print("Prior to balancing class count: ", class_counts)

    # Find the majority class count
    majority_class_count = class_counts.max()

    # Upsample minority classes
    from sklearn.utils import resample

    upsampled_dfs = []
    for class_value, count in class_counts.items():
        class_df = balanced_df[balanced_df[target_column] == class_value]
        if count < majority_class_count:
            class_df = resample(class_df,
                                replace=True,  # Sample with replacement
                                n_samples=majority_class_count,  # Match majority class
                                random_state=42)  # Reproducible results
        upsampled_dfs.append(class_df)

    # Combine all upsampled classes
    balanced_df = pd.concat(upsampled_dfs)

    print("After balancing class count: ", balanced_df[target_column].value_counts())
    print("=====")

    return balanced_df

## Review Balancing the Data Set


In [17]:
# Balance out NObeyesdad (Obesity Levels)

df = balance_classes(df, 'NObeyesdad')
df.info()

=====
Balancing Column:  NObeyesdad
Prior to balancing class count:  NObeyesdad
Obesity_Type_I         351
Obesity_Type_III       324
Obesity_Type_II        297
Overweight_Level_II    290
Normal_Weight          282
Overweight_Level_I     276
Insufficient_Weight    267
Name: count, dtype: int64
After balancing class count:  NObeyesdad
Obesity_Type_I         351
Obesity_Type_III       351
Obesity_Type_II        351
Overweight_Level_II    351
Normal_Weight          351
Overweight_Level_I     351
Insufficient_Weight    351
Name: count, dtype: int64
=====
<class 'pandas.core.frame.DataFrame'>
Index: 2457 entries, 10 to 686
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2457 non-null   object 
 1   Age                             2457 non-null   float64
 2   Height                          2457 non-null   float64
 3   Weight                         

**Outcome:** We can see the target data is actually quite balanced already, so let's see what would happen if we tried to fix the imbalance for our class variables below.

In [18]:
# Balance out FAVC (Eating high caloric food frequently)

df = balance_classes(df, 'FAVC')

=====
Balancing Column:  FAVC
Prior to balancing class count:  FAVC
yes    2161
no      296
Name: count, dtype: int64
After balancing class count:  FAVC
yes    2161
no     2161
Name: count, dtype: int64
=====


In [19]:
# Inspect balancing effect on Target Obesity Levels

df['NObeyesdad'].value_counts()

NObeyesdad
Normal_Weight          1030
Overweight_Level_II     838
Insufficient_Weight     715
Overweight_Level_I      556
Obesity_Type_I          430
Obesity_Type_II         402
Obesity_Type_III        351
Name: count, dtype: int64

**Outcome:** By balancing FAVC inputs, we appear to have skewed the target data to be more imbalanced than before, which might produce skewed model training results.

## Class Balancing Analysis

As a result of this analysis, the team has decided we will not be performing class balancing on the existing data set so as to preserve the distribution of obesity levels, which we deem to be more important for training a model to determine the most significant factors for predicting obesity across all available levels. 