# Multiclass Classification Final Project – Lifestyle Risk Category (Synthetic)

> **Educational Use Only.** This is a fully **synthetic**, interpretable tabular dataset created for teaching machine learning classification. Labels do **not** represent medical diagnoses or real medical risk; use this dataset only for ML practice.

## Overview
Build a **multiclass classifier** that predicts a person's **Lifestyle Risk Category** from human-understandable features such as **age, height, weight, BMI, blood pressure, resting heart rate, sleep, exercise, smoking, and alcohol use**.

**Target classes (`label`):**
- `Low`
- `Elevated`
- `High`
- `Very High`

The label is derived from a latent risk score computed from standardized features with noise and then bucketed into four categories. The distribution is moderately imbalanced by design—consider macro-averaged metrics.

## Files Provided
- `train.csv` — features **and** `label`
- `test.csv` — features only (**no** label)
- `sample_submission.csv` — required format for submissions (`id,label`)

## Feature List (short descriptions)

| Column | Type | Unit | Allowed Values / Range | Description |
|---|---|---|---|---|
| `id` | int | — | unique | Row identifier (must be preserved in submissions). |
| `sex` | category | — | `female`, `male` | Biological sex (categorical). Encode before modeling. |
| `age_years` | int | years | 18–75 (approx.) | Age in years. |
| `height_cm` | float | cm | ~140–205 | Body height. |
| `weight_kg` | float | kg | ~40–160 | Body weight. |
| `bmi` | float | kg/m² | ~15–50 | Body Mass Index derived from height/weight. |
| `waist_cm` | float | cm | ~55–160 | Waist circumference; correlated with BMI and height. |
| `sbp_mmHg` | int | mmHg | ~90–200 | Systolic blood pressure (higher is worse). |
| `dbp_mmHg` | int | mmHg | ~55–120 | Diastolic blood pressure. |
| `resting_hr_bpm` | int | bpm | ~45–120 | Resting heart rate (beats per minute). |
| `exercise_hours_per_week` | float | hours | ≥ 0 | Self-reported average weekly exercise. |
| `smoker` | int | — | 0 (non-smoker), 1 (smoker) | Smoking indicator. |
| `alcohol_units_per_week` | int | units | ≥ 0 | Approximate weekly alcohol units. |
| `sleep_hours_per_night` | float | hours | ~3.5–10.5 | Average nightly sleep duration. |
| `label` | category | — | `Low`, `Elevated`, `High`, `Very High` | **Target** class (only in `train.csv`). |

## Objective
Train a model to predict the `label` for every row in **test.csv** and submit a CSV exactly matching `sample_submission.csv` (`id,label`).

## Rules
1. **Multiclass** prediction: output *one* of `Low|Elevated|High|Very High` per test row.  
2. **No leakage**: you may not access the private labels.  
3. **Reproducibility**: set and report your random seeds.  
4. **Write-up** should include:
   - Problem framing & baseline
   - Preprocessing (encoding, scaling) and rationale
   - Model(s), training details, hyperparameters
   - Validation strategy (e.g., stratified K-fold)
   - Results with multiple metrics (see below)
   - Error analysis (confusion matrix; which classes are confused and why)

## Recommended Workflow
- Use a **stratified** split or cross-validation by `label`.
- Start with simple baselines (Logistic Regression / Linear SVM / Decision Tree).
- Encode categorical feature(s) and consider feature scaling for linear models.
- Compare **Accuracy**, **F1**, **Precision**, **Recall** (macro-averaged is recommended).

## Suggested Evaluation Snippet (for your validation split)

```python
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    classification_report,
    confusion_matrix,
)

print("Accuracy:", accuracy_score(y_val, y_pred))
print("F1 (macro):", f1_score(y_val, y_pred, average="macro"))
print("Precision (macro):", precision_score(y_val, y_pred, average="macro"))
print("Recall (macro):", recall_score(y_val, y_pred, average="macro"))

print("\nClassification Report:\n", classification_report(y_val, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_val, y_pred))


In [279]:
#Setup: imports & paths (edit DATA_DIR if your CSVs are elsewhere)
import os, sys, numpy as np, pandas as pd

# Try a few common locations; edit to your needs
CANDIDATE_DIRS = [
    ".",                # current folder
    "./data",           # a "data" subfolder
    "/mnt/data",        # preloaded path (if using this notebook as provided)
]

for _dir in CANDIDATE_DIRS:
    if os.path.exists(os.path.join(_dir, "train.csv")) and os.path.exists(os.path.join(_dir, "test.csv")):
        DATA_DIR = _dir
        break
else:
    DATA_DIR = "."  # default
    print("Could not auto-locate CSVs. Set DATA_DIR manually to the folder containing train.csv and test.csv.")

TRAIN_PATH = os.path.join(DATA_DIR, "train.csv")
TEST_PATH  = os.path.join(DATA_DIR, "test.csv")

print("Using DATA_DIR:", DATA_DIR)


Using DATA_DIR: .


In [280]:
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import (
    mean_squared_error,
    r2_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    auc,
)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_diabetes, load_breast_cancer, make_regression

In [281]:
#Load data
train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)

print("train shape:", train.shape)
print("test shape :", test.shape)
display(train.head())
display(test.head())

train shape: (3750, 15)
test shape : (1250, 14)


Unnamed: 0,id,sex,age_years,height_cm,weight_kg,bmi,waist_cm,sbp_mmHg,dbp_mmHg,resting_hr_bpm,exercise_hours_per_week,smoker,alcohol_units_per_week,sleep_hours_per_night,label
0,0,female,68,168.1,83.2,29.4,110.8,127,82,68,4.85,0,0,7.0,High
1,1,male,57,182.4,97.5,29.3,123.1,131,76,77,4.01,0,0,7.3,High
2,2,male,24,168.3,75.3,26.6,95.1,98,74,68,2.27,0,0,7.5,Low
3,3,male,49,178.2,95.4,30.0,115.0,122,77,66,1.78,0,3,7.6,High
4,4,female,65,162.0,60.2,22.9,79.6,137,80,67,1.55,0,2,5.7,High


Unnamed: 0,id,sex,age_years,height_cm,weight_kg,bmi,waist_cm,sbp_mmHg,dbp_mmHg,resting_hr_bpm,exercise_hours_per_week,smoker,alcohol_units_per_week,sleep_hours_per_night
0,18,female,75,166.5,55.3,19.9,71.7,134,79,82,1.17,0,7,5.4
1,21,female,70,150.3,47.5,21.0,62.1,129,69,87,1.03,1,5,7.9
2,31,female,25,177.0,63.5,20.3,74.8,101,56,57,4.2,0,0,7.5
3,32,female,38,163.7,66.2,24.7,91.4,118,66,62,3.24,0,2,9.0
4,42,female,72,173.4,69.5,23.1,78.1,143,78,67,2.91,0,3,5.2


In [282]:
#Quick EDA
print("Columns:", list(train.columns))
print("\nLabel distribution (counts):")
print(train["label"].value_counts())

print("\nLabel distribution (proportions):")
print(train["label"].value_counts(normalize=True).round(3))

print("\nNumerical summary:")
display(train.describe(include=[np.number]).T)


Columns: ['id', 'sex', 'age_years', 'height_cm', 'weight_kg', 'bmi', 'waist_cm', 'sbp_mmHg', 'dbp_mmHg', 'resting_hr_bpm', 'exercise_hours_per_week', 'smoker', 'alcohol_units_per_week', 'sleep_hours_per_night', 'label']

Label distribution (counts):
label
Elevated     1125
Low           938
High          937
Very High     750
Name: count, dtype: int64

Label distribution (proportions):
label
Elevated     0.30
Low          0.25
High         0.25
Very High    0.20
Name: proportion, dtype: float64

Numerical summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,3750.0,2475.5768,1442.293373,0.0,1233.5,2473.0,3711.75,4999.0
age_years,3750.0,46.230667,16.77246,18.0,32.0,46.0,61.0,75.0
height_cm,3750.0,168.159653,9.232249,140.0,161.4,167.5,174.9,198.8
weight_kg,3750.0,74.147147,16.233889,32.5,62.725,73.3,84.3,147.0
bmi,3750.0,26.145067,4.931151,15.0,22.7,26.0,29.5,45.0
waist_cm,3750.0,93.212987,18.997346,55.0,79.7,92.5,105.7,160.0
sbp_mmHg,3750.0,122.833067,13.757759,82.0,113.0,123.0,133.0,163.0
dbp_mmHg,3750.0,76.284267,9.10976,44.0,70.0,76.0,82.0,106.0
resting_hr_bpm,3750.0,70.5192,9.018136,45.0,64.0,71.0,76.0,105.0
exercise_hours_per_week,3750.0,2.996171,2.136621,0.04,1.43,2.5,4.05,14.01


In [283]:
train.isnull().sum()

id                         0
sex                        0
age_years                  0
height_cm                  0
weight_kg                  0
bmi                        0
waist_cm                   0
sbp_mmHg                   0
dbp_mmHg                   0
resting_hr_bpm             0
exercise_hours_per_week    0
smoker                     0
alcohol_units_per_week     0
sleep_hours_per_night      0
label                      0
dtype: int64

In [284]:
train.columns

Index(['id', 'sex', 'age_years', 'height_cm', 'weight_kg', 'bmi', 'waist_cm',
       'sbp_mmHg', 'dbp_mmHg', 'resting_hr_bpm', 'exercise_hours_per_week',
       'smoker', 'alcohol_units_per_week', 'sleep_hours_per_night', 'label'],
      dtype='object')

In [285]:
from sklearn.compose import make_column_selector as selector
cat_selector = selector(dtype_include=object)
num_selector = selector(dtype_exclude=object)
cat_cols = cat_selector(train)
num_cols = num_selector(train)
print("Categorical:", cat_cols)
print("Numerical:", num_cols)

Categorical: ['sex', 'label']
Numerical: ['id', 'age_years', 'height_cm', 'weight_kg', 'bmi', 'waist_cm', 'sbp_mmHg', 'dbp_mmHg', 'resting_hr_bpm', 'exercise_hours_per_week', 'smoker', 'alcohol_units_per_week', 'sleep_hours_per_night']


In [286]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder

In [287]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline ,Pipeline
from sklearn.preprocessing import StandardScaler

In [288]:
idn= train.pop('id')


In [289]:
le = LabelEncoder()
train['sex'] = le.fit_transform(train['sex'])

In [290]:
oe = OrdinalEncoder()
train['label'] = pd.DataFrame(
    oe.fit_transform(train[['label']]),
    columns=['label'],
    index=train.index
)

In [291]:
y = train.pop('label')

In [292]:
sscalar= StandardScaler()
train[[ 'age_years', 'height_cm', 'weight_kg', 'bmi', 'waist_cm', 'sbp_mmHg', 'dbp_mmHg', 
       'resting_hr_bpm', 'exercise_hours_per_week','alcohol_units_per_week',
       'sleep_hours_per_night']] = sscalar.fit_transform(train[[ 'age_years', 'height_cm', 
                                                                'weight_kg', 'bmi', 'waist_cm', 'sbp_mmHg', 
                                                                'dbp_mmHg','resting_hr_bpm', 'exercise_hours_per_week',
                                                                'alcohol_units_per_week','sleep_hours_per_night']])



In [293]:
prep_train = pd.DataFrame(train , columns =['sex', 'age_years', 'height_cm', 'weight_kg', 'bmi', 'waist_cm',
       'sbp_mmHg', 'dbp_mmHg', 'resting_hr_bpm', 'exercise_hours_per_week',
       'smoker', 'alcohol_units_per_week', 'sleep_hours_per_night',] )

In [294]:
from sklearn.model_selection import KFold , cross_val_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

In [295]:
X_train, X_test, y_train, y_test = train_test_split(
    prep_train, y, test_size=0.2, random_state=42
)


In [296]:
!pip install xgboost



In [297]:
from sklearn.ensemble import  RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
models = {
    'Logistic Regression': make_pipeline(LogisticRegression(max_iter=1000)),
    'Linear SVM': make_pipeline( SVC(kernel='linear')),
    'Decision Tree': make_pipeline(DecisionTreeClassifier()),
    'Logistic Regression (Lasso)': LogisticRegression(penalty='l1', solver='liblinear'),
    'Logistic Regression (Ridge C=0.1)': LogisticRegression(penalty='l2', C=0.1),
    'Logistic Regression (Ridge C=0.01)': LogisticRegression(penalty='l2', C=0.01),
    'Logistic Regression (Ridge C=10.0)': LogisticRegression(penalty='l2', C=10.0),
    'Random Forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

In [298]:
cv = KFold(n_splits = 5 , shuffle= True , random_state =42)

In [299]:
scorers = {
    'accuracy': 'accuracy',
    'precision_macro': 'precision_macro',
    'recall_macro': 'recall_macro',
    'f1_macro': 'f1_macro'
}

results = []

for name, model in models.items():
    cv_results = cross_validate(model, X_train, y_train, cv=cv, scoring=scorers)
    
    avg_accuracy = np.mean(cv_results['test_accuracy'])
    avg_precision_macro = np.mean(cv_results['test_precision_macro'])
    avg_recall_macro = np.mean(cv_results['test_recall_macro'])
    avg_f1_macro = np.mean(cv_results['test_f1_macro'])
    
    # Append to results (focus on CV averages)
    result_dict = {
        'Model': name,
        'CV Accuracy (Mean)': avg_accuracy,
        'CV Precision (Macro, Mean)': avg_precision_macro,
        'CV Recall (Macro, Mean)': avg_recall_macro,
        'CV F1 Score (Macro, Mean)': avg_f1_macro
    }
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    result_dict['Test Classification Report'] = classification_report(y_test, y_pred)
    result_dict['Test Confusion Matrix'] = confusion_matrix(y_test, y_pred)
    
    results.append(result_dict)

print(results)



[{'Model': 'Logistic Regression', 'CV Accuracy (Mean)': 0.8503333333333334, 'CV Precision (Macro, Mean)': 0.8536923284990656, 'CV Recall (Macro, Mean)': 0.8518220358020651, 'CV F1 Score (Macro, Mean)': 0.8524728654434606, 'Test Classification Report': '              precision    recall  f1-score   support\n\n         0.0       0.81      0.85      0.83       222\n         1.0       0.81      0.79      0.80       190\n         2.0       0.94      0.90      0.92       192\n         3.0       0.91      0.91      0.91       146\n\n    accuracy                           0.86       750\n   macro avg       0.87      0.86      0.86       750\nweighted avg       0.86      0.86      0.86       750\n', 'Test Confusion Matrix': array([[188,  22,  12,   0],
       [ 26, 151,   0,  13],
       [ 19,   0, 173,   0],
       [  0,  13,   0, 133]], dtype=int64)}, {'Model': 'Linear SVM', 'CV Accuracy (Mean)': 0.8506666666666666, 'CV Precision (Macro, Mean)': 0.8533474076329408, 'CV Recall (Macro, Mean)': 

In [300]:
model = models['Logistic Regression (Ridge C=10.0)']
y_new_pred = model.predict(X_test)

In [301]:
from sklearn.metrics import accuracy_score

new_accuracy = accuracy_score(y_test, y_new_pred)
new_report = classification_report(y_test, y_new_pred)
new_conf_matrix = confusion_matrix(y_test, y_new_pred)

# Print results
print(f"New Test Accuracy: {new_accuracy}")
print("New Test Classification Report:\n", new_report)
print("New Test Confusion Matrix:\n", new_conf_matrix)

New Test Accuracy: 0.8613333333333333
New Test Classification Report:
               precision    recall  f1-score   support

         0.0       0.81      0.85      0.83       222
         1.0       0.82      0.79      0.80       190
         2.0       0.94      0.90      0.92       192
         3.0       0.91      0.92      0.91       146

    accuracy                           0.86       750
   macro avg       0.87      0.86      0.87       750
weighted avg       0.86      0.86      0.86       750

New Test Confusion Matrix:
 [[189  22  11   0]
 [ 26 150   0  14]
 [ 19   0 173   0]
 [  0  12   0 134]]


In [302]:
idnf =test.pop('id')

test['sex'] = le.fit_transform(test['sex'])

sscalar= StandardScaler()

test[[ 'age_years', 'height_cm', 'weight_kg', 'bmi', 'waist_cm', 'sbp_mmHg', 'dbp_mmHg', 
       'resting_hr_bpm', 'exercise_hours_per_week','alcohol_units_per_week',
       'sleep_hours_per_night']] = sscalar.fit_transform(test[[ 'age_years', 'height_cm', 
                                                                'weight_kg', 'bmi', 'waist_cm', 'sbp_mmHg', 
                                                                'dbp_mmHg','resting_hr_bpm', 'exercise_hours_per_week',
                                                                'alcohol_units_per_week','sleep_hours_per_night']])

prep_test = pd.DataFrame(test , columns =['sex', 'age_years', 'height_cm', 'weight_kg', 'bmi', 'waist_cm',
       'sbp_mmHg', 'dbp_mmHg', 'resting_hr_bpm', 'exercise_hours_per_week',
       'smoker', 'alcohol_units_per_week', 'sleep_hours_per_night',] )

y_exam = model.predict(prep_test)

In [306]:
mapping = {0: 'Low',1 :'Elevated',2: 'High', 3:'Very High'}
label_series = pd.Series(y_exam).map(mapping).astype('category')
label_pd = pd.DataFrame({'label': label_series})
finally_pd = pd.DataFrame( {'id' :idnf})


In [332]:
print(label_pd)
print(finally_pd)


          label
0     Very High
1      Elevated
2          High
3          High
4      Elevated
...         ...
1245   Elevated
1246       High
1247   Elevated
1248  Very High
1249  Very High

[1250 rows x 1 columns]
        id
0       18
1       21
2       31
3       32
4       42
...    ...
1245  4966
1246  4971
1247  4978
1248  4989
1249  4992

[1250 rows x 1 columns]


In [340]:
pd_tamam=pd.concat([finally_pd, label_pd], axis=1)

In [342]:
print(pd_tamam)

        id      label
0       18  Very High
1       21   Elevated
2       31       High
3       32       High
4       42   Elevated
...    ...        ...
1245  4966   Elevated
1246  4971       High
1247  4978   Elevated
1248  4989  Very High
1249  4992  Very High

[1250 rows x 2 columns]


In [344]:
pd_tamam.to_csv('output_file.csv')