Load the (same) athletes dataset.

With your dataset use AutoML to find the best model (if it is no-code, submit screenshots).

Report any data insights from AutoML run.

What are the reported top 5 features?

What are the top 3 models per validation score when using

all features

only the top features (if you have to choose a number put in 3)

What are the top 3 models per speed when using

all features

only the top features (if you have to choose a number put in 3)

How does the top models compare to your previously developed model (assignments 1 and 2) in terms of validation score and speed?

Is your platform AutoML no-code/low-code/full-code and why?

Repeat 2 - 5 using H2O AutoML (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.htmlLinks to an external site.). Submit code.

### This notebook uses FLAML

I used ChatGTP 4o to help me understand key concepts and debug codes.

### Load the (same) athletes dataset.

In [1]:
!pip install flaml

Collecting flaml
  Downloading FLAML-2.3.5-py3-none-any.whl.metadata (16 kB)
Downloading FLAML-2.3.5-py3-none-any.whl (322 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.2/322.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: flaml
Successfully installed flaml-2.3.5


In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Check current working directory
import os
os.getcwd()

'/content'

In [None]:
!find /content/drive/MyDrive/ -name "athletes_cleaned.csv"

/content/drive/MyDrive/ML_OPS_Assignment3/athletes_cleaned.csv


In [3]:
import pandas as pd

athletes_df = pd.read_csv("/content/drive/MyDrive/ML_OPS_Assignment3/athletes_cleaned.csv")
athletes_df.head()

Unnamed: 0,athlete_id,name,region,affiliate,gender,age,height,weight,eat,train,background,howlong,total_lift
0,21269.0,Erik Acevedo,Southern California,CrossFit Training Yard,Male,30.0,71.0,200.0,I eat whatever is convenient|,I workout mostly at a CrossFit Affiliate|I inc...,College sports,1-2 years|,1110.0
1,21685.0,Richard Ablett,Africa,Cape CrossFit,Male,28.0,70.0,176.0,I eat 1-3 full cheat meals per week|,I workout mostly at a CrossFit Affiliate|,No athletic background,2-4 years|,910.0
2,25464.0,Joe Abruzzo,North East,CrossFit Rapture,Male,35.0,68.0,225.0,I eat quality foods but don't measure the amount|,I workout mostly at a CrossFit Affiliate|I rec...,Other,2-4 years|,1335.0
3,43767.0,Brigham Abbott,North Central,River North CrossFit,Male,36.0,71.0,199.0,I eat quality foods but don't measure the amount|,I workout mostly at a CrossFit Affiliate|I hav...,College sports,1-2 years|,1354.0
4,55504.0,Jason Ackerman,North East,CrossFit Soulshine,Male,36.0,64.0,155.0,I eat strict Paleo|,"I workout mostly at home, work, or a tradition...",College sports,4+ years|,1225.0


In [5]:
athletes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30015 entries, 0 to 30014
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   athlete_id  30015 non-null  float64
 1   name        30015 non-null  object 
 2   region      30015 non-null  object 
 3   affiliate   29062 non-null  object 
 4   gender      30015 non-null  object 
 5   age         30015 non-null  float64
 6   height      30015 non-null  float64
 7   weight      30015 non-null  float64
 8   eat         30015 non-null  object 
 9   train       30015 non-null  object 
 10  background  30015 non-null  object 
 11  howlong     30015 non-null  object 
 12  total_lift  30015 non-null  float64
dtypes: float64(5), object(8)
memory usage: 3.0+ MB


In [4]:
# Encode categoricals
for col in athletes_df.select_dtypes(include="object").columns:
    athletes_df[col] = athletes_df[col].astype("category")

In [5]:
athletes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30015 entries, 0 to 30014
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   athlete_id  30015 non-null  float64 
 1   name        30015 non-null  category
 2   region      30015 non-null  category
 3   affiliate   29062 non-null  category
 4   gender      30015 non-null  category
 5   age         30015 non-null  float64 
 6   height      30015 non-null  float64 
 7   weight      30015 non-null  float64 
 8   eat         30015 non-null  category
 9   train       30015 non-null  category
 10  background  30015 non-null  category
 11  howlong     30015 non-null  category
 12  total_lift  30015 non-null  float64 
dtypes: category(8), float64(5)
memory usage: 3.0 MB


### With your dataset use AutoML to find the best model

In [6]:
# Split Data
from sklearn.model_selection import train_test_split

X = athletes_df.drop(columns=["total_lift"])
y = athletes_df["total_lift"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
!pip install numpy==1.24.3 --force-reinstall

Collecting numpy==1.24.3
  Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albucore 0.0.24 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.24.3 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.24

In [7]:
!pip install flaml



In [8]:
from flaml import AutoML

AutoML(): creates the AutoML engine

time_budget=600: runs for up to 10 minutes

metric="rmse": optimizes for Root Mean Squared Error

task="regression": predicts a numeric target (total_lift)

automl.fit(...): trains multiple models on X_train, selects the best one automatically.

FLAML trains each model (e.g., lgbm, xgboost, etc.) using all features from X_train

It then tunes hyperparameters (like learning rate, tree depth, etc.) to minimize my chosen metric (rmse)

FLAML will select the model that achieves the lowest RMSE on validation data during training.

FLAML uses internal cross-validation (or a hold-out validation split) on training set (X_train, y_train) to estimate model performance. This is:

not the final X_test test set,

instead, it's validation scores calculated during the AutoML search process.

That means:

The best model is selected based on the lowest RMSE on validation data, not test data.

In [22]:
# Run FLAML AutoML

automl = AutoML()

automl_settings = {
    "metric": "rmse",                            # Optimization metric
    'estimator_list': ['lgbm', 'xgboost', 'rf'], # Limit AutoML to these 3 fast and strong regressors
    "task": "regression",                        # Task type
    "log_file_name": "flaml_automl.log",         # Where to store training logs
    "max_iter": 40                               # limits the total number of trials (i.e., models)
}

automl.fit(X_train=X_train, y_train=y_train, **automl_settings)

[flaml.automl.logger: 07-19 22:11:50] {1752} INFO - task = regression
[flaml.automl.logger: 07-19 22:11:50] {1763} INFO - Evaluation method: cv
[flaml.automl.logger: 07-19 22:11:50] {1862} INFO - Minimizing error metric: rmse
[flaml.automl.logger: 07-19 22:11:50] {1979} INFO - List of ML learners in AutoML Run: ['lgbm', 'xgboost', 'rf']
[flaml.automl.logger: 07-19 22:11:50] {2282} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 07-19 22:11:50] {2417} INFO - Estimated sufficient time budget=10000s. Estimated necessary time budget=10s.
[flaml.automl.logger: 07-19 22:11:50] {2466} INFO -  at 0.5s,	estimator lgbm's best error=226.1395,	best estimator lgbm's best error=226.1395
[flaml.automl.logger: 07-19 22:11:50] {2282} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 07-19 22:11:51] {2466} INFO -  at 1.0s,	estimator lgbm's best error=226.1395,	best estimator lgbm's best error=226.1395
[flaml.automl.logger: 07-19 22:11:51] {2282} INFO - iteration 2, current le

**Best model from FLAML**: **LightGBM (lgbm)**
**Best validation RMSE**: **137.89**
**Time taken**: **\~60.3 seconds**
**Final model config**: 288 estimators, 19 leaves, learning rate ≈ 0.052, etc.

In [None]:
# !pip install "flaml[notebook]"

### Report any data insights from AutoML run.

In [23]:
# Print the best model and its parameters
print("Best model type:", automl.best_estimator)
print("Best hyperparameters:", automl.best_config)

Best model type: lgbm
Best hyperparameters: {'n_estimators': 288, 'num_leaves': 19, 'min_child_samples': 9, 'learning_rate': 0.05196155644285028, 'log_max_bin': 5, 'colsample_bytree': 0.38626024040974977, 'reg_alpha': 0.0016471893565638506, 'reg_lambda': 0.002581620911069276}


In [24]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Predict on test set
y_pred = automl.predict(X_test)

# Manually calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Test RMSE:", rmse)

Test RMSE: 139.17101323933082


### What are the reported top 5 features?

howlong, train, eat, background, age

In [25]:
# The most important 5 features reported by the best model found by FLAML
import pandas as pd

# Get best estimator
best_model = automl.model

# Create DataFrame of feature importances
feature_importances = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_model.feature_importances_
}).sort_values(by='importance', ascending=False)

# Display top 5 features
top5 = feature_importances.head(5)
print(top5)

       feature  importance
11     howlong        1154
9        train         715
8          eat         648
10  background         563
5          age         492


### Top 3 models per validation score using all features

In [26]:
import pandas as pd
import json

# Read log lines
with open("flaml_automl.log", "r") as file:
    log_lines = file.readlines()

# Parse model entries
data = []
configs = []
for line in log_lines:
    try:
        record = json.loads(line)
        if "validation_loss" in record and "learner" in record:
            learner = record["learner"]
            val_loss = record["validation_loss"]
            time = record.get("wall_clock_time", None)
            config = record.get("config", {})
            data.append((learner, val_loss, time, config))
    except json.JSONDecodeError:
        continue

# Create DataFrame
df_trials = pd.DataFrame(data, columns=["learner", "val_loss", "wall_clock_time", "config"])

# Get top 3 by validation loss
top3 = df_trials.sort_values(by="val_loss").head(3)

# Print results
print("Top 3 models by validation RMSE with configs:")
for idx, row in top3.iterrows():
    print(f"\nLearner: {row['learner']}")
    print(f"Val Loss (RMSE): {row['val_loss']}")
    print(f"Wall Clock Time: {row['wall_clock_time']}")
    print("Config:")
    for k, v in row['config'].items():
        print(f"  {k}: {v}")

Top 3 models by validation RMSE with configs:

Learner: lgbm
Val Loss (RMSE): 137.88590673968528
Wall Clock Time: 60.33635425567627
Config:
  n_estimators: 288
  num_leaves: 19
  min_child_samples: 9
  learning_rate: 0.05196155644285028
  log_max_bin: 5
  colsample_bytree: 0.38626024040974977
  reg_alpha: 0.0016471893565638506
  reg_lambda: 0.002581620911069276

Learner: lgbm
Val Loss (RMSE): 137.98097414912297
Wall Clock Time: 49.810667514801025
Config:
  n_estimators: 561
  num_leaves: 9
  min_child_samples: 4
  learning_rate: 0.06419203531794242
  log_max_bin: 5
  colsample_bytree: 0.43644825283873734
  reg_alpha: 0.007721515950790995
  reg_lambda: 0.005535643981633633

Learner: lgbm
Val Loss (RMSE): 138.60969463147686
Wall Clock Time: 42.2374746799469
Config:
  n_estimators: 273
  num_leaves: 4
  min_child_samples: 5
  learning_rate: 0.19785013601703466
  log_max_bin: 6
  colsample_bytree: 0.5834987745562512
  reg_alpha: 0.012306726057063571
  reg_lambda: 0.0036001076881759493


### Top 3 models per validation score using only the top features


In [27]:
# Create a reduced dataset using only these top 3 features
top3_features = ['howlong', 'train', 'eat']
X_train_top3 = X_train[top3_features]
X_test_top3 = X_test[top3_features]

In [28]:
# Run FLAML AutoML

automl = AutoML()

automl_settings = {
    "metric": "rmse",                            # Optimization metric
    'estimator_list': ['lgbm', 'xgboost', 'rf'], # Limit AutoML to these 3 fast and strong regressors
    "task": "regression",                        # Task type
    "log_file_name": "flaml_automl_top3.log",    # Where to store training logs
    "max_iter": 40                               # limits the total number of trials (i.e., models)
}

automl.fit(X_train=X_train_top3, y_train=y_train, **automl_settings)

[flaml.automl.logger: 07-19 22:13:40] {1752} INFO - task = regression
[flaml.automl.logger: 07-19 22:13:40] {1763} INFO - Evaluation method: cv
[flaml.automl.logger: 07-19 22:13:40] {1862} INFO - Minimizing error metric: rmse
[flaml.automl.logger: 07-19 22:13:40] {1979} INFO - List of ML learners in AutoML Run: ['lgbm', 'xgboost', 'rf']
[flaml.automl.logger: 07-19 22:13:40] {2282} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 07-19 22:13:40] {2417} INFO - Estimated sufficient time budget=10000s. Estimated necessary time budget=10s.
[flaml.automl.logger: 07-19 22:13:40] {2466} INFO -  at 0.1s,	estimator lgbm's best error=268.1706,	best estimator lgbm's best error=268.1706
[flaml.automl.logger: 07-19 22:13:40] {2282} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 07-19 22:13:40] {2466} INFO -  at 0.3s,	estimator lgbm's best error=268.1706,	best estimator lgbm's best error=268.1706
[flaml.automl.logger: 07-19 22:13:40] {2282} INFO - iteration 2, current le

In [29]:
import pandas as pd
import json

# Read log lines
with open("flaml_automl_top3.log", "r") as file:
    log_lines = file.readlines()

# Parse model entries
data = []
configs = []
for line in log_lines:
    try:
        record = json.loads(line)
        if "validation_loss" in record and "learner" in record:
            learner = record["learner"]
            val_loss = record["validation_loss"]
            time = record.get("wall_clock_time", None)
            config = record.get("config", {})
            data.append((learner, val_loss, time, config))
    except json.JSONDecodeError:
        continue

# Create DataFrame
df_trials = pd.DataFrame(data, columns=["learner", "val_loss", "wall_clock_time", "config"])

# Get top 3 by validation loss
top3 = df_trials.sort_values(by="val_loss").head(3)

# Print results
print("Top 3 models by validation RMSE with config using only the top 3 features:")
for idx, row in top3.iterrows():
    print(f"\nLearner: {row['learner']}")
    print(f"Val Loss (RMSE): {row['val_loss']}")
    print(f"Wall Clock Time: {row['wall_clock_time']}")
    print("Config:")
    for k, v in row['config'].items():
        print(f"  {k}: {v}")

Top 3 models by validation RMSE with config using only the top 3 features:

Learner: lgbm
Val Loss (RMSE): 256.99020140483015
Wall Clock Time: 2.866394281387329
Config:
  n_estimators: 47
  num_leaves: 4
  min_child_samples: 4
  learning_rate: 0.41929025492645006
  log_max_bin: 8
  colsample_bytree: 0.7610534336273627
  reg_alpha: 0.0009765625
  reg_lambda: 0.009280655005879927

Learner: lgbm
Val Loss (RMSE): 257.0689173821486
Wall Clock Time: 1.0220036506652832
Config:
  n_estimators: 13
  num_leaves: 5
  min_child_samples: 5
  learning_rate: 0.7590459488450945
  log_max_bin: 8
  colsample_bytree: 0.8304072431299575
  reg_alpha: 0.001951378031519758
  reg_lambda: 0.04792552866398477

Learner: lgbm
Val Loss (RMSE): 257.1304090799774
Wall Clock Time: 0.7351562976837158
Config:
  n_estimators: 11
  num_leaves: 4
  min_child_samples: 9
  learning_rate: 0.7260594590615893
  log_max_bin: 9
  colsample_bytree: 0.9285002286474459
  reg_alpha: 0.0036840681931986645
  reg_lambda: 0.753248050573

### Top 3 models per speed using all features

In [30]:
import pandas as pd
import json

# Read log lines
with open("flaml_automl.log", "r") as file:
    log_lines = file.readlines()

# Parse model entries
data = []
for line in log_lines:
    try:
        record = json.loads(line)
        if "validation_loss" in record and "learner" in record:
            learner = record["learner"]
            val_loss = record["validation_loss"]
            time = record.get("wall_clock_time", None)
            config = record.get("config", {})
            if time is not None:  # Filter only valid entries
                data.append((learner, val_loss, time, config))
    except json.JSONDecodeError:
        continue

# Create DataFrame
df_trials = pd.DataFrame(data, columns=["learner", "val_loss", "wall_clock_time", "config"])

# Get top 3 by speed (lowest wall clock time)
top3_speed = df_trials.sort_values(by="wall_clock_time").head(3)

# Print results
print("Top 3 models by speed (wall clock time):")
for idx, row in top3_speed.iterrows():
    print(f"\nLearner: {row['learner']}")
    print(f"Val Loss (RMSE): {row['val_loss']}")
    print(f"Wall Clock Time: {row['wall_clock_time']}")
    print("Config:")
    for k, v in row['config'].items():
        print(f"  {k}: {v}")

Top 3 models by speed (wall clock time):

Learner: lgbm
Val Loss (RMSE): 226.13949803386222
Wall Clock Time: 0.5328385829925537
Config:
  n_estimators: 4
  num_leaves: 4
  min_child_samples: 20
  learning_rate: 0.09999999999999995
  log_max_bin: 8
  colsample_bytree: 1.0
  reg_alpha: 0.0009765625
  reg_lambda: 1.0

Learner: lgbm
Val Loss (RMSE): 184.07472472034402
Wall Clock Time: 1.4764113426208496
Config:
  n_estimators: 4
  num_leaves: 4
  min_child_samples: 12
  learning_rate: 0.26770501231052046
  log_max_bin: 7
  colsample_bytree: 1.0
  reg_alpha: 0.001348364934537134
  reg_lambda: 1.4442580148221913

Learner: lgbm
Val Loss (RMSE): 148.45575068018815
Wall Clock Time: 2.2854721546173096
Config:
  n_estimators: 11
  num_leaves: 4
  min_child_samples: 9
  learning_rate: 0.7260594590615893
  log_max_bin: 9
  colsample_bytree: 0.9285002286474459
  reg_alpha: 0.0036840681931986645
  reg_lambda: 0.7532480505730402


### Top 3 models per speed score using only the top features

In [31]:
import pandas as pd
import json

# Read log lines
with open("flaml_automl_top3.log", "r") as file:
    log_lines = file.readlines()

# Parse model entries
data = []
for line in log_lines:
    try:
        record = json.loads(line)
        if "validation_loss" in record and "learner" in record:
            learner = record["learner"]
            val_loss = record["validation_loss"]
            time = record.get("wall_clock_time", None)
            config = record.get("config", {})
            if time is not None:
                data.append((learner, val_loss, time, config))
    except json.JSONDecodeError:
        continue

# Create DataFrame
df_trials = pd.DataFrame(data, columns=["learner", "val_loss", "wall_clock_time", "config"])

# Get top 3 by speed (lowest wall clock time)
top3 = df_trials.sort_values(by="wall_clock_time").head(3)

# Print results
print("Top 3 models by speed using only the top 3 features:")
for idx, row in top3.iterrows():
    print(f"\nLearner: {row['learner']}")
    print(f"Val Loss (RMSE): {row['val_loss']}")
    print(f"Wall Clock Time: {row['wall_clock_time']}")
    print("Config:")
    for k, v in row['config'].items():
        print(f"  {k}: {v}")

Top 3 models by speed using only the top 3 features:

Learner: lgbm
Val Loss (RMSE): 268.1706210848114
Wall Clock Time: 0.14544463157653809
Config:
  n_estimators: 4
  num_leaves: 4
  min_child_samples: 20
  learning_rate: 0.09999999999999995
  log_max_bin: 8
  colsample_bytree: 1.0
  reg_alpha: 0.0009765625
  reg_lambda: 1.0

Learner: lgbm
Val Loss (RMSE): 261.53183392834137
Wall Clock Time: 0.4026212692260742
Config:
  n_estimators: 4
  num_leaves: 4
  min_child_samples: 12
  learning_rate: 0.26770501231052046
  log_max_bin: 7
  colsample_bytree: 1.0
  reg_alpha: 0.001348364934537134
  reg_lambda: 1.4442580148221913

Learner: lgbm
Val Loss (RMSE): 257.1304090799774
Wall Clock Time: 0.7351562976837158
Config:
  n_estimators: 11
  num_leaves: 4
  min_child_samples: 9
  learning_rate: 0.7260594590615893
  log_max_bin: 9
  colsample_bytree: 0.9285002286474459
  reg_alpha: 0.0036840681931986645
  reg_lambda: 0.7532480505730402


### How does the top models compare to your previously developed model (assignments 1 and 2) in terms of validation score and speed?

#### **AutoML (FLAML) Top Model (lowest RMSE)**

* **Learner:** `lgbm`
* **Validation RMSE:** **137.89**
* **Training Time (Wall Clock):** **60.34 sec**

#### **Assignment 2 (Feature Store + MLflow)**

* **Model:** `rf-model` (Random Forest)
* **Validation RMSE:** **178.85**
* **R²:** **0.589**
* **Training Time:** **5 sec**

#### **Assignment 1 (Data Versioning)**

* No RMSE
* R² Instead


---

### **Comparison Summary**

| Model Source     | RMSE            | Speed                    |
| ---------------- | --------------- | ------------------------ |
| **FLAML AutoML** | **137.89**    | 60.34 sec             |
| **Assignment 2** | 178.85        | **5 sec**              |
| **Assignment 1** | R² Instead    | (Unmeasured, but fast) |

---

### **Conclusion**

* **FLAML AutoML** gives the **best RMSE (137.89)**, indicating the most accurate model overall.
* **Assignment 2** trains much **faster (5 sec)** but at the cost of **worse performance**.
* **Assignment 1** training time is not available .



### Is your platform AutoML no-code/low-code/full-code and why?

Low-code — because I used minimal code to run FLAML AutoML, which automatically selected, tuned, and evaluated models without manually coding each one.