# W7: Automated Machine Learning
- Contributer: Dr. Zhonghua Zheng, Yuan Sun
- Course Unit: Earth and Environmental Data Science (EART60702)
- Last modified date: 9 March, 2025

## Intended Learning Outcomes (ILOs)
- Learn the workflow a AutoML
- Train and test a ML model, check the training logs
- Learn to setup a time budget for AutoML
- Learn the difference in resampling strategy

## 0. Setup

Please download the data here: https://www.dropbox.com/scl/fi/azzx0olpeyx45rixlsgdn/project_1.csv?rlkey=b4fj8cnmc4ytyezppfbhpky3t&dl=0

### 0.1 Please use bash commands to launch JupyterLab
```bash
# check if conda works in your local PC
conda --version
# load the environment that you created last week
conda activate myenv
# launch JupyterLab
jupyter lab
```

### 0.2 Please load the necessary Python packages

install packages
```
conda install -c conda-forge xgboost=1.6.2, flaml=1.2.4, scikit-learn=1.0.2
```

In [None]:
!conda --version

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from flaml import AutoML
import sklearn
from sklearn.ensemble import RandomForestRegressor as RF
import pickle
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from flaml.automl.data import get_output_from_log

print("FLAML version: {}".format(AutoML.__version__))
print("Numpy version: {}".format(np.__version__))
print("Pandas version: {}".format(pd.__version__))
print("SKLearn version: {}".format(sklearn.__version__))
print("Seaborn version: {}".format(sns.__version__))

### **NOTE: If your local environment doesn't work, please run the code below to install necessary packages in Google Colab: https://colab.research.google.com/**
```python
# https://saturncloud.io/blog/how-to-install-conda-package-to-google-colab/
!pip install -q condacolab
import condacolab
condacolab.install()

# check if condacolab works
!conda --version

# please install packages below in your condacolab
!pip install --upgrade xarray zarr gcsfs cftime nc-time-axis climetlab
```

## 1. Automated Machine Learning (40 mins)

[FLAML: A Fast Library for Automated Machine Learning & Tuning](https://microsoft.github.io/FLAML/)

### 1.1 Load the data

In [None]:
# The data catalogue is stored as a CSV file. Here we read it with pandas.

data_path = '~/Downloads/project_1.csv' # Change this to the path of the data file on your system

# Load the data
df = pd.read_csv(data_path, index_col=0, parse_dates=True).drop(columns=['lat', 'lon'])
df.head(2)

### 1.2 Exploratory data analysis

In [None]:
df.describe().T

### 1.3 Split data for training and testing

We will use the first 80% of the data for training and the last 20% for testing

In [None]:
train_num = int(0.8 * len(df))
train, test = df.iloc[:train_num], df.iloc[train_num:]
train

In [None]:
# =============Plotting the data================
df.plot(subplots=True, figsize=(10, 10))
plt.show()

In [None]:
# =============exploratory data analysis================
# =============trainning data================
display(train.describe().T)
display(train.info())

# =============test data================
display(test.describe().T)
display(test.info())


### 1.4 Define the features and target variable

In [None]:
feature_ls = df.columns.tolist()
feature_ls.remove('TREFMXAV_U')
print('The features are:', feature_ls)

label = 'TREFMXAV_U'
print('The label is:', label)

### 1.5 Train AutoML model

In [None]:
# ====== train model ======
time_budget = 60  # total running time in seconds

# specify the estimator list
estimator_list = ['lgbm', 'rf', 'xgboost']

# create the AutoML object
automl = AutoML()

# specify the automl settings
automl_settings = {
    "time_budget": time_budget,  # in seconds
    "estimator_list":estimator_list, # estimators
    "metric": 'rmse',
    "task": 'regression',
    "log_file_name": "log.log"
}

# fit the model
automl.fit(train[feature_ls], train[label], **automl_settings) #verbose=-1 for silent
print(automl.model.estimator)

Question: 

- What is the RMSE, R2, and MAE of the model on the training data?

        ```python
        # evaluate the final model performance
        y_train = train[label]
        y_pred = automl.predict(train[feature_ls])
        print("training rmse:", )
        print("training r2:", )
        print("training mean_absolute_error:", )
        ```

- Can we use other metrics to train the model (in automl_settings) ? If yes, which metrics can we use?
    - [reference](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    - `rmse`, `mse`, `r2`, `mape`
- Plot the residuals of the model on the training data.

        ```python
        residual = observation - predictions
        ```

In [None]:
# ====== rmse, r2, mae of trainning data ======

print('Trainning data')
y_pred_train = automl.predict(train[feature_ls])
rmse = mean_squared_error(train[label], y_pred_train, squared=False)
r2 = r2_score(train[label], y_pred_train)
mae = mean_absolute_error(train[label], y_pred_train)
print('RMSE:', rmse)
print('R2:', r2)
print('MAE:', mae)

In [None]:
# ====== plot the residual ======
residual = train[label] - y_pred_train
plt.figure(figsize=(10, 5))
sns.histplot(residual, kde=True)
plt.ylabel('Frequency')
plt.xlabel('Residual [K])')
plt.title('Residual distribution')
plt.show()

In [None]:
# save the model
with open('automl_model.pkl', 'wb') as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

Tips:

- Other method to save the model

```python
        import joblib
        joblib.dump(automl, 'automl_model.pkl')
        
        # Load the model
        automl = joblib.load('automl_model.pkl')
```

- [What's the difference?](https://medium.com/nlplanet/is-it-better-to-save-models-using-joblib-or-pickle-776722b5a095#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjkxNGZiOWIwODcxODBiYzAzMDMyODQ1MDBjNWY1NDBjNmQ0ZjVlMmYiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMDU1NzI4ODU0NDcxMTcwMDA0NzgiLCJlbWFpbCI6Imp1bmppZXl1LnVvbUBnbWFpbC5jb20iLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwibmJmIjoxNzQxNjAxMzEzLCJuYW1lIjoiSnVuamllIFl1IiwicGljdHVyZSI6Imh0dHBzOi8vbGgzLmdvb2dsZXVzZXJjb250ZW50LmNvbS9hL0FDZzhvY0lHY3B6dmlZXzA0RnVJZVU0b3FNVEFhQWVKTHJHX2xmLTBPU1lPWF9Oc1d6VVV2QT1zOTYtYyIsImdpdmVuX25hbWUiOiJKdW5qaWUiLCJmYW1pbHlfbmFtZSI6Ill1IiwiaWF0IjoxNzQxNjAxNjEzLCJleHAiOjE3NDE2MDUyMTMsImp0aSI6ImQ2ODcyNjY5NTZiYjJlY2Y3MWRlNWFkOGU5YzkzOTdkNzI4MWI0MTMifQ.dV50fTIJW_rb5QYkTuhJBMYcmdZyCJ-MnU6_OznqRap2av7vYu2R_E9vZiD9HFY8rpqBOstyfTrOxtOboDQ3Rg55a8qdu0umGXYlamIOWRQO_JFToJg0Xjd6NbhUib48lPqCT8XCp95YtdaKDcubKTUbfKZPQm_uN0mdpc37IqoFc9D4dD1iZpdRPX2ZscE9WmstRwN5ZAsTpWLYGH6j0a9JEHuU7q2JkfyJy8Hqpn_yG9gHULDeSh9UnClR_QgHYn_trk1iZaHnDhoAvMY4KfY9Mo9UBjifoUPOqjLTOitZMkM4DHqSXs1avl-F7Tj2lydTfWJAw4MwrHHyDHrMtw)

**check the logs**

In [None]:
print(automl.best_config_train_time)

print(automl.best_iteration)

print(automl.best_loss)

print(automl.time_to_find_best_model)

print(automl.config_history)

### 1.6 Check the learning curve from the log file

In [None]:
time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history = get_output_from_log(filename=automl_settings["log_file_name"], time_budget=30)

plt.title("Learning Curve")
plt.xlabel("Wall Clock Time (s)")
plt.ylabel("RMSE [K]")
plt.step(time_history, 1 - np.array(best_valid_loss_history), where="post")
plt.show()

### 1.7 Predictions

In [None]:
# =============load model================
automl = pickle.load(open('automl_model.pkl', 'rb'))

In [None]:
# evaluate the final model performance
y_test = test[label]
y_pred = automl.predict(test[feature_ls])
print("testing rmse:", mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False))
print("testing r2:", r2_score(y_true=y_test, y_pred=y_pred))
print("testing mae:", mean_absolute_error(y_true = y_test, y_pred = y_pred))

Question:
- Plot the difference between the predicted and actual values of the test data
- Plot the residuals of the test data
- Compare the residuals of the test data with the residuals of the training data

In [None]:
### example plotting

residuals_train = train[label] - y_pred_train
residuals_test = test[label] - y_pred

plt.figure(figsize=(12, 6))
sns.histplot(residuals_train, kde=True, color='blue', label='train')
sns.histplot(residuals_test, kde=True, color='red', label='test')

mean_train = residuals_train.mean()
mean_test = residuals_test.mean()

plt.axvline(mean_train, color='blue', linestyle='dashed', linewidth=1)
plt.axvline(mean_test, color='red', linestyle='dashed', linewidth=1)

plt.text(-5, 200, f'Mean residuals: {mean_train.round(2)} [K]', rotation=0, color='blue')
plt.text(-5, 100, f'Mean residuals: {mean_test.round(2)} [K]', rotation=0, color='red')

plt.ylabel('Frequency')
plt.xlabel('Residual [K]')
plt.title('Residual distribution')
plt.legend()
plt.show()

**compare with random forest without auto tuning**

In [None]:
rf = RF()
rf.fit(train[feature_ls], train[label])
pred_rf = rf.predict(test[feature_ls])
rmse_rf = mean_squared_error(test[label], pred_rf, squared=False)
r2_rf = r2_score(test[label], pred_rf)
mae_rf = mean_absolute_error(test[label], pred_rf)

print('RMSE of RF:', rmse_rf)
print('R2 of RF:', r2_rf)
print('MAE of RF:', mae_rf)

print('RMSE of FLAML:', mean_squared_error(test[label], y_pred, squared=False))
print('R2 of FLAML:', r2_score(test[label], y_pred))
print('MAE of FLAML:', mean_absolute_error(test[label], y_pred))

### 1.8 Feature Importance

In [None]:
# =========== feature importance =========== 

# only when the model is tree-based, we can get the feature importance directly from the model
fi = automl.model.estimator.feature_importances_

# plot
plt.figure(figsize=(10, 5))
sns.barplot(x=fi, y=feature_ls)
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance")
plt.show()

Tips:

- Only when the model is **tree-based**, we can get the feature importance directly from the model.

- The feature importance is not the same as the correlation between the feature and the target.

- The feature importance is not the same as the coefficient in linear regression.

- The feature importance is not the same as the p-value in statistical tests.

- The feature importance is not the same as the mutual information between the feature and the target.


#### 1.8.1 other feature importance methods

fist, you need to install the shap library by running:
```bash
! conda install -c conda-forge shap=0.39.0 -y
```

In [None]:
## SHAPELY importances

# =========== shapley ===========
import shap

# explain the model's predictions using SHAP
explainer = shap.Explainer(automl.model.estimator)
shap_values = explainer(train[feature_ls])

# visualize the training set predictions
shap.plots.beeswarm(shap_values)


Tips:

- Negative SHAP values mean that the feature value is pushing the prediction lower (less than the expected value), while positive SHAP values mean that the feature value is pushing the prediction higher (more than the expected value).
- Each point is a single sample.

In [None]:
# visualize the first prediction's explanation
shap.initjs()
shap.plots.waterfall(shap_values[0])

# E[f(X)] is the expected value of the model prediction 
#Â f(X) is the model prediction for a single sample X

### 1.9 Others

Qusetion:

- [How to set up the time budget of AutoML?](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML/#how-to-set-time-budget)

In [None]:
# =========== try different time budget ===========
time_budget =  # update the time budget
# update the automl settings with the new time budget
automl_settings = {
    "time_budget": time_budget,  # in seconds
    "estimator_list":estimator_list, # estimators
    "metric": 'rmse',
    "task": 'regression',
    "log_file_name": "log.log"
}
# ====== train model ======

automl.fit(train[feature_ls], train[label], **automl_settings) #verbose=-1 for silent
print(automl.model.estimator)

# ====== rmse, r2, mae ======
print('Test data')
y_pred = automl.predict(test[feature_ls])
rmse = mean_squared_error(test[label], y_pred, squared=False)
r2 = r2_score(test[label], y_pred)
mae = mean_absolute_error(test[label], y_pred)
print('RMSE:', rmse)      
print('R2:', r2)
print('MAE:', mae)

Qusetion: 

- Use other estimators?

  - [The supported estimators](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#estimator)   

In [None]:
# =========== try different estimator ===========

# update the automl settings 
automl_settings = {
    "time_budget": time_budget,  # in seconds
    "estimator_list": # Update your estimators list here
    "metric": 'rmse',
    "task": 'regression',
    "log_file_name": "log.log"
}

automl.fit(train[feature_ls], train[label], **automl_settings) #verbose=-1 for silent
print(automl.model.estimator)

# ====== rmse, r2, mae ======
print('Test data')
y_pred = automl.predict(test[feature_ls])
rmse = mean_squared_error(test[label], y_pred, squared=False)
r2 = r2_score(test[label], y_pred)
mae = mean_absolute_error(test[label], y_pred)
print('RMSE:', rmse)      
print('R2:', r2)
print('MAE:', mae)

Question:
    
- Specifiy the resampling strategy?
    - [reference](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#resampling-strategy)
    - Try `houdout` and `cv` resampling strategies

In [None]:
# =========== try resampling strategy ===========

# update the automl settings 
automl_settings = {
    "time_budget": time_budget,  # in seconds
    "estimator_list": estimator_list, # estimators
    "metric": 'rmse',
    "task": 'regression',
    "log_file_name": "log.log",
    "cv": , # Update the number of folds of the cross-validation
}

automl.fit(train[feature_ls], train[label], **automl_settings) #verbose=-1 for silent
print(automl.model.estimator)

# ====== rmse, r2, mae ======
print('Test data')
y_pred = automl.predict(test[feature_ls])
rmse = mean_squared_error(test[label], y_pred, squared=False)
r2 = r2_score(test[label], y_pred)
mae = mean_absolute_error(test[label], y_pred)
print('RMSE:', rmse)      
print('R2:', r2)
print('MAE:', mae)