# AutoML and Hyperparameter Training

Here I use Automatic Machine Learning tools to quickly and efficently assess a range of potential models

## Import Packages  

In [41]:
# plotting
import matplotlib.pyplot as plt

# general
import pandas as pd
import numpy as np
import calendar

# ml
from pycaret.regression import *

## Import data

This is my first time combining the PCs and the PWT data I'm trying to emulate, so I'm taking a bit of work below to get them into the same format

In [None]:
# set target (what are we trying to predict) - 
target = 'pwt_500hpa'
#target = 'temperature_800hpa'
#target = 'precipitation'

In [None]:
# set lists of coordinates and time ranges (pulled from Prepare_AI_Ready_Data.py) (CURRENTLY JUST ONE, BUT CAN ADD MORE)
coords = [[180,240,45,65],[130,250,20,75]]
times = [['1970-01-01','2023-12-31']]  # Ensure the time range is valid

# set PC option - seperate or combined
#PC_option = 'combined'
PC_option = 'seperate'

# select which of the list I want to load
coords_num = 1
times_num = 0

# pull the correct coordinate and time (as set above)
c = coords[coords_num]
t = times[times_num]

# read in PCs
if PC_option == 'combined':
    pc_df_raw = pd.read_csv(f'../data/dimensionality_reduction/principal_components_combined_{c[0]}-{c[1]}_{c[2]}-{c[3]}_{t[0][:4]}-{t[1][:4]}.csv')
elif PC_option == 'seperate':
    pc_df_raw = pd.read_csv(f'../data/dimensionality_reduction/principal_components_seperate_{c[0]}-{c[1]}_{c[2]}-{c[3]}_{t[0][:4]}-{t[1][:4]}.csv')

In [44]:
target_df_raw = pd.read_csv('../data/target/era5_monthbymonth_allvars.csv')

# rename the 'Time' column to 'time' in the target data
target_df_raw.rename(columns={'Time':'time'}, inplace=True)

# look through all columns, change any 'PWT' in column names to 'pwt'
target_df_raw.columns = [i.lower() for i in target_df_raw.columns]

# merge teh pc and target data. Delete rows where the 'time' column doesn't overlap
master_df = pd.merge(pc_df_raw, target_df_raw, on='time', how='inner')

# add a column for the month
master_df['month'] = pd.to_datetime(master_df['time']).dt.month

# Preform one-hot encoding on the month column
master_df = pd.get_dummies(master_df, columns=['month'])

# save the master_df (for use in subsequent scripts)
master_df.to_csv(f'../data/dimensionality_reduction/principal_components_{PC_option}_{c[0]}-{c[1]}_{c[2]}-{c[3]}_{t[0][:4]}-{t[1][:4]}_target.csv', index=False)

## Apply autoML Frameworks

Here I apply pycaret to automate machine learning model selection along with hyperparameter tuning. I used CoPilot to efficiently set this up. Does this mean we can call this auto-autoML?

In [None]:
# Keep only columns that are target, contain "PC", or contain "month"
columns_to_keep = [col for col in master_df.columns if target in col or "PC" in col or "month" in col]
automl_data = master_df[columns_to_keep]

# normalize all features and target to be between 0 and 1 (except the month columns)
for col in automl_data.columns:
    if 'month' not in col:
        automl_data[col] = (automl_data[col] - automl_data[col].min()) / (automl_data[col].max() - automl_data[col].min())

# check the data
automl_data.head()

PC1_sst       0
PC2_sst       0
PC3_sst       0
PC4_sst       0
PC5_sst       0
PC6_sst       0
PC7_sst       0
PC8_sst       0
PC9_sst       0
PC10_sst      0
PC1_msl       0
PC2_msl       0
PC3_msl       0
PC4_msl       0
PC5_msl       0
PC6_msl       0
PC7_msl       0
PC8_msl       0
PC9_msl       0
PC10_msl      0
PC1_z         0
PC2_z         0
PC3_z         0
PC4_z         0
PC5_z         0
PC6_z         0
PC7_z         0
PC8_z         0
PC9_z         0
PC10_z        0
pwt_500hpa    0
month_1       0
month_2       0
month_3       0
month_4       0
month_5       0
month_6       0
month_7       0
month_8       0
month_9       0
month_10      0
month_11      0
month_12      0
dtype: int64
PC1_sst       float64
PC2_sst       float64
PC3_sst       float64
PC4_sst       float64
PC5_sst       float64
PC6_sst       float64
PC7_sst       float64
PC8_sst       float64
PC9_sst       float64
PC10_sst      float64
PC1_msl       float64
PC2_msl       float64
PC3_msl       float64
PC4_msl      

Unnamed: 0,PC1_sst,PC2_sst,PC3_sst,PC4_sst,PC5_sst,PC6_sst,PC7_sst,PC8_sst,PC9_sst,PC10_sst,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,0.259173,0.635995,0.431253,0.605195,0.468394,0.450963,0.552845,0.414452,0.499316,0.399609,...,False,False,False,False,False,False,False,False,False,False
1,0.297425,0.604696,0.36809,0.608731,0.524775,0.468598,0.522726,0.454354,0.549438,0.483807,...,False,False,False,False,False,False,False,False,False,False
2,0.291333,0.66022,0.374712,0.550507,0.359033,0.560825,0.511349,0.435851,0.573838,0.458497,...,True,False,False,False,False,False,False,False,False,False
3,0.310941,0.518791,0.416521,0.411727,0.288424,0.633701,0.453706,0.457244,0.544118,0.454177,...,False,True,False,False,False,False,False,False,False,False
4,0.314177,0.318119,0.427844,0.399218,0.347708,0.638117,0.574364,0.38846,0.519845,0.464419,...,False,False,True,False,False,False,False,False,False,False


## Initialize PyCaret Setup

In [47]:
# Initialize PyCaret setup
setup(data=automl_data, 
      target=target,
      session_id=123,
      normalize=False,        
      transformation=True,   
      fold=5,                
      verbose=True)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,pwt_500hpa
2,Target type,Regression
3,Original data shape,"(636, 43)"
4,Transformed data shape,"(636, 43)"
5,Transformed train set shape,"(445, 43)"
6,Transformed test set shape,"(191, 43)"
7,Numeric features,30
8,Preprocess,True
9,Imputation type,simple


<pycaret.regression.oop.RegressionExperiment at 0x174284970>

## Identify best model

Here I identify the best model. The compare_models() function gives us a useful summary of the basic preformance metrics of all models. 

As we can see, we're comparing a large number of models. These vary from very simple models (i.e. Linear regression) through to more complex models, including radom forests. Within each model, we're also tuning parameters to achieve the best possible model fit. 

In [None]:
best = compare_models(exclude=['ransac'])


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

## Explore best model parameters

Here we can click through and look at details from the best model run. In partiuclar, we can look at which hyperparameters the model settles on. I'm also interested by the Feature Importance plot - it seems to indicate that the pressure fields are getting more weight than both SST and month. 

In [52]:
evaluate_model(best)

# print hyperparameters of best model


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…