# Introduction

This notebook gives a brief introduction to AutoML (Automated Machine Learning) using AutoGluon:

https://auto.gluon.ai/


# TabularPredictor

As an example we use this dataset:

https://www.kaggle.com/datasets/mchilamwar/predict-concrete-strength

and will try to predict, e.g., the strength of concrete.

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mchilamwar/predict-concrete-strength")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /home/juebrauer/.cache/kagglehub/datasets/mchilamwar/predict-concrete-strength/versions/1


In [2]:
!ls {path}

ConcreteStrengthData.csv


In [3]:
import pandas
df = pandas.read_csv(path + "/ConcreteStrengthData.csv")
df

Unnamed: 0,CementComponent,BlastFurnaceSlag,FlyAshComponent,WaterComponent,SuperplasticizerComponent,CoarseAggregateComponent,FineAggregateComponent,AgeInDays,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.30
...,...,...,...,...,...,...,...,...,...
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28,31.18
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28,23.70
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28,32.77


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   CementComponent            1030 non-null   float64
 1   BlastFurnaceSlag           1030 non-null   float64
 2   FlyAshComponent            1030 non-null   float64
 3   WaterComponent             1030 non-null   float64
 4   SuperplasticizerComponent  1030 non-null   float64
 5   CoarseAggregateComponent   1030 non-null   float64
 6   FineAggregateComponent     1030 non-null   float64
 7   AgeInDays                  1030 non-null   int64  
 8   Strength                   1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.6 KB


In [5]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CementComponent,1030.0,281.167864,104.506364,102.0,192.375,272.9,350.0,540.0
BlastFurnaceSlag,1030.0,73.895825,86.279342,0.0,0.0,22.0,142.95,359.4
FlyAshComponent,1030.0,54.18835,63.997004,0.0,0.0,0.0,118.3,200.1
WaterComponent,1030.0,181.567282,21.354219,121.8,164.9,185.0,192.0,247.0
SuperplasticizerComponent,1030.0,6.20466,5.973841,0.0,0.0,6.4,10.2,32.2
CoarseAggregateComponent,1030.0,972.918932,77.753954,801.0,932.0,968.0,1029.4,1145.0
FineAggregateComponent,1030.0,773.580485,80.17598,594.0,730.95,779.5,824.0,992.6
AgeInDays,1030.0,45.662136,63.169912,1.0,7.0,28.0,56.0,365.0
Strength,1030.0,35.817961,16.705742,2.33,23.71,34.445,46.135,82.6


In [6]:
# shuffle the data
df = df.sample(frac=1.0)

# split data into training and test data
N_train = int(len(df)*0.8)
df.iloc[:N_train].to_csv("concrete_strength_train.csv", index=False)
df.iloc[N_train:].to_csv("concrete_strength_test.csv", index=False)

In [7]:
from autogluon.tabular import TabularPredictor
model = TabularPredictor(label="Strength",
                         eval_metric="mean_absolute_percentage_error",
                         path="autogluon_concrete_strength_predictor")
model = model.fit("concrete_strength_train.csv", time_limit=4*60 )

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.5.0
Python Version:     3.13.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2
CPU Count:          32
Pytorch Version:    2.9.1+cu128
CUDA Version:       12.8
GPU Memory:         GPU 0: 11.60/11.60 GB
Total GPU Memory:   Free: 11.60 GB, Allocated: 0.00 GB, Total: 11.60 GB
GPU Count:          1
Memory Avail:       21.21 GB / 31.03 GB (68.4%)
Disk Space Avail:   14.85 GB / 195.80 GB (7.6%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme'  : New in v1.5: The state-of-the-art for tabular data. Massively better than 'best' on datasets <100000 samples by using new Tabular Foundation Models (TFMs) meta-learned on ht

[1000]	valid_set's l2: 26.7832	valid_set's mean_absolute_percentage_error: -0.117202
[2000]	valid_set's l2: 24.9799	valid_set's mean_absolute_percentage_error: -0.109467
[3000]	valid_set's l2: 24.3872	valid_set's mean_absolute_percentage_error: -0.106534
[4000]	valid_set's l2: 24.4581	valid_set's mean_absolute_percentage_error: -0.106246
[5000]	valid_set's l2: 24.6485	valid_set's mean_absolute_percentage_error: -0.106442


	-0.1059	 = Validation score   (-mean_absolute_percentage_error)
	3.1s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 236.85s of the 236.85s of remaining time.
	Fitting with cpus=24, gpus=0, mem=0.0/21.2 GB
	-0.1278	 = Validation score   (-mean_absolute_percentage_error)
	0.88s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 235.96s of the 235.96s of remaining time.
	Fitting with cpus=32, gpus=0


[1000]	valid_set's l2: 31.7302	valid_set's mean_absolute_percentage_error: -0.127854


	-0.1575	 = Validation score   (-mean_absolute_percentage_error)
	0.53s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 235.33s of the 235.33s of remaining time.
	Fitting with cpus=24, gpus=0
	-0.1497	 = Validation score   (-mean_absolute_percentage_error)
	1.51s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 233.82s of the 233.82s of remaining time.
	Fitting with cpus=32, gpus=0
	-0.1439	 = Validation score   (-mean_absolute_percentage_error)
	0.58s	 = Training   runtime
	0.08s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 233.14s of the 233.14s of remaining time.
	Fitting with cpus=24, gpus=0, mem=0.0/20.9 GB
Metric mean_absolute_percentage_error is not supported by this model - using mean_squared_error instead
	-0.1934	 = Validation score   (-mean_absolute_percentage_error)
	1.19s	 = Training   runtime
	0.01s	 = Validation runtime
F

In [8]:
model.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.095755,mean_absolute_percentage_error,0.010197,52.31304,0.000382,0.032692,2,True,10
1,LightGBMXT,-0.105948,mean_absolute_percentage_error,0.004902,3.097236,0.004902,3.097236,1,True,1
2,NeuralNetTorch,-0.107995,mean_absolute_percentage_error,0.004912,49.183112,0.004912,49.183112,1,True,8
3,LightGBM,-0.127785,mean_absolute_percentage_error,0.001317,0.879704,0.001317,0.879704,1,True,2
4,XGBoost,-0.1333,mean_absolute_percentage_error,0.002702,1.03532,0.002702,1.03532,1,True,7
5,LightGBMLarge,-0.137117,mean_absolute_percentage_error,0.001605,2.375581,0.001605,2.375581,1,True,9
6,ExtraTreesMSE,-0.143915,mean_absolute_percentage_error,0.078691,0.580122,0.078691,0.580122,1,True,5
7,CatBoost,-0.149661,mean_absolute_percentage_error,0.001199,1.50851,0.001199,1.50851,1,True,4
8,RandomForestMSE,-0.15753,mean_absolute_percentage_error,0.074089,0.532181,0.074089,0.532181,1,True,3
9,NeuralNetFastAI,-0.193401,mean_absolute_percentage_error,0.008361,1.190964,0.008361,1.190964,1,True,6


In [9]:
type(model.leaderboard())

pandas.core.frame.DataFrame

In [15]:
# Make new predictions!
from autogluon.tabular import TabularPredictor
model = TabularPredictor.load("autogluon_concrete_strength_predictor")

import pandas
df_test = pandas.read_csv("concrete_strength_test.csv")

df_test["preds"] = model.predict(df_test)
df_test.to_csv("data_with_predictions.csv")

In [13]:
model.predict(df_test)

0      52.309258
1      80.203018
2      13.280212
3      41.483154
4      43.198029
         ...    
201     9.801093
202    32.864487
203    43.014103
204    37.632629
205    26.504713
Name: Strength, Length: 206, dtype: float32

In [None]:
model.evaluate(df_test)

# TimeSeriesPredictor

We will do a forecast for bike sharing in London.

Dataset:

https://www.kaggle.com/datasets/hmavrodiev/london-bike-sharing-dataset

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("hmavrodiev/london-bike-sharing-dataset")

print("Path to dataset files:", path)

In [None]:
!ls {path}

In [None]:
import pandas
df = pandas.read_csv(path + "/london_merged.csv")
df

In [None]:
df.info()

In [None]:
df['timestamp'] = pandas.to_datetime(df['timestamp'])
df = df.set_index("timestamp", drop=False)

# We need to have an "time series ID (item id)" column in AutoGluon
# AutoGluon needs this in order to differ between the time series
df['series_id'] = 'London'

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.index

In [None]:
# prepare train period
df.loc["2015-01" : "2016-09"].to_csv("london_bikes_train.csv", index=False)

# prepare test period / ground truth data to compare forecast with
df.loc["2016-10-01" : "2016-10-14"].to_csv("london_bikes_test.csv", index=False)

In [None]:
from autogluon.timeseries import TimeSeriesPredictor, TimeSeriesDataFrame

train_data = TimeSeriesDataFrame.from_data_frame(
    "london_bikes_train.csv",
    id_column="series_id",
    timestamp_column="timestamp"
)

model = TimeSeriesPredictor(
    prediction_length=48,  # Predict the next 48 hours
    target="cnt",
    eval_metric="MASE",
    freq='h',  # <--- Explicitly tell it "This is Hourly data"
    path="autogluon_london_bikesharing_predictor"
)

model.fit(
    train_data,
    presets="best_quality",
    time_limit=6*60
)

In [None]:
import pandas
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
import matplotlib.pyplot as plt

# 1. Reload model
path = "autogluon_london_bikesharing_predictor"
model = TimeSeriesPredictor.load(path)
print(model.model_names())

# 2. Read in data for which to do a forecast
df = pandas.read_csv("london_bikes_train.csv")
df['timestamp'] = pandas.to_datetime(df['timestamp'])
df = df.set_index("timestamp", drop=False)
df["series_id"] = "London"
input_timeseries = TimeSeriesDataFrame.from_data_frame(
    df,
    id_column="series_id",
    timestamp_column="timestamp"
)

# 3. Read in ground truth data
df_gt = pandas.read_csv("london_bikes_test.csv")
df_gt['timestamp'] = pandas.to_datetime(df_gt['timestamp'])
df_gt = df_gt.set_index("timestamp", drop=False)
gt_data = df_gt.iloc[:48]

# 4. Predict / Forecast with the best model
preds = model.predict( input_timeseries )

# 5. Visualize ground truth vs. predictions
plt.plot(gt_data["cnt"], color="black")
plt.plot(preds["mean"]["London"], color="red", linestyle="--")
plt.xticks(rotation=45)
plt.show()

In [None]:
preds.head()

In [None]:
# Use AutoGluon's prediction visualization function
import matplotlib.pyplot as plt
model.plot(data=input_timeseries, predictions=preds, item_ids=['London'], max_history_length=200)
plt.show()

In [None]:
model.model_names()

In [None]:
for model_name in model.model_names():

    # Predict / Forecast with specific model
    preds = model.predict( input_timeseries, model=model_name )

    import matplotlib.pyplot as plt
    plt.plot(gt_data["cnt"], color="black")
    plt.plot(preds["mean"]["London"], color="red", linestyle="--")
    plt.xticks(rotation=45)
    plt.title(f"Predictions of model {model_name}")
    plt.show()

# Decompose time series

In [None]:
# Additive time series decomposition with visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Try to use statsmodels; if unavailable, fall back to a simple manual additive decomposition
try:
    from statsmodels.tsa.seasonal import seasonal_decompose
    HAVE_SM = True
except Exception:
    HAVE_SM = False

print(HAVE_SM)

# 1. Generate synthetic time series (5 years)
rng = pd.date_range("2020-01-01", periods=5*12, freq="MS")
np.random.seed(42)

trend = np.linspace(10, 22, len(rng))  # linearer Trend
seasonal_true = 2*np.sin(2*np.pi * (rng.month-1)/12)  # jährliche Saisonalität
noise = np.random.normal(0, 0.8, len(rng))

y = trend + seasonal_true + noise
ts = pd.Series(y, index=rng, name="Beispielreihe")

# 2. Decompose with additived model
if HAVE_SM:
    decomp = seasonal_decompose(ts, model="additive", period=12, extrapolate_trend="freq")
    observed = decomp.observed
    trend_est = decomp.trend
    seasonal_est = decomp.seasonal
    resid = decomp.resid
else:
    # Trend: moving average with window=12 (months), centered
    trend_est = ts.rolling(window=12, center=True, min_periods=6).mean()
    # Detrend
    detrended = ts - trend_est
    # Seasonal component: mean per month (and all years), map to index
    month_avgs = detrended.groupby(detrended.index.month).mean()
    seasonal_est = ts.index.month.map(month_avgs).to_series(index=ts.index)
    # Residuals
    resid = ts - trend_est - seasonal_est
    observed = ts

# 3. Visualization helper function
def make_plot(series, title, ylabel):
    plt.figure(figsize=(10, 3.2))
    plt.plot(series.index, series.values)
    plt.title(title)
    plt.xlabel("Datum")
    plt.ylabel(ylabel)
    plt.tight_layout()
    plt.show()

make_plot(observed, "Observed", "value")
make_plot(trend_est, "Trend (additive)", "value")
make_plot(seasonal_est, "Seasonality", "value")
make_plot(resid, "Residuals", "value")
make_plot(trend_est + seasonal_est + resid, "Time series as addition of components", "value")

# Covariates

- In AutoGluon TimeSeriesPredictor, covariates are extra variables that help predict the target time series.
- They provide additional context beyond past target values.
- There are three main types of covariates.
    - Past covariates are known only up to the current time (e.g., past demand or sensor data).
    - Known (future) covariates are available in advance, including the forecast horizon (e.g., holidays or planned promotions).
    - Static covariates do not change over time and describe each series (e.g., store location or product category).
- Covariates help models learn seasonality and external effects.
- They improve accuracy, especially for longer forecasts.
- Only use covariates that are truly available at prediction time.

## Generate train/test data

In [None]:
import pandas as pd
import numpy as np
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

rng = np.random.default_rng(0)

# We will simulate N days
N = 2*365
dates = pd.date_range("2026-01-01", periods=N, freq="D")

# We will simulate two time series
time_series_ids = ["A", "B"]

# Simulate two time series
rows = []
    
# Simulate each day
for t, ts in enumerate(dates):

    for time_series_id in time_series_ids:

        # Give the time series a different y-intercept
        if time_series_id == "A":
            base = 50
            seasonality = 7
        elif time_series_id == "B":
            base = 80
            seasonality = 14
    
    
        # simulate time series value with weekly seasonality
        y = base + 10*np.sin(2*np.pi*t/seasonality) + rng.normal(0, 2)
    
        # simulate past covariate: temperature (yearly seasonality)
        # The temperatur will not be known for the forecast horizon!
        temp = 15 + 8*np.sin(2*np.pi*t/365) + rng.normal(0, 1)
    
        # simulate known covariates: DayOfWeek (dow), Weekend (is_weekend)
        # This will be known in advance for the forecast horizon
        dow = ts.dayofweek
        is_weekend = int(dow >= 5)
    
        # New data row for our table to be created
        rows.append(
            {"series_id": time_series_id,
             "timestamp": ts,
             "target": y,
             "temp": temp,
             "day_of_week": dow,
              "is_weekend": is_weekend}
        )

# create table
df = pd.DataFrame(rows)

# split into train und test data
# save training data
df.iloc[:N//2].to_csv("timeseries_train.csv", index=False)
# save test data
df.iloc[N//2:].to_csv("timeseries_test.csv", index=False)

# plot start of training data
import matplotlib.pyplot as plt
df_A = df.query("series_id=='A'").head(200)
df_B = df.query("series_id=='B'").head(200)
plt.plot(df_A["timestamp"], df_A["target"], color="black", label="A" )
plt.plot(df_B["timestamp"], df_B["target"], color="green", label="B" )
plt.legend()
plt.show()

## Train the time series predictor

In [None]:
# Now, re-read the training data to prepare for training
import pandas
df_train = pandas.read_csv("timeseries_train.csv", parse_dates=["timestamp"])

# This is mandatory in order to use the TimeSeriesPredictor
ts_train = TimeSeriesDataFrame.from_data_frame(
    df_train,
    id_column="series_id",
    timestamp_column="timestamp"
)


# Now, let us train a time series predictor
prediction_length = 28

model = TimeSeriesPredictor(
    path="autogluon_ts_predictor_using_covariates",
    target="target",
    prediction_length=prediction_length,
    freq="D",
    known_covariates_names=["day_of_week", "is_weekend"],
)

# "temp" will be automatically detected as "past covariate"
# since it is not a target and not a known covariate
model.fit(
    train_data=ts_train,
    presets="medium_quality",
)

## Use time series predictor with known covariates

In [None]:
import pandas as pd
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

prediction_length = 28
path = "autogluon_ts_predictor_using_covariates"
model = TimeSeriesPredictor.load(path)

# History: training data
df_train = pd.read_csv("timeseries_train.csv", parse_dates=["timestamp"])
ts_train = TimeSeriesDataFrame.from_data_frame(
    df_train, id_column="series_id", timestamp_column="timestamp"
)

# 1) Create the EXACT future index AutoGluon needs (per series)
future_known = model.make_future_data_frame(ts_train)

# 2) Fill known covariates for that future index
# future_known is a TimeSeriesDataFrame with correct (item_id, timestamp)
# Convert to pandas to compute features easily
fk = future_known.reset_index()  # columns: item_id, timestamp, ...
fk["day_of_week"] = fk["timestamp"].dt.dayofweek
fk["is_weekend"] = (fk["day_of_week"] >= 5).astype(int)

# Keep required columns + rename item_id back to your id column if you want (not necessary)
future_known_covariates = TimeSeriesDataFrame.from_data_frame(
    fk[["item_id", "timestamp", "day_of_week", "is_weekend"]],
    id_column="item_id",
    timestamp_column="timestamp",
)

# 3) Predict
preds = model.predict(ts_train, known_covariates=future_known_covariates)

In [None]:
future_known.head()

In [None]:
fk.head()

In [None]:
future_known_covariates.head()

In [None]:
preds

## Plot predictions

In [None]:
# --- Plot helper ---
def plot_series(item_id: str, history_days: int = 120, lo_q="0.1", hi_q="0.9"):
    # history tail
    hist = ts_train.loc[item_id].reset_index()  # timestamp, target, ...
    hist_tail = hist.tail(history_days)

    # forecast
    fcst = preds.loc[item_id].reset_index()  # timestamp, mean, 0.1, 0.9, ...

    plt.figure()
    plt.plot(hist_tail["timestamp"], hist_tail["target"], label=f"{item_id} history")
    plt.plot(fcst["timestamp"], fcst["mean"], label=f"{item_id} forecast (mean)")

    # Optional: prediction interval if quantile columns exist
    if lo_q in fcst.columns and hi_q in fcst.columns:
        plt.fill_between(fcst["timestamp"], fcst[lo_q], fcst[hi_q], alpha=0.2, label=f"PI [{lo_q}, {hi_q}]")

    plt.axvline(hist_tail["timestamp"].iloc[-1], linestyle="--", label="train end")
    plt.title(f"Series {item_id}: history + {prediction_length}-day forecast")
    plt.legend()
    plt.tight_layout()
    plt.show()

# --- Plot A and B ---
plot_series("A", history_days=120)
plot_series("B", history_days=120)

# TimeSeriesPredictor predicts quantiles

Quantile prediction means that a model forecasts a range of possible future values, not just a single number.

A quantile answers the question: “Below which value will the target fall with probability q?”

For example, the 0.1 quantile means there is a 10 % chance the true value will be below that 
number.

The 0.5 quantile (median) splits the uncertainty in half: the outcome is equally likely to be above or below it.
This is often more robust than the mean when forecasts are skewed.

A 0.9 quantile means there is a 90 % chance the true value will be below that value.
Only 10 % of outcomes are expected to exceed it.

The interval between two quantiles (for example 0.1 and 0.9) forms a prediction interval.
In this case, the model expects the true value to lie inside that band 80 % of the time.

If the interval is wide, the model is uncertain; if it is narrow, the model is confident.

Quantile forecasts are especially useful when decision-making must account for risk.

They allow planners to choose conservative, average, or aggressive strategies based on different quantiles.

In short, quantile prediction turns forecasting into probabilistic decision support rather than a single-point guess.