# Bonus assignment

**Joris LIMONIER**

---

In this assignment, we try to predict the number of passengers through time. We will use the airline dataset.


## Data Preprocessing

In [1]:
from pathlib import Path

import airline_passengers as ap
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from torch.utils.data import DataLoader, Dataset, TensorDataset
from tqdm import tqdm

pio.templates.default = "plotly_white"

2023-03-08 18:26:36.857434: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
%reload_ext autoreload
%autoreload 2

### Load the dataset


In [3]:
filepath = Path("airline_passenger.txt")
passengers = pd.read_csv(
  filepath,
  parse_dates=["date"],
  names=["date", "passengers"],
  index_col="date",
  header=0,
  dtype={"passengers": "float32"},
)
passengers

Unnamed: 0_level_0,passengers
date,Unnamed: 1_level_1
1949-01-01,112.0
1949-02-01,118.0
1949-03-01,132.0
1949-04-01,129.0
1949-05-01,121.0
...,...
1960-08-01,606.0
1960-09-01,508.0
1960-10-01,461.0
1960-11-01,390.0


### Split the dataset into train and test
We use 1/3 of the dataset for testing. The remaining 2/3 is further split into training and validation.

We also scale the data with respect to the training data as the validation and test data should not be used except for model evaluation and testing.

In [4]:
val_size = 0.1  # proportion of the training set is used for validation
test_size = 1 / 3  # proportion of the data is used for testing

train, val, test = ap.ttv_split(df=passengers, val_size=val_size, test_size=test_size)

# Scale the data
train, val, test = ap.scale_wrt(train, val, test, wrt=train, feature_range=(0, 1))
print(f"{len(train) = }, {len(val) = }, {len(test) = }")


len(train) = 81, len(val) = 15, len(test) = 48


We plot the train, validation and test sets with different colors.

In [5]:
ap.plot_tts(train=train, val=val, test=test)

## Model

We define the model and start training.

We define constants to use for training:
- `seq_length`: the number of time steps to use for training, *i.e.* the number of previous months to use to predict the next month
- `n_epochs`: the maximum number of epochs to train for
- `batch_size`: the batch size to use for training

In [6]:
seq_length = 1
n_epochs = 100
batch_size = 2
n_features = train.shape[1]
device = "cuda"

data_module = ap.PassengerDataModule(
  train=train,
  val=val,
  test=test,
  seq_length=seq_length,
  batch_size=batch_size,
  device=device,
)


# Create the model, optimizer and loss function
lstm = ap.PassengerLSTM(
  input_size=n_features, hidden_size=50, num_layers=4, output_size=1, device=device
)
optimizer = optim.AdamW(lstm.parameters(), lr=0.0002)
loss_fn = nn.MSELoss()


# Train the model
predictor = ap.PassengerPredictor(
  data_module=data_module, model=lstm, optimizer=optimizer, loss_fn=loss_fn
)
train_losses, val_losses = predictor.train(
  model=lstm,
  optimizer=optimizer,
  loss_fn=loss_fn,
  n_epochs=n_epochs,
)


# Plot the losses
losses = pd.DataFrame({"train": train_losses, "val": val_losses})
px.line(losses, y=["train", "val"], title="Losses", log_y=True)


Epoch 0: train loss 0.1209, val loss 0.5580
Epoch 10: train loss 0.0494, val loss 0.2322
Epoch 20: train loss 0.0458, val loss 0.2029
Epoch 30: train loss 0.0369, val loss 0.1313
Epoch 40: train loss 0.0168, val loss 0.0239
Epoch 50: train loss 0.0091, val loss 0.0269
Epoch 60: train loss 0.0083, val loss 0.0254
Epoch 70: train loss 0.0078, val loss 0.0229
Epoch 80: train loss 0.0075, val loss 0.0208
Epoch 90: train loss 0.0071, val loss 0.0192


In [7]:
y_pred = predictor.predict(model=lstm, dataloader=data_module.test_dataloader)
pred = pd.DataFrame(y_pred.flatten(), index=test.index[seq_length:], columns=["passengers"])

y_pred_val = predictor.predict(model=lstm, dataloader=data_module.val_dataloader)
pred_val = pd.DataFrame(y_pred_val.flatten(), index=val.index[seq_length:], columns=["passengers"])

fig = ap.plot_tts(train=train, val=val, test=test)
fig.add_trace(go.Scatter(x=pred.index, y=pred["passengers"], name="pred", mode="lines"))
fig.add_trace(go.Scatter(x=pred_val.index, y=pred_val["passengers"], name="pred_val", mode="lines"))

We see that the predictions of the model for time $t+1$ pretty much consist of the value that is passed to it at time $t$. This is because the model does not have any data besides the value at time $t$ to predict the value at time $t+1$. Also, for a given value $v_t$ at time $t$, the model will always predict the same value $v_{t+1}$ at time $t+1$.

We can change this behavior by feeding more data into the model, such as the current year, the current month, the current season, etc. We will perform these changes in the next section.

## Feature engineering

### Add more features

We create extra features to feed into the model.


In [8]:
passengers_augmented = passengers.copy()
passengers_augmented["month"] = passengers_augmented.index.month
passengers_augmented["year"] = passengers_augmented.index.year
passengers_augmented["season"] = passengers_augmented.index.month % 12 // 3 + 1
passengers_augmented.head()

Unnamed: 0_level_0,passengers,month,year,season
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1949-01-01,112.0,1,1949,1
1949-02-01,118.0,2,1949,1
1949-03-01,132.0,3,1949,2
1949-04-01,129.0,4,1949,2
1949-05-01,121.0,5,1949,2


### Visualize the data
#### Number of passengers per month for each year

Let us visualize the number of passengers per month for each year.

In [9]:
px.line(
  passengers_augmented,
  x="month",
  y="passengers",
  color="year",
  title="Passengers by month",
).show()


We see that the number of passengers per year tends to increase for any given month (except on some rare occasions, _e.g._ february 1953, which had more passengers than february 1954). We expect that the model will be able to learn this behavior.

#### Number of passengers per month

Let us now visualize the number of passengers per month.


In [10]:
# Make a df with the month and number of passengers
passengers_month = (
  passengers_augmented.groupby("month").mean().reset_index().drop(columns="year")
)
px.bar(passengers_month, x="month", y="passengers", title="Average passengers by month").show()


We see that the number of passengers per month is not constant. Summer months tend to have more passengers than winter months. We expect that the model will be able to learn this behavior too.

#### Number of passengers per season

Let us now visualize the number of passengers per season. We encode the seasons as follows:
- 1 : winter
- 2 : spring
- 3 : summer
- 4 : fall

In [11]:
passengers_season = (
  passengers_augmented.groupby("season").mean().reset_index()[["season", "passengers"]]
)
fig = px.bar(
  passengers_season, x="season", y="passengers", title="Average passengers by season"
)
# Add season number and name to the x-axis
fig.update_xaxes(
  ticktext=["Winter", "Spring", "Summer", "Autumn"],
  tickvals=[1, 2, 3, 4],
  title_text="Season",
)

fig.show()


We see that as mentioned before, the number of passengers is higher in summer than in winter. Although the granularity of the `month` feature is higher, this `season` feature may give the model more general information.

### One-hot encoding

We replace the `month` column by a one-hot encoding of the month. We don't reproduce this procedure for the `year` column as months are taken from a closed and cyclic set. Years on the other hand will take values that were not present in the training set. Furthermore, years are properly ordered so if two years are close in our dataset, they are also close in the real world. Contrarily, months 12 (December) and 1 (January) are close in the real world but far in our dataset.

We also produce a one-hot encoding of the season for the same reasons. Note that this produces a column that corresponds to a `is_summer` feature, which could be useful and which we would have computed, had the one-hot encoding not produced it.


In [12]:
passengers_augmented = ap.ohe(df=passengers_augmented, columns=["month", "season"])
passengers_augmented.head()


Unnamed: 0_level_0,passengers,year,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12,season_2,season_3,season_4
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1949-01-01,112.0,1949,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1949-02-01,118.0,1949,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1949-03-01,132.0,1949,0,1,0,0,0,0,0,0,0,0,0,1,0,0
1949-04-01,129.0,1949,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1949-05-01,121.0,1949,0,0,0,1,0,0,0,0,0,0,0,1,0,0



#### Correlation between features

Let us now visualize the correlation between the features (dark colors mean high correlation).

In [13]:
# Plot the correlation between the new features and the target
corr = passengers_augmented.corr().round(4)
# Use blue red color scale
fig = px.imshow(
  corr.values,
  color_continuous_scale="blues",
  color_continuous_midpoint=0,
  title="Correlation between features",
)

fig.update_xaxes(
  title="Features", ticktext=corr.columns, tickvals=np.arange(len(corr.columns))
)
fig.update_yaxes(
  title="Features", ticktext=corr.columns, tickvals=np.arange(len(corr.columns))
)
fig.show()

We see that the target (`passengers`) is highly correlated with the `year` feature. This is expected as the number of passengers tends to increase over time. We also see that the target is highly correlated with the `season_3` feature. This is also expected as this feature represents the summer season, which is the season with the most passengers. Finally, we see that the target is correlated with the `month_7` and `month_8` features. This is also expected as these features represent the months of July and August, which are the months with the most passengers.


### Split the dataset into train, validation and test

We split the dataset into train, validation and test sets

In [14]:
val_size = 0.15  # proportion of the training set is used for validation
test_size = 1 / 3  # proportion of the data is used for testing

# Split the data
train, val, test = ap.ttv_split(df=passengers_augmented, val_size=val_size, test_size=test_size)

# Scale the data
train, val, test = ap.scale_wrt(train, val, test, wrt=train, feature_range=(0, 1))
print(f"{len(train) = }, {len(val) = }, {len(test) = }")


len(train) = 74, len(val) = 22, len(test) = 48


In [33]:
seq_length = 1
n_epochs = 1000
batch_size = 4
patience = 100
n_features = train.shape[1]
device = "cuda"

data_module = ap.PassengerDataModule(
  train=train,
  val=val,
  test=test,
  seq_length=seq_length,
  batch_size=batch_size,
  device=device,
  target_col="passengers",
)


# Create the model, optimizer and loss function
lstm = ap.PassengerLSTM(
  input_size=n_features, hidden_size=100, num_layers=3, output_size=1, device=device, dropout=0.1
)
optimizer = optim.AdamW(lstm.parameters(), lr=0.00005)
loss_fn = nn.MSELoss()
early_stopping = ap.EarlyStopping(patience=patience, verbose=True)

# Train the model
predictor = ap.PassengerPredictor(
  data_module=data_module, model=lstm, optimizer=optimizer, loss_fn=loss_fn
)
train_losses, val_losses = predictor.train(
  model=lstm,
  optimizer=optimizer,
  loss_fn=loss_fn,
  n_epochs=n_epochs,
  early_stopping=early_stopping,
)


# Plot the losses
losses = pd.DataFrame({"train": train_losses, "val": val_losses})
px.line(losses, y=["train", "val"], title="Losses", log_y=True)


Epoch 0: train loss 0.2817, val loss 1.2756
Epoch 10: train loss 0.2083, val loss 1.0880
Epoch 20: train loss 0.1022, val loss 0.7359
Epoch 30: train loss 0.0464, val loss 0.3959
Epoch 40: train loss 0.0438, val loss 0.3226
Epoch 50: train loss 0.0385, val loss 0.2957
Epoch 60: train loss 0.0348, val loss 0.2710
Epoch 70: train loss 0.0332, val loss 0.2509
Epoch 80: train loss 0.0308, val loss 0.2318
Epoch 90: train loss 0.0265, val loss 0.2086
Epoch 100: train loss 0.0240, val loss 0.1835
Epoch 110: train loss 0.0211, val loss 0.1603
Epoch 120: train loss 0.0186, val loss 0.1386
Epoch 130: train loss 0.0149, val loss 0.1180
Epoch 140: train loss 0.0145, val loss 0.0962
Epoch 150: train loss 0.0117, val loss 0.0770
Epoch 160: train loss 0.0102, val loss 0.0611
EarlyStopping counter: 1 / 100
EarlyStopping counter: 2 / 100
Epoch 170: train loss 0.0083, val loss 0.0455
EarlyStopping counter: 1 / 100
EarlyStopping counter: 1 / 100
Epoch 180: train loss 0.0061, val loss 0.0346
EarlyStopping

In [34]:
test_pred_error = ap.compute_pred_error(
  y_pred=y_pred,
  y_true=test["passengers"].values,
  loss_fn=loss_fn,
  seq_length=seq_length,
)
val_pred_error = ap.compute_pred_error(
  y_pred=y_pred_val,
  y_true=val["passengers"].values,
  loss_fn=loss_fn,
  seq_length=seq_length,
)

y_pred = predictor.predict(model=lstm, dataloader=data_module.test_dataloader)
pred = pd.DataFrame(
  y_pred.flatten(), index=test.index[seq_length:], columns=["passengers"]
)

y_pred_val = predictor.predict(model=lstm, dataloader=data_module.val_dataloader)
pred_val = pd.DataFrame(
  y_pred_val.flatten(), index=val.index[seq_length:], columns=["passengers"]
)

fig = ap.plot_tts(train=train, val=val, test=test)
fig.add_trace(go.Scatter(x=pred.index, y=pred["passengers"], name="pred", mode="lines"))
fig.add_trace(
  go.Scatter(x=pred_val.index, y=pred_val["passengers"], name="pred_val", mode="lines")
)
# fig.update_layout(title="Passengers prediction")
fig.layout.title.text += f"Val error: {val_pred_error:.4f}, Test error: {test_pred_error:.4f}"
fig


## Evaluation
