## How to Forecast Time Series Data using any Supervised Learning Model

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/cleanlab-tools/blob/time-series-automl-notebook/time_series_automl/time_series_automl.ipynb)

This notebook delves into enhancing the process of forecasting daily energy consumption levels by transforming a time series dataset into a tabular format using open-source libraries. We explore the application of a popular multiclass classification model and leverage AutoML with cleanlab to significantly boost our out-of-sample accuracy.

At a high level we will:

- Establish a baseline accuracy by fitting a Prophet forecasting model on our time series data
- Convert our time series data into a tabular format by using open-source featurization libraries and then will show that can outperform our prophet model with a multiclass classification approach by **38% in out-of-sample accuracy**
- Use cleanlab’s AutoML platform for multiclass classification to **improve our out-of-sample accuracy for our predictions by 8%** compared to our classification model and by **46%** compared to our forecasting model

## Initialize time series data for Prophet

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None 

data = pd.read_csv('PJME_hourly.csv', parse_dates=['Datetime'], index_col='Datetime')

# Assuming pjme_data is loaded as before
daily_data = data.resample('D').mean() 

# Prepare data for Prophet
daily_data.reset_index(inplace=True)
daily_data.columns = ['ds', 'y']

## Initialize time series data for featurization into a tabular format

In [2]:
from sklearn.model_selection import train_test_split

# Reset the datetime
data["Datetime"] = data.index
data = data.reset_index(drop=True)

# Create copy for multiclass data 
df = data.copy()

# Convert the datetime column
df['Datetime'] = pd.to_datetime(df['Datetime'])  # Adjust the 'datetime' column name as necessary
df = df.sort_values('Datetime').reset_index(drop=True)


# Obtain day and hour
df['Date'] = pd.to_datetime(df['Datetime']).dt.floor('D')  
df['Hour'] = pd.to_datetime(df['Datetime']).dt.hour

# Create multi-index feature df to compute time series features on
features = df.set_index(['Date', 'Hour'])  
features.drop("Datetime", inplace=True, axis=1)

# Split the data into training and testing sets, respecting the temporal order
X_train, X_test, y_train, y_test = train_test_split(features, features["PJME_MW"], test_size=0.2, shuffle=False)

# Get group lengths
train_lengths = X_train.groupby(level=0).size()
test_lengths = X_test.groupby(level=0).size()

# Obtain common length value for train/test data
train_common_length = train_lengths.mode().iloc[0]
test_common_length = test_lengths.mode().iloc[0]

# Filter train/test data to groups with same common length for featurizer
X_train = X_train.groupby(level=0).filter(lambda x: len(x) == train_common_length)
X_test = X_test.groupby(level=0).filter(lambda x: len(x) == test_common_length)

# Create quartiles based on training data to avoid leakage
quartiles = [X_train['PJME_MW'].quantile(q) for q in [0.25, 0.50, 0.75]]

## Train and Evaluate Prophet Forecasting Model

In [3]:
# Cutoff date at 2015-04-09
cutoff_index = int(len(daily_data) * 0.8)

# Use 80% of data for training set and 20% for test set
train_df = daily_data.iloc[:cutoff_index]
test_df = daily_data.iloc[cutoff_index:]

print("Training Set Shape:", train_df.shape)
print("Testing Set Shape:", test_df.shape)

Training Set Shape: (4847, 2)
Testing Set Shape: (1212, 2)


In [4]:
train_df.tail()

Unnamed: 0,ds,y
4842,2015-04-05,24577.5
4843,2015-04-06,26996.666667
4844,2015-04-07,27177.833333
4845,2015-04-08,29136.041667
4846,2015-04-09,30535.291667


In [5]:
test_df.head()

Unnamed: 0,ds,y
4847,2015-04-10,29190.166667
4848,2015-04-11,24774.291667
4849,2015-04-12,24407.625
4850,2015-04-13,26825.333333
4851,2015-04-14,26952.125


In [6]:
import numpy as np
from prophet import Prophet
from sklearn.metrics import accuracy_score

# Initialize model and train it on training data
model = Prophet()
model.fit(train_df)

# Create a dataframe for future predictions covering the test period
future = model.make_future_dataframe(periods=len(test_df), freq='D')
forecast = model.predict(future)

# Categorize forecasted daily values into quartiles based on the thresholds
forecast['quartile'] = pd.cut(forecast['yhat'], bins = [-np.inf] + list(quartiles) + [np.inf], labels=[1, 2, 3, 4])

# Extract the forecasted quartiles for the test period
forecasted_quartiles = forecast.iloc[-len(test_df):]['quartile'].astype(int)


# Categorize actual daily values in the test set into quartiles
test_df['quartile'] = pd.cut(test_df['y'], bins=[-np.inf] + list(quartiles) + [np.inf], labels=[1, 2, 3, 4])
actual_test_quartiles = test_df['quartile'].astype(int)


# Calculate the evaluation metrics
accuracy = accuracy_score(actual_test_quartiles, forecasted_quartiles)

# Print the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')

  from .autonotebook import tqdm as notebook_tqdm
Importing plotly failed. Interactive plots will not work.
22:59:41 - cmdstanpy - INFO - Chain [1] start processing
22:59:41 - cmdstanpy - INFO - Chain [1] done processing


Accuracy: 0.4249


## Convert time series data to tabular format through featurization

In [7]:
import tsfel
from sktime.transformations.panel.tsfresh import TSFreshFeatureExtractor

# Define tsfresh feature extractor
tsfresh_trafo = TSFreshFeatureExtractor(default_fc_parameters="minimal")

# Transform the training data using the feature extractor
X_train_transformed = tsfresh_trafo.fit_transform(X_train)

# Transform the test data using the same feature extractor
X_test_transformed = tsfresh_trafo.transform(X_test)

# Retrieves a pre-defined feature configuration file to extract all available features
cfg = tsfel.get_features_by_domain()

# Function to compute tsfel features per day
def compute_features(group):
    # TSFEL expects a DataFrame with the data in columns, so we transpose the input group
    features = tsfel.time_series_features_extractor(cfg, group, fs=1, verbose=0)
    return features


# Group by the 'day' level of the index and apply the feature computation
train_features_per_day = X_train.groupby(level='Date').apply(compute_features).reset_index(drop=True)
test_features_per_day = X_test.groupby(level='Date').apply(compute_features).reset_index(drop=True)

# Combine each featurization into a set of combined features for our train/test data
train_combined_df = pd.concat([X_train_transformed, train_features_per_day], axis=1)
test_combined_df = pd.concat([X_test_transformed, test_features_per_day], axis=1)

# Filter out features that are highly correlated with our target variable
column_of_interest = "PJME_MW__mean"
train_corr_matrix = train_combined_df.corr()
train_corr_with_interest = train_corr_matrix[column_of_interest]
null_corrs = pd.Series(train_corr_with_interest.isnull())
false_features = null_corrs[null_corrs].index.tolist()

columns_to_exclude = list(set(train_corr_with_interest[abs(train_corr_with_interest) > 0.8].index.tolist() + false_features))
columns_to_exclude.remove(column_of_interest)

# Filtered DataFrame excluding columns with high correlation to the column of interest
train_combined_df = train_combined_df.drop(columns=columns_to_exclude)
test_combined_df = test_combined_df.drop(columns=columns_to_exclude)

Feature Extraction: 100%|███████████████████| 4817/4817 [00:00<00:00, 8337.50it/s]
Feature Extraction: 100%|███████████████████| 1205/1205 [00:00<00:00, 9386.94it/s]


In [14]:
# Define a function to classify each value into a quartile
def classify_into_quartile(value):
    if value < quartiles[0]:
        return 1  
    elif value < quartiles[1]:
        return 2  
    elif value < quartiles[2]:
        return 3  
    else:
        return 4  

X_train_transformed = train_combined_df.copy()
X_test_transformed = test_combined_df.copy()

y_train = X_train_transformed["PJME_MW__mean"]
X_train_transformed.drop("PJME_MW__mean", inplace=True, axis=1)

y_test = X_test_transformed["PJME_MW__mean"]
X_test_transformed.drop("PJME_MW__mean", inplace=True, axis=1)

y_train_labels = y_train.apply(classify_into_quartile)
y_test_labels = y_test.apply(classify_into_quartile)

X_train_transformed.head()

Unnamed: 0,PJME_MW__standard_deviation,PJME_MW__variance,PJME_MW_Centroid,PJME_MW_Entropy,PJME_MW_FFT mean coefficient_0,PJME_MW_FFT mean coefficient_1,PJME_MW_FFT mean coefficient_10,PJME_MW_FFT mean coefficient_11,PJME_MW_FFT mean coefficient_12,PJME_MW_FFT mean coefficient_2,...,PJME_MW_Spectral roll-off,PJME_MW_Spectral skewness,PJME_MW_Spectral slope,PJME_MW_Spectral spread,PJME_MW_Spectral variation,PJME_MW_Standard deviation,PJME_MW_Variance,PJME_MW_Wavelet entropy,PJME_MW_Wavelet variance_4,PJME_MW_Wavelet variance_5
0,4097.961271,16793290.0,12.727435,1.0,5207144.0,173670400.0,45827.482018,79791.456899,56591.123457,175815700.0,...,0.083333,5.166394,-0.74539,0.050256,0.247164,4097.961271,16793290.0,1.896687,166480600.0,222922500.0
1,3718.008117,13823580.0,12.554067,1.0,3425893.0,156421900.0,65442.702058,94361.095951,20417.234568,145574600.0,...,0.083333,5.700734,-0.751053,0.046701,0.234079,3718.008117,13823580.0,1.89596,153428700.0,209953900.0
2,3241.304817,10506060.0,12.395692,1.0,2600067.0,101540900.0,1155.348647,71254.542792,19028.669753,122764400.0,...,0.083333,5.583426,-0.753413,0.044695,0.280831,3241.304817,10506060.0,1.89528,137727900.0,185287300.0
3,2259.37171,5104761.0,12.204,1.0,42196.01,65022750.0,35704.669659,92795.408167,11306.777778,52567380.0,...,0.083333,5.868139,-0.750643,0.052614,0.267369,2259.37171,5104761.0,1.898934,103162300.0,126931400.0
4,3250.463504,10565510.0,12.751234,1.0,767862.7,230744500.0,8367.989724,2063.814511,1591.123457,30337230.0,...,0.041667,6.800978,-0.762032,0.036911,0.058819,3250.463504,10565510.0,1.893653,154753100.0,173575200.0


## Train and Evaluate GradientBoostingClassifier Model on multiclass tabular data

In [12]:
print(y_train_labels.value_counts())

PJME_MW__mean
2    1795
3    1333
4    1023
1     666
Name: count, dtype: int64


In [9]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(
    n_estimators=150,
    learning_rate=0.1,
    max_depth=4,
    min_samples_leaf=20,
    max_features='sqrt',
    subsample=0.8,
    random_state=42
)

gbc.fit(X_train_transformed, y_train_labels)


y_pred_gbc = gbc.predict(X_test_transformed)
print(f'Accuracy: {accuracy_score(y_test_labels, y_pred_gbc):.4f}')

Accuracy: 0.8075


## Using AutoML with Cleanlab Studio to improve out-of-sample accuracy

Use your API key to instantiate a Studio object, which can be used to analyze your dataset.

In [None]:
from cleanlab_studio import Studio

# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

# initialize studio object
studio = Studio(API_KEY)

Next load the dataset into Cleanlab Studio (more details/options can be found in [this guide](https://help.cleanlab.ai/guide/quickstart/api/#uploading-a-dataset)). This may take a while for big datasets.

In [None]:
dataset_id = studio.upload_dataset(dataset_path, dataset_name="YOUR DATASET NAME HERE")
print(f"Dataset ID: {dataset_id}")

Now you can create a project using this dataset. 

In [None]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="YOUR PROJECT NAME HERE",
    modality="tabular",
    task_type="multi-class",
    model_type="regular",
    label_column="YOUR LABEL COLUMN HERE",
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

You can then deploy a model using your Cleanlab Studio project to get AutoML results. [This guide](https://help.cleanlab.ai/tutorials/inference_api/) is a useful reference to learn more about model deployment. Below is some example code to run inference on a model once you have deployed it.

In [None]:
# load model from Studio
# you can find your model ID in the models table on the dashboard!
model_id = "<YOUR_MODEL_ID>"
model = studio.get_model(model_id)

predictions = model.predict(test_data, return_pred_proba=True)

In [16]:
y_pred_automl_cleanlab = pd.read_csv("quartile-multiclass-pjme-testing-data_pred_probs.csv")
y_pred_automl_cleanlab = y_pred_automl_cleanlab["Suggested Label"]

In [15]:
print(f'Accuracy: {accuracy_score(y_test_labels, y_pred_automl_cleanlab):.4f}')

Accuracy: 0.8880
