# Datasets

This notebook shows how to prepare and explore existing datasets with smart meter and weather data.

## Preparing data sources

Cumulant Learning requires an instance of the `culearn.base.PredictionDataset` class to represent a dataset with X and Y variables. You can either create this instance directly or you can use existing implementations of the `culearn.data.DataSource` class that will create some predefined datasets for you:

In [None]:
from culearn.data import *

sources = [
    LCL('../data/LCL'),
    REFIT('../data/REFIT'),
    SGSC('../data/SGSC'),
    UMass('../data/UMass')
]

The existing data sources will create datasets with weather data as X variables and smart meter data as Y variables. `LCL`, `REFIT`, and `SGSC` use [meteostat.net](https://meteostat.net/) as default weather data source, while `UMass` uses the weather data provided alongside the smart meter data. Alternatively, you can replace [meteostat.net](https://meteostat.net/) with [worldweatheronline.com](https://www.worldweatheronline.com/) like this:
```
api_key='<YOUR_API_KEY>'
sources = [
    LCL('../data/LCL', WorldWeather, api_key=api_key),
    REFIT('../data/REFIT', WorldWeather, api_key=api_key),
    SGSC('../data/SGSC', WorldWeather, api_key=api_key),
    UMass('../data/UMass')
]
```

If neither of these weather data providers are suitable to your use case you can implement your own subclass of the `culearn.data.Weather` class and pass it to the data sources instead.

## Loading datasets

To create datasets from a data source you just need to call the `.dataset()` function. This will download smart meter and weather data for load forecasting from external data sources, unzip the data files, and split the larger CSV files with multiple time series into smaller ones with individual time series to support parallel processing. This might take a while at first, but will make the rest of the process much faster. However, there is a small exception: `REFIT` currently does not support automatic download, so it will raise an exception instructing you to download the data manually - you will be able to create the dataset after that.

In [None]:
datasets = {}
for source in sources:
    dataset_name = type(source).__name__
    print(f'Preparing {dataset_name} data.')
    datasets[dataset_name] = source.dataset()

## Exploring data

Each `culearn.base.PredictionDataset` contains X variables in one `pandas.DataFrame` instance and Y variables in a collection of `culearn.csv.TimeSeriesCSV` instances. For each Y variable, the Y values can be accessed via `.stream()` and `.series()` functions. The `.stream()` function iteratively returns Y values as a collection of `culearn.base.TimeSeriesTuple` instances (for stream processing), while the `.series()` function monolithically returns Y values as one `pandas.Series` instance (for batch processing).

In [None]:
for dataset_name, dataset in datasets.items():
    print(dataset_name)

    print(f'All {len(dataset.x.columns)} X variables:')
    display(dataset.x)

    print(f'1 of {len(dataset.y)} Y variables:')
    for y in dataset.y:
        print(f'Y tuple:')
        for y_tuple in y.stream():
            print(y_tuple)
            break

        print(f'Y series:')
        display(y.series().to_frame())
        break