# Template for creating a dataset from an existing dataset using a single function

This example creates a dataset from the `covid-19-epidemiology` dataset created in the notebook that demos [how to create a dataset from a single .csv file](Add-csv-template.ipynb). 

To access functionality from the `src` module throughout this notebook, use your project module, whatever you have named it.

## Basic imports

In [None]:
# Basic utility functions
import logging
import pathlib
from functools import partial

from src.log import logger
from src.data import Dataset
from src import paths

# data functions
from src import workflow

In [None]:
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)

In [None]:
%load_ext autoreload
%autoreload 2

## Load existing dataset

In [None]:
ds = Dataset.load('covid-19-epidemiology')

In [None]:
ds.data.shape

In [None]:
print(ds.DESCR)

In [None]:
print(ds.LICENSE)

## Create a function that we want to transform by

Here let's do something extremely simple, subselect by `key` which reflects a geographic region. 

We will use this function to create a derived dataset. As such, let's save it in the project module (`src` in this case) in `transformer_functions.py`.

In [None]:
%%writefile -a ../../test-env/src/data/transformer_functions.py

def subselect_by_key(df, key):
    """
    Filter dataframe by key and return resulting dataframe.
    """
    return df[df.key == key]

In [None]:
from src.data.transformer_functions import subselect_by_key

In [None]:
subselect_by_key.__module__

In [None]:
df = ds.data.copy()

For example, `CA` will give us the numbers for Canada:

In [None]:
key_df = subselect_by_key(df, 'CA')
key_df.shape

Here are some trends:

In [None]:
key_df[['date', 'new_confirmed']].plot();

In [None]:
key_df[['date', 'new_deceased']].plot();

## Create a derived dataset

Let's create a dataset that's just the Canadian epidimelogical numbers. To do so, we only need to apply a single function to the existing data.

Here is the information we need to create a dataset using `workflow.dataset_from_single_function()`:

    source_dataset_name
    dataset_name
    data_function
    added_descr_txt

We'll want our `data_function` to be defined in the project module (in this case `src`) for reproducibility reasons (which we've already done with `subselect_by_key` above).

In [None]:
key = 'CA'

In [None]:
source_dataset_name = 'covid-19-epidemiology'
dataset_name = f'covid-19-epidemiology-{key}'
data_function = partial(subselect_by_key, key=key)

In [None]:
added_descr_txt = f"""The dataset {dataset_name} is the subselection \
to the {key} dataset."""

In [None]:
# test out the function
data_function(df).shape

### Use the workflow function to create the derived dataset

In [None]:
ds = workflow.dataset_from_single_function(
        source_dataset_name=source_dataset_name,
        dataset_name=dataset_name,
        data_function=data_function,
        added_descr_txt=added_descr_txt,
        overwrite_catalog=True)

In [None]:
dataset_name

In [None]:
ds = Dataset.load(dataset_name)

In [None]:
ds.data.shape

In [None]:
print(ds.DESCR)

In [None]:
print(ds.LICENSE)

In [None]:
ds.data[['date', 'new_confirmed']].plot();