# Template for creating a dataset from a single .csv file

This example creates a dataset using a single manually downloaded .csv file using a helper function in the `workflow`.

The `src` module here should be the name of your project module, whatever you have named it.

In this case, we'll use one of the [COVID-19 Open-Data](https://github.com/GoogleCloudPlatform/covid-19-open-data) files from Google: https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv as an example.

## Basic imports

In [None]:
# Basic utility functions
import logging
import pathlib

from src.log import logger
from src.data import Dataset
from src import paths

# data functions
from src import workflow

In [None]:
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)

In [None]:
%load_ext autoreload
%autoreload 2

As a reference, this is your current `paths['raw_data_path']` set in your conda environment.

In [None]:
paths['raw_data_path']

## Dataset creation information

This is the information that you need to provide to create this dataset:

* `ds_name`: The name you want to call your dataset in the Dataset catalog
* `csv_path`: The desired path to your .csv file (in this case `epidemiology.csv`) relative to paths['raw_data_path']
* `download_message`: The message to display to indicate to the user how to manually download your .csv file.
* `license_str`: Information on the license for the dataset
* `descr_str`: Information on the dataset itself

In [None]:
ds_name = 'covid-19-epidemiology'
csv_path = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file

In [None]:
download_message = f"""Please retrieve epidemiology.csv from https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv \
and place it in {paths['raw_data_path']}"""

In [None]:
license_str = """
[CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY)
"""

In [None]:
descr_str = """
The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data). 

The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:

    0: Country
    1: Province, state, or local equivalent
    2: Municipality, county, or local equivalent
    3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"

There are multiple types of data:

    Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
    Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
    Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions

The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.

One of these files is the epidemiology.csv file used here.
"""

If you have not yet placed your `epidemiology.csv` file in the appropriate place, the following cell will fail with a `FileNotFoundError` to the path it expects for your `epidemiology.csv` file. Put your file in the appropriate place, and then try again.

## Create the dataset and explore it

In [None]:
%%time
ds = workflow.dataset_from_csv_manual_download(ds_name=ds_name,
                                               csv_path=csv_path,
                                               download_message=download_message,
                                               license_str=license_str,
                                               descr_str=descr_str,
                                               overwrite_catalog=False)

In [None]:
%%time
ds = Dataset.load(ds_name)

In [None]:
ds.data.head()

In [None]:
ds.data.shape

By default, the workflow helper function also created a `covid-19-epidemiology_raw` dataset that has an empty `ds.data`, but keeps a record of the location of the final `epidemiology.csv` file relative to  in `ds.EXTRA`.

The `.EXTRA` functionality is covered in other documentation.

In [None]:
%%time
ds_raw = Dataset.from_catalog(ds_name+"_raw")

In [None]:
print(ds_raw.data)

In [None]:
ds_raw.EXTRA

In [None]:
# fq path to epidemiology.csv file
ds_raw.extra_file('epidemiology.csv')

## Check-in the catalog
The new dataset will now be in the catalog:

In [None]:
workflow.dataset_catalog(keys_only=True)

At this point, you'll need to check in your new catalog files so that they are shared with others. Then, anyone with the catalog file can `ds.load()` the new dataset.