# General template for create a new dataset from scratch

This example creates the same raw dataset as in the [`Add-csv-template.ipynb`](https://cookiecutter-easydata.readthedocs.io/en/latest/Add-csv-template/) example, but does it completely generally without using a function from `helpers`. Any (non-derived) dataset can be added in this way.

We'll use this as an example of a non-manual download. 

## Basic imports

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Basic utility functions
import logging
import os
import pathlib
from pprint import pprint

from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial

# data functions
from src.data import DataSource, Dataset, DatasetGraph, Catalog
from src import helpers

In [None]:
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)

## Create a DataSource



In [None]:
ds_name = 'covid-19-epidemiology-raw'
dsrc = DataSource(ds_name)

In [None]:
url = 'https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv'

In [None]:
filename = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file

In [None]:
license = """
[CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY)
"""

In [None]:
metadata = """
The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data). 

The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:

    0: Country
    1: Province, state, or local equivalent
    2: Municipality, county, or local equivalent
    3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"

There are multiple types of data:

    Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
    Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
    Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions

The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.

One of these files is the epidemiology.csv file used here.
"""

This example uses `add_url`, but there are other options such as `add_manual_download` and `add_google_drive`. 

In [None]:
dsrc.add_url(url=url, file_name=filename, unpack_action='copy')
dsrc.add_metadata(contents=metadata, force=True)
dsrc.add_metadata(contents=license, kind='LICENSE', force=True)

In [None]:
dsrc.file_dict

### Create a process function
By default, we recommend that you use the `process_extra_files` functionality and then use a transformer function to create a derived dataset, but you can optionally create your own.

In [None]:
from src.data.extra import process_extra_files
process_function = process_extra_files
process_function_kwargs = {'file_glob':'*.csv',
                           'do_copy': True,
                           'extra_dir': ds_name+'.extra',
                           'extract_dir': ds_name}

In [None]:
help(process_function)

In [None]:
dsrc.process_function = partial(process_function, **process_function_kwargs)

In [None]:
dsrc.update_catalog()

In [None]:
dsc = Catalog.load('datasources')
dsc[ds_name]

In [None]:
%%time
dsrc.fetch()

In [None]:
%%time
dsrc.unpack()

## Create a Dataset from the DataSource

In [None]:
from src.data import DatasetGraph

In [None]:
paths['catalog_path']

In [None]:
dag = DatasetGraph(catalog_path=paths['catalog_path'])

In [None]:
dag.sources

In [None]:
dsc = Catalog.load('datasources'); dsc

In [None]:
dag.add_source(output_dataset=ds_name, datasource_name=ds_name, overwrite_catalog=True)

In [None]:
dc = Catalog.load('datasets'); dc

In [None]:
%%time
ds = Dataset.from_catalog(ds_name)

In [None]:
%%time
ds = Dataset.load(ds_name)

In [None]:
pprint(ds.metadata)

In [None]:
print(ds.LICENSE)

In [None]:
ds.EXTRA

In [None]:
ds.extra_file('epidemiology.csv')

In [None]:
ds.data is None

In [None]:
ds.target is None

## Check-in the new dataset
Finally don't forget to check in the new catalog files. 