# Overview of Database

The `DB()` object in the `dl_utils.db` module encapsulates functionality to organize project data and track data transformations as well as analysis. As the size and complexity of a project grows, use of the `DB()` object will ensure that:

1. Project file paths are defined in a well-organized directory hierarchy
2. Project metadata is serialized in a clear and standard format
3. Code to perform data transformations and analysis are well-documented with clear input(s) and output(s)
4. The entire data transformation pipeline can be applied easily to new cohorts

### Table of Contents

* Initialization
* Basic functionality
* Iteration

# Set Up

The following lines of code prepare the requisite environment to run this notebook.

In [None]:
import os
import pandas as pd
from dl_utils.db import DB
from dl_utils import datasets

In addition this tutorial assumes that the `bet` example dataset has been downloaded in the `dl_utils/data` folder. If needed, the following lines of code will download the required data:

In [None]:
# --- Set paths
DAT_PATH = '../../../data'
CSV_PATH = '{}/bet/csvs/db-all.csv.gz'.format(DAT_PATH)
YML_PATH = '{}/bet/ymls/db-all.yml'.format(DAT_PATH)

# --- Download data
if not os.path.exists(CSV_PATH):
    datasets.download(name='bet', path=DAT_PATH)

# Initialization

All `DB()` object data including filenames and raw values are stored in a `*.csv` (or `*.csv.gz`) file. If necessary, all `DB()` object metadata such as filename directory roots, filename patterns or method definitions are stored in a `*.yml` file. Either file type may be passed directly into the `DB(...)` constructor to create a new object.

### Creating from a `*.csv` file

All underlying raw `DB()` data is stored in a `*.csv` file. Each row in the `*.csv` file represents a single exam. Each column in the `*.csv` file may be one of three different types: sid, fnames, header.

An example template `*.csv` file is shown here:

```
sid           fname-dat       fname-lbl       hemorrhage
exam-id-000   /000/dat.hdf5   /000/lbl.hdf5   True
exam-id-001   /001/dat.hdf5   /001/lbl.hdf5   False
exam-id-002   /002/dat.hdf5   /002/lbl.hdf5   True
...           ...             ...             ...
```

#### sid (required)

Exactly one column in the `*.csv` file must be named `sid` and be populated with a **unique** study ID for each exam (row). The `sid` may be either numeric or alphanumeric in content.

#### fnames (optional)

If a project utilizes one or more serialized data volumes, the file paths should be maintained in columns specified with a `fname-` prefix (e.g. `fname-dat` and `fname-lbl` as above). Files may be listed using either complete or relative paths, or using a number of keywords. See `*.yml` configuration below for more information.

#### header (optional)

All other data for a project should be maintained in the remaining columns of the `*.csv` file (e.g. not `sid` and not prefixed with `fname-`). It is best practice to serialize a single value per column (e.g. either a numeric value or string), rather than storing multiple values as an object. 

In [None]:
# --- Create from *.csv file
db = DB(CSV_PATH)

### Creating from a `*.yml` file

In addition to the raw `DB()` data stored in a `*.csv` file, various metadata that defines `DB()` behavior may also be specified in a corresponding `*.yml` file. 

An example template `*.yml` file is shown here:

```yml
files: 
  csv: /csvs/db-all.csv.gz
  yml: /ymls/db-all.yml
paths: 
  data: /path/to/data
  code: /path/to/code
sform: {}
query: {}
fdefs: []

```

#### files and paths (required)

To facilitate transfer of code and data, relative paths are stored in the `files` variable, with path roots stored in the `pqths` variable. Thus:

```python
paths['code'] + files['csv'] # complete path to *.csv file
paths['code'] + files['yml'] # complete path to *.yml file
```

Additionally, `paths['data']` represents the root directory for serialized data volumes.

#### sform (optional)

There are a two different methods to store files paths in the `*.csv` table (in columns prefixed with `fname-`). As above, the simplest method is to use the complete file path name. Alternatively, the `sform` dictionary may be set with key-values pairs where the key represents a column name and the value represents a Python string format pattern. The Python string format pattern may use one of three different keywords:

* *root*: the data root directory (`paths['data'` as above)
* *curr*: the current contents stored in the `*.csv` file
* *sid*: the current exam study ID

Consider the following examples (using the same template `*.csv` as above):

##### Example 1

```yml
sform:
  dat: '{root}/{curr}'
  lbl: '{root}/{curr}`
```

... would be expanded to ...

```
sid           fname-dat                    fname-lbl       
exam-id-000   /path/to/data/000/dat.hdf5   /path/to/data/000/lbl.hdf5
exam-id-001   /path/to/data/001/dat.hdf5   /path/to/data/001/lbl.hdf5
exam-id-002   /path/to/data/002/dat.hdf5   /path/to/data/002/lbl.hdf5
...           ...             ...             ...
```

##### Example 2

```yml
sform:
  dat: '{root}/{sid}/dat.hdf5'
  lbl: '{root}/{sid}/lbl.hdf5`
```

... would be expanded to ...

```
sid           fname-dat                            fname-lbl       
exam-id-000   /path/to/data/exam-id-000/dat.hdf5   /path/to/data/exam-id-000/lbl.hdf5
exam-id-001   /path/to/data/exam-id-001/dat.hdf5   /path/to/data/exam-id-001/lbl.hdf5
exam-id-002   /path/to/data/exam-id-002/dat.hdf5   /path/to/data/exam-id-002/lbl.hdf5
...           ...             ...             ...
```

#### query (optional)

As an alternative to manually identifying the relevant file paths, a simple query can configured to automatically find (and update) the requested data. The query dictinoary is defined simply using a root data directory and one or more matching suffix patterns. 

Consider the following example:

```yml
query:
  root: /path/to/data
  dat: dat.hdf5
  lbl: lbl.hdf5
```

In this scenario:

* column `fname-dat`: populated with results from `glob.glob('/path/to/data/**/dat.hdf5')`
* column `fname-lbl`: populated with results from `glob.glob('/path/to/data/**/lbl.hdf5')`

Note that corresponding `dat.hdf5` and `lbl.hdf5` files for the same exam are expected to be in the **same subdirectory**.

#### fdefs (optional)

See notes below.

In [None]:
# --- Create from *.yml file
db = DB(YML_PATH)

# Basic functionality

Upon creating a `DB()` object, the underlying data structure is split into two separate `pandas` DataFrames, `db.fnames` and `db.header` (each DataFrame is an attribute of the main `DB()` object). The `db.fnames` DataFrame comprises of all `*.csv` columns prefixed with `fname-` (with the `fname-` prefix itself removed upon import); the `db.header` DataFrame contains all remaining columns.

In [None]:
# --- Inspect fnames and header
assert type(db.header) is pd.DataFrame
assert type(db.fnames) is pd.DataFrame

# --- Ensure all five exams are available
assert db.fnames.shape[0] == 5
assert db.header.shape[0] == 5

## db.fnames

See below for some basic functionality related to `db.fnames`:

In [None]:
# --- Check to see if fnames exist

# Tear Down