# Prepare Age and Gender Data from the MIMIC-III Demo Dataset for Conquery

This tutorial shows how data and meta data tables from the [MIMIC-III Demo Dataset](https://physionet.org/content/mimiciii-demo/1.4/) can be used to prepare data structures
needed for conquery.

In detail we will generate meta JSONs describing a table schema (Table-JSON), an import operation (Import-JSON, is much like the corresponding Table-JSON used for the preprocessing) and concepts (Concept-JSON, which offers the query functionality) for the table [PATIENTS.csv](https://physionet.org/files/mimiciii-demo/1.4/PATIENTS.csv).
This table contains informations about age, gender. We will use these to create two corresponding concepts.

In [None]:
## The imports for this notebook
import pandas as pd
import io
import requests as r
import os
import json
from pathlib import Path
from zipfile import ZipFile
from io import BytesIO
import lib.conquery_util as cq
import re

# Define working directory
wd = Path(".")

## Meta Data Creation
We will start with the creation of the meta data. For Table-JSON and Import-JSON we need the header of the data table (PATIENTS.csv), we want to use later in conquery.
This process is rather generic, as it is usually just an annotation of the columns with type information.

For the Concept-JSONs, we will create two objects that will reference the columns from the Table-JSON.

### Download Data Table

In [None]:
data_url = "https://physionet.org/files/mimiciii-demo/1.4/PATIENTS.csv?download"
s=r.get(data_url).content
data_df = pd.read_csv(io.StringIO(s.decode('utf-8')), index_col="row_id", dtype={"subject_id": str})

### Clean Data
For now, Conquery can only work with dates, not times. Because of that, we will clean all columns, that we will later reference, from a timestamp. We will use the two columns `dob` (date of birth) and `dod` (date of death).

In [None]:
datetime_matcher = re.compile(r"^(?P<date>\d{4}-\d{2}-\d{2})\s*(?P<time>\d{2}:\d{2}:\d{2})$")

# Clean dob and convert to datetime type
data_df['dob'] = pd.to_datetime(data_df['dob'].str.replace(datetime_matcher, lambda match: match.group('date')))
# Clean dod and convert to datetime type
data_df['dod'] = pd.to_datetime(data_df['dod'].str.replace(datetime_matcher, lambda match: match.group('date')))

# Write out the csv because it is needed for the preprocessing
data_file = wd / "data" / "csv" / cq.get_csv_name(data_url)
data_file.parent.mkdir(parents=True, exist_ok=True)

data_df.to_csv(data_file)

### Generate Table-JSON and Validate

In [None]:
# Extract the table name that we will use through out this notebook
table_name = data_file.name.split(".")[0]

# We create an extra column for the lifetime in the table which isn't actually in the data 
extra_columns = [cq.DateRangeColumn(name="lifetime", start_column="dob", end_column="dod")]

# Generate the Table-JSON
table = cq.generate_table(table_name, data_df, "subject_id", extra=extra_columns)

# Load the validation schema for Table-JSON (it is under ./json_schema/table.json) and validate the generated object
cq.get_validator(wd/"json_schema"/"table.json").validate(table)

# Prepare the folder for the Table-JSONs
table_json_file = wd / "data" / "tables" / f"{table_name}.table.json"
table_json_file.parent.mkdir(parents=True, exist_ok=True)

# Write the Table-JSON 
with open(table_json_file, "w") as f:
    json.dump(table, f, indent="\t")

### Generate Import-JSON and Validate

In [None]:
table_name = data_file.name.split(".")[0]

# Here we also pass the extra columns as the import file defines then a special operation
# that creates a virual column "lifetime" that references "dob" and "dod".
import_ = cq.generate_import(data_df, "subject_id", data_file,extra=extra_columns)

cq.get_validator(wd/"json_schema"/"import.json").validate(import_)

import_json_file = wd / "data" / "imports" / f"{table_name}.import.json"
import_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(import_json_file, "w") as f:
    json.dump(import_, f, indent="\t")

### Generate Concept-JSON

In this section we generate an age and a gender concept.

#### Age Concept
The age concept allows to build a cohort based on an age restriction (see `concept->connectors[0]->filters[0]`) and to output the birth date or by default the age of a patient based on today or the upper bound of a date restriction.

In [None]:
concept = {
    "connectors": [
        {
            "filters": [
                {
                    "column": f"{table_name}.dob",
                    "description": "Allowed ages within the given date restriction",
                    "label": "Age Restriction",
                    "name": "age_restriction",
                    "timeUnit": "YEARS",
                    "type": "DATE_DISTANCE"
                }
            ],
            "label": "Age",
            "name": "age",
            "selects": [
                {
                    "column": f"{table_name}.dob",
                    "default": True,
                    "description": "Age at upper bound of date restriction",
                    "label": "Age",
                    "name": "age_select",
                    "timeUnit": "YEARS",
                    "type": "DATE_DISTANCE"
                },
                {
                    "column": f"{table_name}.dob",
                    "label": "Date of Birth",
                    "name": "date_of_birth",
                    "type": "LAST"
                }
            ],
            "validityDates": [
                {
                    "column": f"{table_name}.lifetime",
                    "label": "Lifetime",
                    "name": "lifetime"
                },
            ],
            "table": f"{table_name}",
        }
    ],
    "label": "Age",
    "name": "age",
    "type": "TREE"
}

Validate and write the concept.

In [None]:
cq.get_validator(wd / "json_schema" / "concept.json").validate(concept)

concept_json_file = wd / "data" / "concepts" / f"{concept['name']}.concept.json"
concept_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(concept_json_file, "w") as f:
    json.dump(concept, f, indent="\t")

#### Gender Concept
The age concept allows the cohort to be filtered by gender. Therefore, a filter is added to the concept.

In [None]:
concept = {
    "connectors": [
        {
            "column": f"{table_name}.gender",
            "filters": [
                {
                    "column": f"{table_name}.gender",
                    "label": "Gender",
                    "labels": {
                        "F": "female",
                        "M": "male"
                    },
                    "name": "gender",
                    "type": "BIG_MULTI_SELECT"
                }
            ],
            "label": "Gender",
            "name": "gender",
            "selects": [
                {
                    "column": f"{table_name}.gender",
                    "description": "The recent gender as a code",
                    "label": "Gender Code",
                    "name": "gender_code",
                    "type": "LAST"
                },
            ],
            "validityDates": [
                {
                    "column": f"{table_name}.lifetime",
                    "label": "Lifetime",
                    "name": "lifetime"
                }
            ]
        }
    ],
    "label": "Gender",
    "name": "gender",
    "type": "TREE"
}

Validate and write the concept.

In [None]:
cq.get_validator(wd / "json_schema" / "concept.json").validate(concept)

concept_json_file = wd / "data" / "concepts" / f"{concept['name']}.concept.json"
concept_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(concept_json_file, "w") as f:
    json.dump(concept, f, indent="\t")

## Preprocessing and Upload

The next tutorial is to [Preprocess and Upload](./preprocess_and_upload.ipynb) all data and meta data produced from this notebook.
