# Demonstration notebook of converting variety of input files to HEAL variable level metadata (ie data dictionary)
This takes a specified input file and uses the healdatautils to export HEAL-formatted data dictionaries.
The data dictionary titles are inferred from the file names. 

> Note, currently there are a few fields that do not have descriptions so return 
validation failure warnings. 

Will demonstrate two ways to create a data dictionary via the healdata-util `vlmd` tool.

1. Via python
2. Via command line

In [None]:
!pip install git+https://github.com/norc-heal/healdata-utils

## Via python

In [2]:
from pathlib import Path 
from healdata_utils.cli import convert_to_vlmd
import os 
import pandas as pd
import json
import shutil

from healdata_utils.cli import input_descriptions
from IPython.display import Markdown,display

In [3]:
def printdir(dirname):
    for d in Path(dirname).iterdir():
        print(d)
        if Path(d).is_dir():
            for _d in Path(d).iterdir():
                print(f"   {_d}")

In [4]:
# available inputs
display(Markdown("Available inputs (except por):"))
display(Markdown("".join(["- "+ext+"\n" for ext in list(input_descriptions.keys())])))
display(Markdown("Change the variable `input_type` to one of the extensions "))

Available inputs (except por):

- csv
- sav
- dta
- por
- sas7bdat
- json
- redcap.csv


Change the variable `input_type` to one of the extensions 

In [5]:
input_type = "sav"

In [6]:
display(Markdown((input_descriptions[input_type])))

Converts a "metadata-rich" (ie statistical software file) 
    into a HEAL-specified data dictionary in both csv format and json format.

    This function relies on [readstat](https://github.com/Roche/pyreadstat) which supports SPSS (sav and por), 
    SAS (sas7bdat), and Stata (dta). 

    > Currently, this function uses both data and metadata to generate 
    a HEAL specified data dictionary. That is, types are inferred from the 
    data (so at least test or synthetic data needed) in addition to the metadata 
    (ie variable labels and value labels). 

    

In [17]:
data_repo = "https://raw.githubusercontent.com/norc-heal/healdata-utils/tests/data"
os.chdir("c:\\Users\\kranz-michael\\projects\\healdata-utils") # TODO: delete when repo is public
data_repo = "tests/data"
inputpath = data_repo+f"/example.{input_type}"
description = "This is a proof of concept to demonstrate the healdata-utils functionality"
title = "Healdata-utils Demonstration Data Dictionary"
healdir = "output"

In [18]:
# make python demo output
Path(healdir).mkdir(exist_ok=True)

In [19]:
data_dictionaries = convert_to_vlmd(
    filepath=inputpath,
    outputdir=healdir, #if not specified, will not write to file
    inputtype="sav", #if not specified, looks for suffix
    data_dictionary_props={
        "name":Path(inputpath).stem,
        "title":title,
        "description":description}
)

Validating csv data dictionary...
Csv is VALID
Validating heal-specified json fields.....
JSON array of data dictionary fields is VALID


In [22]:
Markdown("Here is the resulting contents of the file directory:")
print(printdir("output"))

output\errors
   output\errors\heal-csv-errors-summary.txt
   output\errors\heal-csv-errors.json
   output\errors\heal-json-errors.json
output\heal-csvtemplate-data-dictionary.csv
output\heal-jsontemplate-data-dictionary.json
None


Resulting CSV fields

Examine human-readable csv validation report. Say a data dictionary is not valid. The csv report summary will give these errors. If this is the case, you can edit the csv data dictionary and re-run `convert_vlmd` with the csv input type. For an example of this, see the csv validation demo notebook. In this notebook, all files are valid, so the summary will return a 
report indicating it is valid.

In [24]:
print(Path("output/errors/heal-csv-errors-summary.txt").read_text())


# -----
# valid: memory 
# -----

## Summary 

+------------------------+-------------------+
| Description            | Size/Name/Count   |
| File name (Not Found)  | memory            |
+------------------------+-------------------+
| File size              | N/A               |
+------------------------+-------------------+
| Total Time Taken (sec) | 0.021             |
+------------------------+-------------------+




You can view the data dictionary by looking via a pandas dataframe from the written file or directly from the returned
data dictionary object. 

In [25]:
pd.DataFrame(data_dictionaries['csvtemplate']).head()

Unnamed: 0,module,name,title,description,type,format,constraints.maxLength,constraints.enum,constraints.pattern,constraints.maximum,...,univar_stats.median,univar_stats.mean,univar_stats.std,univar_stats.min,univar_stats.max,univar_stats.mode,univar_stats.count,univar_stats.twenty_five_percentile,univar_stats.seventy_five_percentile,univar_stats.cat_marginals
0,,id,,\tUnique identifier for participant,integer,,,,,,...,,,,,,,,,,
1,,visit_dt,,Date of the interview,string,,,,,,...,,,,,,,,,,
2,,sex_at_birth,,The self-reported sex of the participant/subje...,integer,,,1|2|3|-99|-98,,,...,,,,,,,,,,
3,,race,,Self-reported race,integer,,,1|2|3|4|5|6|7|-99|-98,,,...,,,,,,,,,,
4,,hispanic_ethnicity,,"Are you of Hispanic, Latino, or Spanish origin?",integer,,,,,,...,,,,,,,,,,


Resulting JSON object 

> Note how currently the fields are nested within the data_dictionary property) as opposed to the csv tempalte which just has fields.

In [12]:
print(json.dumps(data_dictionaries['jsontemplate'],indent=4)[0:1000])

{
    "name": "JCOIN_NORC-Omnibus_SURVEY5_Feb2021_072821",
    "title": "protocol1-survey5",
    "data_dictionary": {
        "name": "JCOIN_NORC-Omnibus_SURVEY5_Feb2021_072821",
        "title": "protocol1-survey5",
        "data_dictionary": [
            {
                "name": "CaseId",
                "type": "integer",
                "description": "Case ID"
            },
            {
                "name": "WEIGHT",
                "type": "number",
                "description": "Post-stratification weights - 18+ general population (N=1,161)"
            },
            {
                "name": "FluVax",
                "type": "integer",
                "encodings": {
                    "1": "Yes, I already got the vaccine for the 2020-2021 flu season",
                    "2": "Yes, I plan to get the vaccine for the 2020-2021 flu season",
                    "3": "No, not this flu season",
                    "4": "No, I never get the flu vaccine",
                    

## Via command line

We will demonstrate the `vlmd` command line utility using one of the data dictionaries. 

In [36]:
# make a separate output-cli folder for cli demo

Path("output-cli").mkdir(exist_ok=True)

In [13]:
!vlmd --help

Usage: vlmd [OPTIONS]

Options:
  --filepath TEXT                 Path to the file you want to convert to a
                                  HEAL data dictionary  [required]
  --title TEXT                    The title of your data dictionary. If not
                                  specified, then the file name will be used
  --description TEXT              Description of data dictionary
  --inputtype [csv|sav|dta|por|sas7bdat|json|redcap.csv]
                                  The type of your input file.
  --outputdir TEXT                The folder where you want to output your
                                  HEAL data dictionary
  --help                          Show this message and exit.


To create the above data dictionary via the command line, run directly in this notebook with the cell below:

In [32]:
!vlmd --filepath "tests/data/example.sav" \
--outputdir "output-cli" \
--title "Healdata-utils Demonstration Data Dictionary" \
--description "This is a proof of concept to demonstrate the healdata-utils functionality" 

Validating csv data dictionary...
Csv is VALID
Validating heal-specified json fields.....
JSON array of data dictionary fields is VALID


In [34]:
printdir("output-cli")

output\errors
   output\errors\heal-csv-errors-summary.txt
   output\errors\heal-csv-errors.json
   output\errors\heal-json-errors.json
output\heal-csvtemplate-data-dictionary.csv
output\heal-jsontemplate-data-dictionary.json
