# SDOH Dataset Coverage Checker
This notebook is intended to make it easier to run the dataset coverage checker interactively.

### Prerequisites

First, execute the cell below using `Shift + Enter`.

This will download the coverage.py script and install needed dependencies so that it can be used in your workspace.

In [1]:
import os

# Helper function to download a remote file from GitHub for local use
def download_file(url):
    local_filename = url.split('/')[-1]
    import requests
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

# Download coverage.py to your workspace
if not os.path.exists('coverage.py'):
    print('coverage.py not found - downloading it to your workspace...')
    download_file('https://raw.githubusercontent.com/healthyregions/SDOHPlace-MetadataManager/refs/heads/main/manager/coverage/coverage.py')
    print('coverage.py has been downloaded to your workspace!')

# Install geopandas (required by coverage.py) 
print('Installing dependencies...')
! pip3 install geopandas requests ipywidgets
print('Dependencies installed successfully!')

coverage.py not found - downloading it to your workspace...
coverage.py has been downloaded to your workspace!
Installing dependencies...
Dependencies installed successfully!


### Specify User Inputs

Execute the cell below to show 3 widgets that will allow you to choose your desired `input_file`, `geography`, and name of the `id_field` to use for the coverage checker.

In [2]:
import ipywidgets as widgets
import coverage

# Allow user to select an input file, a geography type, and the name of the field that represents the ID
input_file_choice = widgets.Text(value='')
geography_choice = widgets.Dropdown(value=1, options=[('state', 1), ('county', 2), ('tract', 3), ('bg', 4), ('zcta', 5)])
id_field_choice = widgets.Text(value='FIPS')

# Finally, display all of our widgets
print('Path to input file:')
display(input_file_choice)
print('Name of geography to check against (using 2018 US Census files):')
display(geography_choice)
print('Name of field in input file that contains FIPS:')
display(id_field_choice)

Path to input file:


Text(value='')

Name of geography to check against (using 2018 US Census files):


Dropdown(options=(('state', 1), ('county', 2), ('tract', 3), ('bg', 4), ('zcta', 5)), value=1)

Name of field in input file that contains FIPS:


Text(value='FIPS')


Use the widgets above to input your desired parameters for the coverage checker.

### How to Use Coverage Checker

You will need to upload the input CSV input files into your notebook workspace.

If you need an example input file, you can use [SVI_2010_US.csv](https://uofi.app.box.com/s/fqtslnfkpmgi32pb1cah1eyvmimvp740/file/1765513583484)

For the first input, specify the path in your workspace to the uploaded file.

Then select the values of the other two inputs based on the `input_file` chosen. 

For more information about the relationship between FIPS and HEROPID, expand the HEROP_IDs section at the top of the [healthyregions GitHub organization](https://github.com/healthyregions)

#### Examples

For example: let's choose `SVI_2010_US.csv` from the `CDC - Social Vulnerability Index` dataset as our `input_file`.

1. First, we download this file to our workspace using the Box webpage
2. Examining the file we see that this file contains a list of `tract` objects, so we select `tract`
3. Looking at the header in the CSV input file, we see find that the column where these IDs are listed is named `HEROPID`

Another example: let's choose `SVI_2010_US_county.csv` from the `CDC - Social Vulnerability Index` dataset as our `input_file`.

1. First, we download this file to our workspace using the Box webpage
2. Examining the file we see that this file contains a list of `county` objects, so we select `county`
3. Looking at the header in the CSV input file, we see find that the column where these IDs are listed is named `HEROPID`

### Running the Coverage Checker
Execute the final cell to run the coverage checker

In [3]:
input_file = input_file_choice.value
geography = geography_choice.label
id_field = id_field_choice.value

highlight_ids = coverage.check_coverage(input_file, geography, id_field)

# Print the list of highlight IDs at the end
print("\n")
print(f"Highlight IDs ({len(highlight_ids)}):")
print(", ".join(highlight_ids))

Checking coverage...
Reading SVI_2010_US.csv...
This file has FIPS column: 72891 rows found
0        01001020100
1        01001020200
2        01001020300
3        01001021100
4        01001020900
            ...     
72886    56043000301
72887    56043000200
72888    56043000302
72889    56045951100
72890    56045951300
Name: FIPS, Length: 72891, dtype: object
0        140US01117030317
1        140US01119011500
2        140US01121010900
3        140US01125010102
4        140US01089000701
               ...       
73869    140US78010970400
73870    140US78010970700
73871    140US78010970200
73872    140US78010970100
73873    140US78010970300
Name: HEROP_ID, Length: 73874, dtype: object
73874 HEROP_IDs are present in master geography file.
1070 are missing from the input dataset (SVI_2010_US.csv).
generating highlight_ids list...
Done checking!


Highlight IDs (1070):
-140US02158000100, -140US04019002704, -140US04019004125, -140US04019004121, -140US04019005300, -140US04019002906, -140US