# MIT (TA1): Data cards for Scenarios 3.1a and 3.4a

## 0. Preprocessing

In [10]:
import io, json, pandas as pd, requests
from gpt_key import *
API_ROOT = "http://localhost:8000/"

# display imports
from IPython.display import display, HTML, Markdown
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Scenario 3-1a. MA Wastewater Flow Rate and Viral Load

`Extract the appropriate data columns (cRNA2 and F2, which represent the viral load in wastewater and the flow rate, respectively) from the paper’s supplemental materials.`

Relevant data columns for flow rate and viral load can be found in two `.csv` files in the supplemental materials; we provide data annotations via data card for each of them here, which includes general annotations about the dataset, statistics about the dataset, and column-level DKG groundings.

Below, we start with the flow rate table:

In [11]:
csv_name = "scenario-3.1a-flow-rates.csv"
pd.read_csv(csv_name)

Unnamed: 0,Sample Date,"North System Flows to DITP, MGD","South System Flows to DITP, MGD",Total flows to DITP (MGD: Million gallon per day)
0,10/2/20,134.0,64.2,198.2
1,10/3/20,130.5,61.7,192.2
2,10/4/20,126.4,59.2,185.6
3,10/5/20,127.4,60.0,187.5
4,10/6/20,130.1,60.7,190.9
...,...,...,...,...
111,1/21/21,200.3,129.2,329.5
112,1/22/21,193.9,124.3,318.2
113,1/23/21,186.9,119.9,306.8
114,1/24/21,185.1,115.0,300.1


We also take a look at the source paper to add context to the datasets; in this scenario, we use the same documentation for both the flow rate and viral concentration datasets.

In [12]:
doc_name = "scenario-3.1a-data-documentation.txt"

with open(doc_name, 'r') as f:
    for line in f.readlines()[:3]:
        print(line)
    print('...') # ellipsis to indicate there's more

2. Materials and methods

2.1. Samples and wastewater data

Raw, 24-h composite wastewater samples were collected from the Deer Island wastewater treatment plant in Massachusetts from October 02, 2020 to January 25, 2021. The Massachusetts wastewater treatment plant where we obtained samples has two major influent streams, which are referred to as the “northern” and “southern” influents. The daily flow rates during the sampling period for the northern and southern influents are 4.54e5–2.3e6 m3/day, and 2.16e5–1.19e6m3/day, respectively. Together the two catchments represent approximately 2.3 million wastewater customers in Middlesex, Norfolk, and Suffolk counties, primarily in urban and suburban neighborhoods. There are 5100 miles of local sewers transporting wastewater into 227 miles of interceptor pipes to the wastewater treatment plant (www.mwra.com), and the typical turnaround time for the plant to treat wastewater is 24 h. Samples were processed as they were received. Experimental

Now that we have our inputs, we are ready to fetch our data card:

In [18]:
with open(csv_name, 'rb') as f_csv, open(doc_name,  'rb') as f_doc:
    files = {'csv_file': ('filename', f_csv), 'doc_file': ('filename', f_doc)}
    params = {"gpt_key": GPT_KEY}
    response = requests.post(API_ROOT + "cards/get_data_card/",  params=params,  files=files)
    json_str = response.text
profile31a_flow = json.loads(json_str)

...and read it:

In [14]:
display(Markdown("# Data Card Output"))
for key, value in profile31a_flow.items():
    if type(value) is str and value != 'UNKNOWN':
        display(Markdown("**" + key.capitalize().replace('_', ' ') + "**: " + value))

# Data Card Output

**Description**: The dataset contains wastewater data collected from the Deer Island wastewater treatment plant in Massachusetts from October 02, 2020 to January 25, 2021.

**Provenance**: The data was collected from the Deer Island wastewater treatment plant in Massachusetts.

**Dataset type**: tabular

In [15]:
profiling_result = profile31a_flow["DATA_PROFILING_RESULT"]
for column_name, output in profiling_result.items():
    # string formating one of the DKG groundings for display
    selected_grounding = output['dkg_groundings'][0]
    grounding_keys = ['id', 'name', 'class']
    grounding_output = [f"<b>{k}</b>: {v}" for k, v in zip(grounding_keys, selected_grounding)]
    profiling_result[column_name]['dkg_groundings'] = '\\n'.join(grounding_output) + '\\n...'

    column_stats = output['column_stats']
    # truncate numbers to a lower floating point precision
    for key, val in column_stats.items():
        if type(val) is float:
            column_stats[key] = round(val, ndigits=3)
    output['column_stats'] = ''
    # pulling up the column type first
    if column_stats.get('type'):
        output['column_stats'] += f"<b>type: {column_stats['type']}</b>\\n\\n"
    # truncating and string formatting most common entries due to length
    if column_stats.get('most_common_entries'):
        output_str = ''
        for i, entry_key in enumerate(column_stats['most_common_entries'].keys()):
            if i > 1:
                continue
            entry_value = column_stats['most_common_entries'][entry_key]
            output_str += f"\\n\\t{entry_key} ({entry_value} times)"
        output_str += "\\n\\t..."
        column_stats['most_common_entries'] = output_str
    # truncate column stats key set
    smaller_key_set = ['earliest', 'latest', 'mean', 'std'] # ['most_common_entries', 'earliest', 'latest', 'mean', 'std', 'min', 'max', '50%']
    if smaller_key_set:
        output['column_stats'] += '\\n'.join(key + ': ' + str(value) for key, value in column_stats.items() if key != 'type' and key in smaller_key_set)
        output['column_stats'] += '\\n\\n'
        output['column_stats'] += '\\n'.join(key + ': ' + str(value) for key, value in column_stats.items() if key != 'type' and key not in smaller_key_set)
    else:
        output['column_stats'] += '\\n'.join(key + ': ' + str(value) for key, value in column_stats.items() if key != 'type')

In [16]:
display(Markdown("# Data Profiling Output"))
df = pd.read_json(io.StringIO(json.dumps(profiling_result)), orient='index')
# deduplicate column names
df = df.drop('col_name', axis=1)
# print the whole description
df['description'] = df['description'].str.wrap(100)
# transpose table
df = df.T
# make sure the table is aligned to the left
df_style = df.style.set_properties(**{'text-align': 'left'})
# make sure the column headers are also aligned to the left
df_style = df_style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
# increase the font size
df_style = df_style.set_table_attributes('style="font-size: 14px"')
# replace delimiters for HTML parsing
df_style = df_style.to_html().replace("\\n","<br>").replace("\\t"," ")

display(HTML(df_style))


# Data Profiling Output

Unnamed: 0,Sample Date,"North System Flows to DITP, MGD","South System Flows to DITP, MGD",Total flows to DITP (MGD: Million gallon per day)
concept,Date of sample collection,Flow rate of the northern influent,Flow rate of the southern influent,Total flow rate to the treatment plant
unit,Date,Million gallons per day (MGD),Million gallons per day (MGD),Million gallons per day (MGD)
description,The date when the wastewater sample was collected from the Deer Island wastewater treatment plant in Massachusetts.,The daily flow rate of the northern influent to the Deer Island wastewater treatment plant in Massachusetts.,The daily flow rate of the southern influent to the Deer Island wastewater treatment plant in Massachusetts.,The total daily flow rate (sum of the northern and southern influents) to the Deer Island wastewater treatment plant in Massachusetts.
dkg_groundings,id: oboinowl:date name: date class: property ...,id: geonames:2297169 name: Northern class: individual ...,id: geonames:7533609 name: Southern class: individual ...,id: oae:0001658 name: protein total increased AE class: class ...
column_stats,type: date num_null_entries: 0 num_unique_entries: 116 most_common_entries: 2020-10-02T00:00:00 (1 times)  2020-12-14T00:00:00 (1 times)  ...,type: numeric mean: 210.637 std: 84.609 num_null_entries: 0 min: 120.7 max: 617.0 quantile_25: 157.0 quantile_50: 191.85 quantile_75: 231.5,type: numeric mean: 115.939 std: 47.982 num_null_entries: 0 min: 57.0 max: 314.0 quantile_25: 81.375 quantile_50: 111.25 quantile_75: 138.625,type: numeric mean: 326.576 std: 124.737 num_null_entries: 0 min: 177.7 max: 931.0 quantile_25: 241.5 quantile_50: 306.75 quantile_75: 369.475


Now that we have gone through this data card pipeline, we will wrap it in a function:

In [35]:
def fetch_data_card(csv_filename, documentation_filename):
    with open(csv_name, 'rb') as f_csv, open(doc_name,  'rb') as f_doc:
        files = {'csv_file': ('filename', f_csv), 'doc_file': ('filename', f_doc)}
        params = {"gpt_key": GPT_KEY}
        response = requests.post(API_ROOT + "cards/get_data_card/",  params=params,  files=files)
        json_str = response.text
    data_profile = json.loads(json_str)

    display(Markdown("# Data Card Output"))
    for key, value in data_profile.items():
        if type(value) is str and value != 'UNKNOWN':
            display(Markdown("**" + key.capitalize().replace('_', ' ') + "**: " + value))

    profiling_result = data_profile["DATA_PROFILING_RESULT"]
    for column_name, output in profiling_result.items():
        # string formating one of the DKG groundings for display
        selected_grounding = output['dkg_groundings'][0]
        grounding_keys = ['id', 'name', 'class']
        grounding_output = [f"<b>{k}</b>: {v}" for k, v in zip(grounding_keys, selected_grounding)]
        profiling_result[column_name]['dkg_groundings'] = '\\n'.join(grounding_output) + '\\n...'

        column_stats = output['column_stats']
        # truncate numbers to a lower floating point precision
        for key, val in column_stats.items():
            if type(val) is float:
                column_stats[key] = round(val, ndigits=3)
        output['column_stats'] = ''
        # pulling up the column type first
        if column_stats.get('type'):
            output['column_stats'] += f"<b>type: {column_stats['type']}</b>\\n\\n"
        # truncating and string formatting most common entries due to length
        if column_stats.get('most_common_entries'):
            output_str = ''
            for i, entry_key in enumerate(column_stats['most_common_entries'].keys()):
                if i > 1:
                    continue
                entry_value = column_stats['most_common_entries'][entry_key]
                output_str += f"\\n\\t{entry_key} ({entry_value} times)"
            output_str += "\\n\\t..."
            column_stats['most_common_entries'] = output_str
        # truncate column stats key set
        smaller_key_set = ['earliest', 'latest', 'mean', 'std'] # ['most_common_entries', 'earliest', 'latest', 'mean', 'std', 'min', 'max', '50%']
        if smaller_key_set:
            output['column_stats'] += '\\n'.join(key + ': ' + str(value) for key, value in column_stats.items() if key != 'type' and key in smaller_key_set)
            if column_stats.get('type') and column_stats['type'] == 'categorical':
                pass
            else:
                output['column_stats'] += '\\n\\n'
            output['column_stats'] += '\\n'.join(key + ': ' + str(value) for key, value in column_stats.items() if key != 'type' and key not in smaller_key_set)
        else:
            output['column_stats'] += '\\n'.join(key + ': ' + str(value) for key, value in column_stats.items() if key != 'type')

    display(Markdown("# Data Profiling Output"))
    df = pd.read_json(io.StringIO(json.dumps(profiling_result)), orient='index')
    # deduplicate column names
    df = df.drop('col_name', axis=1)
    # print the whole description
    df['description'] = df['description'].str.wrap(100)
    # transpose table
    df = df.T
    # make sure the table is aligned to the left
    df_style = df.style.set_properties(**{'text-align': 'left'})
    # make sure the column headers are also aligned to the left
    df_style = df_style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
    # increase the font size
    df_style = df_style.set_table_attributes('style="font-size: 14px"')
    # replace delimiters for HTML parsing
    df_style = df_style.to_html().replace("\\n","<br>").replace("\\t"," ")

    display(HTML(df_style))

Now, we switch to looking at the viral concentration dataset:

In [21]:
csv_name = "scenario-3.1a-viral-concentration.csv"
pd.read_csv(csv_name)

Unnamed: 0,Date,SARS-CoV-2 levels [copies/L],log_SARS-CoV-2,seven_day_average,seven_day_average_log
0,10/2/20,4.791207e+04,4.680454,43998.28696,4.643446
1,10/3/20,,,43998.28696,4.643446
2,10/4/20,,,52228.97821,4.717920
3,10/5/20,4.742341e+04,4.676002,52228.97821,4.717920
4,10/6/20,,,55536.09344,4.744583
...,...,...,...,...,...
111,1/21/21,5.159181e+05,5.712582,924447.10450,5.965883
112,1/22/21,6.448364e+05,5.809450,967594.89500,5.985694
113,1/23/21,9.203620e+05,5.963959,884487.78690,5.946692
114,1/24/21,2.038592e+06,6.309331,856670.00150,5.932814


... and  present data annotations on this viral concentration dataset through another data card:

In [22]:
csv_name = "scenario-3.1a-viral-concentration.csv"
doc_name = "scenario-3.1a-data-documentation.txt"
fetch_data_card(csv_name, doc_name)

# Data Card Output

**Description**: The dataset contains wastewater data collected from the Deer Island wastewater treatment plant in Massachusetts from October 02, 2020 to January 25, 2021.

**Provenance**: The data was collected from the Deer Island wastewater treatment plant in Massachusetts.

**Dataset type**: tabular

# Data Profiling Output

Unnamed: 0,Date,SARS-CoV-2 levels [copies/L],log_SARS-CoV-2,seven_day_average,seven_day_average_log
concept,Sampling date,SARS-CoV-2 concentration,Logarithm of SARS-CoV-2 concentration,Seven-day average of SARS-CoV-2 concentration,Logarithm of seven-day average of SARS-CoV-2 concentration
unit,Date,copies/L,Log(copies/L),copies/L,Log(copies/L)
description,The date when the wastewater sample was collected,"The concentration of SARS-CoV-2 in the wastewater sample, measured in copies per liter",The natural logarithm of the SARS-CoV-2 concentration in the wastewater sample,The seven-day moving average of the SARS-CoV-2 concentration in the wastewater sample,The natural logarithm of the seven-day moving average of the SARS-CoV-2 concentration in the wastewater sample
dkg_groundings,id: oboinowl:date name: date class: property ...,id: cido:0004830 name: SARS-CoV-2-SARS-CoV-2 N-S physical association class: class ...,id: cido:0003058 name: SARS-CoV-2-SARS-CoV-2 M-S physical association class: class ...,id: opmi:0000091 name: day of birth class: class ...,id: opmi:0000091 name: day of birth class: class ...
column_stats,type: date num_null_entries: 0 num_unique_entries: 116 most_common_entries: 2020-10-02T00:00:00 (1 times)  2020-12-14T00:00:00 (1 times)  ...,type: numeric mean: 601107.122 std: 407072.939 num_null_entries: 9 min: 47423.412 max: 2038591.913 quantile_25: 249928.689 quantile_50: 566181.008 quantile_75: 875858.806,type: numeric mean: 5.642 std: 0.391 num_null_entries: 9 min: 4.676 max: 6.309 quantile_25: 5.397 quantile_50: 5.753 quantile_75: 5.942,type: numeric mean: 552940.252 std: 367337.12 num_null_entries: 0 min: 43998.287 max: 1256651.488 quantile_25: 156967.388 quantile_50: 614862.797 quantile_75: 873505.164,type: numeric mean: 5.587 std: 0.426 num_null_entries: 0 min: 4.643 max: 6.099 quantile_25: 5.196 quantile_50: 5.789 quantile_75: 5.941


## Scenario 3-4a. NYC Wastewater Flow Rate and Viral Load

`The City of New York maintains openly available COVID-19 wastewater monitoring data at https://data.cityofnewyork.us/Health/SARS-CoV-2-concentrations-measured-in-NYC-Wastewat/f7dc-2q9f/data. Extract the relevant data columns (viral load, population served); you will also need the flow rate to implement the SEIR-V model. This can be found in Table S1 of the supplemental materials of https://doi.org/10.1039/D1EW00747E. Extract this data from the table.`

For this scenario, we provide data annotations for both indicated datasets via data card. In the case of Table S1, (WISCONSIN) extracted a table from the supplemental materials which we generate data annotations for below. As mentioned above, these data cards include general annotations about the dataset, statistics about the dataset, and column-level DKG groundings.

Below, we start with the NYC wastewater monitoring table:

In [23]:
csv_name = "scenario-3.4a-nyc-wastewater.csv"
pd.read_csv(csv_name)

Unnamed: 0,Sample Date,Test date,WRRF Name,WRRF Abbreviation,Concentration SARS-CoV-2 gene target (N1 Copies/L),Per capita SARS-CoV-2 load (N1 copies per day per population),Annotation,"Population Served, estimated"
0,08/31/2020,09/01/2020,26th Ward,26W,389.0,263535.64,Concentration below Method Limit of Quantifica...,290608
1,08/31/2020,09/01/2020,Bowery Bay,BB,1204.0,443632.86,,924695
2,08/31/2020,09/01/2020,Coney Island,CI,304.0,168551.56,Concentration below Method Limit of Quantifica...,682342
3,08/31/2020,09/01/2020,Hunts Point,HP,940.0,574446.57,,755948
4,08/31/2020,09/01/2020,Jamaica Bay,JA,632.0,233077.74,,748737
...,...,...,...,...,...,...,...,...
3271,04/11/2023,04/12/2023,Port Richmond,PR,4616.0,1850000.00,,226167
3272,04/11/2023,04/12/2023,Red Hook,RH,4726.0,2080000.00,,224029
3273,04/11/2023,04/12/2023,Rockaway,RK,1697.0,906000.00,,120539
3274,04/11/2023,04/12/2023,Tallman Island,TI,3340.0,1210000.00,,449907


... and take a peek at the documentation, taken from the dataset description webpage hosted by the NYC wastewater data portal:

In [28]:
doc_name = "scenario-3.4a-nyc-wastewater-documentation.txt"

with open(doc_name, 'r') as f:
    for line in f.readlines()[:10]:
        print(line)
    print('...') # ellipsis to indicate there's more

Skip to Main Content

NYC Open Data logo

SARS-CoV-2 concentrations measured in NYC Wastewater

Health

View Data



Visualize

ExportAPI



Results of sampling to determine the SARS-CoV-2 N gene levels in NYC DEP Wastewater Resource Recovery Facility (WRRF) influent, disaggregated by the WRRF where the sample was collected, date sample was collected, and date sample was tested.

...


With these two inputs, we retrieve the following data card:

In [36]:
csv_name = "scenario-3.4a-nyc-wastewater.csv"
doc_name = "scenario-3.4a-nyc-wastewater-documentation.txt"
fetch_data_card(csv_name, doc_name)

# Data Card Output

**Description**: Results of sampling to determine the SARS-CoV-2 N gene levels in NYC DEP Wastewater Resource Recovery Facility (WRRF) influent, disaggregated by the WRRF where the sample was collected, date sample was collected, and date sample was tested.

**Author name**: Department of Environmental Protection (DEP)

**Date**: May 1, 2023

**Dataset type**: tabular

# Data Profiling Output

Unnamed: 0,Sample Date,Test date,WRRF Name,WRRF Abbreviation,Concentration SARS-CoV-2 gene target (N1 Copies/L),Per capita SARS-CoV-2 load (N1 copies per day per population),Annotation,"Population Served, estimated"
concept,Date of sample collection,Date of sample analysis,Wastewater Resource Recovery Facility name,Abbreviation of Wastewater Resource Recovery Facility name,Concentration of SARS-CoV-2 gene target,Normalized SARS-CoV-2 load,Notes on sampling and testing,Estimated population served
unit,Date,Date,Text,Text,Copies/L,Copies per day per population,Text,People
description,Date sample was collected; The “sample” is a 24 hour composite of influent wastewater. The “sample date” is the date of start of collection.,Date sample was analyzed; This date is the date the analysis started (this is a three-days analysis protocol).,Wastewater Resource Recovery Facility (waste water treatment plant) where sample was taken. Samples are taken from WRRF influent.,WRRF Abbreviation; Two letter abbreviation for WRRF name.,Concentration of the N1 target of SARS-CoV2 genetic material measured in wastewater influent.,Normalized SARS-CoV-2 N gene concentration (taking into account average daily flow and total population).,Notes on sampling and testing.,Population of sewershed; Estimated from 2020 New York Department of City Planning population estimate model.
dkg_groundings,id: dc:date name: Date class: property ...,id: oboinowl:date name: date class: property ...,id: rdfs:Resource name: Resource class: class ...,id: geogeo:000000010 name: US State Two Letter Abbreviation class: property ...,id: ro:0002463 name: target participant in class: property ...,id: cemo:per_capita_mobility name: per capita mobility class: class ...,id: iao:0000634 name: notes section class: class ...,id: idomal:0001255 name: human population class: class ...
column_stats,type: date num_null_entries: 0 num_unique_entries: 234 most_common_entries: 2020-08-31T00:00:00 (14 times)  2022-06-05T00:00:00 (14 times)  ...,type: date num_null_entries: 15 num_unique_entries: 237 most_common_entries: 2020-09-30T00:00:00 (18 times)  2021-08-02T00:00:00 (17 times)  ...,type: categorical num_null_entries: 0 num_unique_entries: 14 most_common_entries: 26th Ward (234 times)  Bowery Bay (234 times)  ...,type: categorical num_null_entries: 0 num_unique_entries: 14 most_common_entries: 26W (234 times)  BB (234 times)  ...,type: numeric mean: 11412.836 std: 14586.518 num_null_entries: 120 min: 30.0 max: 194978.0 quantile_25: 2964.75 quantile_50: 7374.0 quantile_75: 15472.25,type: numeric mean: 5545685.215 std: 6915624.234 num_null_entries: 119 min: 0.0 max: 107298936.35 quantile_25: 1500000.0 quantile_50: 3731524.81 quantile_75: 7540000.0,type: categorical num_null_entries: 2658 num_unique_entries: 42 most_common_entries: This sample was analyzed in duplicate. The higher of the 2 results is reported (202 times)  Concentration below Method Limit of Quantification (above Method Limit of Detection) (117 times)  ...,type: numeric mean: 614621.357 std: 345445.583 num_null_entries: 0 min: 120539.0 max: 1201485.0 quantile_25: 258731.0 quantile_50: 670469.0 quantile_75: 906442.0


Last, but not least, we inspect Table S1 from the paper's supplemental materials:

In [30]:
csv_name = "scenario-3.4a-supplement.csv"
pd.read_csv(csv_name)

Unnamed: 0,Wastewater Resource Recovery Facility (WRRF),Borough(s),Population Served*,Daily Flow Range (Average)† in MGD
0,Hunts Point,Bronx,755948,115 - 215 (136)
1,Wards Island,Bronx and Manhattan,1201485,143 - 273 (180)
2,North River,Manhattan,658596,81 - 143 (94)
3,Newtown Creek,"Manhattan, Brooklyn, \nand Queens",1156473,158 - 296 (188)
4,Red Hook,Brooklyn,224029,21 - 46 (26)
5,Owls Head,Brooklyn,906442,81 - 159 (95)
6,Coney Island,Brooklyn,682342,70 - 102 (82)
7,26th Ward,Brooklyn,290608,44 - 89 (55)
8,Rockaway,Queens,120539,18 - 25 (20)
9,Jamaica Bay,Queens,748737,74 - 103 (81)


...and peek at the paper contents, which are used as this dataset's documentation:

In [33]:
doc_name = "scenario-3.4a-supplement-documentation.txt"

with open(doc_name, 'r') as f:
    for line in f.readlines()[:5]:
        print(line)
    print('...') # ellipsis to indicate there's more

Monitoring SARS-CoV-2 in wastewater during New York City's second wave of COVID-19: sewershed-level trends and relationships to publicly available clinical testing data†	Check for updates

Catherine Hoar, ORCID logo a   Francoise Chauvin,b   Alexander Clare,b   Hope McGibbon,b   Esmeraldo Castro,b   Samantha Patinella,b   Dimitrios Katehis,b   John J. Dennehy, ORCID logo cd   Monica Trujillo,e   Davida S. Smyth‡f  and  Andrea I. Silverman ORCID logo *a  

 Author affiliations

Abstract

New York City's wastewater monitoring program tracked trends in sewershed-level SARS-CoV-2 loads starting in the fall of 2020, just before the start of the city's second wave of the COVID-19 outbreak. During a five-month study period, from November 8, 2020 to April 11, 2021, viral loads in influent wastewater from each of New York City's 14 wastewater treatment plants were measured and compared to new laboratory-confirmed COVID-19 cases for the populations in each corresponding sewershed, estimated from

Finally, we present data annotations on the supplemental table via data card:

In [34]:
csv_name = "scenario-3.4a-supplement.csv"
doc_name = "scenario-3.4a-supplement-documentation.txt"
fetch_data_card(csv_name, doc_name)

# Data Card Output

**Description**: Monitoring SARS-CoV-2 in wastewater during New York City's second wave of COVID-19: sewershed-level trends and relationships to publicly available clinical testing data† Check for updates

**Author name**: Catherine Hoar, Francoise Chauvin, Alexander Clare, Hope McGibbon, Esmeraldo Castro, Samantha Patinella, Dimitrios Katehis, John J. Dennehy, Monica Trujillo, Davida S. Smyth, Andrea I. Silverman

**Provenance**: New York City's wastewater monitoring program tracked trends in sewershed-level SARS-CoV-2 loads starting in the fall of 2020, just before the start of the city's second wave of the COVID-19 outbreak.

**Dataset type**: tabular

# Data Profiling Output

Unnamed: 0,Wastewater Resource Recovery Facility (WRRF),Borough(s),Population Served*,Daily Flow Range (Average)† in MGD
concept,Wastewater treatment plant,Geographic location,Number of people,Wastewater flow
unit,,,People,Million gallons per day (MGD)
description,The name of the wastewater treatment plant in New York City,The borough(s) in New York City that the wastewater treatment plant serves,The estimated number of people whose wastewater is treated by the plant,"The average daily flow of wastewater treated by the plant, given as a range in million gallons per day"
dkg_groundings,id: apollosv:00000618 name: wastewater surveillance data set class: class ...,id: geonames:3333125 name: City and Borough of Birmingham class: individual ...,id: askemo:0000001 name: population class: class ...,id: cemo:average_daily_number_of_new_infections_generated_per_case_rt name: average daily number of new infections generated per case (rt) class: class ...
column_stats,type: categorical num_null_entries: 0 num_unique_entries: 14 most_common_entries: Hunts Point (1 times)  Wards Island (1 times)  ...,type: categorical num_null_entries: 0 num_unique_entries: 7 most_common_entries: Brooklyn (4 times)  Queens (4 times)  ...,type: numeric mean: 614621.357 std: 358431.105 num_null_entries: 0 min: 120539.0 max: 1201485.0 quantile_25: 266700.25 quantile_50: 670469.0 quantile_75: 868818.5,type: categorical num_null_entries: 0 num_unique_entries: 14 most_common_entries: 115 - 215 (136) (1 times)  143 - 273 (180) (1 times)  ...


In [37]:
csv_name = "scenario-3.4a-nyc-wastewater.csv"
doc_name = "scenario-3.4a-nyc-wastewater-documentation.txt"
fetch_data_card(csv_name, doc_name)

# Data Card Output

**Description**: Results of sampling to determine the SARS-CoV-2 N gene levels in NYC DEP Wastewater Resource Recovery Facility (WRRF) influent, disaggregated by the WRRF where the sample was collected, date sample was collected, and date sample was tested.

**Author name**: Department of Environmental Protection (DEP)

**Date**: May 1, 2023

**Dataset type**: tabular

# Data Profiling Output

Unnamed: 0,Sample Date,Test date,WRRF Name,WRRF Abbreviation,Concentration SARS-CoV-2 gene target (N1 Copies/L),Per capita SARS-CoV-2 load (N1 copies per day per population),Annotation,"Population Served, estimated"
concept,Date of sample collection,Date of sample analysis,Wastewater Resource Recovery Facility name,Abbreviation of Wastewater Resource Recovery Facility name,Concentration of SARS-CoV-2 gene target,Normalized SARS-CoV-2 gene concentration,Notes on sampling and testing,Estimated population served
unit,Date,Date,Text,Text,Copies/L,Copies per day per population,Text,People
description,Date sample was collected; The “sample” is a 24 hour composite of influent wastewater. The “sample date” is the date of start of collection.,Date sample was analyzed; This date is the date the analysis started (this is a three-days analysis protocol).,Wastewater Resource Recovery Facility (waste water treatment plant) where sample was taken. Samples are taken from WRRF influent.,WRRF Abbreviation; Two letter abbreviation for WRRF name.,Concentration of the N1 target of SARS-CoV2 genetic material measured in wastewater influent.,Normalized SARS-CoV-2 N gene concentration (taking into account average daily flow and total population).,Notes on sampling and testing.,Population of sewershed; Estimated from 2020 New York Department of City Planning population estimate model.
dkg_groundings,id: dc:date name: Date class: property ...,id: oboinowl:date name: date class: property ...,id: rdfs:Resource name: Resource class: class ...,id: geogeo:000000010 name: US State Two Letter Abbreviation class: property ...,id: ro:0002463 name: target participant in class: property ...,id: cemo:per_capita_mobility name: per capita mobility class: class ...,id: iao:0000634 name: notes section class: class ...,id: idomal:0001255 name: human population class: class ...
column_stats,type: date earliest: 2020-08-31T00:00:00 latest: 2023-04-11T00:00:00 num_null_entries: 0 num_unique_entries: 234 most_common_entries: 2020-08-31T00:00:00 (14 times)  2022-06-05T00:00:00 (14 times)  ...,type: date earliest: 2020-09-01T00:00:00 latest: 2023-04-12T00:00:00 num_null_entries: 15 num_unique_entries: 237 most_common_entries: 2020-09-30T00:00:00 (18 times)  2021-08-02T00:00:00 (17 times)  ...,type: categorical num_null_entries: 0 num_unique_entries: 14 most_common_entries: 26th Ward (234 times)  Bowery Bay (234 times)  ...,type: categorical num_null_entries: 0 num_unique_entries: 14 most_common_entries: 26W (234 times)  BB (234 times)  ...,type: numeric mean: 11412.836 std: 14586.518 num_null_entries: 120 min: 30.0 max: 194978.0 25%: 2964.75 50%: 7374.0 75%: 15472.25,type: numeric mean: 5545685.215 std: 6915624.234 num_null_entries: 119 min: 0.0 max: 107298936.35 25%: 1500000.0 50%: 3731524.81 75%: 7540000.0,type: categorical num_null_entries: 2658 num_unique_entries: 42 most_common_entries: This sample was analyzed in duplicate. The higher of the 2 results is reported (202 times)  Concentration below Method Limit of Quantification (above Method Limit of Detection) (117 times)  ...,type: numeric mean: 614621.357 std: 345445.583 num_null_entries: 0 min: 120539.0 max: 1201485.0 25%: 258731.0 50%: 670469.0 75%: 906442.0
