# MIT (TA1): Data cards for Scenarios 3.1a and 3.4a


## 0. Preprocessing

In [1]:
import json, requests, os
from gpt_key import *
API_ROOT = "http://localhost:8000/"

## Scenario 3-1a. MA Wastewater Flow Rate and Viral Load

`Extract the appropriate data columns (cRNA2 and F2, which represent the viral load in wastewater and the flow rate, respectively) from the paper’s supplemental materials.`

Relevant data columns for flow rate and viral load can be found in two `.csv` files in the supplemental materials; we provide data annotations via data card for each of them here, which includes general annotations about the dataset, statistics about the dataset, and column-level DKG groundings.

[http://100.26.10.46/#/Data-and-model-cards/get_data_card_cards_get_data_card_post](http://100.26.10.46/#/Data-and-model-cards/get_data_card_cards_get_data_card_post)

Below, we start with the flow rate table:

In [3]:
csv_name = "scenario-3.1a-flow-rates.csv"
doc_name = "scenario-3.1a-data-documentation.txt"

with open(csv_name, 'rb') as f_csv, open(doc_name,  'rb') as f_doc:
    files = {'csv_file': ('filename', f_csv), 'doc_file': ('filename', f_doc)}
    params = {"gpt_key": GPT_KEY}
    response = requests.post(API_ROOT + "cards/get_data_card/",  params=params,  files=files)
    json_str = response.text
profile31a_flow = json.loads(json_str)

profile31a_flow

{'DESCRIPTION': 'This dataset contains wastewater data collected from the Deer Island wastewater treatment plant in Massachusetts from October 02, 2020 to January 25, 2021.',
 'AUTHOR_NAME': 'UNKNOWN',
 'AUTHOR_EMAIL': 'UNKNOWN',
 'DATE': 'UNKNOWN',
 'PROVENANCE': 'The data was collected from the Deer Island wastewater treatment plant in Massachusetts.',
 'SENSITIVITY': 'UNKNOWN',
 'LICENSE': 'UNKNOWN',
 'SCHEMA': ['Sample Date',
  'North System Flows to DITP, MGD',
  'South System Flows to DITP, MGD',
  'Total flows to DITP (MGD: Million gallon per day)'],
 'EXAMPLES': {'Sample Date': '11/26/20',
  'North System Flows to DITP, MGD': 178.9,
  'South System Flows to DITP, MGD': 106.9,
  'Total flows to DITP (MGD: Million gallon per day)': 285.8},
 'DATA_PROFILING_RESULT': {'Sample Date': {'col_name': 'Sample Date',
   'concept': 'Date of sample collection',
   'unit': 'Date',
   'description': 'The date when the wastewater sample was collected from the Deer Island wastewater treatment p

Now, we present data annotations on the viral concentration dataset through another data card:

In [4]:
csv_name = "scenario-3.1a-viral-concentration.csv"
doc_name = "scenario-3.1a-data-documentation.txt"

with open(csv_name, 'rb') as f_csv, open(doc_name,  'rb') as f_doc:
    files = {'csv_file': ('filename', f_csv), 'doc_file': ('filename', f_doc)}
    params = {"gpt_key": GPT_KEY}
    response = requests.post(API_ROOT + "cards/get_data_card/",  params=params,  files=files)
    json_str = response.text
profile31a_viral = json.loads(json_str)

profile31a_viral

{'DESCRIPTION': 'This dataset contains wastewater data collected from the Deer Island wastewater treatment plant in Massachusetts from October 02, 2020 to January 25, 2021.',
 'AUTHOR_NAME': 'UNKNOWN',
 'AUTHOR_EMAIL': 'UNKNOWN',
 'DATE': 'UNKNOWN',
 'PROVENANCE': 'The data was collected from the Deer Island wastewater treatment plant in Massachusetts.',
 'SENSITIVITY': 'UNKNOWN',
 'LICENSE': 'UNKNOWN',
 'SCHEMA': ['Date',
  'SARS-CoV-2 levels [copies/L]',
  'log_SARS-CoV-2',
  'seven_day_average',
  'seven_day_average_log'],
 'EXAMPLES': {'Date': '10/24/20',
  'SARS-CoV-2 levels [copies/L]': 118741.7144,
  'log_SARS-CoV-2': 5.074606972,
  'seven_day_average': 136504.2122,
  'seven_day_average_log': 5.135149235},
 'DATA_PROFILING_RESULT': {'Date': {'col_name': 'Date',
   'concept': 'Sampling date',
   'unit': 'Date',
   'description': 'The date when the wastewater sample was collected',
   'dkg_groundings': [['oboinowl:date', 'date', 'property'],
    ['dc:date', 'Date', 'property'],
  

## Scenario 3-4a. NYC Wastewater Flow Rate and Viral Load

`The City of New York maintains openly available COVID-19 wastewater monitoring data at https://data.cityofnewyork.us/Health/SARS-CoV-2-concentrations-measured-in-NYC-Wastewat/f7dc-2q9f/data. Extract the relevant data columns (viral load, population served); you will also need the flow rate to implement the SEIR-V model. This can be found in Table S1 of the supplemental materials of https://doi.org/10.1039/D1EW00747E. Extract this data from the table.`

For this scenario, we provide data annotations for both indicated datasets via data card. In the case of Table S1, (WISCONSIN) extracted a table from the supplemental materials which we generate data annotations for below. As mentioned above, these data cards include general annotations about the dataset, statistics about the dataset, and column-level DKG groundings.

[http://100.26.10.46/#/Data-and-model-cards/get_data_card_cards_get_data_card_post](http://100.26.10.46/#/Data-and-model-cards/get_data_card_cards_get_data_card_post)

Below, we start with the NYC wastewater monitoring table:

In [5]:
csv_name = "scenario-3.4a-nyc-wastewater.csv"
doc_name = "scenario-3.4a-nyc-wastewater-documentation.txt"

with open(csv_name, 'rb') as f_csv, open(doc_name,  'rb') as f_doc:
    files = {'csv_file': ('filename', f_csv), 'doc_file': ('filename', f_doc)}
    params = {"gpt_key": GPT_KEY}
    response = requests.post(API_ROOT + "cards/get_data_card/",  params=params,  files=files)
    json_str = response.text
profile34a_nyc = json.loads(json_str)

profile34a_nyc

{'DESCRIPTION': 'Results of sampling to determine the SARS-CoV-2 N gene levels in NYC DEP Wastewater Resource Recovery Facility (WRRF) influent, disaggregated by the WRRF where the sample was collected, date sample was collected, and date sample was tested.',
 'AUTHOR_NAME': 'Department of Environmental Protection (DEP)',
 'AUTHOR_EMAIL': 'UNKNOWN',
 'DATE': 'May 1, 2023',
 'PROVENANCE': 'UNKNOWN',
 'SENSITIVITY': 'UNKNOWN',
 'LICENSE': 'UNKNOWN',
 'SCHEMA': ['Sample Date',
  'Test date',
  'WRRF Name',
  'WRRF Abbreviation',
  'Concentration SARS-CoV-2 gene target (N1 Copies/L) ',
  'Per capita SARS-CoV-2 load (N1 copies per day per population)',
  'Annotation',
  'Population Served, estimated '],
 'EXAMPLES': {'Sample Date': '01/12/2022',
  'Test date': '01/13/2022',
  'WRRF Name': 'Coney Island',
  'WRRF Abbreviation': 'CI',
  'Concentration SARS-CoV-2 gene target (N1 Copies/L)': 44315.0,
  'Per capita SARS-CoV-2 load (N1 copies per day per population)': 17209241.19,
  'Annotation':

Next, we present data annotations on the supplemental table through another data card:

In [6]:
csv_name = "scenario-3.4a-supplement.csv"
doc_name = "scenario-3.4a-supplement-documentation.txt"

with open(csv_name, 'rb') as f_csv, open(doc_name,  'rb') as f_doc:
    files = {'csv_file': ('filename', f_csv), 'doc_file': ('filename', f_doc)}
    params = {"gpt_key": GPT_KEY}
    response = requests.post(API_ROOT + "cards/get_data_card/",  params=params,  files=files)
    json_str = response.text
profile34a_supplement = json.loads(json_str)

profile34a_supplement

{'DESCRIPTION': "Monitoring SARS-CoV-2 in wastewater during New York City's second wave of COVID-19: sewershed-level trends and relationships to publicly available clinical testing data† Check for updates",
 'AUTHOR_NAME': 'Catherine Hoar, Francoise Chauvin, Alexander Clare, Hope McGibbon, Esmeraldo Castro, Samantha Patinella, Dimitrios Katehis, John J. Dennehy, Monica Trujillo, Davida S. Smyth, Andrea I. Silverman',
 'AUTHOR_EMAIL': 'UNKNOWN',
 'DATE': 'UNKNOWN',
 'PROVENANCE': "New York City's wastewater monitoring program tracked trends in sewershed-level SARS-CoV-2 loads starting in the fall of 2020, just before the start of the city's second wave of the COVID-19 outbreak.",
 'SENSITIVITY': 'UNKNOWN',
 'LICENSE': 'UNKNOWN',
 'SCHEMA': ['Wastewater Resource Recovery Facility (WRRF)',
  'Borough(s)',
  'Population Served*',
  'Daily Flow Range (Average)† in MGD'],
 'EXAMPLES': {'Wastewater Resource Recovery Facility (WRRF)': 'Newtown Creek',
  'Borough(s)': 'Manhattan, Brooklyn, \nan