# MIT (TA1): From Paper and Code to annotated extraction

*[Step 3 -  M3: Data profiling API](#cellm3)

*[Step 5 -  M1: MIT extraction pipeline demo](#cellm1)

*[Step 6 -  M2: TA1 abstractions merge pipeline demo](#cellm2)


## 0. Preprocessing

In [59]:
import ast, json, requests, os
from gpt_key import *
API_ROOT = "http://localhost:8000/"

#### We can run a local script to consolidate the "content" fields to get just the text of the paper:

## 1. Extracting variables and annotating them from the text and the DKG

#### We extract variables from the paper alongside a list of possible definitions, and ground each of these variables to the MIRA DKG.

In [60]:
with open("../../resources/models/Bucky/bucky_short.txt", "r") as f:
        text = f.read()           
        dct_extract = {"text":text, "gpt_key":GPT_KEY}           
        json_str = requests.post(API_ROOT + "annotation/find_text_vars/", params=dct_extract).text

In [61]:
ast.literal_eval(json_str)

[{'type': 'variable',
  'name': 'S_i j',
  'id': 'v0',
  'text_annotations': [' Proportion of individuals who are susceptible to the virus'],
  'dkg_annotations': [['geonames:2479536', 'Skikda'],
   ['geonames:487495', 'Sterlitamak']]},
 {'type': 'variable',
  'name': 'E_i j',
  'id': 'v1',
  'text_annotations': [' Proportion of individuals who have been exposed to the virus'],
  'dkg_annotations': [['geonames:1816670', 'Beijing'],
   ['geonames:1799491', 'Neijiang']]},
 {'type': 'variable',
  'name': 'I_i, j^hosp',
  'id': 'v2',
  'text_annotations': [' Proportion of individuals that are exhibiting severe disease symptoms and are in need of hospitalization'],
  'dkg_annotations': []},
 {'type': 'variable',
  'name': 'I_i, j^mild',
  'id': 'v3',
  'text_annotations': [' Proportion of individuals that are exhibiting mild disease symptoms'],
  'dkg_annotations': []},
 {'type': 'variable',
  'name': 'I_i, j^asym',
  'id': 'v4',
  'text_annotations': [' Proportion of individuals who are in

## 2. Adding annotations from dataset columns

#### Alongside the text, we might also have discovered a collection of datasets that we think might be relevant:

In [62]:
with open("../../resources/dataset/covid_confirmed_usafacts.csv") as f:
    dataset_1 = f.read()
    print(dataset_1[:150])

countyFIPS,County Name,State,StateFIPS,2020-01-22
0,"Statewide Unallocated","AL","01",0
1001,"Autauga County ","AL","01",0


In [63]:
with open("../../resources/dataset/covid_deaths_usafacts.csv") as f:
    dataset_2 = f.read()
    print(dataset_2[:150])

countyFIPS,County Name,State,StateFIPS,2020-01-22
0,"Statewide Unallocated","AL","01",0
1001,"Autauga County ","AL","01",0


#### Let's collect just the column names into a single file:

In [64]:
dir = "../../resources/dataset/"
with open(os.path.join(dir,"headers.txt"), "w+") as fw:
    for filename in os.listdir(dir):
        file = os.path.join(dir, filename)
        if os.path.isfile(file) and file.endswith(".csv"):
            fw.write("{}:\t{}".format(filename, open(file, "r").readline()))

In [65]:
with open(os.path.join(dir,"headers.txt")) as f:
    dataset_str = f.read()
    print(dataset_str[:419])

COVID-19_Reported_Patient_Impact_and_Hospital_Capacity_by_State_Archive_Repository.csv:	Update Date,Days Since Update,User,Rows,Row Change,Columns,Column Change,Metadata Published,Metadata Updates,Column Level Metadata,Column Level Metadata Updates,Archive Link
StatewideCOVID-19CasesDeathsTests.csv:	date,area,area_type,population,cases,cumulative_cases,deaths,cumulative_deaths,total_tests,cumulative_total_tests,posi


#### Now we can call our `annotation/link_datasets_to_vars` endpoint to map variables discovered earlier to any matching dataset columns (GIGO warning here):

In [66]:
dct_cols = {"json_str":json_str, "dataset_str": dataset_str, "gpt_key":GPT_KEY}           
json_str = requests.post(API_ROOT + "annotation/link_datasets_to_vars/", params=dct_cols).text
json_str

'[{"type":"variable","name":"S_i j","id":"v0","text_annotations":[" Proportion of individuals who are susceptible to the virus"],"dkg_annotations":[["geonames:2479536","Skikda"],["geonames:487495","Sterlitamak"]],"data_annotations":[""]},{"type":"variable","name":"E_i j","id":"v1","text_annotations":[" Proportion of individuals who have been exposed to the virus"],"dkg_annotations":[["geonames:1816670","Beijing"],["geonames:1799491","Neijiang"]],"data_annotations":[["StatewideCOVID-19CasesDeathsTests.csv","reported_cases"],["StatewideCOVID-19CasesDeathsTests.csv","cumulative_reported_cases"]]},{"type":"variable","name":"I_i, j^hosp","id":"v2","text_annotations":[" Proportion of individuals that are exhibiting severe disease symptoms and are in need of hospitalization"],"dkg_annotations":[],"data_annotations":["","",["Reported_State_Hospital_Capacity_and_COVID19_Patient_Impact.csv","staffed_adult_icu_bed_occ"],["Reported_State_Hospital_Capacity_and_COVID19_Patient_Impact.csv","staffed_a

In [67]:
ast.literal_eval(json_str)

[{'type': 'variable',
  'name': 'S_i j',
  'id': 'v0',
  'text_annotations': [' Proportion of individuals who are susceptible to the virus'],
  'dkg_annotations': [['geonames:2479536', 'Skikda'],
   ['geonames:487495', 'Sterlitamak']],
  'data_annotations': ['']},
 {'type': 'variable',
  'name': 'E_i j',
  'id': 'v1',
  'text_annotations': [' Proportion of individuals who have been exposed to the virus'],
  'dkg_annotations': [['geonames:1816670', 'Beijing'],
   ['geonames:1799491', 'Neijiang']],
  'data_annotations': [['StatewideCOVID-19CasesDeathsTests.csv',
    'reported_cases'],
   ['StatewideCOVID-19CasesDeathsTests.csv', 'cumulative_reported_cases']]},
 {'type': 'variable',
  'name': 'I_i, j^hosp',
  'id': 'v2',
  'text_annotations': [' Proportion of individuals that are exhibiting severe disease symptoms and are in need of hospitalization'],
  'dkg_annotations': [],
  'data_annotations': ['',
   '',
   ['Reported_State_Hospital_Capacity_and_COVID19_Patient_Impact.csv',
    'staf

## 3. Annotate dataset and ground to the DKG
<a id="cellm3"></a>

#### Let's collect column names from a csv file:

In [68]:
with open("../../resources/dataset/us-counties.csv", "r") as f:
    col_str = f.read()
print(col_str)

date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061,1,0
2020-01-22,Snohomish,Washington,53061,1,0
2020-01-23,Snohomish,Washington,53061,1,0
2020-01-24,Cook,Illinois,17031,1,0
2020-01-24,Snohomish,Washington,53061,1,0
2020-01-25,Orange,California,06059,1,0
2020-01-25,Cook,Illinois,17031,1,0


In [69]:
with open("../../resources/dataset/us-counties_doc.txt", "r") as f:
    col_doc = f.read()
print(col_doc)

The data is the product of dozens of journalists working across several time zones to monitor news conferences, analyze data releases and seek clarification from public officials on how they categorize cases.

It is also a response to a fragmented American public health system in which overwhelmed public servants at the state, county and territorial level have sometimes struggled to report information accurately, consistently and speedily. On several occasions, officials have corrected information hours or days after first reporting it. At times, cases have disappeared from a local government database, or officials have moved a patient first identified in one state or county to another, often with no explanation. In those instances, which have become more common as the number of cases has grown, our team has made every effort to update the data to reflect the most current, accurate information while ensuring that every known case is counted.

When the information is available, we count

#### Now we can call our `annotation/link_dataset_col_to_dkg` endpoint to ground each of these column names to the MIRA DKG:
[http://100.26.10.46/#/Paper-2-annotated-vars/link_dataset_columns_to_dkg_info_annotation_link_dataset_col_to_dkg_post](http://100.26.10.46/#/Paper-2-annotated-vars/link_dataset_columns_to_dkg_info_annotation_link_dataset_col_to_dkg_post)

In [70]:
dct_cols_dkg = {"csv_str":col_str, "doc":col_doc, "gpt_key":GPT_KEY}
ground_res = requests.post(API_ROOT + "annotation/link_dataset_col_to_dkg/", params=dct_cols_dkg).text
ast.literal_eval(ground_res)

{'date': {'col_name': 'date',
  'concept': 'Date of Report',
  'unit': 'YYYY-MM-DD',
  'description': 'The date when the cases and deaths were reported.',
  'dkg_groundings': [['apollosv:00000429', 'date'],
   ['oboinowl:date', 'date'],
   ['dc:date', 'Date'],
   ['geonames:2130188', 'Hakodate'],
   ['oboinowl:hasDate', 'has_date'],
   ['idocovid19:0001277', 'COVID-19 incidence', 'class'],
   ['ido:0000480', 'infection incidence', 'class'],
   ['idocovid19:0001283', 'SARS-CoV-2 incidence', 'class'],
   ['hp:0001402', 'Hepatocellular carcinoma', 'class'],
   ['oae:0000178', 'AE incidence rate', 'class'],
   ['orphanet.ordo:409966', 'point prevalence', 'class'],
   ['obcs:0000064', 'period prevalence', 'class'],
   ['cemo:weighted_prevalence', 'weighted prevalence', 'class'],
   ['idocovid19:0001272', 'COVID-19 prevalence', 'class'],
   ['ido:0000486', 'infection prevalence', 'class']]},
 'county': {'col_name': 'county',
  'concept': 'County Name',
  'unit': 'Text',
  'description': 'The

## 4. Getting a Petri net (as a pyascet) from code

#### Let's now turn our attention to code. We have a python function that describes the Bucky dynamics:

In [71]:
with open("../../resources/models/Bucky/bucky.py", "r") as f:
    code = f.read()
print(code)

def RHS_func(self, t, y_flat, Nij, contact_mats, Aij, par, npi, aij_sparse, y):
    # constraint on values
    lower, upper = (0.0, 1.0)  # bounds for state vars  ## TODO multiple_value_asignment

    # grab index of OOB values so we can zero derivatives (stability...)
    too_low = y_flat <= lower
    too_high = y_flat >= upper

    # TODO we're passing in y.state just to overwrite it, we probably need another class
    # reshape to the usual state tensor (compartment, age, node)
    y.state = y_flat.reshape(y.state_shape)

    # Clip state to be in bounds (except allocs b/c thats a counter)
    xp.clip(y.state, a_min=lower, a_max=upper, out=y.state)

    # init d(state)/dt
    dy = buckyState(y.consts, Nij)  # TODO make a pseudo copy operator w/ zeros

    # effective params after damping w/ allocated stuff
    t_index = min(int(t), npi["r0_reduct"].shape[0] - 1)  # prevent OOB error when integrator overshoots
    BETA_eff = npi["r0_reduct"][t_index] * par["BETA"]
    F_eff = par["F_

#### Using calls to the public MIT API, we can get Petri net components (places, transitions, hypothesized arcs) from this piece of code.

In [72]:
dict_petri = {"code": code, "gpt_key": GPT_KEY}
places = requests.post(API_ROOT + "petri/get_places", params=dict_petri).text
ast.literal_eval(places)

['S', ' E', ' I', ' Ia', ' Ic', ' Rh', ' R', ' D', ' incH', ' incC']

In [73]:
transitions = requests.post(API_ROOT + "petri/get_transitions", params=dict_petri).text
ast.literal_eval(transitions)

['BETA',
 ' F_eff',
 ' HOSP',
 ' THETA',
 ' GAMMA',
 ' GAMMA_H',
 ' SIGMA',
 ' SYM_FRAC',
 ' CASE_REPORT']

In [74]:
arcs = requests.post(API_ROOT + "petri/get_arcs", params=dict_petri).text
ast.literal_eval(arcs)

[['S', ' E'],
 ['E', ' Ia'],
 ['E', ' I'],
 ['Ia', ' I'],
 ['I', ' Ic'],
 ['Ic', ' Rh'],
 ['Rh', ' R'],
 ['R', ' D'],
 ['E', ' Rh'],
 ['E', ' R'],
 ['E', ' D'],
 ['Ia', ' D'],
 ['I', ' D'],
 ['Ic', ' D']]

#### We can then convert these outputs into a py-acset (thanks to Justin Lieffers from Arizona for some of the conversion code and to Owen Lynch for the py-acset code!)

In [75]:
dict_acset = {"places_str": places, "transitions_str": transitions, "arcs_str": arcs}
pyacset_str = requests.post(API_ROOT + "petri/get_pyacset", params=dict_acset).text

In [76]:
ast.literal_eval(pyacset_str)

{'S': [{'sname': 'S', 'uid': 1},
  {'sname': 'E', 'uid': 2},
  {'sname': 'I', 'uid': 3},
  {'sname': 'Ia', 'uid': 4},
  {'sname': 'Ic', 'uid': 5},
  {'sname': 'Rh', 'uid': 6},
  {'sname': 'R', 'uid': 7},
  {'sname': 'D', 'uid': 8},
  {'sname': 'incH', 'uid': 9},
  {'sname': 'incC', 'uid': 10}],
 'T': [{'tname': 'BETA', 'uid': 11},
  {'tname': 'F_eff', 'uid': 12},
  {'tname': 'HOSP', 'uid': 13},
  {'tname': 'THETA', 'uid': 14},
  {'tname': 'GAMMA', 'uid': 15},
  {'tname': 'GAMMA_H', 'uid': 16},
  {'tname': 'SIGMA', 'uid': 17},
  {'tname': 'SYM_FRAC', 'uid': 18},
  {'tname': 'CASE_REPORT', 'uid': 19}],
 'I': [{'it': 1, 'is': 1},
  {'it': 2, 'is': 2},
  {'it': 3, 'is': 2},
  {'it': 4, 'is': 4},
  {'it': 5, 'is': 3},
  {'it': 6, 'is': 5},
  {'it': 7, 'is': 6},
  {'it': 8, 'is': 7},
  {'it': 9, 'is': 2},
  {'it': 10, 'is': 2},
  {'it': 11, 'is': 2},
  {'it': 12, 'is': 4},
  {'it': 13, 'is': 3},
  {'it': 14, 'is': 5}],
 'O': [{'ot': 1, 'os': 2},
  {'ot': 2, 'os': 4},
  {'ot': 3, 'os': 3},
  

## 5. MIT annotation end to end pipeline
<a id="cellm1"></a>

#### Finally, we bring every annotation step together: for the original paper text, let's integrate the above extraction modules and output the MIT extraction:
[http://100.26.10.46/#/Paper-2-annotated-vars/upload_file_annotate_annotation_upload_file_extract__post](http://100.26.10.46/#/Paper-2-annotated-vars/upload_file_annotate_annotation_upload_file_extract__post)

In [77]:
with open('../../resources/models/Bucky/bucky.txt', 'rb') as f:
    files = {'file': ('filename', f)}
    params = {"gpt_key": GPT_KEY}
    response = requests.post(API_ROOT + "annotation/upload_file_extract/",  params=params,  files=files)
    json_str = response.text
ast.literal_eval(json_str)


[{'type': 'variable',
  'name': 'S_i j',
  'id': 'v0',
  'text_annotations': [' Proportion of individuals who are susceptible to the virus'],
  'dkg_annotations': [['geonames:2479536', 'Skikda'],
   ['geonames:487495', 'Sterlitamak']],
  'title': 'd096ade60e503c888b756724e3818835__filename',
  'data_annotations': []},
 {'type': 'variable',
  'name': 'E_i j',
  'id': 'v1',
  'text_annotations': [' Proportion of individuals who have been exposed to the virus'],
  'dkg_annotations': [['geonames:1816670', 'Beijing'],
   ['geonames:1799491', 'Neijiang']],
  'title': 'd096ade60e503c888b756724e3818835__filename',
  'data_annotations': [[[5,
     'usa-cases-hospitalized-by-age.csv',
     2,
     'new_confirmed_age_0'],
    'https://github.com/DARPA-ASKEM/program-milestones/blob/main/6-month-milestone/evaluation/scenario_3/ta_1/google-health-data/usa-cases-hospitalized-by-age.csv'],
   [[5, 'usa-cases-hospitalized-by-age.csv', 3, 'new_confirmed_age_1'],
    'https://github.com/DARPA-ASKEM/progr

## 6. Interacting with the University of Arizona extraction
<a id="cellm2"></a>

#### With both the University of Arizona and MIT extractions, we first build the entity matching mapping for all the extracted variable entities, and then integrate the two integration together with the unified TA1 data model.
[http://100.26.10.46/#/TA1-Integration/upload_files_integration_integration_get_mapping_post](http://100.26.10.46/#/TA1-Integration/upload_files_integration_integration_get_mapping_post)

In [78]:
with open('../../resources/xDD/mit-extraction/bucky__mit-extraction_id.json', 'rb') as f_mit, open('../../resources/xDD/arizona-extraction/bucky_arizona_output_example.json', 'rb') as f_arizona:
    files = { 'mit_file': ('bucky__mit-extraction_id.json', f_mit, 'application/json'),
        'arizona_file': ('bucky_arizona_output_example.json', f_arizona, 'application/json')}
    params = {"gpt_key": GPT_KEY}
    response = requests.post(API_ROOT + "integration/get_mapping", params=params, files=files)
    print(response.text)

{"attributes":[{"type":"anchored_extraction","amr_element_id":null,"payload":{"id":{"id":"R:190348269"},"names":[{"id":{"id":"T:-1709799622"},"name":"Bucky","extraction_source":{"page":0,"block":0,"char_start":738,"char_end":743,"document_reference":{"id":"buckymodel_webdocs.pdf"}},"provenance":{"method":"Skema TR Pipeline rules","timestamp":"2023-06-27T22:31:51.485020"}}],"descriptions":[{"id":{"id":"T:-486841659"},"source":"time","grounding":[{"grounding_text":"time since time scale zero","grounding_id":"apollosv:00000272","source":[],"score":0.8945620059967041,"provenance":{"method":"SKEMA-TR-Embedding","timestamp":"2023-06-27T22:31:51.486950"}}],"extraction_source":{"page":0,"block":0,"char_start":732,"char_end":736,"document_reference":{"id":"buckymodel_webdocs.pdf"}},"provenance":{"method":"Skema TR Pipeline rules","timestamp":"2023-06-27T22:31:51.485020"}}],"value_specs":[],"groundings":[],"data_columns":null}},{"type":"anchored_extraction","amr_element_id":null,"payload":{"id":