# MIT (TA1): From Paper and Code to annotated Petri Nets

#### Mike Cafarella, Chunwei Liu, Markos Markakis, Peter Chen

## 0. Preprocessing

In [None]:
import ast, json, requests, os
from IPython import display

API_ROOT = "http://100.26.10.46/"
GPT_KEY = ""

#### Starting with the SIDARTHE [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7175834/pdf/41591_2020_Article_883.pdf) provided in Scenario 2,  we can run COSMOS (thanks Enrique Noriega from UArizona!) to extract a JSON with entries like this:
    

In [None]:
with open("documents_sidarthe--COSMOS-data.json", "r") as f:
    text = f.read()
    print(ast.literal_eval(text)[0])

#### We can run a local script to consolidate the "content" fields to get just the text of the paper:

In [None]:
with open("sidarthe.txt", "r") as f:
    text = f.read()
    print(text.replace('\n', ' ')[:500])

#### From the COSMOS output, we can also keep metadata like the paper name and doi for later:

In [None]:
with open("sidarthe_info.json", "r") as f:
    info = json.load(f)
    info_s = json.dumps(info)
    print(info_s)

## 1. Extracting variables and annotating them

#### Using our API (powered by GPT-3), we can extract variables from the paper alongside a list of possible definitions, and ground each of these variables to the MIRA DKG (thanks Harvard team!). If you're interested, the JSON format of our intermediate output can be found [here](https://github.com/mikecafarella/mitaskem/blob/main/JSONformat.md).

In [None]:
with open("sidarthe_short.txt", "r") as f:
        text = f.read()           
        dct = {"text":text, "gpt_key":GPT_KEY}           
        r = requests.post(API_ROOT + "annotation/find_text_vars/", params=dct)
        print(r)

In [None]:
json_str = r.text
ast.literal_eval(json_str)

## 2. Extracting LaTeX from formula images

####  Here is a formula image from the SIDARTHE paper:

In [None]:
display.Image("../../resources/images/SIDARTHE/sidarthe_dAdt.png")

#### As we demoed last week, we can extract LaTeX from such formula images (powered by `pix2tex`), also through a public API.

In [None]:
directory = '../../resources/images/SIDARTHE'
latex_strs = []
 
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    latex_str = !python3 img_latex.py -p {f} # This is a local script that resizes the image and calls the public API.
    print(latex_str)
    latex_strs.append(latex_str[0])

## 3. Linking variables from the LaTeX formulas to variables from the text

#### We just extracted all these equations in LaTeX, which include variables. Let's link these variables to the variables we found from the text in part 1.  Again, the output (internal)JSON format can be found [here](https://github.com/mikecafarella/mitaskem/blob/main/JSONformat.md)

In [None]:
full_json_str = json_str
latex_strs = latex_strs 
for latex_str in latex_strs:
    print(latex_str)
    dct2 = {"json_str":full_json_str, "formula": latex_str, "gpt_key":GPT_KEY}           

    r2 = requests.post(API_ROOT + "annotation/link_latex_to_vars/", params=dct2)
    print(r2)
    full_json_str = r2.text

In [None]:
ast.literal_eval(full_json_str)

## 4. Getting a Petri net (as a pyascet) from code 

#### Let's now turn our attention to code. We have a python function that describes the SIDARTHE dynamics:

In [None]:
with open("../../resources/jan_evaluation/scenario_2_sidarthe/sidarthe_code.py", "r") as f:
    code = f.read()
print(code)

#### Using calls to the public MIT API, we can get Petri net components (places, transitions, hypothesized arcs) from this piece of code.

In [None]:
dict_petri = {"code": code, "gpt_key": GPT_KEY}
places = requests.post(API_ROOT + "petri/get_places", params=dict_petri).text
print(places)

In [None]:
transitions = requests.post(API_ROOT + "petri/get_transitions", params=dict_petri).text
print(transitions)

In [None]:
arcs = requests.post(API_ROOT + "petri/get_arcs", params=dict_petri).text
print(arcs)

#### We can then convert these outputs into a py-acset (thanks to Justin Lieffers from Arizona for some of the conversion code and to Owen Lynch for the py-acset code!)

In [None]:
dict_acset = {"places_str": places, "transitions_str": transitions, "arcs_str": arcs}

acset = requests.post(API_ROOT + "petri/get_pyacset", params=dict_acset).text

pyacset_s = acset
print(acset)

In [None]:
ast.literal_eval(pyacset_s)

## 5. Linking the annotations to the py-acset and paper info

#### Finally, we bring everything together: for every place and transition in the pyacset, let's map it to the annotations from earlier:

In [None]:
dct3 = {"pyacset_str":pyacset_s, "annotations_str":full_json_str, "info_str":info_s}           
r3 = requests.post(API_ROOT + "annotation/link_annos_to_pyacset/", params=dct3)
print(r3)

In [None]:
ast.literal_eval(r3.text)

#### Data in this format can be ingested, visualized and edited by TA4!

## 6. Interacting with the University of Arizona codepaths

#### The University of Arizona team can also produce an annotated py-acset as an output. We can integrate the two outputs by matching on the names of places and transitions, to get a more complete picture of the model. The metadata extracted by both teams can then be accessible by using the associated `uid` of each place/transition as a key into the metadata JSON file.

## 6. Interacting with the University of Arizona codepaths

#### The University of Arizona team can also produce an annotated py-acset as an output. We can integrate the two outputs by matching on the names of places and transitions, to get a more complete picture of the model. The metadata extracted by both teams can then be accessible by using the associated `uid` of each place/transition as a key into the metadata JSON file.