# Tuples (Chemical compound, Property, Value) extraction from ChemRxiv

This notebook extracts tuples from scientific publications taken from ChemRxiv and processed with our pipeline. We assume that the [Corpus](./Corpus.ipynb) notebook was run beforehand and that processing results for each scientific publication are saved in Spacy DocBins in the `~/.cprex/data` directory.

In [1]:
from pathlib import Path

DATA_DIR = Path.home() / ".cprex" / "data"

In [2]:
from cprex.pipeline import get_pipeline

nlp = get_pipeline(spacy_model="en_core_web_sm", enable_ner_pipelines= False)

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


The `get_tuples` function will extract tuples present in the Docs parsed by the pipeline. The `triplets_only` parameter lets you choose to keep only full triplets (having a value, a property and a chemical) or if we accept only pairs of named entity (i.e. with a chemical and a value, but no linked property).

In [3]:
import json
from cprex.corpus.corpus import load_docs
from cprex.corpus.tuples import extract_tuple_relations

def get_tuples(nlp, triplets_only: bool = False):
    res = []
    for doc_file in DATA_DIR.glob("*.spacy"):
        docs = load_docs(doc_file, nlp, set_doi=True)
        for doc in docs:
            tuples = extract_tuple_relations(doc)
            for tuple_ in tuples:
                if tuple_.chemicals is not None and (not triplets_only or tuple_.properties is not None):
                    res.append(tuple_.to_dict())

    return res

In [4]:
triplets = get_tuples(nlp, triplets_only = True)
tuples = get_tuples(nlp, triplets_only = False)

In [5]:
print(f"Triplets: {len(triplets)}, pairs: {len(tuples)}")

Triplets: 447, pairs: 1000


Once extraction is complete, we can save the tuples in a json file.

In [6]:
import json
from pathlib import Path

out_file = Path() / "triplets_chemrxiv.json"
with open(out_file, 'w') as f:
    json.dump(triplets, f, indent=2)

### Displaying docs for a single paper

From the DOI of a paper, we can display the paragraphs which contain the tuples that were extracted.

In [7]:
from pathlib import Path
from cprex.corpus.corpus import load_docs
from cprex.corpus.tuples import extract_tuple_relations
from cprex.displacy.render import render_docs

def display_tuples_for_doc(nlp, doi: str, triplets_only: bool = False):
    doi = doi.replace("/", "_") + ("" if doi.endswith(".spacy") else ".spacy")
    docs = load_docs(DATA_DIR / doi, nlp)
    display_docs = []
    for doc in docs:
        tuples = extract_tuple_relations(doc)
        good_tuples = [tuple_.to_dict() for tuple_ in tuples if tuple_.chemicals is not None and (not triplets_only or tuple_.properties is not None)]
        if good_tuples:
            display_docs.append(doc)

    render_docs(display_docs)

In [8]:
display_tuples_for_doc(nlp, "10.26434/chemrxiv-2024-f58l0-v2", triplets_only=False)