# CPREx - Chemical Properties Relation Extraction

This notebook shows how to use CPREx for extracting Named Entities (NER) such as chemical compounds, properties and values from scientific articles. CPREx can also perform Relation Extraction (RE) to link the named entities with certain types of relations.

### Loading the CPREx pipeline

The first step to using CPREx is to load a spaCy pipeline by using the `get_pipeline` function. This function requires a few models to perform NER and REL. We use a PubmedBert model to identify chemical compound entities. This transformer model was trained on the NLM-CHEM corpus for the [BioCreative VII NLM-CHEM track](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-2/). We also use a custom Relation Extraction model pretrained on our annotated data to extraction relations between chemical compounds, their properties and their values. You can download these models by running `cprex install-models`.

CPREx also needs a base spaCy model to perform standard NLP tasks such as tokenization, lemmatization, dependency parsing, etc. Make sure to [install a model](https://github.com/explosion/spacy-models) (*e.g.* `en_core_web_sm`) as well.

Finally, to perform text extraction from PDF articles, and also to extract values for the chemical properties, CPREx requires a running instance of [GROBID](https://github.com/kermitt2/grobid) with the [grobid-quantities](https://github.com/kermitt2/grobid-quantities) extension installed. You can install both by running `cprex install-grobid` and then start a server with `cprex start-grobid`.

In [1]:
from cprex.pipeline import get_pipeline

nlp = get_pipeline(spacy_model="en_core_web_sm")

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(




 2024-06-13 15:58:42,024 - grobid_quantities.quantities - INFO - Grobid-quantities server is up and running


### Perform automatic data extraction on a PDF file

Once our pipeline is loaded and ready, all we need to do is provide the path to a PDF file on the file system, and CPREx will parse the PDF to text, then extract the Named Entities and perform Relation Extraction on the candidate entities. The convenience function `parse_and_filter_pdf` performs all those steps and returns the paragraphs which contain relevant information, *i.e.* where chemical compound properties are expressed.

In [2]:
from pathlib import Path
pdf = Path.cwd().parent / "resources" / "chemrxiv.pdf"

In [3]:
from cprex.corpus.corpus import parse_and_filter_pdf
docs = parse_and_filter_pdf(pdf, nlp, segment_sentences=False)

The `parse_and_filter_pdf` function returns a list of spaCy Docs. Each paragraph is a `Doc` with its named entities and relations stored as attributes. We'll use a custom visualiser to display the Named Entities as well as the relations between them.

In [4]:
from cprex.displacy.render import render_docs

render_docs(docs)