# Corpus extraction

This notebook shows how to process scientific papers to extract Named Entities and relations between them in order to create a corpus of chemical compound properties.

We start by setting up some logging to track the progress of our functions.

In [1]:
import logging

logger = logging.getLogger('cprex.corpus.corpus')
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(ch)

### 1 - Online Archive Crawling

The first step is to crawl an online archive of scientific papers (preferably chemistry papers) that we would like to parse and process. For this notebook, we'll crawl [ChemRxiv](https://chemrxiv.org/). We'll first query the online archive with a keyword of interest to retrieve the list of papers matching that query, and we'll store paper metadata (title, doi, pdf url) into a jsonl file.

In [2]:
from pathlib import Path

papers_metadata = Path() / "fuel_papers.json"

In [3]:
from cprex.corpus.corpus import crawl_chemrxiv_papers

crawl_chemrxiv_papers(papers_metadata, "fuel")

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
2024-06-13 16:13:40,285 - INFO - Starting to crawl chemRxiv API.
3797it [02:32, 24.92it/s]
2024-06-13 16:16:12,684 - INFO - Crawl finished. Dumping results to fuel_papers.json


### 2 - Download and parse PDFs

Once we have our list of papers, we can start parsing them. To do so, we need to create a pipeline for processing them, and we also need a directory to save the PDF files we will download.

In [4]:
from cprex.pipeline import get_pipeline

nlp = get_pipeline(spacy_model="en_core_web_sm")

  _torch_pytree._register_pytree_node(




 2024-06-13 16:16:13,322 - grobid_quantities.quantities - INFO - Grobid-quantities server is up and running


In [5]:
from pathlib import Path

DATA_DIR = Path.home() / ".cprex" / "data"

In [6]:
from cprex.corpus.corpus import parse_papers

parsed_papers = parse_papers(papers_metadata, DATA_DIR, nlp, limit=5, save_parsed_docs=True)

2024-06-13 16:16:21,287 - INFO - Reading paper metadata ...
2024-06-13 16:16:21,307 - INFO - Processing papers (max 5) ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:22<00:00, 28.47s/it]
2024-06-13 16:18:43,681 - INFO - Done processing. Writing output.


### 3 - Visualise results

Once we've parsed some papers and extracted information, we can visualise the relevant information using our custom `render_docs` function which will display the paragraphs' text with Named Entities and relations highlighted.

In [7]:
from cprex.displacy.render import render_docs

for paper in parsed_papers:
    if paper.docs:
        print(paper.title)
        print()
        render_docs(paper.docs)

Potential-dependent polaron formation activates TiO2 for the hydrogen evolution reaction



Solution-Phase Synthesis of Alloyed Ba(Zr1-xTix)S3 Perovskite and Non-Perovskite Nanomaterials  



Phosphorylated Sporopollenin as a Sustainable Catalyst for Selective 5-Hydroxymethylfurfural Formation in Water: Insights into Phosphate Functionalization, Kinetics, and Mechanism

