# Tutorial: Large Language Models for scientific data extraction

Structured data is at the heart of machine learning. LLMs offer a convenient way to generate structured data based on unstructured inputs. This tutorial gives hands-on examples of the different steps in the extraction workflow using LLMs.

More extensive and detailed examples can be found at [Matextract.pub](https://matextract.pub)


 **Tutorial Instructions**

This notebook is designed as a hands-on exercise.  
Selected parts of the code are intentionally left incomplete and marked with `___` to encourage active participation.  
You do not need to download any external files or documents — all example data is already included and ready to use.


## Obtaining data

At the start of the data extraction process you have to collect a set of potentially relevant data sources. To do so, you could collect a dataset manually or use a tool to help to automate and speed up this process. One of these libraries is crossrefapi.

Besides the API there are multiple Python libraries available that make access to the API easier. One of these libraries is crossrefapi. As an example, 10 sources including metadata on the topic ‘copolymerization’ are extracted and saved into a JSON file.

In [2]:
from crossref.restful import Works
import json

works = Works()

# Performing the search for sources on the topic of copolymerizations 
query_result = (
    works.query(bibliographic="___") # TODO define the search topic 
    .select("DOI", "title", "author", "type", "publisher", "issued")
    .sample(___) # TODO define the number of papers 
)

results = [item for item in query_result]

# Save results including their metadata in a json file
with open("copolymerization_results.json", "w") as file:
    json.dump(results, file)

# TODO print the results

[{'DOI': '10.1021/jacs.4c05094.s001', 'issued': {'date-parts': [[None]]}, 'publisher': 'American Chemical Society (ACS)', 'title': ['Radical Stitching Polymerization and Its Alternating Copolymerization'], 'type': 'component'}, {'DOI': '10.1016/0032-3950(62)90580-4', 'issued': {'date-parts': [[1962, 1]]}, 'publisher': 'Elsevier BV', 'title': ['Studies in cyclic polymerization and copolymerization—V. Cyclic copolymerization of divinylacetals with vinyl acetate'], 'type': 'journal-article'}, {'DOI': '10.1021/acs.biomac.7b00229.s002', 'issued': {'date-parts': [[None]]}, 'publisher': 'American Chemical Society (ACS)', 'title': ['Enzymatically Debranched Xylans in Graft Copolymerization'], 'type': 'component'}, {'DOI': '10.1021/acsmacrolett.7b00904.s002', 'issued': {'date-parts': [[None]]}, 'publisher': 'American Chemical Society (ACS)', 'title': ['Nickel-Catalyzed Propylene/Polar Monomer Copolymerization'], 'type': 'component'}, {'DOI': '10.1021/acs.macromol.8b00696.s001', 'issued': {'date

The next step is to download the relevant papers. There are multiple datasets and tools which can be used for data mining. While downloading papers, please always be aware of copyright.

Since this step can take some time, already downloaded example papers can be found in the pdf folder. 

In [3]:
import os

folder_path = "pdfs"
files = os.listdir(folder_path)
pdf_files = [f for f in files if f.lower().endswith('.pdf')]

print(pdf_files)

['10.26434_chemrxiv-2024-1l0sn.pdf', '10.26434_chemrxiv-2024-rfzjm.pdf', '10.26434_chemrxiv-2021-2x06r-v3.pdf', '10.26434_chemrxiv-2024-tddfc-v2.pdf', '10.26434_chemrxiv.14217314.v1.pdf']


## Annotating a test and validation dataset

To evaluate the data extraction and find the best hyperparameters one must have a test and validation set. Annotating at least a small part of the obtained article dataset is crucial. A reasonable number of annotated test and validation paper would be between 10-20 papers. The more diverse and representative the test and validation set of the whole paper corpus are, the better. 

In [None]:
# I will add some schema here

## Converting the PDF documents to text

An important part of data extraction pipelines is often converting inputs into a form that the text-based pipelines can use.

In many cases, this conversion involves that image inputs (e.g., scans of a paper) must be converted into text. To convert the PDF documents into text we will use a so-called OCR (Optical Character Recognition) tool. There is a variety of OCR tools available. We will now use the PyMuPDF python package since its easy to use and fast. There are also special packages made for the convertion of scientific publications like [NOUGAT](https://facebookresearch.github.io/nougat/) (Warning: NOUGAT is resource intensive and should be run on a GPU). 

In [None]:
import os
import fitz  # PyMuPDF

pdf_folder = ___ # TODO: Specify the folder path where the PDF files are located

output_folder = ___ # TODO: Specify the output folder path for the text files

# Make sure the output folder exists
os.makedirs(output_folder, exist_ok=True)

# Loop through all PDF files in the folder
for filename in os.listdir(pdf_folder):
    if filename.lower().endswith(".pdf"):
        file_path = os.path.join(pdf_folder, filename)
        
        # Open the PDF file
        doc = fitz.open(file_path)
        
        # Extract text from all pages
        full_text = ""
        for page in doc:
           # TODO Use .get_text() to extract the text from the documents and add it to full_text
        
        # TODO print the extracted text
        doc.close()
        
        # TODO: Write the extracted text into a .txt file with the same name
        txt_filename = os.path.splitext(filename)[0] + ".txt"
        txt_path = os.path.join(output_folder, txt_filename)


## Cleaning the document

Since most of the downloaded files include irrelevant sections like bibliography and acknowledgments the next step would be to remove those.

The simplest method to remove irrelevant section would be using hard-coded rules. We will remove extraneous line breaks and irrelevant sections. Therefore, we wil use Regular expressions ([regex](https://docs.python.org/3/library/re.html)). They are a powerful tool for pattern matching, allowing for complex searches, substitutions, and data extraction based on specific string patterns.

In [None]:
import re

def clean_document(text):
    """
    Cleans a single document's text.
    - Keeps only the content between 'Introduction' and 'Acknowledgements'
    - Removes line breaks within paragraphs
    - Removes the heading 'Acknowledgements'
    """
    
    # Remove double newlines to prepare for paragraph fixing
    text = text.replace("\n\n", "\n")

    # Extract the part between Introduction and Acknowledgements
    pattern = re.compile(r"Introduction.*?Acknowledgements", re.DOTALL)
    matches = re.findall(___, ___) # TODO: define the used regex pattern and the text to be cleaned 

    if matches:
        filtered = matches[0].replace("Acknowledgements", "")
    else:
        # If pattern not found, keep original text
        # TODO: define the filtered text when no pattern is found 
    
    return filtered

In [None]:
# TODO: Replace this with the actual folder where the .txt files are saved
text_folder = ___  

# TODO: Replace this with the desired output folder for the cleaned text
cleaned_folder = ___  
os.makedirs(cleaned_folder, exist_ok=True)

# Loop through all text files and clean them
for filename in os.listdir(text_folder):
    if filename.endswith(".txt"):
        file_path = os.path.join(text_folder, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
        
        # Clean the content
        cleaned =  # TODO: add the correct call of the cleaning function

        # TODO: Print the cleaned content

        
        # Save the cleaned version
        cleaned_path = os.path.join(cleaned_folder, filename)
        with open(cleaned_path, "w", encoding="utf-8") as f:
            f.write(cleaned)

## Chunking the text

Models always have a context window, which is the number of tokens ([short explanation for tokens](https://www.geeksforgeeks.org/nlp/tokenization-in-natural-language-processing-nlp/)) they can process at a given time.  
This becomes a problem when we want to process long texts that exceed this limit.

To handle this, we break the input into **smaller overlapping segments**, called **chunks**, that fit within the model's context window.

We use a method called `RecursiveCharacterTextSplitter` from LangChain.  
This splitter breaks the text at **logical boundaries** (like paragraphs, sentences, or characters) in a smart way, trying to preserve semantic structure rather than cutting arbitrarily.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

# Function to chunk a given text into overlapping segments
def chunk_text(text):
    """
    Splits a given text into overlapping chunks using RecursiveCharacterTextSplitter.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=___,       # TODO: Define chunk size
        chunk_overlap=___     # TODO: Define overlap size
    )
    
    chunks = text_splitter.split_text(text)
    return chunks

In [None]:
# TODO: Specify folder containing the cleaned text files
cleaned_folder = ___  

for filename in os.listdir(cleaned_folder):
    if filename.endswith(".txt"):
        with open(os.path.join(cleaned_folder, filename), "r", encoding="utf-8") as f:
            content = f.read()

        chunks = # TODO: call the chunking function with the content

        print(f"{len(chunks)} chunks created for {filename}.")

## ✅ Summary: What We've Learned So Far

In this first part of the tutorial, you have gone through the essential preprocessing steps for scientific document extraction using LLMs:

1. **Data collection**  
   You used the Crossref API to collect metadata about relevant scientific articles and stored them in a structured JSON format.

2. **PDF handling**  
   You explored how to programmatically list and process PDF files using PyMuPDF (`fitz`) to extract raw text content.

3. **Text cleaning**  
   You implemented rule-based cleaning techniques using regular expressions to remove irrelevant sections such as bibliographies or acknowledgements.

4. **Text chunking**  
   You split cleaned texts into overlapping chunks to prepare them for LLM processing, taking context window limitations into account.

These steps are foundational for building reliable data extraction workflows and ensure that the inputs are clean, structured, and manageable in size for downstream tasks like embedding, classification, or entity extraction.

In the next section, you will use this structured and chunked data to **perform actual information extraction using LLMs**.
