# Scientific Data for Large Language Models

-------------------------------------------------------------
Notebook created by Marcelo Amaral.
Assisted by GPT, a language model developed by OpenAI.
-------------------------------------------------------------

This Jupyter notebook is part of a process that is designed to convert scientific articles in PDF format into a structured data format (JSON), which can be used to train large language models. The process involves two main steps:

    PDF to XML Conversion: This step is performed outside of this notebook, using a tool called GROBID. GROBID (or GeneRation Of BIbliographic Data) is a machine learning library that converts PDFs into structured TEI-encoded documents with a particular focus on technical and scientific articles.

    XML to JSON Conversion: This is the step carried out in this notebook. We use Python's xml.etree.ElementTree to parse the XML files and extract relevant information such as the title, date, authors, abstract, and the content of the paper in LaTeX format.

The output of this process is a collection of JSON files, each corresponding to a scientific paper. Each JSON file contains the paper's title, date, authors, abstract, and the main content formatted in LaTeX. The main content includes the sections of the paper and any equations written in LaTeX.

This structured data format is much more convenient for training large language models, as it allows for easy access to the different components of a scientific paper. Furthermore, the use of LaTeX for formatting the main content preserves the structure and presentation of mathematical equations and scientific expressions.

Note: This process assumes that the directory structure used for storing PDFs is maintained when storing the corresponding XML and JSON files. This makes it easy to trace back the processed data to the original PDF if required.

Please be aware that the quality of the output data depends heavily on the quality of the input PDFs and the effectiveness of the GROBID tool in converting these PDFs into structured XML files. Some documents may not be processed correctly due to inconsistencies in the XML structure, errors in the original PDFs, or limitations of the GROBID tool.

## Import necessary libraries

from langchain.document_loaders import TextLoader: This line imports the TextLoader class from the langchain.document_loaders module. TextLoader is a custom class that is designed to load documents from a given file path. In the context of this notebook, it is used to load XML files for processing.

import os: This line imports the built-in Python os module. This module provides a portable way of using operating system dependent functionality such as reading or writing to the file system, starting or killing processes, and more. In this notebook, it is mainly used for file and directory manipulation.

import re: This line imports the built-in Python re module, which stands for Regular Expressions. Regular expressions are a powerful tool for various kinds of string manipulation. They are a domain-specific language (DSL) that is present in various forms in almost all modern programming languages.

import json: This line imports the built-in Python json module. This module provides an easy way to encode and decode data in JSON. The JSON format is a popular data interchange format and is used in this notebook to save the extracted information from the scientific papers.

The xml.etree.ElementTree module in Python provides a lightweight and efficient API for parsing and creating XML data. It is part of Python's standard library and doesn't require any additional installation.

In [1]:
from langchain.document_loaders import TextLoader
import os
import re
import json
import xml.etree.ElementTree as ET

## Functions

The function recursive_text_extraction(element) is used to extract all the text nested within an XML element, including text within child elements.

Here is a brief explanation of the function:

    The function takes as input an XML element.

    It initializes text with the text of the current XML element, or an empty string if the element has no text.

    It then iterates over each child of the current XML element.

    For each child, it recursively calls recursive_text_extraction(child) to extract the text within the child, including any text within nested child elements. This text is then appended to text.

    If the child element has a tail (text that comes after the child element but is still inside the parent element), this is also appended to text.

    Finally, the function returns the complete text that has been extracted.

In this way, the function is able to extract all the text within an XML element, no matter how deeply nested the text might be within child elements.

In [2]:
def recursive_text_extraction(element):
    text = element.text or ""
    for child in element:
        text += recursive_text_extraction(child)
        if child.tail:
            text += child.tail
    return text

The extract_information_from_tei function is designed to parse an XML string that is structured according to the Text Encoding Initiative (TEI) guidelines. The TEI is a standard for the representation of texts in digital form, widely used in the field of digital humanities, libraries, and linguistics. This function extracts key information from the XML file such as the title, authors, abstract, keywords, and content of the document.

Here's a brief overview of what the function does:

    Parse XML: The function starts by using the ET.fromstring() method from the xml.etree.ElementTree module to parse the XML string into an XML ElementTree object.

    Extract Metadata: The function then extracts the title, authors, date, abstract, and keywords from the XML tree. This is done using the find and findall methods to locate the relevant elements in the XML tree.

    Extract Content: The function also extracts the body content of the document, including the text of each section and any LaTeX equations in each section. This is done by iterating over the div elements in the body of the document, which represent individual sections.

    Convert to LaTeX: The extracted text is then formatted into LaTeX syntax. This is done by wrapping section titles in \section{...}, wrapping equations in \begin{equation} ... \end{equation}, and simply appending the text of each paragraph.

    Return as Dictionary: Finally, the function returns a dictionary with the extracted information. The dictionary keys are 'title', 'authors', 'date', 'abstract', 'keywords', and 'latex_doc'. The 'latex_doc' key contains the full LaTeX-formatted text of the document, while the other keys contain the respective pieces of metadata.

In short, this function serves as a TEI-XML to LaTeX converter, specifically tailored for scientific documents. It extracts both the metadata and content from the XML file and formats them in a way that's ready for further LaTeX processing or display.

In [3]:
def extract_information_from_tei(tei_string):
    # Parse the XML string
    root = ET.fromstring(tei_string)

    # XML namespaces
    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

    # LaTeX document
    latex_doc = []
    
    # Details dictionary
    details = {}

    # Extract the title - only proceed if we have a title
    title = root.find('.//tei:titleStmt/tei:title', ns)
    if title is None or title.text is None:
        raise ValueError("Missing title in document")
        
    # Extract the title
    title = root.find('.//tei:titleStmt/tei:title', ns)
    if title is not None and title.text is not None:
        details['title'] = title.text[:1000] # restricting according Milvus  FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=1000), 
        latex_doc.append('\\title{' + title.text + '}')

    # Extract the date
    date = root.find('.//tei:publicationStmt/tei:date', ns)
    if date is not None and date.text is not None:
        details['date'] = date.text
        latex_doc.append('\\date{' + date.text + '}')
        
    # Extract the authors from the main document only
    authors = []
    for author in root.findall('.//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:analytic/tei:author', ns):
        forename = author.find('tei:persName/tei:forename', ns)
        surname = author.find('tei:persName/tei:surname', ns)
        if forename is not None and surname is not None:
            authors.append(forename.text + ' ' + surname.text)
            latex_doc.append('\\author{' + forename.text + ' ' + surname.text + '}')
    details['authors'] = authors

    # Extract the abstract
    abstract = root.find('.//tei:abstract', ns)
    if abstract is not None:
        details['abstract'] = ' '.join([p.text for p in abstract.findall('.//tei:p', ns)])
        latex_doc.append('\\begin{abstract}')
        for p in abstract.findall('.//tei:p', ns):
            latex_doc.append(p.text)
        latex_doc.append('\\end{abstract}')

    # Extract the keywords
    keywords = root.find('.//tei:keywords', ns)
    if keywords is not None:
        details['keywords'] = [term.text for term in keywords.findall('tei:term', ns)]
        latex_doc.append('\\keywords{' + ', '.join([term.text for term in keywords.findall('tei:term', ns)]) + '}')

    # Process each section in the body of the document
    for body in root.findall('.//tei:body', ns):
        for div in body.findall('.//tei:div', ns):
            # Add section
            section_title = div.find('tei:head', ns)
            if section_title is not None:
                latex_doc.append('\\section{' + section_title.text + '}')
            
            # Process each child element in the div
            for child in div:
                # If the child is a paragraph
                if child.tag == '{http://www.tei-c.org/ns/1.0}p':
                    paragraph_text = recursive_text_extraction(child)
                    latex_doc.append(paragraph_text)

                # If the child is a formula
                elif child.tag == '{http://www.tei-c.org/ns/1.0}formula':
                    # Add equation
                    latex_doc.append('\\begin{equation}')
                    latex_doc.append(child.text.strip())  # strip leading and trailing whitespace
                    latex_doc.append('\\end{equation}')

                    
    # Begin the bibliography
    latex_doc.append('\\begin{thebibliography}{99}')

    # Extract the references
    for biblStruct in root.findall('.//tei:div[@type="references"]/tei:listBibl/tei:biblStruct', ns):
        # Extract the authors
        authors = [forename.text + ' ' + surname.text for forename, surname in zip(biblStruct.findall('.//tei:author/tei:persName/tei:forename', ns), biblStruct.findall('.//tei:author/tei:persName/tei:surname', ns))]

        # Extract the title
        title = biblStruct.find('.//tei:title', ns)

        # Extract the year
        year = biblStruct.find('.//tei:date', ns)

        # Extract the publisher (for books) or journal title (for articles)
        publisher = biblStruct.find('.//tei:publisher', ns)
        journal = biblStruct.find('.//tei:title[@level="j"]', ns)

        # Extract the volume and page numbers (for articles)
        volume = biblStruct.find('.//tei:biblScope[@unit="volume"]', ns)
        page = biblStruct.find('.//tei:biblScope[@unit="page"]', ns)

        # Format the reference for the bibliography
        reference = ', '.join(filter(None, [', '.join(authors), (title.text if title is not None else None), (year.text if year is not None else None), (publisher.text if publisher is not None else None), (journal.text if journal is not None else None), (volume.text if volume is not None else None), (page.text if page is not None else None)]))
        latex_doc.append('\\bibitem{' + biblStruct.attrib['{http://www.w3.org/XML/1998/namespace}id'] + '} ' + reference + '.')

    # End the bibliography
    latex_doc.append('\\end{thebibliography}')

    details['latex_doc'] = '\n'.join(latex_doc)
    return details



The slugStrip function is used to clean up a given string, specifically to prepare it for use as a file name or URL slug. 

In [4]:
def slugStrip(instr):
    outstr = re.sub('[^a-zA-Z0-9 ]','',instr)
    while "  " in outstr:
        outstr = outstr.replace("  "," ")
    outstr = outstr.replace(" ","-")
    outstr = outstr.lower()
    if len(outstr) > 0:
        if outstr[0] == "-":
            outstr = outstr[1:]
    if len(outstr) > 0: 
        if outstr[-1] == "-":
            outstr = outstr[:-1]
    return outstr

The create_file_name function is used to convert a given string (in this case, a title) into a suitable file name for a JSON file. The steps it takes to do this are:

    Clean the Title: It uses the slugStrip function to remove any special characters and spaces from the title and converts it to lowercase.

    Replace Spaces with Underscores: It then replaces any spaces that remain in the title with underscores using the replace method. This is a common practice when creating file names because spaces can sometimes cause issues in file paths.

    Limit Length: To prevent the file name from being too long (which can cause issues with some file systems), it then limits the length of the title to 250 characters. This leaves room for the ".json" extension while still ensuring that the total length of the file name stays within the common maximum limit of 255 characters.

    Add Extension: Finally, it adds the ".json" extension to the title, completing the file name.

The function then returns the resulting file name. This function is especially useful when you want to create file names that reflect the contents of the files, and you are dealing with user-provided or otherwise unpredictable text for the titles.

In [5]:
def create_file_name(title):
    # Use the slugStrip function to clean the title
    title = slugStrip(title)

    # Replace spaces with underscores
    title = title.replace(' ', '_')

    # Limit length to 250 characters (to allow for the .json extension)
    title = title[:250]
    
    # Append the .json extension
    title += '.json'

    return title

## Extracting information of one file

The code in this section demonstrates how to load a single XML file and extract information from it.

Here's a step-by-step breakdown of the process:

    Specify the File Location: The location of the XML file is specified using the dir_path and file_name variables. These are then combined using the os.path.join() function to create the full path to the file.

    Load the File: The TextLoader class is used to load the XML file from the specified location. The load() method of the TextLoader instance is then called to read the content of the file.

    Extract Information: For each document loaded (in this case, there's just one), the extract_information_from_tei() function is called with the document's content as an argument. This function extracts various pieces of information from the document, such as the title, authors, abstract, keywords, and the document's main content, and stores them in a dictionary (latex_doc).

At this point, you could do whatever you like with the latex_doc dictionary. You might print it out to see the extracted information, or write it to a file for later use. For example, you could convert it to JSON and write it to a .json file, or use it to generate a LaTeX document.

In [9]:
# Load the file with TextLoader
dir_path = "data/xml/"
file_name = "arxiv QSF 2023.tei.xml"

full_path = os.path.join(dir_path, file_name)

loader = TextLoader(full_path)
documents = loader.load()

# Extract the information from each loaded document
for document in documents:
    latex_doc = extract_information_from_tei(document.page_content)
    # do something with latex_doc, for example print it or write it to a file

latex_doc

{'title': 'QUASICRYSTALLINE SPIN FOAM WITH MATTER: DEFINITIONS AND EXAMPLES',
 'authors': ['Marcelo Amaral', 'Richard Clawson', 'Klee Irwin'],
 'abstract': 'In this work, we define quasicrystalline spin networks as a subspace within the standard Hilbert space of loop quantum gravity, effectively constraining the states to coherent states that align with quasicrystal geometry structures. We introduce quasicrystalline spin foam amplitudes, a variation of the EPRL spin foam model, in which the internal spin labels are constrained to correspond to the boundary data of quasicrystalline spin networks. Within this framework, the quasicrystalline spin foam amplitudes encode the dynamics of quantum geometries that exhibit aperiodic structures. Additionally, we investigate the coupling of fermions within the quasicrystalline spin foam amplitudes. We present calculations for three-dimensional examples and then explore the 600-cell construction, which is a fundamental component of the four-dimensi

In this section, we demonstrate how to save the extracted data (stored in the latex_doc dictionary) to a JSON file, and then how to load it back into Python.

Here's a step-by-step explanation:

    Create File Name: The create_file_name() function is called with the title of the document as an argument. This function cleans the title by removing special characters and spaces, replacing them with underscores, and appending the '.json' extension. The output of this function is the name of the JSON file where the document's data will be saved.

    Specify Full Path: The os.path.join() function is used to concatenate the directory path (dir_path) and the JSON file name (file_name) to create the full path to the JSON file.

    Save Data to JSON File: A JSON file is opened in write mode with the open() function. The json.dump() function is then used to write the latex_doc dictionary to the JSON file. The file is automatically closed at the end of the with block.

    Load Data from JSON File: The same JSON file is opened in read mode. The json.load() function is used to load the data from the JSON file into a new latex_doc dictionary.

After loading the data, you can access any of the information in the dictionary using its keys. In this case, the title of the document is accessed with latex_doc['title'], and the main content of the document (in LaTeX format) is accessed with latex_doc['latex_doc']. These are then printed to the console.

In [10]:
# testing saving a json file and loading
file_name = create_file_name( latex_doc['title'])
full_path = os.path.join(dir_path, file_name)


# Save the latex_doc dictionary to a JSON file
with open(full_path, 'w') as f:
    json.dump(latex_doc, f)
    
# Load the data from the JSON file
with open(full_path, 'r') as f:
    latex_doc = json.load(f)

# Now you can access the information
print(latex_doc['title'])
print(latex_doc['latex_doc'])

QUASICRYSTALLINE SPIN FOAM WITH MATTER: DEFINITIONS AND EXAMPLES
\title{QUASICRYSTALLINE SPIN FOAM WITH MATTER: DEFINITIONS AND EXAMPLES}
\author{Marcelo Amaral}
\author{Richard Clawson}
\author{Klee Irwin}
\begin{abstract}
In this work, we define quasicrystalline spin networks as a subspace within the standard Hilbert space of loop quantum gravity, effectively constraining the states to coherent states that align with quasicrystal geometry structures. We introduce quasicrystalline spin foam amplitudes, a variation of the EPRL spin foam model, in which the internal spin labels are constrained to correspond to the boundary data of quasicrystalline spin networks. Within this framework, the quasicrystalline spin foam amplitudes encode the dynamics of quantum geometries that exhibit aperiodic structures. Additionally, we investigate the coupling of fermions within the quasicrystalline spin foam amplitudes. We present calculations for three-dimensional examples and then explore the 600-cell

## Geting the information for a full directory

This section of the code automates the process of extracting data from multiple .tei.xml files located in a directory structure, and saving the extracted data to corresponding .json files in a parallel directory structure.

Here's a step-by-step explanation:

    Walk Through Directory Structure: The os.walk() function is used to traverse through the directory structure of the source directory (dir_source_path). For each directory it encounters, it returns the directory path, the names of any subdirectories, and the names of any files.

    Process Each .tei.xml File: For each file in the current directory, if the file name ends with .tei.xml, the code processes that file. It creates the full source file path by joining the directory path and the file name.

    Load Document and Extract Information: The TextLoader class is used to load the document from the .tei.xml file. The extract_information_from_tei() function is then used to extract data from the document. This function returns a dictionary, latex_doc, containing the extracted data. If there's an error during the extraction process (for example, if the document doesn't have a title), the code prints an error message and skips to the next document.

    Create Destination Directory Path: The relative path of the current directory is computed with respect to the source directory path. The destination directory path is then created by joining the destination directory path (dir_destination_path) and the relative directory path. If the destination directory doesn't already exist, it's created with os.makedirs().

    Create Destination File Name: The create_file_name() function is called with the title of the document as an argument to create the name of the .json file where the document's data will be saved.

    Create Full Destination File Path: The full destination file path is created by joining the destination directory path and the .json file name.

    Save Data to JSON File: Finally, the latex_doc dictionary is saved to the .json file using the json.dump() function.

By running this code, you can automate the process of extracting data from a large number of .tei.xml files and saving that data to .json files.

In [11]:
dir_source_path = "data/xml/papers_xml/"
dir_destination_path = "data/xml/papers_json/"

# Walk through the source directory structure
for dirpath, dirnames, filenames in os.walk(dir_source_path):
    # Process each .tei.xml file in the current directory
    for filename in filenames:
        if filename.endswith('.tei.xml'):
            # Create the full source file path
            full_source_path = os.path.join(dirpath, filename)

            loader = TextLoader(full_source_path)
            documents = loader.load()

            # Extract the information from each loaded document
            for document in documents:
                try:
                    latex_doc = extract_information_from_tei(document.page_content)
                except ValueError as e:
                    print(f"Skipping file due to error: {e}")
                    continue
        
                latex_doc = extract_information_from_tei(document.page_content)

                # Create the destination directory path
                relative_dirpath = os.path.relpath(dirpath, dir_source_path)
                full_destination_dirpath = os.path.join(dir_destination_path, relative_dirpath)

                # Create the destination directory if it does not exist
                os.makedirs(full_destination_dirpath, exist_ok=True)

                # Create the destination file name
                file_destination_name = create_file_name(latex_doc['title'])

                # Create the full destination file path
                full_destination_path = os.path.join(full_destination_dirpath, file_destination_name)

                # Save the latex_doc dictionary to a JSON file
                with open(full_destination_path, 'w') as f:
                    json.dump(latex_doc, f)


## Final Notes

With the conclusion of this process, we have successfully transformed the raw .tei.xml files, that were initially extracted from scientific articles, into structured JSON files. This structured format is more accessible and manageable, enabling the use of standard Python tools for data manipulation and analysis in a straightforward manner.

These JSON files contain valuable information such as the title, authors, publication date, abstract, keywords, and the content of the article in LaTeX format. This provides a rich dataset for various natural language processing (NLP) tasks.

For instance, we can load these JSON files into pandas DataFrames to carry out exploratory data analysis. In addition, it is possible to preprocess the text data by performing operations like tokenization, stemming or lemmatization, and removal of stop words.

Once the data is preprocessed, we can convert the text into numerical features using a variety of techniques, including Bag-of-Words, TF-IDF, or word embeddings. These numerical features can then be used as input for machine learning models. Depending on the task at hand, these models could be used for text classification, information retrieval, topic modelling, text generation, or even more complex tasks like machine translation or question answering.

Furthermore, the structured format of the JSON files allows for easy metadata analysis. For example, one could analyze trends in authorship or study the usage and popularity of certain keywords over time.

It's important to note that the quality of data fed into the machine learning models significantly influences the quality of the output. Therefore, it's critical to ensure that the data is as clean and well-structured as possible. The methods and techniques outlined in this notebook are aimed at facilitating this process.

This notebook serves as a pre-processing step, preparing the ground for more sophisticated NLP and machine learning workflows. With the structured data in hand, we are now ready to delve deeper into the world of data science and machine learning, leveraging the information contained in these scientific articles to extract valuable insights or make predictions.

## Appendix: Using GROBID to Extract Information from Nested Directories with Scientific PDF Files

In this appendix, we will provide step-by-step instructions on how to use GROBID, a machine learning library for extracting structured data from scientific literature, including PDF files. We will explain how to install and run GROBID, how to use the GROBID Python client, and how to process PDF files in nested directories. We will also demonstrate how to replicate the directory structure from one location to another.

## Download and Install GROBID

Download the GROBID source code:

```
wget https://github.com/kermitt2/grobid/archive/0.7.3.zip
```

Unzip the downloaded file:
```
unzip 0.7.3.zip
```

Navigate to the GROBID directory:
```
cd grobid-0.7.3/
```

Build and install GROBID:
```
./gradlew clean install
```

## Configure GROBID

You can customize the configuration of GROBID to suit your needs:

```
nano grobid-home/config/grobid.yaml
```

After configuring, run GROBID:
```
./gradlew run
```

## Install the GROBID Python Client

Navigate back to the parent directory:
```
cd .. 
```

Clone the GROBID Python client repository:
```
git clone https://github.com/kermitt2/grobid_client_python
```

Navigate to the GROBID Python client directory:
```
cd grobid_client_python/
```

Install the GROBID Python client:
```
python3 setup.py install
```


## Run GROBID on Nested Directories

Create a Bash script, pdfextractionGrobid.sh, to process all PDF files in nested directories:

```
#!/bin/bash

# Base directory
base_dir="~/install/grobid_client_python/pdf"

# Find all directories in the base directory and its subdirectories
find "${base_dir}" -type d | while read dir; do
    # Process all the PDFs in the directory with GROBID
    grobid_client --input "${dir}" --output "${dir}" processFulltextDocument
done
```

Make the script executable and run it:
```
chmod +x pdfextractionGrobid.sh
./pdfextractionGrobid.sh
```

## Replicate Directory Structure

To replicate the directory structure from one location to another 
(for example, from data/xml/ to data/partitions/), you can use the rsync command:

```
rsync -av -f"+ */" -f"- *" data/xml/ data/partitions/
```

This command will create an identical directory structure in data/partitions/ without copying the files.

By following these steps, you can use GROBID to automatically extract structured data from scientific PDFs in nested directories. This data can then be processed and analyzed further, as demonstrated in the main part of this notebook.