## Document cleaning

Since most of the downloaded files include irrelevant sections like bibliography and acknowledgments the next step would be to remove those. Therefore, one could use Regex patterns to identify and deleted these. In this example the patterns `[MISSING_PAGE_FAIL:x]` and the sections 'Acknowledgments' and 'References' are detected and deleted. 

```{admonition} Regex
:class: tip

Regular expressions (regex) are a powerful tool for pattern matching, allowing for complex searches, substitutions, and data extraction based on specific string patterns. For instance the first regular expression `r'\[MISSING_PAGE_FAIL:\d+\]` is used to remove any text matching the pattern `[MISSING_PAGE_FAIL:` followed by one or more digits and a closing bracket. 
```

In [1]:
import llmstructdata

import re


def clean_text(text):
    # Delete the pattern [MISSING_PAGE_FAIL:x]
    cleaned_text = re.sub(r"\[MISSING_PAGE_FAIL:\d+\]", "", text)

    # Delete the acknowledgements section
    cleaned_text = re.sub(
        r"## Acknowledgements.*?(?=##|$)", "", cleaned_text, flags=re.S
    )

    # delete the references section
    cleaned_text = re.sub(r"## References.*", "", cleaned_text, flags=re.S)

    return cleaned_text


input_file = "./markdown_files/10.26434_chemrxiv-2024-1l0sn.mmd"

with open(input_file, "r", encoding="utf-8") as f:
    content = f.read()

# clean the text
cleaned_text = clean_text(content)
print(cleaned_text)



These types of connectivity have also been converted to extended \(\pi\)-systems by oxidative follow-up reactions, allowing a higher level of conjugation and hence strong bathochromic shifts.[8] The installation of heteroatoms has however been a challenge for some time. In 2014, Shinokubo et al. presented linearly connect monomers through an azo-bridge at the \(\beta\)-position (Figure 1A (d)).[10] Linear connectivity at the \(\alpha\)-position using heteroatoms such as sulfur has been achieved through a similarly iterative process by the groups of Hao and Jiao (Figure 1A (e)).[7] Furthermore, cyclic amine-linked oligo-BODIPYs have already been synthesized in a one-pot reaction in 2022 by Song et al., utilizing Buchwald-Hartwig conditions (Figure 1A (f)).[10]

We present a novel type of BODIPY oligomers, connected via _N_-bridges in a linear fashion (Figure 1B). Utilizing both symmetric and unsymmetric BODIPY monomers as building blocks has paved the way to selectively synthesize oli

### Harmonizing XML files

Many APIs return the articles directly in machine-readable XML format. However, the ones of different publishers are quite different, which can make this kind of cleanup tedious. Thus, it is great that there are packages such as [Pub2TEI](https://github.com/kermitt2/Pub2TEI) that can help one to streamline this process.

```{note}

Execute the following lines in bash terminal.

    docker run --rm --gpus all --init --ulimit core=0 -p 8060:8060 grobid/pub2tei:0.2
    git clone https://github.com/kermitt2/Pub2TEI
    cd Pub2TEI/client
    pip install requests

This will start the starting the Pub2TEI service with Docker
```

In [13]:
import os
import requests
import time

# Define the input directory containing XML files and the output directory for TEI files
input_dir = "./XML_files"
output_dir = "./XML_files_cleaned"
os.makedirs(output_dir, exist_ok=True)

# Define the Pub2TEI server URL
server_url = "http://localhost:8060/service/processXML"


# Function to process a single XML file
def process_xml_file(xml_file, output_dir):
    files = {
        "input": open(xml_file, "rb"),
        "segmentSentences": (
            None,
            "1",
        ),  # Optional, set to '1' for sentence segmentation
        "grobidRefine": (None, "1"),  # Optional, set to '1' for refining with Grobid
    }
    for attempt in range(5):  # Retry up to 5 times
        try:
            response = requests.post(server_url, files=files)
            if response.status_code == 200:
                with open(output_dir, "wb") as f:
                    f.write(response.content)
                print(f"Processed {xml_file} successfully.")
                return output_dir
            else:
                print(
                    f"Failed to process {xml_file}. Status code: {response.status_code}"
                )
            break
        except ConnectionError as e:
            print(f"Connection error: {e}. Retrying in 5 seconds...")
            time.sleep(5)


# Process all XML files in the input directory
for filename in os.listdir(input_dir):
    if filename.endswith(".xml"):
        input_file = os.path.join(input_dir, filename)
        print(input_file)
        output_file = os.path.join(output_dir, filename.replace(".xml", ".tei.xml"))
        process_xml_file(input_file, output_file)

./XML_files/ao0c01342.xml
Processed ./XML_files/ao0c01342.xml successfully.


In [14]:
with open(output_file, "r", encoding="utf-8") as f:
    content = f.read()
    print(f"Content of {output_file}:\n")
    print(content)

Content of ./XML_files_cleaned/ao0c01342.tei.xml:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title level="a" type="main">Synthesis and Biological Evaluation of Three New Chitosan
Schiff Base Derivatives</title>
         </titleStmt>
         <publicationStmt>
            <publisher>American Chemical Society</publisher>
            <availability>
               <p>
                  <s>American Chemical Society</s>
               </p>
            </availability>
            <date type="e-published" when="2020-06-01">2020</date>
            <date when="2020-06-16">2020</date>
            <date type="Copyright" when="2020">2020</date>
         </publicationStmt>
         <notesStmt>
            <note type="cont

One could use this tool to first unify all different downloaded files from different publisher styles and afterwards remove irrelevant section of these articles automatically as shown in the beginning of this section. 