# PDF Text Extraction

PDF scientific paper text extraction using GROBID: 
- Github: https://github.com/kermitt2/grobid
- Documentation: https://grobid.readthedocs.io/en/latest/

Used command for full GROBID model in docker:

`docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.2`

Took around 20 minutes

In [3]:
import os
import requests

def convert_pdf_to_xml(input_pdf_path, output_xml_path):
    if os.path.exists(output_xml_path):
        print(f"Already exists: {output_xml_path}")
        return
    
    url = "http://localhost:8070/api/processFulltextDocument"
    with open(input_pdf_path, 'rb') as pdf_file:
        files = {'input': pdf_file}
        response = requests.post(url, files=files)
        
        if response.status_code == 200:
            with open(output_xml_path, 'wb') as xml_file:
                xml_file.write(response.content)
            print(f"Successfully converted: {output_xml_path}")
        else:
            print(f"Failed to convert {input_pdf_path}. Status code: {response.status_code}")

In [5]:
# Example usage
pdf_name = "r001_A Fault Analysis Method for Three-Phase Induction Motors Based on Spiking Neural P Systems"
input_pdf_path = f"../data/papers/{pdf_name}.pdf"
output_xml_path = f"../data/extractions/full_model/{pdf_name}.xml"
convert_pdf_to_xml(input_pdf_path, output_xml_path)

Successfully converted: ../data/extractions/full_model/r001_A Fault Analysis Method for Three-Phase Induction Motors Based on Spiking Neural P Systems.xml


In [6]:
# Directory containing the PDF files
pdf_directory = "../data/papers"

# Iterate through all the files in the directory
for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf") and filename[0] == "r":
        pdf_name = filename[:-4]  # Remove the .pdf extension
        input_pdf_path = os.path.join(pdf_directory, filename)
        output_xml_path = f"../data/extractions/full_model/{pdf_name}.xml"
        convert_pdf_to_xml(input_pdf_path, output_xml_path)

Successfully converted: ../data/extractions/full_model/r069_Development of a pH indicator composed of high moisture-absorbing materials for real-time monitoring.xml
Successfully converted: ../data/extractions/full_model/r214_Magnetic assembly and field-tuning of ellipsoidal-nanoparticle-based colloidal photonic crystals.xml
Successfully converted: ../data/extractions/full_model/r023_Doppler audio signal analysis as an additional tool in Evaluation of umbilical artery circulation.xml
Successfully converted: ../data/extractions/full_model/r155_Experimental detection of a Majorana mode in the core of a magnetic vortex inside a topological insu.xml
Already exists: ../data/extractions/full_model/r001_A Fault Analysis Method for Three-Phase Induction Motors Based on Spiking Neural P Systems.xml
Successfully converted: ../data/extractions/full_model/r131_MBE deserves a place in the history books.xml
Successfully converted: ../data/extractions/full_model/r121_Yttrium hydride nanoantennas for a

## Extract TEI text from XMLs

In [6]:
from lxml import etree

def is_well_formatted(file_path):
    try:
        tree = etree.parse(file_path)
        # print("XML is well-formed")
        return True
    except etree.XMLSyntaxError as e:
        print("XML Syntax Error:", e)
        return False

In [7]:
def get_body(element):
    if element.tag.endswith('body'):
        return element

    for child in element:
        body = get_body(child)
        if body != None:
            return body

In [8]:
def collect_text_from_divs(body):
    collected_text = []
    for div in body.findall(".//{*}div"):
        # Check if the div is not within a figure or other excluded tags
        if not any(parent.tag.endswith("figure") for parent in div.iterancestors()):
            itered_texts = list(div.itertext())
            collected_text.append(f"{itered_texts[0]}\n{' '.join(itered_texts[1:])}")
    return "\n\n".join(collected_text)

In [9]:
from lxml import etree

def extract_paper_text(file_path):
    print(file_path)
    tree = etree.parse(file_path)
    body = get_body(tree.getroot())
    text = collect_text_from_divs(body)
    return text

In [10]:
import os

def convert_xml_to_text(input_xml_path, output_text_path):
    if os.path.exists(output_text_path):
        print(f"Already exists: {output_text_path}")
        return
    
    if not is_well_formatted(input_xml_path):
        return
    
    text = extract_paper_text(input_xml_path)
    with open(output_text_path, 'w') as text_file:
        text_file.write(text)

In [11]:
# Directory containing the XML files
xml_directory = "../data/extractions/full_model"

# Iterate through all the files in the directory
for filename in os.listdir(xml_directory):
    if filename.endswith(".xml") and filename[0] == "r":
        xml_name = filename[:-4]  # Remove the .xml extension
        print(xml_name)
        input_xml_path = f"{xml_directory}/{xml_name}.xml"
        output_text_path = f"../data/extractions/full_model_texts/{xml_name}.txt"
        convert_xml_to_text(input_xml_path, output_text_path)

r133_The 2008 revision of the World Health Organization (WHO) classification of myeloid neoplasms and acu
Already exists: ../data/extractions/full_model_texts/r133_The 2008 revision of the World Health Organization (WHO) classification of myeloid neoplasms and acu.txt
r101_CTFFIND4 - Fast and accurate defocus estimation from electron micrographs
Already exists: ../data/extractions/full_model_texts/r101_CTFFIND4 - Fast and accurate defocus estimation from electron micrographs.txt
r040_Ionic liquids as an efficient medium for the mechanochemical synthesis of
Already exists: ../data/extractions/full_model_texts/r040_Ionic liquids as an efficient medium for the mechanochemical synthesis of.txt
r062_Numerical simulation of composition B high explosive charge desensitization in gap test assembly aft
Already exists: ../data/extractions/full_model_texts/r062_Numerical simulation of composition B high explosive charge desensitization in gap test assembly aft.txt
r049_Recent Advances and Perspec