# PDF Text Extraction

PDF scientific paper text extraction using GROBID: 
- Github: https://github.com/kermitt2/grobid
- Documentation: https://grobid.readthedocs.io/en/latest/

In [8]:
import os
import requests

def convert_pdf_to_xml(input_pdf_path, output_xml_path):
    if os.path.exists(output_xml_path):
        print(f"Already exists: {output_xml_path}")
        return
    
    url = "http://localhost:8070/api/processFulltextDocument"
    with open(input_pdf_path, 'rb') as pdf_file:
        files = {'input': pdf_file}
        response = requests.post(url, files=files)
        
        if response.status_code == 200:
            with open(output_xml_path, 'wb') as xml_file:
                xml_file.write(response.content)
            print(f"Successfully converted: {output_xml_path}")
        else:
            print(f"Failed to convert {input_pdf_path}. Status code: {response.status_code}")

In [7]:
# Example usage
pdf_name = "c001_Heating a residential building using the heat generated in the lithium ion battery pack by the elect"
input_pdf_path = f"../data/papers/{pdf_name}.pdf"
output_xml_path = f"../data/extractions/{pdf_name}.xml"
convert_pdf_to_xml(input_pdf_path, output_xml_path)

The file ../data/extractions/c001_Heating a residential building using the heat generated in the lithium ion battery pack by the elect.xml already exists.


In [9]:
# Directory containing the PDF files
pdf_directory = "../data/papers"

# Iterate through all the files in the directory
for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        pdf_name = filename[:-4]  # Remove the .pdf extension
        input_pdf_path = os.path.join(pdf_directory, filename)
        output_xml_path = f"../data/extractions/{pdf_name}.xml"
        convert_pdf_to_xml(input_pdf_path, output_xml_path)

Already exists: ../data/extractions/c001_Heating a residential building using the heat generated in the lithium ion battery pack by the elect.xml
Successfully converted: ../data/extractions/c002_Oxidative Potential and Nanoantioxidant Activity of Flavonoids and Phenolic Acids in Sophora flavesc.xml
Successfully converted: ../data/extractions/c003_The Choice of Anesthetic Drugs in Outpatient Hysteroscopic Surgery A.xml
Successfully converted: ../data/extractions/c004_A Fault-Tolerant Structure for Nano-Power Communication Based on the Multidimensional Crossbar Switc.xml
Successfully converted: ../data/extractions/c005_A Secure Routing Protocol for Wireless Sensor Energy Network Based on Trust Management.xml
Successfully converted: ../data/extractions/c006_A Study on the Impact of International Translation Levels Based on Multiple Correlation Analysis.xml
Successfully converted: ../data/extractions/c007_An Effective Hybrid Multiobjective Flexible Job Shop Scheduling Problem Based on Impr