# Extract metadata from PDF files using fine-tuned GPT3 language model

This notebook will demonstrate how we can extract Dublin Core style metadata about PDF documents, in this case doctoral theses from four Finnish universities (Åbo Akademi, University of Turku, University of Vaasa and Lappeenranta University of Technology), using only the raw text from the first few pages of the PDF.

The set of 192 documents will be split into two subsets (train: 149, test: 43). We will extract the text from around 5 pages of text, aiming for 500 to 700 words. The corresponding metadata, which has been exported from DSpace repositories of the universities, is represented in a simple textual "key: value" format, which should be easy enough for a language model to handle. The train set is used to create a data set which will then be used to fine-tune a GPT model. Subsequently the model can be used to generate similar metadata for unseen documents from the test set.

For this experiment, an OpenAI API access key is required. It can be generated after registering an user account (the same account can be used for e.g. ChatGPT). The API key has to be stored in an environment variable `OPENAI_API_KEY` before starting this notebook. The finetuning will cost around \\$5 USD and generating new metadata with the API also has a small cost, but currently every account gets a free \\$18 credit from OpenAI which is plenty for this experiment even with a few iterations.

This notebook depends on a few Python libraries, which are listed in `requirements.txt`. See the README for details.

## Test the connection and API key

Make sure it's possible to use the OpenAI API.

In [1]:
import openai
import os

# read the OpenAI API key from an environment variable
openai.api_key = os.environ['OPENAI_API_KEY']

# test the API connection by making a simple request
response = openai.completions.create(model="davinci-002", prompt="Say this is a test", temperature=0, max_tokens=7)
print(response)
print(response.choices[0].text)

Completion(id='cmpl-8NfzPEOGuGfXo95HaW7XaE7lw9dP7', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=". I'm gonna say this is")], created=1700653643, model='davinci-002', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=7, prompt_tokens=5, total_tokens=12))
. I'm gonna say this is


## Prepare the data set

Extract metadata and PDF text

In [6]:
# Define some settings for the metadata extraction

import glob

MAXPAGES = 5  # how many pages of text to extract (maximum)
MARGIN = 2  # how many more pages to look at, in case we can't find text from the first ones
TEXT_MIN = 500  # how many words to aim for (minimum)
TEXT_LIMIT = 700  # upper limit on # of words

# files containing metadata about doctoral theses documents, exported from DSpace repositories
METADATAFILES = glob.glob("dspace/*-doctheses.xml")

# metadata fields we are interested in (corresponding to fields used in DSpace)
# syntax: "fieldname" or "fieldname/qualifier"
METADATAFIELDS = """
title
title/alternative
contributor/faculty
contributor/author
contributor/organization
contributor/opponent
contributor/supervisor
contributor/reviewer
publisher
date/issued
relation/issn
relation/isbn
relation/ispartofseries
relation/numberinseries
""".strip().split()

# identifiers of documents that will form the test set
# these have been selected to correspond with a demo application for the same purpose
TEST_SET_IDS = """
handle_10024_181378
handle_10024_181284
handle_10024_181280
handle_10024_181229
handle_10024_181227
handle_10024_181210
handle_10024_181206
handle_10024_181139
handle_10024_181073
handle_10024_181025
handle_10024_181001
handle_10024_163335
handle_10024_163304
handle_10024_163298
handle_10024_163277
handle_10024_163276
handle_10024_163263
handle_10024_163258
handle_10024_163257
handle_10024_163057
handle_10024_163056
handle_10024_162878
handle_10024_11364
handle_10024_11363
handle_10024_11348
handle_10024_11207
handle_10024_10928
handle_10024_10620
handle_10024_10614
handle_10024_10443
handle_10024_10432
handle_10024_10254
handle_10024_152922
handle_10024_152903
handle_10024_152904
handle_10024_152860
handle_10024_152862
handle_10024_152853
handle_10024_152854
handle_10024_152855
handle_10024_152852
handle_10024_152846
handle_10024_152836
""".strip().split()

In [7]:
#%%time

from lxml import etree
import requests
import os.path
from pypdf import PdfReader
import glob

# train set: document identifiers, text (x) and metadata (y)
train_ids = []
train_x = []
train_y = []

# test set: document identifiers, text (x) and metadata (y)
test_ids = []
test_x = []
test_y = []

def extract_metadata(doc_item):
    """extract the metadata as a list of (key, value) tuples from an etree element representing a document"""
    metadata = []
    for fldname in METADATAFIELDS:
        if '/' in fldname:
            fld, qualifier = fldname.split('/')
            for val in doc_item.findall(f"metadata[@element='{fld}'][@qualifier='{qualifier}']"):
                if fld == 'date':
                    metadata.append((fldname, val.text[:4]))  # only the year
                else:
                    metadata.append((fldname, val.text))
        else:
            for val in doc_item.findall(f"metadata[@element='{fldname}'][@qualifier='']"):
                metadata.append((fldname, val.text))
    return metadata

def id_to_fn(identifier):
    """convert a URI identifier to a simpler string we can use as a filename for the PDF"""
    return 'docs/' + identifier.replace('https://', '').replace('/','_') + ".pdf"

def download(file_url, identifier):
    """download a PDF file, with the given identifier, from the given URL (unless this was done already)
    and return a path to the PDF file"""
    path = id_to_fn(identifier)
    if os.path.exists(path) and os.path.getsize(path) > 0:
        return path

    response = requests.get(file_url)
    with open(path, "wb") as f:
        f.write(response.content)
        print(f"wrote {file_url} as {path}")
    return path

def extract_text(fn):
    """extract and return the first few pages of text from the given PDF file"""
    reader = PdfReader(fn)
    texts = []
    extracted_pages = 0
    extracted_length = 0
    for idx, page in enumerate(reader.pages[:MAXPAGES + MARGIN]):
        text = page.extract_text()
        text_length = len(text.strip().split())
        if extracted_length + text_length < TEXT_LIMIT:
            texts.append(text)
            extracted_length += text_length
            extracted_pages += 1
        else:
            print(f"skipping page {idx+1} of {fn}: text would become too long")
        if extracted_pages >= MAXPAGES or extracted_length >= TEXT_MIN:
            break
    return '\n'.join(texts)

def is_test_doc(identifier):
    """return True iff the given identifier belongs to the test set"""
    shortid = 'handle' + identifier.split('handle')[1].replace('/', '_')
    return shortid in TEST_SET_IDS

# Read all the metadata files, extract the DSpace metadata, download the PDFs and extract text from them
# into the train_* and test_* lists
for fn in METADATAFILES:
    tree = etree.parse(fn)
    for item in tree.findall('item'):
        try:
            identifier = item.find("metadata[@element='identifier'][@qualifier='uri']").text
        except AttributeError:
            print("no identifier found, skipping")
            continue
        try:
            file_url = item.find('file').text
        except AttributeError:
            print(f"no file element found (id: {identifier}), skipping")
            continue
            print(f"skipping test document {identifier}")
            continue
        path = download(file_url, identifier)
        text = extract_text(path)
        metadata = extract_metadata(item)
        if is_test_doc(identifier):
            test_ids.append(identifier)
            test_x.append(text)
            test_y.append(metadata)
        else:
            train_ids.append(identifier)
            train_x.append(text)
            train_y.append(metadata)

print(f"train set size: {len(train_ids)}")
print(f"test set size: {len(test_ids)}")

skipping page 5 of docs/www.utupub.fi_handle_10024_153232.pdf: text would become too long
skipping page 5 of docs/www.utupub.fi_handle_10024_153200.pdf: text would become too long
skipping page 5 of docs/www.doria.fi_handle_10024_182724.pdf: text would become too long
skipping page 6 of docs/www.doria.fi_handle_10024_182724.pdf: text would become too long
skipping page 5 of docs/www.doria.fi_handle_10024_182159.pdf: text would become too long
skipping page 6 of docs/www.doria.fi_handle_10024_182159.pdf: text would become too long
skipping page 5 of docs/www.doria.fi_handle_10024_181975.pdf: text would become too long
skipping page 6 of docs/www.doria.fi_handle_10024_181975.pdf: text would become too long
skipping page 5 of docs/www.doria.fi_handle_10024_181902.pdf: text would become too long
skipping page 6 of docs/www.doria.fi_handle_10024_181902.pdf: text would become too long
skipping page 7 of docs/www.doria.fi_handle_10024_181902.pdf: text would become too long
skipping page 5 of 

## Fine-tuning

Prepare a fine-tuning dataset and use it to fine-tune a GPT3 model.

In [10]:
# prepare fine-tuning dataset
import json

PROMPT_SUFFIX = '\n\n###\n\n'
COMPLETION_STOP = '\n###'
TRAINFILE = 'fine-tune.jsonl'

def metadata_to_text(metadata):
    """convert the metadata tuple to text with key: value pairs"""
    return "\n".join([f"{fld}: {val}" for fld, val in metadata])

def create_sample(text, metadata):
    """create a fine-tuning sample from text and metadata about a single document"""
    return {'prompt': text + PROMPT_SUFFIX,
            'completion': metadata_to_text(metadata) + COMPLETION_STOP}

with open(TRAINFILE, 'w') as outf:
    for text, metadata in zip(train_x, train_y):
        sample = create_sample(text, metadata)
        print(json.dumps(sample), file=outf)

print(f"wrote fine-tuning data set into file {TRAINFILE}")

wrote fine-tuning data set into file fine-tune.jsonl


In [None]:
# Optional:
# Check that the fine-tuning data set is OK using the prepare_data tool.
# It will complain that all completions start with the same "title:" prefix, this can be ignored.
# NOTE: The command has to be interrupted by pressing the stop button in Jupyter.
!openai tools fine_tunes.prepare_data -f fine-tune.jsonl

Analyzing...

- Your file contains 149 prompt-completion pairs
- All prompts end with suffix `\n\n###\n\n`
- All completions start with prefix `title: `. Most of the time you should only add the output data into the completion, without any prefix
- All completions end with suffix `\n###`
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove prefix `title: ` from all completions [Y/n]: ^C



In [25]:
# Upload training data

upload_response = openai.files.create(
    file=open(TRAINFILE, "rb"),
    purpose="fine-tune"
)
trainfile_id = upload_response.id
upload_response

FileObject(id='file-7xlI2o70vyBn0pmu5qr9VpJn', bytes=643486, created_at=1700655246, filename='fine-tune.jsonl', object='file', purpose='fine-tune', status='uploaded', status_details=None)

In [27]:
# Perform the actual finetuning via the API. This can take a while, there can be a long queue.

openai.fine_tuning.jobs.create(
    training_file=trainfile_id,
    model="davinci-002"
)

FineTuningJob(id='ftjob-03qWkJDNvGiWPkUS8gKj6Po6', created_at=1700655410, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='davinci-002', object='fine_tuning.job', organization_id='org-5QEUW2DacClOLTNQvTEKMHdV', result_files=[], status='validating_files', trained_tokens=None, training_file='file-7xlI2o70vyBn0pmu5qr9VpJn', validation_file=None)

In [47]:
fine_tuning_job_id = openai.fine_tuning.jobs.list(limit=10).data[0].id
openai.fine_tuning.jobs.list_events(fine_tuning_job_id=fine_tuning_job_id, limit=20).data

[FineTuningJobEvent(id='ftevent-DRZk9aSjRgegzl4HSqlX336k', created_at=1700655698, level='info', message='The job has successfully completed', object='fine_tuning.job.event', data={}, type='message'),
 FineTuningJobEvent(id='ftevent-ou3WAE3Cv5WKKcGk57tnEWlU', created_at=1700655695, level='info', message='New fine-tuned model created: ft:davinci-002:personal::8NgWUbLM', object='fine_tuning.job.event', data={}, type='message'),
 FineTuningJobEvent(id='ftevent-4rBvsUGvACde4EwYDBsopslF', created_at=1700655685, level='info', message='Step 441/447: training loss=0.03', object='fine_tuning.job.event', data={'step': 441, 'train_loss': 0.028055638074874878, 'train_mean_token_accuracy': 0.9852216839790344}, type='metrics'),
 FineTuningJobEvent(id='ftevent-X5TytJwlt7mLKBfS11MiZDLA', created_at=1700655682, level='info', message='Step 431/447: training loss=0.03', object='fine_tuning.job.event', data={'step': 431, 'train_loss': 0.031480707228183746, 'train_mean_token_accuracy': 0.9888476133346558}, 

In [52]:
# store the model name from above fine tuning job

model_name = openai.fine_tuning.jobs.retrieve(fine_tuning_job_id).fine_tuned_model
model_name

'ft:davinci-002:personal::8NgWUbLM'

## Test the fine-tuned model

Give the model some documents from the test set that it has never seen before and see what kind of metadata it can extract. Compare that to the manually created metadata of the same documents, extracted from DSpace systems.

In [53]:
def get_completions(text):
    response = openai.completions.create(model=model_name,
                                    prompt=text + PROMPT_SUFFIX,
                                    temperature=0,  # no fooling around!
                                    max_tokens=500, # should be plenty
                                    stop=[COMPLETION_STOP])  # stop at ###
    return response.choices[0].text

# test it with some sample documents from the test set
for idx in (3,8,13,18,23,28,33,38):
    identifier = test_ids[idx]
    text = test_x[idx]
    metadata = test_y[idx]
    print(identifier)
    print("---")
    print("DSpace metadata:")
    print(metadata_to_text(metadata))
    print("---")
    print("Generated metadata:")
    gen_metadata = get_completions(text).strip()
    print(gen_metadata)
    print()



https://www.utupub.fi/handle/10024/152860
---
DSpace metadata:
title: Essays on income inequality and financial incentives to work
contributor/faculty: fi=Turun kauppakorkeakoulu|en=Turku School of Economics|
contributor/author: Ollonqvist, Joonas
publisher: fi=Turun yliopisto. Turun kauppakorkeakoulu|en=University of Turku, Turku School of Economics|
date/issued: 2021
relation/issn: 2343-3167
relation/ispartofseries: Turun yliopiston julkaisuja - Annales Universitatis Turkuensis, Ser E: Oeconomica
relation/numberinseries: 82
---
Generated metadata:
title: Essays on income inequality and financial incentives to work
contributor/faculty: fi=Turun kauppakorkeakoulu|en=Turku School of Economics|
contributor/author: Ollonqvist, Joonas
publisher: fi=Turun yliopisto. Turun kauppakorkeakoulu|en=University of Turku, Turku School of Economics|
date/issued: 2022
relation/issn: 2343-3167
relation/ispartofseries: Turun yliopiston julkaisua - Annales Universitatis Turkuensis, Ser. E: Oeconomica
rel