This is a POC (Proof Of Concept), that intent to show that existing LLM (Large Language Models) can be used to extract data from a PDF **without a need to develop specialized models**.

## POC Description
This POC will:
1. Extract text from a PDF document
2. Formulate query to Open AI's  `text-davinci-003` model via an API call using text extracted from PDF
3. Parse the API response into python dictionary

For example, I will be using a PDF downloaded from [Renewed Materials website](http://www.renewedmaterials.com/alkemi-copper).

## Goal
Prove that it is possible to use existing LLMs like GPT3 to extract data from natural language documents (PDFs, web pages, etc.) to get structural data for further manipulation.
The last step will prove that data is parsed and available for further programmatic usage, like storing in a database or further processing.

## How to use this

This is a [Jupiter Notebook](https://jupyter.org/) a tool used mostly for data science rapid prototyping. To run it as-is read more on the [Jupiter website](https://jupyter.org/)

*Alternatively* the code from the notebook might be copied to a regular python file and executed from there.

### Requirements

1. This code has few dependencies (described in `requirements.txt`), and can be installed via pip or any other python dependency management you like
2. It requires an API key to call Open AI's API (I won't be sharing mine ;) ) to be set in environment variable `OPENAI_API_KEY`. Read more about Open AI's API [here](https://beta.openai.com/docs/introduction)


# Step 1: Extracting text from a PDF document

## Disclaimer
Because it did not make sense to spend lots of time to look for appropriate documents, nor I would know what those documents are I just picked a random one and tried to find some non-trivial data in the doc.
## Parameters
Following parameters will help you experiment with existing model. You can use a document that makes more sense as well as well as changing what to look for:
1. `searched_data` type of data you are looking for
2. `searched_parameters` parameters you are explicitly interested in
3. `file_path` to work on a different file.


In [1]:
searched_data = "construction material testing information"
searched_parameters = "material name, weight, and water absorption"
file_path = "Renewed-Materials-LLC177-026855-1.pdf"

## Reading PDF file

In [2]:
import re

from PyPDF2 import PdfReader

reader = PdfReader(file_path)
text = u""
for page in reader.pages:
    text += page.extract_text() + "\n"
# clean up special characters. There will be more work like that, the more documents you would see
text = text.replace(u'\xa0', u' ')
text = re.sub(r"(\n +){2,}", "", text)

print(text)

 
This document is issued by the Company subject to its General Conditions of Service printed overleaf, available on request or accessible at 
www.sgs.com/terms_and_conditions.htm  and, for electronic format documents, subject to Terms and Conditions for Electronic Documents at 
www.sgs.com/terms_e -document.htm .  Attention is drawn to the limitation of liability, indemnification and jurisdiction issues defined the rein.  Any 
holder of this document is advised that information contained hereon reflects the Company’s findings at the time of its intervention only and 
within the limits of Client’s instructions, if any.   The Company’s sole responsibility is to its Client and this document does not exonerate parties to 
a transaction from exercising all their rights and obligations under the transaction documents.  This docum ent cannot be reproduced except in 
full, without prior written approval of the Company.  Any unauthorized alteration, forgery or falsification of the content or a

# Step 2: Formulate query
The goal of the query is to create a directive to the model and provide the text to analyze in one request. For more examples see the [docs](https://beta.openai.com/examples)

In [3]:
query = f"Extract {searched_data} from the document. Collect only {searched_parameters} parameters. Provide the result as a JSON:\n{text}"

# The API has 4k token limit, where "token" is identified as "roughly 4 characters in plain english"
char_limit = 4000 * 4
limited_query = query[:char_limit]

print(limited_query)

Extract construction material testing information from the document. Collect only material name, weight, and water absorption parameters. Provide the result as a JSON:
 
This document is issued by the Company subject to its General Conditions of Service printed overleaf, available on request or accessible at 
www.sgs.com/terms_and_conditions.htm  and, for electronic format documents, subject to Terms and Conditions for Electronic Documents at 
www.sgs.com/terms_e -document.htm .  Attention is drawn to the limitation of liability, indemnification and jurisdiction issues defined the rein.  Any 
holder of this document is advised that information contained hereon reflects the Company’s findings at the time of its intervention only and 
within the limits of Client’s instructions, if any.   The Company’s sole responsibility is to its Client and this document does not exonerate parties to 
a transaction from exercising all their rights and obligations under the transaction documents.  This d

# Step 3: Parse the API response into python dictionary

First step is to send the request and get the response. The code is split to two cells intentionally as sending the request costs money, Open-AI charges per request

In [18]:
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  model="text-davinci-003",
  prompt=query,
  temperature=0,
  max_tokens=256,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0
)

In [15]:
response_text = response.choices[0].text
print(response_text)


{
  "Material Name": "Alkemi -Acrylic, Color Beluga 600",
  "Weight": {
    "Initial Weight": [29.8713, 31.7711, 31.5311],
    "Final Weight": [29.9305, 31.8230, 31.5797]
  },
  "Water Absorption": {
    "Water Absorption %": [0.198, 0.163, 0.154],
    "Average": 0.172
  }
}


## Extracting structural data

With this data you now can easily store/process/display it in any way you want

In [17]:
import json
data = json.loads(response_text)
print(data["Material Name"])
print(data["Weight"]["Initial Weight"])

Alkemi -Acrylic, Color Beluga 600
[29.8713, 31.7711, 31.5311]
