<a href="https://colab.research.google.com/github/lmarszalek-suffolk/ctl/blob/main/Copy_of_Data_Extraction_with_RegEx_and_LLMs_LM_Project(public).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sample Colab for Extracting Data from OCRed PDFs Using Regex and LLMs

One can use this notebook to build a pipeline to parse and extract data from OCRed PDF files. _**Warning:** When using LLMs for entity extraction, be sure to perform extensive quality control. They are very susceptible to distracting language (latching on to text that sound "kind of like" what you're looking for) and missing language (making up content to fill any holes), and importantly, they do **NOT** provide any hints to when they may be erroring. You need to make sure random audits are part of your workflow!_ Below we've worked out a workflow using regular expressions and LLMs to parse data from zoning board orders, but the process is generalizable.

1. Collect a set of PDFs
2. Place OCRed PDFs into the a folder
3. Write regular expressions to pull out data
4. Write LLM prompts to pull out data


# Load Libraries


First we load the libraries we need. Note, if you try to run the cell, and you get something like `ModuleNotFoundError: No module named 'mod_name'`, you'll need to install the module. You can do this commentating the line below that reads `#!pip install mod_name` if it's listed. If it isn't, you can probably install it with a similarly formatted command.

In [None]:
#!pip install os
!pip install PyPDF2
#!pip install re
#!pip install pandas
#!pip install numpy

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/232.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
!pip install transformers
!pip install openai==0.28
!pip install tiktoken

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.0
Collecting tiktoken
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the 

In [None]:
import os
from os import walk, path
import PyPDF2
import re
import pandas as pd
import numpy as np
import random

def read_pdf(file):
    try:
        pdfFile = PyPDF2.PdfReader(open(file, "rb"), strict=False)
        text = ""
        for page in pdfFile.pages:
            text += " " + page.extract_text()
        return text
    except:
        return ""

In [None]:
# Test Audio call
# Only works on Mac. If you aren't using a Mac, you should disable such calls below.
#tmp = os.system( "say Testing, testing, one, two, three.")
#del(tmp)

In [None]:
import json

from nltk.tokenize import word_tokenize, sent_tokenize

import openai
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

import tiktoken
ENCODING = "gpt2"
encoding = tiktoken.get_encoding(ENCODING)

import time

def complete_text(prompt,temp=0,trys=0,clean=False):
    #time.sleep(23)
    global tokens_used

    model="text-davinci-003"
    model_token_limit = 4097

    token_count = len(encoding.encode(prompt))
    max_tokens= model_token_limit-round(token_count+5)

    #try:
    response = openai.Completion.create(
      model=model,
      prompt=prompt,
      temperature=temp,
      max_tokens=max_tokens,
      top_p=1.0,
      frequency_penalty=0.0,
      presence_penalty=0.0
    )
    output = str(response["choices"][0]["text"].strip())
    #except:
    #    print("Problem with API call!")
    #    output = """{"output":"error"}"""

    #output=prompt

    tokens_used += token_count+len(encoding.encode(output))



    if clean:
        return clean_pseudo_json(output,temp=0,trys=trys)
    else:
        return output

def clean_pseudo_json(string,temp=0,key="output",trys=0,ask_for_help=1):
    try:
        output = json.loads(string)[key]
    except:
        try:
            string_4_json = re.findall("\{.*\}",re.sub("\n","",string))[0]
            output = json.loads(string_4_json)[key]
        except:
            try:
                string = "{"+string+"}"
                string_4_json = re.findall("\{.*\}",re.sub("\n","",string))[0]
                output = json.loads(string_4_json)[key]
            except Exception as e:
                prompt = "I tried to parse some json and got this error, '{}'. This was the would-be json.\n\n{}\n\nReformat it to fix the error.".format(e,string)
                if trys <= 3:
                    if trys == 0:
                        warm_up = 0
                    else:
                        warm_up = 0.25
                    output = complete_text(prompt,temp=0+warm_up,trys=trys+1)
                    print("\n"+str(output)+"\n")
                elif ask_for_help==1:
                    print(prompt+"\nReformaing FAILED!!!")
                    #try:
                    #    os.system( "say hey! I need some help. A little help please?")
                    #except:
                    #    print("'say' not supported.\n\n")
                    output = input("Let's see if we can avoid being derailed. Examine the above output and construct your own output text. Then enter it below. If the output needs to be something other than a string, e.g., a list or json, start it with `EVAL: `. If you're typing that, be very sure there's no malicious code in the output.\n")
                    if output[:6]=="EVAL: ":
                        output = eval(output[6:])
                else:
                    output = "There was an error getting a reponse!"

    return output

# Input OpenAI API Key & LLM settings

You'll need an API key to use an LLM. After creating an OpenAI account, you can create an API key here: https://platform.openai.com/account/api-keys

Enter your key between the quation marks next to `openai.api_key =` below, and run that cell.

In [None]:
# Toggle LLM usage on or off
use_LLM = True

llm_temperature = 0 # I strongly suggest keeping the LLM's temp at zero to avoid it making things up.

openai.api_key = "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" # <<--- REPLACE WITH YOUR KEY

# Load and pase files
Next, place a bunch of OCRed pdf files in the right folder (here, the `/content/gdrive/entity_extraction_sample_data/boston/` folder). FWIW, you can use Adobe Pro to OCR in batch. Note: to make your files visisble at a location like that above, you'll need to add them to your Google Drive. E.g., you would need to copy https://drive.google.com/drive/folders/1H3bMgxzNxwxNL2YK6eMWt3nX985oBqVS?usp=sharing to your GDrive and name it `entity_extraction_sample_data` for it to be accessable at `/content/gdrive/entity_extraction_sample_data/`.

In [None]:
# this mounts your google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
df = pd.DataFrame() #this will create an empty dataframe

# list the files in the drive
filepath = "/content/gdrive/MyDrive/DataExtraction/" # this is where we'll be looking for files
f = []
for (dirpath, dirnames, filenames) in walk(filepath): # create a list of file names
    f.extend(filenames)
    break

f #show list

['4 page_COPY_Uganda-2018.pdf',
 '4 page_Uganda_ Violence against women unabated despite laws and policies _ Africa Renewal.pdf',
 '5 page_Domestic Violence and the Death Penalty in Uganda _ Oxford Law Blogs.pdf',
 '3 page_Making health services a safe place for women_ Uganda steps up to support women subjected to violence.pdf',
 '5 page_Refworld _ Uganda_ Domestic violence, including legislation, statistics and attitudes toward domestic violence; the availability of protection and support services.pdf',
 '7 page_World Report 2023_ Uganda _ Human Rights Watch.pdf',
 '2 page_Uganda’s violence against women survey heralds legislative and policy changes _ UN Women Data Hub.pdf',
 '7 page_Uganda_ Freedom in the World 2023 Country Report _ Freedom House.pdf',
 '7 page_Uganda Policy Hub – None In Three.pdf',
 '7 page excerpt_Intimate Parter Violence against Women in Uganda - Ballard Brief.pdf']

In [None]:
#f=['ENTER FILE NAME'] #if you want to user-test on just one, or a few documents at a time, you can use f= by entering the file name within single quote and brackets, and uncomment the line


token_counts = []
for file in f: # for each file in the list of file names, do some stuff

    tokens_used = 0

    column_names = ["file"]
    column_values = [file]

    fileloc = filepath+file
    text = read_pdf(fileloc)
    #print("text here: ", text)
    words = len(text.split())

    print("Parsing ~{} words ({} tokens) from: \"{}\"\n".format(words,len(encoding.encode(text)),fileloc))






    #############################################################
    # Here's where you use GPT to pull out some specific content. To update in the future for country condition reports, you can change the "topic" and/or the "country" (or anything else you'd like) by changing the language in red. Each individual prompt starts with #try and ends with #column_values.append ("NA"). Lines that indicate "column" will dictate what the column label is in the exported CSV file

    #
    # Note: You should consider combining multiple prompts into a single prompt
    # to avoid making unnecessary api calls. See e.g. Reasoning & Decision below
    #

    if use_LLM:
#prompt #1: Domestic Violence - Mention

      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find _if it mentions domestic violence_. That is, whether or not the report mentions domestic violence.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find _if it mentions domestic violence_ in the text of the above, answer simply yes or no.""". format(text)
        #print(prompt_text)
        mentionsDV = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("mentionsDV")
        column_values.append(mentionsDV)
      #except:
        #column_names.append("mentionsDV")
        #column_values.append("NA")

#prompt #2: Domestic Violence - Two Sentences
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find two sentences, if applicable, from the articles that reference domestic violence. That is, whether or not the report has sentences including the words "domestic violence" or "DV", or describes a husband's or partner's violence against their wife or women.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find a reference to domestic violence in the text of the above, answer "NA - No quotes about Domestic Violence". """. format(text)
        #print(prompt_text)
        DVQuote = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("DVQuote")
        column_values.append(DVQuote)
      #except:
        #column_names.append("DVQuote")
        #column_values.append("NA")

#prompt #3: Violence Against Women - Mention
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find _if it mentions violence against women_. That is, whether or not the report mentions violence against women.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find _if it mentions violence against women_ in the text of the above, answer simply yes or no.""". format(text)
        #print(prompt_text)
        mentionsVAW = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("mentionsVAW")
        column_values.append(mentionsVAW)
      #except:
        #column_names.append("mentionsVAW")
        #column_values.append("NA")

#prompt #4: Violence Against Women - Two Sentences
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find two sentences, if applicable, from the articles that reference violence against women. That is, whether or not the report has sentences including the words "violence against women" or "VAWG".

    Here's the text of the order.

    {}

    ---

   Return your answer below. If you can't find a reference to violence against women in the text of the above, answer "NA - No quotes about Violence Against Women". """. format(text)
        #print(prompt_text)
        VAWQuote = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("VAWQuote")
        column_values.append(VAWQuote)
      #except:
        #column_names.append("VAWQuote")
        #column_values.append("NA")

#prompt #5: Sexual Violence - Mention
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find _if it mentions sexual violence_. That is, whether or not the report mentions sexual violence.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find _if it mentions sexual violence_ in the text of the above, answer simply yes or no.""". format(text)
        #print(prompt_text)
        mentionsSV = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("mentionsSV")
        column_values.append(mentionsSV)
      #except:
        #column_names.append("mentionsSV")
        #column_values.append("NA")

#prompt #6: Sexual Violence - Two Sentences
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find two sentences, if applicable, from the articles that reference sexual violence. That is, whether or not the report has sentences including the words "sexual violence".

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find a reference to sexual violence in the text of the above, answer "NA - No quotes about Sexual Violence". """. format(text)
        #print(prompt_text)
        SVQuote = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("SVQuote")
        column_values.append(SVQuote)
      #except:
        #column_names.append("SVQuote")
        #column_values.append("NA")

#prompt #7: Gender-Based Violence - Mention
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find _if it mentions gender-based violence_. That is, whether or not the report mentions gender-based violence.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find _if it mentions gender-based violence_ in the text of the above, answer simply yes or no.""". format(text)
        #print(prompt_text)
        mentionsGBV = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("mentionsGBV")
        column_values.append(mentionsGBV)
      #except:
        #column_names.append("mentionsGBV")
        #column_values.append("NA")

#prompt #8: Gender-Based Violence - Two Sentences
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find two sentences, if applicable, from the articles that reference gender-based violence. That is, whether or not the report has sentences including the words "gender-based violence" or describes men's violence against women, including husbands or partners. The sentences do not need to be in quotation marks.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find a reference to gender-based violence in the text of the above, answer "NA - No quotes about Gender-Based Violence". """. format(text)
        #print(prompt_text)
        GBVQuote = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("GBVQuote")
        column_values.append(GBVQuote)
      #except:
        #column_names.append("GBVQuote")
        #column_values.append("NA")

#prompt #9: Intimate Partner Violence - Mention
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find _if it mentions intimate partner violence_. That is, whether or not the report mentions intimate partner violence.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find _if it mentions intimate partner violence_ in the text of the above, answer simply yes or no.""". format(text)
        #print(prompt_text)
        mentionsIPV = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("mentionsIPV")
        column_values.append(mentionsIPV)
      #except:
        #column_names.append("mentionsIPV")
        #column_values.append("NA")

#prompt #10: Intimate Partner Violence - Two Sentences
      #try:
        # ---------------------------------------------------------
        # description of variance requested
        # ---------------------------------------------------------
        prompt_text = """Below you will be provided with the text of a report on conditions in Uganda. You're looking to find two sentences, if applicable, from the articles that reference intimate partner violence. That is, whether or not the report has sentences including the words "intimate partner violence" or IPV. The sentences do not need to be in quotation marks.

    Here's the text of the order.

    {}

    ---

    Return your answer below. If you can't find a reference to intimate partner violence in the text of the above, answer "NA - No quotes about Intimate Partner Violence". """. format(text)
        #print(prompt_text)
        IPVQuote = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("IPVQuote")
        column_values.append(IPVQuote)
      #except:
        #column_names.append("IPVQuote")
        #column_values.append("NA")

    #############################################################

    # After testing or when working with large numbers, you may want to comment this next bit out

    # Show your work
    i = 0
    for datum in column_values:
        print("{}: {}\n".format(column_names[i].upper(),datum))
        i+=1


    # Show cost per run
    if use_LLM:
        print("Tokens used (approx.): {} (API Cost ~${})\n".format(tokens_used,tokens_used*(0.002/1000))) # See https://openai.com/pricing
        token_counts.append(tokens_used)

    print("================================================\n")

    df = pd.concat([df,pd.DataFrame([column_values],columns=column_names)], ignore_index=True,sort=False)

print("Average approx. tokens used per item {} (API Cost ~${})\n".format(np.array(token_counts).mean(),np.array(token_counts).mean()*(0.002/1000))) # See https://openai.com/pricing

display(df)

Parsing ~1411 words (2205 tokens) from: "/content/gdrive/MyDrive/DataExtraction/4 page_COPY_Uganda-2018.pdf"

FILE: 4 page_COPY_Uganda-2018.pdf

MENTIONSDV: No

DVQUOTE: "On May 16, the UPDF spokesperson denied the killing and insisted that the eviction was peaceful."
"The African Center for Treatment and Rehabilitation of Torture Victims (ACTV) reported that through July, it had registered 63 allegations of torture committed by the UPF, seven by the Flying Squad Unit of the UPF, 12 by the UPDF, and three by the Chieftaincy of Military Intelligence (CMI)."

MENTIONSVAW: No

VAWQUOTE: "The African Center for Treatment and Rehabilitation of Torture Victims (ACTV) reported that through July, it had registered 63 allegations of torture committed by the UPF, seven by the Flying Squad Unit of the UPF, 12 by the UPDF, and three by the Chieftaincy of Military Intelligence (CMI)."

"Authorities did not effectively enforce labor laws, due to insufficient resources for monitoring. Local NGOs repo

Unnamed: 0,file,mentionsDV,DVQuote,mentionsVAW,VAWQuote,mentionsSV,SVQuote,mentionsGBV,GBVQuote,mentionsIPV,IPVQuote
0,4 page_COPY_Uganda-2018.pdf,No,"""On May 16, the UPDF spokesperson denied the k...",No,"""The African Center for Treatment and Rehabili...",No,NA - No quotes about Sexual Violence,No,"""The Anti-Torture Act stipulates that any pers...",No,NA - No quotes about Intimate Partner Violence
1,4 page_Uganda_ Violence against women unabated...,Yes,"""In 2014 Desire Luzinda, a celebrated Ugandan ...",Yes,"""Violence against women is on the increase in ...",Yes,"""The 2016 Uganda Demographic and Health Survey...",Yes,"""The 2016 Uganda Demographic and Health Survey...",No,"""In 2014 Desire Luzinda, a celebrated Ugandan ..."
2,5 page_Domestic Violence and the Death Penalty...,Yes,"""Uganda has the tenth highest lifetime prevale...",Yes,"""Uganda has the tenth highest lifetime prevale...",No,NA - No quotes about Sexual Violence,Yes,"""Domestic violence against women in Uganda is ...",Yes,"""Uganda has the tenth highest lifetime prevale..."
3,3 page_Making health services a safe place for...,Yes,"""More than half of all women have experienced ...",Yes,"""Violence against women is a global public hea...",Yes,"""Women from all parts of society experience re...",Yes,"""Violence against women is a global public hea...",Yes,"""More than half of all women have experienced ..."
4,"5 page_Refworld _ Uganda_ Domestic violence, i...",Yes,"""Most women do not report cases of domestic vi...",Yes,"""marital rape is not recognized under the Pena...",No,NA - No quotes about Sexual Violence,Yes,"""Marital rape is not recognized under the Pena...",Yes,"""68 percent of ever-married women aged 15 to 4..."
5,7 page_World Report 2023_ Uganda _ Human Right...,No,"""On May 25, soldiers raided the offices of the...",No,"""On August 3, Uganda’s National Bureau for NGO...",No,NA - No quotes about Sexual Violence,No,Ugandan riot police surround veteran Ugandan o...,No,Police later charged Rukirabashaija with “offe...
6,2 page_Uganda’s violence against women survey ...,No,"""Among the shocking survey findings were that ...",Yes,"""Almost all Ugandan women and girls (95%) had ...",Yes,"""Almost all Ugandan women and girls (95%) had ...",Yes,"""Almost all Ugandan women and girls (95%) had ...",Yes,"""Almost all Ugandan women and girls (95%) had ..."
7,7 page_Uganda_ Freedom in the World 2023 Count...,No,"""Domestic violence is widespread; more than 60...",Yes,"""Rape, extrajudicial violence, and torture and...",No,NA - No quotes about Sexual Violence.,Yes,"""Domestic violence is widespread; more than 60...",No,"""Domestic violence is widespread; more than 60..."
8,7 page_Uganda Policy Hub – None In Three.pdf,Yes,"""56% of women in Uganda aged 15-49 report havi...",Yes,"""56% of women in Uganda aged 15-49 report havi...",Yes,"""56% of women in Uganda aged 15-49 report havi...",Yes,56% of women in Uganda aged 15-49 report havin...,Yes,56% of women in Uganda aged 15-49 report havin...
9,7 page excerpt_Intimate Parter Violence agains...,Yes,"""Sixty-five percent of women in Uganda report ...",Yes,"""Sixty-five percent of women in Uganda report ...",Yes,"""intimate partner sexual violence"" and ""Sexual...",Yes,"""Incorrect attitudes about violence, controlli...",Yes,Sixty-five percent of women in Uganda report e...


In [None]:
# If you're happy with the stuff you pulled out above, you can write the df to a csv file
# make sure the path is placing it where you want it!

df.to_csv("/content/gdrive/MyDrive/DataExtraction/CSV Export of Coding Results/Coding of Uganda Domestic Violence.csv", index=False, encoding="utf-8")