<a href="https://colab.research.google.com/github/higherbar-ai/ai-workflows/blob/main/src/example-doc-extraction.ipynb" target="_parent"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>

# About this notebook

This notebook provides an example of how the `ai-workflows` package can be used to extract structured data from an arbitrary number of unstructured documents. It is set up to extract questions from survey forms, but it is designed to be modified to extract any kind of structured data from any kind of document.

If you'd rather not edit the code in this notebook to adapt it to your needs, you can use [the generic version of this notebook that extracts structured data based on an Excel template](https://github.com/higherbar-ai/ai-workflows/blob/main/src/example-doc-extraction-templated.ipynb).

## Configuration

This notebook requires different settings depending on which AI service providers you want to use. If you're running in Google Colab, you configure these settings as "secrets"; just click the key icon in the left sidebar (and, once you create a secret, be sure to click the toggle to give the notebook access to the secret). If you're running this notebook in a different environment, you can set these settings in a `.env` file; the first time you run, it will write out a template `.env` file for you to fill in and direct you to its location. Following are the settings, regardless of the environment.

If you don't have an API key for an AI provider yet, [see here to learn what that is and how to get one](https://www.linkedin.com/pulse/those-genai-api-keys-christopher-robert-l5rie/).

### OpenAI (direct)

To use OpenAI directly:

* `openai_api_key` - your OpenAI API key (get one from [the OpenAI API key page](https://platform.openai.com/api-keys), and be sure to fund your platform account with at least $5 to allow GPT-4o model access)
* `openai_model` (optional) - the model to use (defaults to `gpt-4o`)

### OpenAI (via Microsoft Azure)

To use OpenAI via Microsoft Azure:

* `azure_api_key` - your Azure API key
* `azure_api_base` - the base URL for the Azure API
* `azure_api_engine` - the engine to use (a.k.a. the "deployment")
* `azure_api_version` - the API version to use

### Anthropic (direct)

To use Anthropic directly:

* `anthropic_api_key` - your Anthropic API key
* `anthropic_model` - the model to use

### LangSmith (for tracing)

Optionally, you can add [LangSmith tracing](https://langchain.com/langsmith):

* `langsmith_api_key` - your LangSmith API key

## Setting up the runtime environment

This next code block installs all necessary Python and system packages into the current environment.

**If you're running in Google Colab and it prompts you to restart the notebook in the middle of the installation steps, just click CANCEL.**

In [None]:
# install Google Colab Support and ai_workflows package
%pip install colab-or-not py-ai-workflows[docs]

# download NLTK data
import nltk
nltk.download('punkt', force=True)

# set up our notebook environment (including LibreOffice)
from colab_or_not import NotebookBridge
notebook_env = NotebookBridge(
    system_packages=["libreoffice"],
    config_path="~/.hbai/ai-workflows.env",
    config_template={
        "openai_api_key": "",
        "openai_model": "",
        "azure_api_key": "",
        "azure_api_base": "",
        "azure_api_engine": "",
        "azure_api_version": "",
        "anthropic_api_key": "",
        "anthropic_model": "",
        "langsmith_api_key": "",
    }
)
notebook_env.setup_environment()

## Configuring the document extraction job

This next code block provides the configuration necessary to guide the document extraction process.

**You'll want to modify this section to meet your specific needs.**

In [None]:
# TO-DO: edit the context below to provide a description for each document that you will be processing
json_context = """The file contains the following:

A survey instrument or digital form"""

# TO-DO: edit the job text below to provide a description of each row that you want in your extracted data output
json_job = """Your job is to extract a series of row objects from the file's content, and to return them all in a specific JSON format. Each row object should represent the following:

An individual question or field included in the survey or digital form (with each translation of a single question or field being treated as a separate row)"""

# TO-DO: edit the specification below to customize the list of fields that you want included in each row (these will be columns in your output data)
json_output_spec = """Return JSON with the following fields (and only the following fields):

* `rows` (list): The list of rows extracted, or an empty list if none found. Each row should contain the following fields:

  * `question_id` (string): The numeric or alphanumeric identifier or short variable name identifying the question (if any), usually located just before or at the beginning of the question. "" if none found.

  * `question` (string): The exact text of the question or form field, including any introductory text that provides context or explanation. Often follows a unique question ID of some sort, like "2.01." or "gender:". Should not include response options, which should be included in the 'options' field, or extra enumerator or interviewer instructions (including interview probes), which should be included in the 'instructions' field. Be careful: the same question might be asked in multiple languages, and each translation should be included as a separate row. Never translate between languages or otherwise alter the question text in any way.

  * `instructions` (string): Instructions or other guidance about how to ask or answer the question (if any), including enumerator or interviewer instructions. If the question includes a list of specific response options, do NOT include those in the instructions. However, if there is guidance as to how to fill out an open-ended numeric or text response, or guidance about how to choose among the options, include that guidance here. "" if none found.

  * `options` (string): The list of specific response options for multiple-choice questions in a single string, including both the label and the internal value (if specified) for each option. For example, a 'Male' label might be coupled with an internal value of '1', 'M', or even 'male'. Separate response options with a space, three pipe symbols ('|||'), and another space, and, if there is an internal value, add a space, three # symbols ('###'), and the internal value at the end of the label. For example: 'Male ### 1 ||| Female ### 2' (codes included) or 'Male ||| Female' (no codes); 'Yes ### yes ||| No ### no', 'Yes ### 1 ||| No ### 0', 'Yes ### y ||| No ### n', or 'YES ||| NO'. Do NOT include fill-in-the-blank content here, only multiple-choice options. "" if the question is open-ended (i.e., does not include specific multiple-choice options)."""

# TO-DO (ADVANCED): if information that should be extracted into a single row never spans multiple pages, you can change the following to False in order to process PDF files page-by-page rather than converting them to Markdown format first
markdown_first = True

## Initializing for AI workflows

The next code block initializes the notebook by loading settings and initializing the LLM interface.

In [None]:
from ai_workflows.llm_utilities import LLMInterface
from ai_workflows.document_utilities import DocumentInterface

# read all supported settings
openai_api_key = notebook_env.get_setting('openai_api_key')
openai_model = notebook_env.get_setting('openai_model', 'gpt-4o')
azure_api_key = notebook_env.get_setting('azure_api_key')
azure_api_base = notebook_env.get_setting('azure_api_base')
azure_api_engine = notebook_env.get_setting('azure_api_engine')
azure_api_version = notebook_env.get_setting('azure_api_version')
anthropic_api_key = notebook_env.get_setting("anthropic_api_key")
anthropic_model = notebook_env.get_setting("anthropic_model")
langsmith_api_key = notebook_env.get_setting('langsmith_api_key')

# complain if we don't have the bare minimum to run
if (not openai_api_key
        and not (azure_api_key and azure_api_base and azure_api_engine and azure_api_version)
        and not (anthropic_api_key and anthropic_model)):
    raise Exception('We need settings set for OpenAI access (direct or via Azure) or for Anthropic access (direct). See the instructions above for more details.')

# initialize LLM interface
llm = LLMInterface(openai_api_key=openai_api_key, openai_model=openai_model, azure_api_key=azure_api_key, azure_api_base=azure_api_base, azure_api_engine=azure_api_engine, azure_api_version=azure_api_version, temperature = 0.0, total_response_timeout_seconds=600, number_of_retries=2, seconds_between_retries=5, langsmith_api_key=langsmith_api_key, anthropic_api_key=anthropic_api_key, anthropic_model=anthropic_model)

# initialize our document processor
doc_interface = DocumentInterface(llm_interface=llm)

# report success
print("Initialization successful.")

## Prompting for input files

This next code block prompts you to upload or select the document(s) you want to process. If you only want to process a single document, just upload or select that. If you want to process multiple, upload or select them all or compress them together into a single `.zip` file and upload or select that.

In [None]:
# prompt for one or more files to process
files_to_process = notebook_env.get_input_files("Document(s) to process (.zip file for multiple):")

# report out on the files we plan to process
for file_to_process in files_to_process:
    if file_to_process.lower().endswith('.zip'):
        print(f'Will process all files within: {file_to_process}')
    else:
        print(f'Will process: {file_to_process}')

## Extracting data

This next code block processes each file one-by-one, unzipping `.zip` files into a temporary directory as needed, and aggregates all the results into a single list of rows.

In [None]:
import tempfile
import zipfile
import os

# next, process files, with zip files unzipped into a temporary directory
all_rows = []
with tempfile.TemporaryDirectory() as temp_dir:
    # tally up all files, unzipping as needed
    all_files = []
    for file_to_process in files_to_process:
        if file_to_process.lower().endswith('.zip'):
            # if it's a .zip file, unzip it into the temporary directory
            print(f'Unzipping {file_to_process}')
            with zipfile.ZipFile(file_to_process, 'r') as zip_ref:
                zip_ref.extractall(temp_dir)
        else:
            # just add the file to the list of files to process
            all_files.append(file_to_process)
    # add all unzipped files to the list of files to process (ignoring hidden files)
    for root, dirs, files in os.walk(temp_dir):
        for unzipped_file in files:
            unzipped_file_path = os.path.join(root, unzipped_file)
            if not unzipped_file.startswith('.'):
                all_files.append(unzipped_file_path)

    # process each file
    for file_to_process in all_files:
        filename = os.path.basename(file_to_process)
        print(f'Processing {filename}...')

        # process the file
        all_responses = doc_interface.convert_to_json(file_to_process, json_context, json_job, json_output_spec, markdown_first=markdown_first)

        # combine all responses into a single list of rows
        merged_responses = doc_interface.merge_dicts(all_responses)
        rows = merged_responses['rows']

        # output and save results
        print(f"  Extracted {len(rows)} row{'s' if len(rows) != 1 else ''}")
        all_rows.append((filename, rows))

## Outputting extracted data

This final code block outputs the extracted data to `extracted_data.csv`, with the filename in column A and the extracted data columns thereafter.

If you're running in Google Colab, this `.csv` file will be saved into the content folder. Find, view, or download it by clicking on the folder icon in the left sidebar.

If you're running elsewhere, it will be saved into an `ai-workflows` subdirectory created off of your user home directory.

In [None]:
import csv

# output files to ~/ai-workflows directory if local, otherwise /content if Google Colab
output_path_prefix = notebook_env.get_output_dir(not_colab_dir="~/ai-workflows", colab_subdir="")

# assemble all unique keys in all_rows into a single list
data_columns = []
for _, rows in all_rows:
    for row in rows:
        for k in row.keys():
            if not k in data_columns:
                data_columns.append(k)

# output .csv file with extracted data, with filename in column A and the output columns thereafter
output_csv_path = os.path.join(output_path_prefix, 'extracted_data.csv')
with open(output_csv_path, 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(['filename'] + data_columns)
    for filename, rows in all_rows:
        for row in rows:
            csvwriter.writerow([filename] + [row.get(k, '') for k in data_columns])

# report out on the output file
print(f'Extracted data saved to: {output_csv_path}')