<a href="https://colab.research.google.com/github/higherbar-ai/ai-workflows/blob/main/src/example-qual-analysis.ipynb" target="_parent"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>

# About this example-qual-analysis-1 notebook

This notebook provides a simple example of an automated AI workflow, designed to run in [Google Colab](https://colab.research.google.com) or a local environment. It uses [the ai-workflows package](https://github.com/higherbar-ai/ai-workflows) to summarize and perform some basic qualitative analysis on a set of documents. The documents might contain interview transcripts, customer service tickets, or any other text-based data. The notebook will:

1. Prompt you to upload or select a `.zip` file with the documents to summarize and analyze

2. Extract the text from each document

3. Summarize each document

4. Identify the top 10 themes present in the documents

5. Code each document with the themes identified

6. Output the results in a series of `.csv` files

See [the ai-workflows GitHub repo](https://github.com/higherbar-ai/ai-workflows) for more details on the `ai_workflows` package.

## Configuration

This notebook requires different settings depending on which AI service providers you want to use. If you're running in Google Colab, you configure these settings as "secrets"; just click the key icon in the left sidebar (and, once you create a secret, be sure to click the toggle to give the notebook access to the secret). If you're running this notebook in a different environment, you can set these settings in a `.env` file; the first time you run, it will write out a template `.env` file for you to fill in and direct you to its location.

Following are the settings, regardless of the environment.

### OpenAI (direct)

To use OpenAI directly:

* `openai_api_key` - your OpenAI API key (get one from [the OpenAI API key page](https://platform.openai.com/api-keys), and be sure to fund your platform account with at least $5 to allow GPT-4o model access)
* `openai_model` (optional) - the model to use (defaults to `gpt-4o`)

### OpenAI (via Microsoft Azure)

To use OpenAI via Microsoft Azure:

* `azure_api_key` - your Azure API key
* `azure_api_base` - the base URL for the Azure API
* `azure_api_engine` - the engine to use (a.k.a. the "deployment")
* `azure_api_version` - the API version to use

### Anthropic (direct)

To use Anthropic directly:

* `anthropic_api_key` - your Anthropic API key
* `anthropic_model` - the model to use

### LangSmith (for tracing)

Optionally, you can add [LangSmith tracing](https://langchain.com/langsmith):

* `langsmith_api_key` - your LangSmith API key

## Your documents

The notebook will prompt you to select or upload a `.zip` file. It should contain all the documents you would like to summarize and analyze.

## Setting up the runtime environment

This next code block installs all necessary Python and system packages into the current environment.

**If you're running in Google Colab and it prompts you to restart the notebook in the middle of the installation steps, just click CANCEL.**

In [None]:
# install Google Colab support and ai_workflows package
%pip install colab-or-not py-ai-workflows[docs]

# download NLTK data
import nltk
nltk.download('punkt', force=True)

# set up our notebook environment (including LibreOffice)
from colab_or_not import NotebookBridge
notebook_env = NotebookBridge(
    system_packages=["libreoffice"],
    config_path="~/.hbai/ai-workflows.env",
    config_template={
        "openai_api_key": "",
        "openai_model": "",
        "azure_api_key": "",
        "azure_api_base": "",
        "azure_api_engine": "",
        "azure_api_version": "",
        "anthropic_api_key": "",
        "anthropic_model": "",
        "langsmith_api_key": "",
    }
)
notebook_env.setup_environment()

## Initializing for AI workflows

The next code block initializes the notebook by loading settings and initializing the LLM interface.

In [None]:
from ai_workflows.llm_utilities import LLMInterface
from ai_workflows.document_utilities import DocumentInterface

# read all supported secrets
openai_api_key = notebook_env.get_setting('openai_api_key')
openai_model = notebook_env.get_setting('openai_model', 'gpt-4o')
azure_api_key = notebook_env.get_setting('azure_api_key')
azure_api_base = notebook_env.get_setting('azure_api_base')
azure_api_engine = notebook_env.get_setting('azure_api_engine')
azure_api_version = notebook_env.get_setting('azure_api_version')
anthropic_api_key = notebook_env.get_setting("anthropic_api_key")
anthropic_model = notebook_env.get_setting("anthropic_model")
langsmith_api_key = notebook_env.get_setting('langsmith_api_key')

# complain if we don't have the bare minimum to run
if (not openai_api_key
        and not (azure_api_key and azure_api_base and azure_api_engine and azure_api_version)
        and not (anthropic_api_key and anthropic_model)):
    raise Exception('We need settings set for OpenAI access (direct or via Azure) or for Anthropic access (direct). See the instructions above for more details.')

# initialize LLM interface
llm = LLMInterface(openai_api_key=openai_api_key, openai_model=openai_model, azure_api_key=azure_api_key, azure_api_base=azure_api_base, azure_api_engine=azure_api_engine, azure_api_version=azure_api_version, temperature = 0.0, total_response_timeout_seconds=600, number_of_retries=2, seconds_between_retries=5, langsmith_api_key=langsmith_api_key, anthropic_api_key=anthropic_api_key, anthropic_model=anthropic_model)

# initialize two document processors, one with an LLM and one without
doc_processor = DocumentInterface()
doc_processor_llm = DocumentInterface(llm_interface=llm)

# set max tokens to consider from each document (120,000 tokens is about 90,000 words or 180 pages)
max_doc_tokens = 120000

# report success
print("Initialization successful.")

## Prompting for your documents

This next code block prompts you to upload or select a `.zip` file with the documents to summarize and analyze.

If you don't have any documents handy, you can use [this set of example interview transcripts](https://github.com/higherbar-ai/ai-workflows/blob/main/resources/sample_orda_interviews.zip). These come from the *Fostering cultures of open qualitative research* project and were originally retrieved [from the ORDA repository here](https://orda.shef.ac.uk/articles/dataset/Fostering_cultures_of_open_qualitative_research_Dataset_2_Interview_Transcripts/23567223).

In [None]:
# prompt for a single .zip file and keep prompting till we get one
file_path = ""
while True:
    # prompt for a .zip file
    selected_files = notebook_env.get_input_files(".zip file with documents:")

    # complain if we didn't get just a single file
    if len(selected_files) != 1 or not selected_files[0].endswith('.zip'):
        print()
        print('Please upload a single .zip file to continue.')
        print()
    else:
        # fetch the path of the uploaded file
        file_path = selected_files[0]
        
        # break from loop
        break

# report results
print()
print(f"Will process this .zip file: {file_path}")

## Summarizing the documents

The next code block runs through each document, using the LLM to summarize it. Feel free to adjust the instructions used to guide the summarization to meet your needs.

What we summarize here is the raw text extracted from each document. If your documents include figures, images, or complex layouts, you may want to use an LLM to read the document in a higher-quality (but slower and more-expensive) manner. You can do this by simply changing `doc_processor.convert_to_markdown(unzipped_file_path)` to `doc_processor_llm.convert_to_markdown(unzipped_file_path)` in the code block below.


In [None]:
import tempfile
import zipfile
import os

# create a list to store document details
documents = []

# create a temporary directory
with tempfile.TemporaryDirectory() as temp_dir:
    # unzip the .zip file into the temporary directory
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(temp_dir)

    # loop through each file in the temporary directory
    for root, dirs, files in os.walk(temp_dir):
        for unzipped_file in files:
            unzipped_file_path = os.path.join(root, unzipped_file)
            if unzipped_file.startswith('.'):
                # skip hidden files
                continue

            print()
            print(f"Reading {unzipped_file}...")
            # if you want to use an LLM to read the file in a higher-quality (but slower and more-expensive) manner, you can use the following line instead:
            # doc_text = doc_processor_llm.convert_to_markdown(unzipped_file_path)
            doc_text = doc_processor.convert_to_markdown(unzipped_file_path)

            print(f"Summarizing {unzipped_file}...")

            # provide context so that the LLM knows what it's looking at
            json_context = "The file contains a document being processed for qualitative analysis."

            # provide a summary of the job to be done (in this case, extracting a title and summarizing the document for qualitative analysis)
            json_job = f"""Your job is to:

1. Extract or create an appropriate title for the document

2. Summarize the document's contents in a short, concise form that is between one and three paragraphs in length. The summarized form should retain those details from the original file necessary to inform the qualitative analysis (subjects discussed, themes, key points, etc.).

Only give a truthful and faithful title and summary. Never invent any details or expand upon the file's contents in any way (except in creating a title if the document doesn't begin with an appropriate title)."""

            # provide the exact JSON format expected in the output
            json_output_spec = f"""Return JSON with the following fields (and only the following fields):

* `title` (string): A suitable title for the document. If a suitable title appears near the beginning of the document, use that exactly. Otherwise, do your best to create an appropriate title based on the content of the document.

* `summary` (string): A short, concise summary of the document's contents, between one and three paragraphs in length. The summary should retain those details from the original file necessary to inform the qualitative analysis (subjects discussed, themes, key points, etc.)."""

            # process the file
            all_responses = doc_processor_llm.markdown_to_json(markdown=doc_text, json_context=json_context, json_job=json_job, json_output_spec=json_output_spec, max_chunk_size=max_doc_tokens)
            response = all_responses[0]
            if len(all_responses) > 1:
                # if we had to split the document to summarize it in pieces, we're just going to go with the first piece for simplicity
                # (if we wanted, we could use the first title and use the LLM to combine the separate summaries into a single summary)
                total_doc_tokens = llm.count_tokens(doc_text)
                print(f"  Warning: only summarized first {max_doc_tokens} of {total_doc_tokens} tokens in document")

            # save results
            documents.append({
                "file": unzipped_file,
                "text": doc_text,
                "title": response['title'],
                "summary": response['summary']
            })

print()
print(f"Completed summarization of {len(documents)} documents.")

## Identifying the top 10 subjects or themes

This next code block will use the document titles and summaries to identify the top 10 subjects or themes present in the documents. Feel free to adjust the instructions used to guide the theme identification to meet your needs.

If the total number of tokens in the titles and summaries exceeds our limit (120,000 tokens, which is about 90,000 words or 180 pages), we'll truncate the text to fit within the limit. We could also use the LLM to combine the results from multiple runs if we wanted to analyze the full text, but likely the first 180 pages of document summaries is far more than enough to identify the top themes.

In [None]:
# assemble all titles and summaries into a single text block
all_summaries = ""
for document in documents:
    all_summaries += f"**{document['title']}**\n{document['summary']}\n\n"

# truncate the summaries if needed (to avoid overflowing LLM context window)
summaries_tokens = llm.count_tokens(all_summaries)
if summaries_tokens > max_doc_tokens:
    all_summaries = llm.enforce_max_tokens(all_summaries, max_doc_tokens)
    print()
    print(f"  Warning: only considering first {max_doc_tokens} of {summaries_tokens} tokens in list of document titles and summaries")
    print()

print()
print(f"Identifying themes...")

# describe the job we want the LLM to do
job_description = """I'm performing a qualitative analysis on a set of documents. All of the document titles and summaries are included below. Your job is to identify the top 10 themes or subjects discussed in the documents, returning them in the exact JSON format described below. The top 10 you choose should be based on the frequency with which they appear in the documents summarized below. Perform the analysis as if you were an expert in qualitative analysis trained in a top graduate program."""

# describe the exact JSON format we expect back from the LLM
json_output_spec = """Return JSON with the following fields (and only the following fields):

* `themes` (list of objects): The list of top themes or subjects discussed in the documents, each of which should be an object with the following fields:

    * `id` (string): A short, concise identifier for the theme or subject. This should be alphanumeric and contain no spaces or special characters.

    * `description` (string): A short description of the theme or subject. This can be as short as a single sentence fragment or as long as a short paragraph, depending on the complexity of the subject or theme. The description should be sufficient for a human qualitative coder to be able to code individual documents."""

# assemble the overall prompt
json_prompt = f"""{job_description}

{json_output_spec}

All document titles and summaries enclosed by |@| delimiters:

|@|{all_summaries}|@|

Your JSON response precisely following the instructions given above the titles and summaries:"""

# execute the LLM query, with automatic JSON validation+retry
parsed_response, raw_response, error = llm.get_json_response(prompt=json_prompt, json_validation_desc=json_output_spec)

# save and report results
if error:
    print()
    print(f"Error: {error}")

    themes = []
else:
    themes = parsed_response['themes']
    print()
    print(f"Identified {len(themes)} themes:")
    for theme in themes:
        print()
        print(f"  {theme['id']}: {theme['description']}")

## Coding each document with the identified themes, extracting example quotes

This next code block will code each document with the themes identified and extract one example quote from each document. Feel free to adjust the instructions used to guide the coding and quote extraction to meet your needs.

If we wanted this to be faster and cheaper, we could code documents based on their summaries rather than going back to their full text. However, going back to the full text allows us to be more thorough — and it allows us to extract example quotes from the original text.

In [None]:
# if no themes identified, raise an error
if not themes:
    raise Exception("No themes identified, so we can't code the documents")

# loop through each document
error_count = 0
last_error = ""
for document in documents:
    print()
    print(f"Coding {document['file']}...")

    # get document text and truncate if needed
    doc_text = document['text']
    doc_tokens = llm.count_tokens(doc_text)
    if doc_tokens > max_doc_tokens:
        doc_text = llm.enforce_max_tokens(doc_text, max_doc_tokens)
        print(f"  Warning: only considering first {max_doc_tokens} of {doc_tokens} tokens in document")

    # describe the job we want the LLM to do
    job_description = f"""I'm performing a qualitative analysis on a set of documents. I'm going to give you the text from one of the documents, and I need you to do two things:

1. Identify which of the themes or subjects (if any) is present in the document, based on the JSON list I provide. Be sure to include all themes or subjects that are present in the document, even if they are only mentioned once.

2. If you identified at least one theme or subject present in the document, provide an exact quote from the document that exemplifies or supports one of the themes or subjects that you identified. This quote should be suitable for use in the final analysis, as an example of how the theme or subject appears in the document's text.

Perform the analysis as if you were an expert in qualitative analysis trained in a top graduate program.

This is the list of themes or subjects in JSON format:

```
{themes}
```"""

    # describe the exact JSON format we expect back from the LLM
    json_output_spec = """Return JSON with the following fields (and only the following fields):

* `themes` (list of objects): The list of ALL themes or subjects discussed in the document (be sure not to miss any), each of which should be an object with the following fields:

    * `id` (string): The short identifier for the theme or subject (from the list provided).

* `exact_quote` (string): An exact quote from the document that exemplifies or supports one of the themes or subjects identified in the `themes` list (or "" if no themes identified). Try to choose a quote that does a good job of illustrating the theme or subject, if possible.

* `exact_quote_theme_id` (string): The short identifier for the theme or subject that the `exact_quote` exemplifies or supports (must match an `id` supplied in the `themes` list, or "" if no themes identified)."""

    # assemble the overall prompt
    json_prompt = f"""{job_description}

{json_output_spec}

The full document text enclosed by |@| delimiters:

|@|{doc_text}|@|

Your JSON response precisely following the instructions given above the titles and summaries:"""

    # execute the LLM query, with automatic JSON validation+retry
    parsed_response, raw_response, error = llm.get_json_response(prompt=json_prompt, json_validation_desc=json_output_spec)

    # save and report results
    if error:
        document['themes'] = ""
        document['example_quote'] = ""
        document['example_quote_theme_id'] = ""

        print(f"  Error during coding: {error}")

        error_count += 1
        last_error = error
    else:
        document['themes'] = parsed_response['themes']
        document['example_quote'] = parsed_response['exact_quote']
        document['example_quote_theme_id'] = parsed_response['exact_quote_theme_id']

        print(f"  Coded with themes: {', '.join([theme['id'] for theme in parsed_response['themes']])}")

# report overall results
if error_count > 0:
    print()
    print(f"Some documents could not be coded due to errors.")
    print(f"  Total errors encountered: {error_count}")
    print(f"  Last error encountered: {last_error}")
else:
    print()
    print(f"All {len(documents)} documents coded successfully.")

## Organizing and outputting the results

This final code block organizes and outputs final results, saving them in a file named `survey-review-results.txt`.

If you're running in Google Colab, this file will be saved into the content folder. Find, view, and download it by clicking on the folder icon in the left sidebar.

If you're running elsewhere, it will be saved into an `ai-workflows` subdirectory created off of your user home directory.

In [None]:
import os
import csv

# output files to ~/ai-workflows directory if local, otherwise /content if Google Colab
output_path_prefix = notebook_env.get_output_dir(not_colab_dir="~/ai-workflows", colab_subdir="")

# output theme list to UTF-8 .csv file
themes_output_file = os.path.join(output_path_prefix, "example-qual-analysis-1-themes.csv")
with open(themes_output_file, "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["id", "description"])
    for theme in themes:
        writer.writerow([theme['id'], theme['description']])

# output document list to UTF-8 .csv file in wide format
docs_output_file = os.path.join(output_path_prefix, "example-qual-analysis-1-documents.csv")
theme_ids = [theme['id'] for theme in themes]
with open(docs_output_file, "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)

    # write header
    writer.writerow(["file", "title", "summary", "example_quote", "example_quote_theme_id"] + theme_ids)

    # write document data
    for document in documents:
        theme_presence = [1 if theme_id in [t['id'] for t in document['themes']] else 0 for theme_id in theme_ids]
        writer.writerow([
            document['file'],
            document['title'],
            document['summary'],
            document['example_quote'],
            document['example_quote_theme_id']
        ] + theme_presence)

print()
print(f"Themes saved to {themes_output_file}")
print(f"Document summaries, codes, and example quotes saved to {docs_output_file}")