<a href="https://colab.research.google.com/github/higherbar-ai/ai-workflows/blob/main/src/example-qual-analysis-1.ipynb" target="_parent"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>

# About this example-qual-analysis-1 notebook

This notebook provides a simple example of an automated AI workflow, designed to run in [Google Colab](https://colab.research.google.com) or a local environment. It uses [the ai-workflows package](https://github.com/higherbar-ai/ai-workflows) to perform some basic qualitative analysis on a set of interview transcripts. The notebook will:

1. Prompt you to upload or select a `.zip` file with one document per interview transcript

2. Extract the text from each transcript

3. Summarize each transcript

4. Develop and refine a codebook for coding the transcripts

5. Code each document

6. Output the results in a series of `.csv` files

You can easily adapt this workflow to other kinds of document-based qualitative analyses. Just adjust the text that guides the LLM in summarizing the documents, developing the codebook, and coding the documents.

See [the ai-workflows GitHub repo](https://github.com/higherbar-ai/ai-workflows) for more details on the `ai_workflows` package.

## Configuration

This notebook requires different settings depending on which AI service providers you want to use. If you're running in Google Colab, you configure these settings as "secrets"; just click the key icon in the left sidebar (and, once you create a secret, be sure to click the toggle to give the notebook access to the secret). If you're running this notebook in a different environment, you can set these settings in a `.env` file; the first time you run, it will write out a template `.env` file for you to fill in and direct you to its location.

Following are the settings, regardless of the environment.

### OpenAI (direct)

To use OpenAI directly:

* `openai_api_key` - your OpenAI API key (get one from [the OpenAI API key page](https://platform.openai.com/api-keys), and be sure to fund your platform account with at least $5 to allow GPT-4o model access)
* `openai_model` (optional) - the model to use (defaults to `gpt-4o`)

### OpenAI (via Microsoft Azure)

To use OpenAI via Microsoft Azure:

* `azure_api_key` - your Azure API key
* `azure_api_base` - the base URL for the Azure API
* `azure_api_engine` - the engine to use (a.k.a. the "deployment")
* `azure_api_version` - the API version to use

### Anthropic (direct)

To use Anthropic directly:

* `anthropic_api_key` - your Anthropic API key
* `anthropic_model` - the model to use

### LangSmith (for tracing)

Optionally, you can add [LangSmith tracing](https://langchain.com/langsmith):

* `langsmith_api_key` - your LangSmith API key

## Your documents

The notebook will prompt you to select or upload a `.zip` file. It should contain all the transcripts you would like to summarize and analyze.

## Setting up the runtime environment

This next code block installs all necessary Python and system packages into the current environment.

**If you're running in Google Colab and it prompts you to restart the notebook in the middle of the installation steps, just click CANCEL.**

In [None]:
# install Google Colab support and ai_workflows package
%pip install colab-or-not py-ai-workflows[docs]

# download NLTK data
import nltk
nltk.download('punkt', force=True)

# set up our notebook environment (including LibreOffice)
from colab_or_not import NotebookBridge
notebook_env = NotebookBridge(
    system_packages=["libreoffice"],
    config_path="~/.hbai/ai-workflows.env",
    config_template={
        "openai_api_key": "",
        "openai_model": "",
        "azure_api_key": "",
        "azure_api_base": "",
        "azure_api_engine": "",
        "azure_api_version": "",
        "anthropic_api_key": "",
        "anthropic_model": "",
        "langsmith_api_key": "",
    }
)
notebook_env.setup_environment()

## Initializing for AI workflows

The next code block initializes the notebook by loading settings and initializing the LLM interface.

In [None]:
from ai_workflows.llm_utilities import LLMInterface
from ai_workflows.document_utilities import DocumentInterface

# read all supported secrets
openai_api_key = notebook_env.get_setting('openai_api_key')
openai_model = notebook_env.get_setting('openai_model', 'gpt-4o')
azure_api_key = notebook_env.get_setting('azure_api_key')
azure_api_base = notebook_env.get_setting('azure_api_base')
azure_api_engine = notebook_env.get_setting('azure_api_engine')
azure_api_version = notebook_env.get_setting('azure_api_version')
anthropic_api_key = notebook_env.get_setting("anthropic_api_key")
anthropic_model = notebook_env.get_setting("anthropic_model")
langsmith_api_key = notebook_env.get_setting('langsmith_api_key')

# complain if we don't have the bare minimum to run
if (not openai_api_key
        and not (azure_api_key and azure_api_base and azure_api_engine and azure_api_version)
        and not (anthropic_api_key and anthropic_model)):
    raise Exception('We need settings set for OpenAI access (direct or via Azure) or for Anthropic access (direct). See the instructions above for more details.')

# initialize LLM interface
llm = LLMInterface(openai_api_key=openai_api_key, openai_model=openai_model, azure_api_key=azure_api_key, azure_api_base=azure_api_base, azure_api_engine=azure_api_engine, azure_api_version=azure_api_version, temperature = 0.0, total_response_timeout_seconds=600, number_of_retries=2, seconds_between_retries=5, langsmith_api_key=langsmith_api_key, anthropic_api_key=anthropic_api_key, anthropic_model=anthropic_model)

# initialize two document processors, one with an LLM and one without
doc_processor = DocumentInterface()
doc_processor_llm = DocumentInterface(llm_interface=llm)

# set max tokens to consider from each document (120,000 tokens is about 90,000 words or 180 pages)
max_doc_tokens = 120000

# report success
print("Initialization successful.")

## Prompting for your transcripts

This next code block prompts you to upload or select a `.zip` file with the transcripts to summarize and analyze (assumes one transcript per document).

If you don't have any transcripts handy, you can use [this set of example interview transcripts](https://github.com/higherbar-ai/ai-workflows/blob/main/resources/sample_orda_interviews.zip). These come from the *Fostering cultures of open qualitative research* project and were originally retrieved [from the ORDA repository here](https://orda.shef.ac.uk/articles/dataset/Fostering_cultures_of_open_qualitative_research_Dataset_2_Interview_Transcripts/23567223).

In [None]:
# prompt for a single .zip file and keep prompting till we get one
file_path = ""
while True:
    # prompt for a .zip file
    selected_files = notebook_env.get_input_files(".zip file with transcripts:")

    # complain if we didn't get just a single file
    if len(selected_files) != 1 or not selected_files[0].endswith('.zip'):
        print()
        print('Please upload a single .zip file with your transcripts to continue.')
        print()
    else:
        # fetch the path of the uploaded file
        file_path = selected_files[0]
        
        # break from loop
        break

# report results
print()
print(f"Will process transcripts in this .zip file: {file_path}")

## Summarizing the transcripts

The next code block runs through each transcript, using the LLM to summarize it. Feel free to adjust the instructions used to guide the summarization to meet your needs.

*Note:* What we summarize here is the raw text extracted from each document. If your documents include figures, images, or complex layouts, you may want to use an LLM to read the document in a higher-quality (but slower and more-expensive) manner. You can do this by simply changing `doc_processor.convert_to_markdown(unzipped_file_path)` to `doc_processor_llm.convert_to_markdown(unzipped_file_path)` in the code block below.


In [None]:
import tempfile
import zipfile
import os

# create a list to store document details
documents = []

# create a temporary directory
with tempfile.TemporaryDirectory() as temp_dir:
    # unzip the .zip file into the temporary directory
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(temp_dir)

    # loop through each file in the temporary directory
    for root, dirs, files in os.walk(temp_dir):
        for unzipped_file in files:
            unzipped_file_path = os.path.join(root, unzipped_file)
            if unzipped_file.startswith('.'):
                # skip hidden files
                continue

            print()
            print(f"Reading {unzipped_file}...")
            # if you want to use an LLM to read the file in a higher-quality (but slower and more-expensive) manner, you can use the following line instead:
            # doc_text = doc_processor_llm.convert_to_markdown(unzipped_file_path)
            doc_text = doc_processor.convert_to_markdown(unzipped_file_path)

            print(f"Summarizing {unzipped_file}...")

            # provide context so that the LLM knows what it's looking at
            json_context = "The file contains a single interview transcript."

            # provide a summary of the job to be done
            json_job = f"""Your job is to read the transcript carefully and then produce a concise summary that highlights the main topics, recurring ideas, significant opinions, and emotional tones expressed by the participant(s). Please include key details such as the participant’s role or background if evident, the issues or themes they discuss, any problems or challenges mentioned, any proposed solutions or actions, and any notable conflicts or contradictions. If the participant shows particular feelings (e.g., frustration, excitement, uncertainty), describe those. If the participant refers to external events, organizational structures, or specific stakeholders, include that information. The goal is for the summary to capture enough richness and detail so that a subsequent analysis could generate meaningful qualitative codes from it."""

            # provide the exact JSON format expected in the output
            json_output_spec = f"""Return JSON with the following fields (and only the following fields):

* `summary` (string): Your summary of the transcript, following the instructions above."""

            # process the file
            all_responses = doc_processor_llm.markdown_to_json(markdown=doc_text, json_context=json_context, json_job=json_job, json_output_spec=json_output_spec, max_chunk_size=max_doc_tokens)
            response = all_responses[0]
            if len(all_responses) > 1:
                # if we had to split the document to summarize it in pieces, we're just going to go with the first piece for simplicity
                # (if we wanted, we could use the first title and use the LLM to combine the separate summaries into a single summary)
                total_doc_tokens = llm.count_tokens(doc_text)
                print(f"  Warning: only summarized first {max_doc_tokens} of {total_doc_tokens} tokens in document")

            # save results
            documents.append({
                "file": unzipped_file,
                "text": doc_text,
                "summary": response['summary']
            })

print()
print(f"Completed summarization of {len(documents)} transcripts.")

## Developing the codebook

This next code block will use the document summaries to develop a codebook for coding the transcripts, then it will refine that codebook. Feel free to adjust the instructions used to guide the codebook generation to meet your needs.

For each codebook draft, we randomize the order of summaries to mitigate any bias that might arise from the order of the transcripts.

If the total number of tokens in the summaries exceeds our limit (120,000 tokens, which is about 90,000 words or 180 pages), we'll truncate the text to fit within the limit. (We could also use the LLM to combine the results from multiple runs if we wanted to include all summaries in the analysis, but likely the first 180 pages of document summaries is more than enough to identify appropriate codes.)

### Using your own codebook

If you prefer, you can replace this next code block with a much simpler one that just sets up your own codebook for the next step. Here's the format to follow:

```
codes = [
    {
        "id": "code1",
        "label": "Label for Code 1",
        "definition": "Definition for Code 1",
        "example": "Example for Code 1"
    },
    {
        "id": "code2",
        "label": "Label for Code 2",
        "definition": "Definition for Code 2",
        "example": "Example for Code 2"
    }
]
```

In [None]:
import random, json

# assemble all titles and summaries into a single text block (in random order)
def get_all_summaries(docs) -> str:
    # randomize the order of the documents
    random.seed()
    docs = random.sample(docs, len(docs))

    # assemble all the summaries
    retval = ""
    doc_number = 0
    for doc in docs:
        doc_number += 1
        retval += f"### Transcript {doc_number}\n{doc['summary']}\n\n"

    # truncate the summaries if needed (to avoid overflowing LLM context window)
    summaries_tokens = llm.count_tokens(retval)
    if summaries_tokens > max_doc_tokens:
        retval = llm.enforce_max_tokens(retval, max_doc_tokens)
        print()
        print(f"  Warning: only considering first {max_doc_tokens} of {summaries_tokens} tokens in list of summaries")
        print()

    return retval

# get a randomized list of transcript summaries
all_summaries = get_all_summaries(documents)

print()
print(f"Creating first-draft codebook...")

# describe the job we want the LLM to do
job_description = """Below are summaries of multiple interview transcripts. Each summary encapsulates key topics, opinions, emotions, contexts, and dynamics present in the original data. Please read these summaries carefully and then propose an initial codebook for qualitative analysis.

Your codebook should:

1. Identify key recurring concepts, patterns, and themes that emerge across the summaries.
2. Provide a a short, unique identifier for each code.
3. Provide a short, descriptive label for each code.
4. Include a concise definition of what the code represents (i.e., when and how it should be applied).
5. Offer an illustrative example (excerpt or paraphrase) from the summaries that demonstrates the use of each code.

The goal is to produce a practical, clearly defined set of codes that could guide a researcher or analyst in systematically coding the full dataset."""

# describe the exact JSON format we expect back from the LLM
json_output_spec = """Return JSON with the following fields (and only the following fields):

* `codes` (list of objects): The list of proposed codes, each of which should be an object with the following fields:

* `id` (string): A short, concise identifier for the code. This should be alphanumeric and contain no spaces or special characters.

* `label` (string): A short, descriptive label for the code. This should be in the form of a descriptive sentence fragment, as might be used in a bulleted list of codes.

* `definition` (string): A concise definition of what the code represents (i.e., when and how it should be applied). This can be between one sentence and one paragraph in length.

* `example` (string): An illustrative example (excerpt or paraphrase) from the summaries that demonstrates the use of the code. This should help coders to better understand when to use the code."""

# assemble the overall prompt
json_prompt = f"""{job_description}

{json_output_spec}

All transcript summaries enclosed in a code block:

```
{all_summaries}
```

Your JSON response precisely following the instructions given above the summaries:"""

# execute the LLM query, with automatic JSON validation+retry
parsed_response, raw_response, error = llm.get_json_response(prompt=json_prompt, json_validation_desc=json_output_spec)

# save and report results
if error:
    print()
    print(f"Error: {error}")

    codes = []
else:
    codes = parsed_response['codes']
    print()
    print(f"Proposed {len(codes)} codes:")
    for code in codes:
        print()
        print(f"  * {code['id']}: {code['label']}")

# next, if we have a proposed set of codes, try to refine them
if codes:
    # re-randomize the list of transcript summaries
    all_summaries = get_all_summaries(documents)

    print()
    print(f"Creating first-draft codebook...")

    # describe the job we want the LLM to do
    job_description = """I have a draft codebook derived from a set of interview summaries, and your job is to refine and finalize that codebook. Each code currently includes a unique ID, a label, a definition, and an example excerpt. I'd like you to refine this codebook based on the content and patterns described in these summaries. Specifically:

1. Review each code and confirm that it is clearly defined and relevant to the summarized data.
2. If any codes overlap significantly or appear redundant, merge or rename them to reduce redundancy.
3. If any codes are too broad, vague, or ambiguous, modify them as appropriate (e.g., split them into more specific codes or clarify definitions).
4. Identify any gaps (areas where a code might be missing because the summaries hint at concepts or themes not yet represented). Add new codes if needed.
5. Update the examples to ensure they are representative and clearly illustrate when to apply the code.

Please return a revised codebook that best represents the recurring themes and nuances of the summaries — and be sure to keep the final codebook that you return anchored in the data as presented by the summaries. Do not introduce concepts not supported by the summaries."""

    # (we'll use the same json_output_spec from earlier since we want the same format back from the LLM)

    # assemble the overall prompt
    json_prompt = f"""{job_description}

{json_output_spec}

**Transcript summaries:**
All transcript summaries enclosed in a code block:

```
{all_summaries}
```

**Draft codebook:**
The draft codebook, in JSON format, enclosed in a code block:
```
{json.dumps(parsed_response, indent=2)}
```

Your JSON response precisely following the instructions given above the summaries and draft:"""

    # execute the LLM query, with automatic JSON validation+retry
    parsed_response, raw_response, error = llm.get_json_response(prompt=json_prompt, json_validation_desc=json_output_spec)

    # save and report results
    if error:
        print()
        print(f"Error: {error}")

        codes = []
    else:
        codes = parsed_response['codes']
        print()
        print(f"Final draft includes {len(codes)} codes:")
        for code in codes:
            print()
            print(f"  * {code['id']}: {code['label']}")

## Coding each document, extracting excerpts

This next code block will code each document according to the codebook, identifying a relevant excerpt to go with each code. Feel free to adjust the instructions used to guide the coding and excerpt extraction to meet your needs.

If we wanted this to be faster and cheaper, we could code documents based on their summaries rather than going back to their full text. However, going back to the full text allows us to be more thorough — and it allows us to extract actual excerpts from the original text.

In [None]:
# if no themes identified, raise an error
if not codes:
    raise Exception("No codes identified, so we can't code the documents")

# loop through each document
error_count = 0
last_error = ""
for document in documents:
    print()
    print(f"Coding {document['file']}...")

    # get document text and truncate if needed
    doc_text = document['text']
    doc_tokens = llm.count_tokens(doc_text)
    if doc_tokens > max_doc_tokens:
        doc_text = llm.enforce_max_tokens(doc_text, max_doc_tokens)
        print(f"  Warning: only considering first {max_doc_tokens} of {doc_tokens} tokens in document")

    # describe the job we want the LLM to do
    job_description = f"""You are given a single interview transcript and a codebook (defined below). Your task is to read the transcript carefully and then apply those codes from the codebook that apply to the transcript. For each code that you apply to the transcript, you should:

1. Supply the unique `id` for the code
2. Supply an `excerpt` from the transcript that best supports the application of the code"""

    # describe the exact JSON format we expect back from the LLM
    json_output_spec = """Return JSON with the following fields (and only the following fields):

* `codes` (list of objects): The list of ALL codes from the codebook that apply to the transcript (be sure not to miss any), each of which should be an object with the following fields:

    * `id` (string): The short identifier for code from the codebook. Must exactly match an `id` from the codebook.

    * `excerpt` (string): An excerpt from the transcript that best supports the application of the code. This should be a short, relevant snippet of text that clearly demonstrates why the code applies to the transcript. Do not paraphrase or alter the text in any way, except as necessary to format in proper JSON format."""

    # assemble the overall prompt
    json_prompt = f"""{job_description}

{json_output_spec}

**Codebook:**
The codebook in JSON format, enclosed in a code block:
```
{json.dumps(codes, indent=2)}
```

**Interview transcript:**
The full interview transcript, enclosed in a code block:
```
{doc_text}
```

Your JSON response precisely following the instructions given above the transcript and codebook:"""

    # execute the LLM query, with automatic JSON validation+retry
    parsed_response, raw_response, error = llm.get_json_response(prompt=json_prompt, json_validation_desc=json_output_spec)

    # save and report results
    if error:
        document['codes'] = ""

        print(f"  Error during coding: {error}")

        error_count += 1
        last_error = error
    else:
        document['codes'] = parsed_response['codes']

        print(f"  Coded with: {', '.join([code['id'] for code in parsed_response['codes']])}")

# report overall results
if error_count > 0:
    print()
    print(f"Some documents could not be coded due to errors.")
    print(f"  * Total errors encountered: {error_count}")
    print(f"  * Last error encountered: {last_error}")
else:
    print()
    print(f"All {len(documents)} documents coded successfully.")

## Organizing and outputting the results

This final code block organizes and outputs final results, saving them in two `.csv` files:

1. `example-qual-analysis-1-codebook.csv` - The codebook used to code the transcripts
2. `example-qual-analysis-1-documents.csv` - A wide-format `.csv` file with dummy variables and excerpts for all codes

If you're running in Google Colab, these files will be saved into the content folder. Find, view, and download them by clicking on the folder icon in the left sidebar.

If you're running elsewhere, they will be saved into an `ai-workflows` subdirectory created off of your user home directory.

In [None]:
import os
import csv

# output files to ~/ai-workflows directory if local, otherwise /content if Google Colab
output_path_prefix = notebook_env.get_output_dir(not_colab_dir="~/ai-workflows", colab_subdir="")

# output theme list to UTF-8 .csv file
codebook_output_file = os.path.join(output_path_prefix, "example-qual-analysis-1-codebook.csv")
with open(codebook_output_file, "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["id", "label", "definition", "example"])
    for code in codes:
        writer.writerow([code['id'], code['label'], code['definition'], code['example']])

# output document list to UTF-8 .csv file in wide format
docs_output_file = os.path.join(output_path_prefix, "example-qual-analysis-1-documents.csv")
# output two columns for each code (one for presence, one for excerpt)
code_columns = []
for code in codes:
    code_columns.append(f"{code['id']}")
    code_columns.append(f"{code['id']}_excerpt")

with open(docs_output_file, "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)

    # write header
    writer.writerow(["file", "summary"] + code_columns)

    # write document data
    for document in documents:
        # assemble values for code columns
        code_values = []
        for code in codes:
            code_id = code['id']
            code_present = 1 if code_id in [c['id'] for c in document['codes']] else 0
            code_excerpt = next((c['excerpt'] for c in document['codes'] if c['id'] == code_id), "")
            code_values.extend([code_present, code_excerpt])

        # write row for document
        writer.writerow([
            document['file'],
            document['summary'],
        ] + code_values)

print()
print(f"Codebook saved to {codebook_output_file}")
print(f"Transcript summaries, codes, and excerpts saved to {docs_output_file}")