<a href="https://colab.research.google.com/github/higherbar-ai/ai-workflows/blob/main/src/example-surveyeval-lite.ipynb" target="_parent"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>

# About this surveyeval-lite notebook

This notebook provides a simple example of an automated AI workflow. It's a much-simplified version of [the surveyeval toolkit available here in GitHub](https://github.com/higherbar-ai/survey-eval) designed to run in [Google Colab](https://colab.research.google.com) or a local environment. This version, self-contained in this single notebook, uses [the ai-workflows package](https://github.com/higherbar-ai/ai-workflows), along with an OpenAI LLM, to:

1. Parse a survey file into a series of questions

2. Loop through each question to:

    1. Evaluate the question for potential phrasing issues
    2. Evaluate the question for potential bias issues

3. Assemble and output all findings and recommendations

See [the ai-workflows GitHub repo](https://github.com/higherbar-ai/ai-workflows) for more details.

## Configuration

This notebook requires different settings depending on which AI service providers you want to use. If you're running in Google Colab, you configure these settings as "secrets"; just click the key icon in the left sidebar (and, once you create a secret, be sure to click the toggle to give the notebook access to the secret). If you're running this notebook in a different environment, you can set these settings in a `.env` file; the first time you run, it will write out a template `.env` file for you to fill in and direct you to its location.

Following are the settings, regardless of the environment.

### OpenAI (direct)

To use OpenAI directly:

* `openai_api_key` - your OpenAI API key (get one from [the OpenAI API key page](https://platform.openai.com/api-keys), and be sure to fund your platform account with at least $5 to allow GPT-4o model access)
* `openai_model` (optional) - the model to use (defaults to `gpt-4o`)

### OpenAI (via Microsoft Azure)

To use OpenAI via Microsoft Azure:

* `azure_api_key` - your Azure API key
* `azure_api_base` - the base URL for the Azure API
* `azure_api_engine` - the engine to use (a.k.a. the "deployment")
* `azure_api_version` - the API version to use

### Anthropic (direct)

To use Anthropic directly:

* `anthropic_api_key` - your Anthropic API key
* `anthropic_model` - the model to use

### LangSmith (for tracing)

Optionally, you can add [LangSmith tracing](https://langchain.com/langsmith):

* `langsmith_api_key` - your LangSmith API key

## Your survey file

The notebook will prompt you to select or upload a survey file. It will then parse that file into a series of questions and evaluate each question for phrasing and bias issues. The simpler the file's formatting, the better the results will be.

## Setting up the runtime environment

This next code block installs all necessary Python and system packages into the current environment.

**If you're running in Google Colab and it prompts you to restart the notebook in the middle of the installation steps, just click CANCEL.**

In [None]:
# install Google Colab Support and ai_workflows package
%pip install colab-or-not py-ai-workflows[docs]

# download NLTK data
import nltk
nltk.download('punkt', force=True)

# set up our notebook environment (including LibreOffice)
from colab_or_not import NotebookBridge
notebook_env = NotebookBridge(
    system_packages=["libreoffice"],
    config_path="~/.hbai/ai-workflows.env",
    config_template={
        "openai_api_key": "",
        "openai_model": "",
        "azure_api_key": "",
        "azure_api_base": "",
        "azure_api_engine": "",
        "azure_api_version": "",
        "anthropic_api_key": "",
        "anthropic_model": "",
        "langsmith_api_key": "",
    }
)
notebook_env.setup_environment()

## Initializing for AI workflows

The next code block initializes the notebook by loading settings and initializing the LLM interface.

In [None]:
from ai_workflows.llm_utilities import LLMInterface
from ai_workflows.document_utilities import DocumentInterface

# read all supported secrets
openai_api_key = notebook_env.get_setting('openai_api_key')
openai_model = notebook_env.get_setting('openai_model', 'gpt-4o')
azure_api_key = notebook_env.get_setting('azure_api_key')
azure_api_base = notebook_env.get_setting('azure_api_base')
azure_api_engine = notebook_env.get_setting('azure_api_engine')
azure_api_version = notebook_env.get_setting('azure_api_version')
anthropic_api_key = notebook_env.get_setting("anthropic_api_key")
anthropic_model = notebook_env.get_setting("anthropic_model")
langsmith_api_key = notebook_env.get_setting('langsmith_api_key')

# complain if we don't have the bare minimum to run
if (not openai_api_key
        and not (azure_api_key and azure_api_base and azure_api_engine and azure_api_version)
        and not (anthropic_api_key and anthropic_model)):
    raise Exception('We need settings set for OpenAI access (direct or via Azure) or for Anthropic access (direct). See the instructions above for more details.')

# initialize LLM interface
llm = LLMInterface(openai_api_key=openai_api_key, openai_model=openai_model, azure_api_key=azure_api_key, azure_api_base=azure_api_base, azure_api_engine=azure_api_engine, azure_api_version=azure_api_version, temperature = 0.0, total_response_timeout_seconds=600, number_of_retries=2, seconds_between_retries=5, langsmith_api_key=langsmith_api_key, anthropic_api_key=anthropic_api_key, anthropic_model=anthropic_model)

# initialize our document processor
doc_processor = DocumentInterface(llm_interface=llm)

# report success
print("Initialization successful.")

## Prompting for your survey file

This next code block prompts you to upload or select a single survey file for evaluation.

If you don't have a survey file handy, you can use [this short excerpt from the DHS](https://github.com/higherbar-ai/ai-workflows/blob/main/resources/sample_dhs_questions.txt).

In [None]:
# prompt for a single file and keep prompting till we get one
file_path = ""
while True:
    # prompt for a survey file
    survey_files = notebook_env.get_input_files("Survey file to evaluate")

    # complain if we didn't get just a single file
    if len(survey_files) != 1:
        print()
        print('Please upload a single survey file.')
        print()
    else:
        # fetch the path of the uploaded file
        file_path = survey_files[0]
        
        # break from loop
        break

# report results
print(f"Will process this survey file: {file_path}")

## Extracting and parsing survey text

The next code block extracts all text from the survey file without using an LLM to assist, then uses an LLM to extract all survey questions from the file. For PDF, Word, and PowerPoint files, this will extract questions page-by-page, which generally produces the best results because it uses the LLM to understand the formatting of the questions, their instructions and response options, etc. However, if questions span pages, this approach won't work well. In that case, you can change these lines:

```python
doc_processor_no_llm = DocumentInterface()
survey_text = doc_processor_no_llm.convert_to_markdown(file_path)
```

To this:
    
```python
survey_text = doc_processor.convert_to_markdown(file_path)
```

And also this line:

```python
all_responses = doc_processor.convert_to_json(file_path, json_context, json_job, json_output_spec, markdown_first=False)
```

To this:
    
```python
all_responses = doc_processor.markdown_to_json(survey_text, json_context, json_job, json_output_spec)
```

Either way, the JSON with all survey questions will be output for you to verify.


In [None]:
import json

# first, extract all text from the survey file without LLM assistance
doc_processor_no_llm = DocumentInterface()
survey_text = doc_processor_no_llm.convert_to_markdown(file_path)

# next, extract all survey questions from the file using LLM assistance

json_context = "The file contains a survey instrument or digital form."

json_job = f"""Your job is to extract survey questions or form fields from the file's content, including question IDs, instructions, and multiple-choice options, and to return it all in a specific JSON format. More specifically:

* **Your job is to extract verbatim text:** In the JSON you return, only ever include text content, directly quoted without modification, from the survey text you are supplied (i.e., never add or invent any text and never revise or rephrase any text).

* **Only respond with valid JSON that precisely follows the format specified below:** Your response should only include valid JSON and nothing else; if you cannot find any questions to return, simply return an empty questions list.

* **Treat translations as separate questions:** If you see one or more translated versions of a question, include them as separate questions in the JSON you return."""

json_output_spec = f"""Return JSON with the following fields (and only the following fields):

* `questions` (list): The list of questions extracted, or an empty list if none found. Each question should be a dictionary with the following keys:

  * `question_id` (string): The numeric or alphanumeric identifier or short variable name identifying the question (if any), usually located just before or at the beginning of the question. "" if none found.

  * `question` (string): The exact text of the question or form field, including any introductory text that provides context or explanation. Often follows a unique question ID of some sort, like "2.01." or "gender:". Should not include response options, which should be included in the 'options' field, or extra enumerator or interviewer instructions (including interview probes), which should be included in the 'instructions' field. Be careful: the same question might be asked in multiple languages, and each translation should be included as a separate question. Never translate between languages or otherwise alter the question text in any way.

  * `instructions` (string): Instructions or other guidance about how to ask or answer the question (if any), including enumerator or interviewer instructions. If the question includes a list of specific response options, do NOT include those in the instructions. However, if there is guidance as to how to fill out an open-ended numeric or text response, or guidance about how to choose among the options, include that guidance here. "" if none found.

  * `options` (string): The list of specific response options for multiple-choice questions in a single string, including both the label and the internal value (if specified) for each option. For example, a 'Male' label might be coupled with an internal value of '1', 'M', or even 'male'. Separate response options with a space, three pipe symbols ('|||'), and another space, and, if there is an internal value, add a space, three # symbols ('###'), and the internal value at the end of the label. For example: 'Male ### 1 ||| Female ### 2' (codes included) or 'Male ||| Female' (no codes); 'Yes ### yes ||| No ### no', 'Yes ### 1 ||| No ### 0', 'Yes ### y ||| No ### n', or 'YES ||| NO'. Do NOT include fill-in-the-blank content here, only multiple-choice options. "" if the question is open-ended (i.e., does not include specific multiple-choice options)."""

# process the file
all_responses = doc_processor.convert_to_json(file_path, json_context, json_job, json_output_spec, markdown_first=False)

# combine all responses into a single list of questions
merged_responses = doc_processor.merge_dicts(all_responses)
questions = merged_responses['questions']

# output results
if questions:
    # output summary of results
    num_questions = len(questions)
    num_question_ids = len(set([q['question_id'] for q in questions]))
    num_instructions = len(set([q['instructions'] for q in questions]))
    num_options = len(set([q['options'] for q in questions]))
    print(f"Parsed {num_questions} questions ({num_question_ids} with IDs, {num_instructions} with instructions, and {num_options} with multiple-choice options)")
    print()
    print(json.dumps(questions, indent=2))
else:
    print(f"Failed to parse any questions from file.")

## Reviewing the survey questions

This next code block will review each question in the survey, asking the LLM for advice re: question phrasing as well as potential biased or stereotypical language.

In [None]:
# loop through every question, reviewing them and saving results as we go
all_results = []
for question in questions:
    # format our question for the LLM
    question_text = f"""* Question ID: {question['question_id']}
* Instructions: {question['instructions']}
* Question: {question['question']}
* Options: {question['options']}"""

    # set up our phrasing-review prompt for the LLM
    phrasing_prompt = f"""You are an AI designed to evaluate questionnaires and other survey instruments used by researchers and M&E professionals. You are an expert in survey methodology with training equivalent to a member of the American Association for Public Opinion Research (AAPOR) with a Ph.D. in survey methodology from University of Michigan’s Institute for Social Research. You consider primarily the content, context, and questions provided to you, and then content and methods from the most widely-cited academic publications and public and nonprofit research organizations.

You always give truthful, factual answers. When asked to give your response in a specific format, you always give your answer in the exact format requested. You never give offensive responses. If you don’t know the answer to a question, you truthfully say you don’t know.

You will be given the raw text from a questionnaire or survey instrument between |!| and |!| delimiters. You will also be given a specific question from that text to evaluate between |@| and |@| delimiters. The question will be supplied in the following format:

* Question ID: ID (if any)
* Instructions: Instructions (if any)
* Question: Question text
* Options: Multiple-choice options (if any), with each separated by three pipe symbols (|||) and option values (if any) separated from option labels by three hash symbols (###)

Evaluate the question only, but also consider its context within the larger survey.

Assume that this survey will be administered by a trained enumerator who asks each question and reads each prompt or instruction as indicated in the excerpt. Your job is to anticipate the phrasing or translation issues that would be identified in a rigorous process of pre-testing (with cognitive interviewing) and piloting.

When evaluating the question, DO:

1. Ensure that the question will be understandable by substantially all respondents.

2. Consider the question in the context of the excerpt, including any instructions, related questions, or prompts that precede it.

3. Ignore question numbers and formatting.

4. Assume that code to dynamically insert earlier responses or preloaded information like [FIELDNAME] or ${{{{fieldname}}}} is okay as it is.

5. Ignore HTML or other formatting, and focus solely on question phrasing (assume that HTML tags will be for visual formatting only and will not be read aloud).

When evaluating the question, DON'T:

1. Recommend translating something into another language (i.e., suggestions for rephrasing should always be in the same language as the original text).

2. Recommend changes in the overall structure of a question (e.g., changing from multiple choice to open-ended or splitting one question into multiple), unless it will substantially improve the quality of the data collected.

3. Comment on HTML tags or formatting.

Respond in JSON format with all of the following fields:

* `Phrases` (list): a list containing all phrases from the excerpt that pre-testing or piloting is likely to identify as problematic (each phrase should be an exact quote)

* `Number of phrases` (number): the exact number of phrases in Phrases [ Note that this key must be exactly "Number of phrases", with exactly that capitalization and spacing ]

* `Recommendations` (list): a list containing suggested replacement phrases, one for each of the phrases in Phrases (in the same order as Phrases; each replacement phrase should be an exact quote that can exactly replace the corresponding phrase in Phrases; and each replacement phrase should be in the same language as the original phrase)

* `Explanations` (list): a list containing explanations for why the authors should consider revising each phrase, one for each of the phrases in Phrases (in the same order as Phrases). Do not repeat the entire phrase in the explanation, but feel free to reference specific words or parts as needed.

* `Severities` (list): a list containing the severity of each identified issue, one for each of the phrases in Phrases (in the same order as Phrases); each severity should be expressed as a number on a scale from 1 for the least severe issues (minor phrasing issues that are very unlikely to substantively affect responses) to 5 for the most severe issues (problems that are likely to substantively affect responses in a way that introduces bias and/or variance)

Raw text:
|!|
{survey_text}
|!|

Question to evaluate:
|@|
{question_text}
|@|

Your JSON response following the format described above:"""

    # call out to the LLM
    print()
    print(f"Evaluating question for phrasing: {question['question']}")
    response_dict, response_text, error = llm.get_json_response(phrasing_prompt)

    # save and output results
    if error:
        print(f"  Failed to get a valid response. Error: {error}")
    elif 'Number of phrases' in response_dict and response_dict['Number of phrases'] > 0:
        print(f"  Identified {response_dict['Number of phrases']} issue(s)")
        all_results += [response_dict]
    else:
        print("  No issues identified")

    # set up our bias-review prompt for the LLM
    # Note that this prompt was inspired by the example in this blog post:
    # https://www.linkedin.com/pulse/using-chatgpt-counter-bias-prejudice-discrimination-johannes-schunter/
    bias_prompt = f"""You are an AI designed to evaluate questionnaires and other survey instruments used by researchers and M&E professionals. You are an expert in survey methodology with training equivalent to a member of the American Association for Public Opinion Research (AAPOR) with a Ph.D. in survey methodology from University of Michigan’s Institute for Social Research. You are also an expert in the areas of gender equality, discrimination, anti-racism, and anti-colonialism. You consider primarily the content, context, and questions provided to you, and then content and methods from the most widely-cited academic publications and public and nonprofit research organizations.

You always give truthful, factual answers. When asked to give your response in a specific format, you always give your answer in the exact format requested. You never give offensive responses. If you don’t know the answer to a question, you truthfully say you don’t know.

You will be given the raw text from a questionnaire or survey instrument between |!| and |!| delimiters. You will also be given a specific question from that text to evaluate between |@| and |@| delimiters. The question will be supplied in the following format:

* Question ID: ID (if any)
* Instructions: Instructions (if any)
* Question: Question text
* Options: Multiple-choice options (if any), with each separated by three pipe symbols (|||) and option values (if any) separated from option labels by three hash symbols (###)

Evaluate the question only, but also consider its context within the larger survey.

Assume that this survey will be administered by a trained enumerator who asks each question and reads each prompt or instruction as indicated in the excerpt. Your job is to review the question for:

a. Stereotypical representations of gender, ethnicity, origin, religion, or other social categories.

b. Distorted or biased representations of events, topics, groups, or individuals.

c. Use of discriminatory or insensitive language towards certain groups or topics.

d. Implicit or explicit assumptions made in the text or unquestioningly adopted that could be based on prejudices.

e. Prejudiced descriptions or evaluations of abilities, characteristics, or behaviors.

Respond in JSON format with all of the following fields:

* `Phrases`: a list containing all problematic phrases from the excerpt that you found in your review (each phrase should be an exact quote from the excerpt)

* `Number of phrases`: the exact number of phrases in Phrases [ Note that this key must be exactly "Number of phrases", with exactly that capitalization and spacing ]

* `Recommendations`: a list containing suggested replacement phrases, one for each of the phrases in Phrases (in the same order as Phrases; each replacement phrase should be an exact quote that can exactly replace the corresponding phrase in Phrases)

* `Explanations`: a list containing explanations for why the phrases are problematic, one for each of the phrases in Phrases (in the same order as Phrases)

* `Severities`: a list containing the severity of each identified issue, one for each of the phrases in Phrases (in the same order as Phrases); each severity should be expressed as a number on a scale from 1 for the least severe issues (minor phrasing issues that are very unlikely to offend respondents or substantively affect their responses) to 5 for the most severe issues (problems that are very likely to offend respondents or substantively affect responses in a way that introduces bias and/or variance)

Raw text:
|!|
{survey_text}
|!|

Question to evaluate:
|@|
{question_text}
|@|

Your JSON response following the format described above:"""

    # call out to the LLM
    print()
    print(f"Evaluating question for bias: {question['question']}")
    response_dict, response_text, error = llm.get_json_response(bias_prompt)

    # save and output results
    if error:
        print(f"  Failed to get a valid response. Error: {error}")
    elif 'Number of phrases' in response_dict and response_dict['Number of phrases'] > 0:
        print(f"  Identified {response_dict['Number of phrases']} issue(s)")
        all_results += [response_dict]
    else:
        print("  No issues identified")

## Organizing and outputting the results

This final code block organizes and outputs final results, saving them in a file named `survey-review-results.txt`.

If you're running in Google Colab, this file will be saved into the content folder. Find, view, and download it by clicking on the folder icon in the left sidebar.

If you're running elsewhere, it will be saved into an `ai-workflows` subdirectory created off of your user home directory.

In [None]:
import os

# output files to ~/ai-workflows directory if local, otherwise /content if Google Colab
output_path_prefix = notebook_env.get_output_dir(not_colab_dir="~/ai-workflows", colab_subdir="")

# generate report
if len(all_results) == 0:
    report = "No results to report"
else:
    report = "Survey review results:\n"
    for result in all_results:
        if 'Phrases' in result and result['Number of phrases'] > 0:
            # loop through all recommendations, treating lists as parallel arrays
            for phrase, recommendation, explanation, severity in zip(result['Phrases'], result['Recommendations'], result['Explanations'], result['Severities']):
                report += f"\n---\n\nSuggest replacing this: {phrase}\n\nWith this: {recommendation}\n\n{explanation}\n\nImportance: {severity} out of 5\n"

# save the report to file
output_file = os.path.join(output_path_prefix, "survey-review-results.txt")
with open(output_file, "w") as f:
    f.write(report)

print(f"All recommendations saved to {output_file}")