<a href="https://colab.research.google.com/github/higherbar-ai/survey-eval-lite/blob/main/survey_eval_lite.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About this survey-eval-lite notebook

This notebook provides a simple example of an automated AI workflow. It's a much-simplified version of [the surveyeval toolkit available here in GitHub](https://github.com/higherbar-ai/survey-eval) designed to run in [Google Colab](https://colab.research.google.com). This version, self-contained in this single notebook, uses the OpenAI API to:

1. Parse a text-format survey into a series of questions

2. Loop through each question to:

    1. Evaluate the question for potential phrasing
    2. Evaluate the question for potential bias

3. Assemble and output all findings and recommendations

## Configuration

This notebook uses secrets, which you can configure by clicking on the key icon in Google Colab's left sidebar.

To use OpenAI directly, configure the `openai_api_key` secret to contain your API key. (Get a key from [the OpenAI API key page](https://platform.openai.com/api-keys), and be sure to fund your platform account with at least $5 to allow GPT-4o model access.)

Alternatively, you can use OpenAI via Microsoft Azure by configuring the following secrets:

1. `azure_api_key`
2. `azure_api_base`
3. `azure_api_engine`
4. `azure_api_version`

Finally, you can override the default model of `gpt-4o` by setting the `openai_model` secret to your preferred model, and you can optionally add [LangSmith tracing](https://langchain.com/langsmith) by setting the `langsmith_api_key` secret.

## Your survey text

This notebook will prompt you to upload a survey `.txt` file in plain text format (like [this short excerpt from the DHS](https://github.com/higherbar-ai/survey-eval-lite/blob/main/sample_dhs_questions.txt)). Tips on creating `.txt` versions from other formats:

1. `.docx`: In Microsoft Word, click _Save as_ and then choose `Plain Text (.txt)` as the format.

2. `.pdf`: Upload to [ChatGPT](https://chatgpt.com/) and ask it to give you the survey in plain text format, then copy and paste into a `.txt` file.

3. `.xlsx`: Print a preview to `.pdf` format and then use ChatGPT as in #2 above.

While it's certainly possible to read `.pdf` and other formats, it's a fair bit more complicated. For example, see the code and discussion in [the surveyeval toolkit](https://github.com/higherbar-ai/survey-eval).

## Installing prerequisites

This next code block installs all necessary Python packages into the current environment.

In [None]:
!pip install google-colab
!pip install 'langchain>=0.2.0,<0.3'
!pip install 'langchain-openai==0.1.19'
!pip install 'langchain-community>=0.2.0,<0.3'
!pip install 'langsmith>=0.1.63,<0.2'
!pip install 'tiktoken>=0.7.0,<1.0.0'
!pip install 'openai==1.37.1'
!pip install tenacity

## Initializing LLM and defining support functions

This next code block uses your configured secrets to initialize LangChain for OpenAI access (possibly via Microsoft Azure, if secrets are configured for that).

It also includes a set of handy support functions to facilitate AI workflows.

In [None]:
from google.colab import userdata
from langchain_openai.chat_models.base import _AllReturnType, ChatOpenAI
from langchain_openai.chat_models.azure import _AllReturnType, AzureChatOpenAI
from langchain_core.messages import BaseMessage
import concurrent.futures
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_exception_type
import json
import os

# utility function: get secret from Google Colab, with support for a default
def get_secret_with_default(secretName, defaultValue=None):
  try:
    return userdata.get(secretName)
  except:
    return defaultValue


# read all supported secrets
openai_api_key = get_secret_with_default('openai_api_key')
openai_model = get_secret_with_default('openai_model', 'gpt-4o')
azure_api_key = get_secret_with_default('azure_api_key')
azure_api_base = get_secret_with_default('azure_api_base')
azure_api_engine = get_secret_with_default('azure_api_engine')
azure_api_version = get_secret_with_default('azure_api_version')
langsmith_api_key = get_secret_with_default('langsmith_api_key')

# complain if we don't have the bare minimum to run
if not openai_api_key and not (azure_api_key
                               and azure_api_base
                               and azure_api_engine
                               and azure_api_version):
  raise Exception('We need either an openai_api_key secret set in the secrets — or set azure_api_key, azure_api_base, azure_api_engine, and azure_api_version to use Azure instead. Also be sure to enable Notebook Access for the secret(s).')

# initialize LangSmith API (if key specified)
if langsmith_api_key:
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_PROJECT"] = "local"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
    os.environ["LANGCHAIN_API_KEY"] = langsmith_api_key

# configure LLM settings
temperature = 0.1

# configure retry and timeout settings for calls to the LLM
total_response_timeout_seconds = 600
number_of_retries = 2
seconds_between_retries = 5

# initialize LangChain LLM access
if azure_api_key:
    llm = AzureChatOpenAI(openai_api_key=azure_api_key, temperature=temperature, deployment_name=azure_api_engine, azure_endpoint=azure_api_base,
                          openai_api_version=azure_api_version, openai_api_type="azure")
else:
    llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=temperature, model_name=openai_model)
json_llm = llm.with_structured_output(method="json_mode", include_raw=True)

# report success
print("Initialization successful.")


# utility function: call out to LLM for structured JSON response
def llm_json_response(prompt) -> _AllReturnType | _AllReturnType | dict[str, BaseMessage]:
    # execute LLM evaluation, but catch and return any exceptions
    try:
        result = json_llm.invoke(prompt)
    except Exception as caught_e:
        # format error result like success result
        result = {"raw": BaseMessage(type="ERROR", content=f"{caught_e}")}
    return result

# utility function: call out to LLM for structured JSON response, w/ automatic timeout and retry
@retry(stop=stop_after_attempt(number_of_retries), wait=wait_fixed(seconds_between_retries),
       retry=retry_if_exception_type(concurrent.futures.TimeoutError), reraise=True)
def llm_json_response_with_timeout(prompt) -> _AllReturnType | _AllReturnType | dict[str, BaseMessage]:
    try:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future = executor.submit(llm_json_response, prompt)
            result = future.result(timeout=total_response_timeout_seconds)
    except Exception as caught_e:
        # format error result like success result
        result = {"raw": BaseMessage(type="ERROR", content=f"{caught_e}")}
    return result

# utility function: process JSON response and return as raw response and parsed dictionary from JSON
def process_json_response(response)-> tuple[str, dict]:
    final_response = ""
    parsed_response = None
    if response['raw'].type == "ERROR":
        # if we caught an error, report and save that error, then move on
        final_response = response['raw'].content
        print(f"Error from LLM: {final_response}")
    elif 'parsed' in response and response['parsed'] is not None:
        # if we got a parsed version, save the JSON version of that
        final_response = json.dumps(response['parsed'])
        parsed_response = response['parsed']
    elif 'parsing_error' in response and response['parsing_error'] is not None:
        # if there was a parsing error, report and save that error, then move on
        final_response = str(response['parsing_error'])
        print(f"Parsing error : {final_response}")
    else:
        final_response = ""
        print(f"Unknown response from LLM")

    # return response in both raw and parsed formats
    return final_response, parsed_response

## Uploading your survey file

When you run this next code cell, it will prompt you to upload a `.txt` file with the plain text of your survey. It will then output the contents of that file so that you can confirm that it read it okay.

If you don't have a `.txt` file handy, you can use [this short excerpt from the DHS](https://github.com/higherbar-ai/survey-eval-lite/blob/main/sample_dhs_questions.txt).

In [None]:
from google.colab import files
import io, os

# prompt for a .txt file and keep prompting till we get one
print('Upload a .txt file with your survey text:')
print()
survey_text = None
while True:
  # prompt for upload
  uploaded = files.upload()

  # complain if we didn't get just a single file
  if len(uploaded.items()) != 1:
    print()
    print('Please upload a single .txt file.')
    print()
    continue

  filename, data = uploaded.popitem()
  if not filename.endswith('.txt'):
    # clean up the unsupported file from local storage
    os.remove(filename)
    print()
    print("Invalid file type. Please upload a .txt file.")
    print()
    continue

  # read the file
  survey_text = io.StringIO(data.decode('utf-8')).read()
  # clean up the read file from local storage
  os.remove(filename)
  # break from our re-prompt loop
  break

# output the result
print()
print(f'Survey text read from {filename}:')
print()
print(survey_text)

## Parsing the survey text

The next code block will use the LLM to parse the survey text into a list of questions.

In [None]:
# set up our survey-parsing prompt for the LLM
parsing_prompt = f"""You are an expert survey questionnaire and form parser. Given the plain text of a survey or digital form, you can parse into a well-structured JSON list of questions. Important instructions:

* **Your job is to extract exact text from the supplied survey text:** In the JSON you return, only ever include text content, directly quoted without modification, from the survey text you are supplied (i.e., never add or invent any text and never revise or rephrase any text).

* **Only respond with valid JSON that precisely follows the format specified below:** Your response should only include valid JSON and nothing else; if you cannot find any questions to return, simply return an empty questions list.

* **Treat translations as separate questions:** If you see one or more translated versions of a question, include them as separate questions in the JSON you return.

The JSON you return should include fields as follows:

* `questions` (list): The list of questions extracted, or an empty list if none found. Each question should be an object with the following fields:

  * `question_id` (string): The numeric or alphanumeric identifier or short variable name identifying the question (if any), usually located just before or at the beginning of the question. "" if none found.

  * `question` (string): The exact text of the question or form field, including any introductory text that provides context or explanation. Often follows a unique question ID of some sort, like "2.01." or "gender:". Should not include response options, which should be included in the 'options' field, or extra enumerator or interviewer instructions (including interview probes), which should be included in the 'instructions' field. Be careful: the same question might be asked in multiple languages, and each translation should be included as a separate question. Never translate between languages or otherwise alter the question text in any way.

  * `instructions` (string): Instructions or other guidance about how to ask or answer the question (if any), including enumerator or interviewer instructions. If the question includes a list of specific response options, do NOT include those in the instructions. However, if there is guidance as to how to fill out an open-ended numeric or text response, or guidance about how to choose among the options, include that guidance here. "" if none found.

  * `options` (string): The list of specific response options for multiple-choice questions in a single string, including both the label and the internal value (if specified) for each option. For example, a 'Male' label might be coupled with an internal value of '1', 'M', or even 'male'. Separate response options with a space, three pipe symbols ('|||'), and another space, and, if there is an internal value, add a space, three # symbols ('###'), and the internal value at the end of the label. For example: 'Male ### 1 ||| Female ### 2' (codes included) or 'Male ||| Female' (no codes); 'Yes ### yes ||| No ### no', 'Yes ### 1 ||| No ### 0', 'Yes ### y ||| No ### n', or 'YES ||| NO'. Do NOT include fill-in-the-blank content here, only multiple-choice options. "" if the question is open-ended (i.e., does not include specific multiple-choice options).

Here is the survey text for you to parse, delimited by triple backticks:

```
{survey_text}
```

Return your JSON list of questions, each with `question_id`, `question`, `instructions`, and `options` strings:
"""

# call out to the LLM and process the returned JSON
response_text, response_dict = process_json_response(llm_json_response_with_timeout(parsing_prompt))

# output results
if response_dict is not None:
  # get list of questions from the response dictionary
  questions = response_dict['questions']
  # output summary of results
  num_questions = len(questions)
  num_question_ids = len(set([q['question_id'] for q in questions]))
  num_instructions = len(set([q['instructions'] for q in questions]))
  num_options = len(set([q['options'] for q in questions]))
  print(f"Parsed {num_questions} questions ({num_question_ids} with IDs, {num_instructions} with instructions, and {num_options} with multiple-choice options)")
else:
  print(f"Failed to parse any questions. Response text: {response_text}")

## Reviewing the survey questions

This next code block will review each question in the survey, asking the LLM for advice re: question phrasing as well as potential biased or stereotypical language.

In [None]:
# loop through every question, reviewing them and saving results as we go
all_results = []
for question in questions:
  # format our question for the LLM
  question_text = f"""* Question ID: {question['question_id']}
* Instructions: {question['instructions']}
* Question: {question['question']}
* Options: {question['options']}"""

  # set up our phrasing-review prompt for the LLM
  phrasing_prompt = f"""You are an AI designed to evaluate questionnaires and other survey instruments used by researchers and M&E professionals. You are an expert in survey methodology with training equivalent to a member of the American Association for Public Opinion Research (AAPOR) with a Ph.D. in survey methodology from University of Michigan’s Institute for Social Research. You consider primarily the content, context, and questions provided to you, and then content and methods from the most widely-cited academic publications and public and nonprofit research organizations.

You always give truthful, factual answers. When asked to give your response in a specific format, you always give your answer in the exact format requested. You never give offensive responses. If you don’t know the answer to a question, you truthfully say you don’t know.

You will be given the raw text from a questionnaire or survey instrument between |!| and |!| delimiters. You will also be given a specific question from that text to evaluate between |@| and |@| delimiters. The question will be supplied in the following format:

* Question ID: ID (if any)
* Instructions: Instructions (if any)
* Question: Question text
* Options: Multiple-choice options (if any), with each separated by three pipe symbols (|||) and option values (if any) separated from option labels by three hash symbols (###)

Evaluate the question only, but also consider its context within the larger survey.

Assume that this survey will be administered by a trained enumerator who asks each question and reads each prompt or instruction as indicated in the excerpt. Your job is to anticipate the phrasing or translation issues that would be identified in a rigorous process of pre-testing (with cognitive interviewing) and piloting.

When evaluating the question, DO:

1. Ensure that the question will be understandable by substantially all respondents.

2. Consider the question in the context of the excerpt, including any instructions, related questions, or prompts that precede it.

3. Ignore question numbers and formatting.

4. Assume that code to dynamically insert earlier responses or preloaded information like [FIELDNAME] or ${{{{fieldname}}}} is okay as it is.

5. Ignore HTML or other formatting, and focus solely on question phrasing (assume that HTML tags will be for visual formatting only and will not be read aloud).

When evaluating the question, DON'T:

1. Recommend translating something into another language (i.e., suggestions for rephrasing should always be in the same language as the original text).

2. Recommend changes in the overall structure of a question (e.g., changing from multiple choice to open-ended or splitting one question into multiple), unless it will substantially improve the quality of the data collected.

3. Comment on HTML tags or formatting.

Respond in JSON format with all of the following fields:

* `Phrases` (list): a list containing all phrases from the excerpt that pre-testing or piloting is likely to identify as problematic (each phrase should be an exact quote)

* `Number of phrases` (number): the exact number of phrases in Phrases [ Note that this key must be exactly "Number of phrases", with exactly that capitalization and spacing ]

* `Recommendations` (list): a list containing suggested replacement phrases, one for each of the phrases in Phrases (in the same order as Phrases; each replacement phrase should be an exact quote that can exactly replace the corresponding phrase in Phrases; and each replacement phrase should be in the same language as the original phrase)

* `Explanations` (list): a list containing explanations for why the authors should consider revising each phrase, one for each of the phrases in Phrases (in the same order as Phrases). Do not repeat the entire phrase in the explanation, but feel free to reference specific words or parts as needed.

* `Severities` (list): a list containing the severity of each identified issue, one for each of the phrases in Phrases (in the same order as Phrases); each severity should be expressed as a number on a scale from 1 for the least severe issues (minor phrasing issues that are very unlikely to substantively affect responses) to 5 for the most severe issues (problems that are likely to substantively affect responses in a way that introduces bias and/or variance)

Raw text:
|!|
{survey_text}
|!|

Question to evaluate:
|@|
{question_text}
|@|

Your JSON response following the format described above:"""

  # call out to the LLM
  print()
  print(f"Evaluating question for phrasing: {question['question']}")
  response_text, response_dict = process_json_response(llm_json_response_with_timeout(phrasing_prompt))

  # save and output results
  if response_dict is not None:
    if 'Number of phrases' in response_dict and response_dict['Number of phrases'] > 0:
      print(f"  Identified {response_dict['Number of phrases']} issue(s)")
      all_results += [response_dict]
    else:
      print("  No issues identified")
  else:
    print(f"  Failed to get a valid response. Response text: {response_text}")

  # set up our bias-review prompt for the LLM
  # Note that this prompt was inspired by the example in this blog post:
  # https://www.linkedin.com/pulse/using-chatgpt-counter-bias-prejudice-discrimination-johannes-schunter/
  bias_prompt = f"""You are an AI designed to evaluate questionnaires and other survey instruments used by researchers and M&E professionals. You are an expert in survey methodology with training equivalent to a member of the American Association for Public Opinion Research (AAPOR) with a Ph.D. in survey methodology from University of Michigan’s Institute for Social Research. You are also an expert in the areas of gender equality, discrimination, anti-racism, and anti-colonialism. You consider primarily the content, context, and questions provided to you, and then content and methods from the most widely-cited academic publications and public and nonprofit research organizations.

You always give truthful, factual answers. When asked to give your response in a specific format, you always give your answer in the exact format requested. You never give offensive responses. If you don’t know the answer to a question, you truthfully say you don’t know.

You will be given the raw text from a questionnaire or survey instrument between |!| and |!| delimiters. You will also be given a specific question from that text to evaluate between |@| and |@| delimiters. The question will be supplied in the following format:

* Question ID: ID (if any)
* Instructions: Instructions (if any)
* Question: Question text
* Options: Multiple-choice options (if any), with each separated by three pipe symbols (|||) and option values (if any) separated from option labels by three hash symbols (###)

Evaluate the question only, but also consider its context within the larger survey.

Assume that this survey will be administered by a trained enumerator who asks each question and reads each prompt or instruction as indicated in the excerpt. Your job is to review the question for:

a. Stereotypical representations of gender, ethnicity, origin, religion, or other social categories.

b. Distorted or biased representations of events, topics, groups, or individuals.

c. Use of discriminatory or insensitive language towards certain groups or topics.

d. Implicit or explicit assumptions made in the text or unquestioningly adopted that could be based on prejudices.

e. Prejudiced descriptions or evaluations of abilities, characteristics, or behaviors.

Respond in JSON format with all of the following fields:

* `Phrases`: a list containing all problematic phrases from the excerpt that you found in your review (each phrase should be an exact quote from the excerpt)

* `Number of phrases`: the exact number of phrases in Phrases [ Note that this key must be exactly "Number of phrases", with exactly that capitalization and spacing ]

* `Recommendations`: a list containing suggested replacement phrases, one for each of the phrases in Phrases (in the same order as Phrases; each replacement phrase should be an exact quote that can exactly replace the corresponding phrase in Phrases)

* `Explanations`: a list containing explanations for why the phrases are problematic, one for each of the phrases in Phrases (in the same order as Phrases)

* `Severities`: a list containing the severity of each identified issue, one for each of the phrases in Phrases (in the same order as Phrases); each severity should be expressed as a number on a scale from 1 for the least severe issues (minor phrasing issues that are very unlikely to offend respondents or substantively affect their responses) to 5 for the most severe issues (problems that are very likely to offend respondents or substantively affect responses in a way that introduces bias and/or variance)

Raw text:
|!|
{survey_text}
|!|

Question to evaluate:
|@|
{question_text}
|@|

Your JSON response following the format described above:"""

  # call out to the LLM
  print()
  print(f"Evaluating question for bias: {question['question']}")
  response_text, response_dict = process_json_response(llm_json_response_with_timeout(bias_prompt))

  # save and output results
  if response_dict is not None:
    if 'Number of phrases' in response_dict and response_dict['Number of phrases'] > 0:
      print(f"  Identified {response_dict['Number of phrases']} issue(s)")
      all_results += [response_dict]
    else:
      print("  No issues identified")
  else:
    print(f"  Failed to get a valid response. Response text: {response_text}")

## Organizing and outputting the results

This final code block organizes and outputs final results, saving them in a local file called `survey-review-results.txt`. View or download this file by clicking the file-folder icon in Google Colab's left sidebar.

In [None]:
# generate report
if len(all_results) == 0:
  report = "No results to save"
else:
  report = "Survey review results:\n"
  for result in all_results:
    if 'Phrases' in result and result['Number of phrases'] > 0:
      # loop through all recommendations, treating lists as parallel arrays
      for phrase, recommendation, explanation, severity in zip(result['Phrases'], result['Recommendations'], result['Explanations'], result['Severities']):
        report += f"\n---\n\n{explanation}\n\nImportance: {severity} out of 5\n\nSuggest replacing this: {phrase}\n\nWith this: {recommendation}\n"

# save the report to file
with open("survey-review-results.txt", "w") as f:
  f.write(report)

print("All recommendations saved to survey-review-results.txt")