<a href="https://colab.research.google.com/github/higherbar-ai/survey-eval/blob/main/src/file-evaluation-example.ipynb" target="_parent"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>

# Using this file evaluation example

This notebook will use [the `surveyeval` package](https://github.com/higherbar-ai/survey-eval) to evaluate a single survey file at a time. It is broken down into the following steps:

1. **Set up** the runtime environment (you only have to do this once).
2. **Initalize** to load credentials and configuration.
3. **Prompt** for your survey file and contextual details.
4. **Read** the survey file to be evaluated.
5. **Parse** the survey file to extract the survey modules, questions, and translations.
6. **Evaluate** the extracted survey content.
7. **Output** the evaluation results.

The notebook uses the `surveyeval` package's `survey_parser` module to read and parse the survey file and the `evaluation_engine` module to conduct the evaluation. If you're running this for the first time, you should run each code cell one at a time so that you can verify the results at each stage.

See [the survey-eval GitHub repo](https://github.com/higherbar-ai/survey-eval) for more details.

## Configuration

This notebook requires different settings depending on which AI service providers you want to use. If you're running in Google Colab, you configure these settings as "secrets"; just click the key icon in the left sidebar (and, once you create a secret, be sure to click the toggle to give the notebook access to the secret). If you're running this notebook in a different environment, you can set these settings in a `.env` file; the first time you run, it will write out a template `.env` file for you to fill in and direct you to its location.

Following are the settings, regardless of the environment.

### OpenAI (direct)

To use OpenAI directly:

* `openai_api_key` - your OpenAI API key (get one from [the OpenAI API key page](https://platform.openai.com/api-keys), and be sure to fund your platform account with at least $5 to allow GPT-4o model access)
* `openai_model` (optional) - the model to use (defaults to `gpt-4o`)

### OpenAI (via Microsoft Azure)

To use OpenAI via Microsoft Azure:

* `azure_api_key` - your Azure API key
* `azure_api_base` - the base URL for the Azure API
* `azure_api_engine` - the engine to use (a.k.a. the "deployment")
* `azure_api_version` - the API version to use

### Anthropic (direct)

To use Anthropic directly:

* `anthropic_api_key` - your Anthropic API key
* `anthropic_model` - the model to use

### LangSmith (for tracing)

Optionally, you can add [LangSmith tracing](https://langchain.com/langsmith):

* `langsmith_api_key` - your LangSmith API key

## Your survey file

The notebook will prompt you to select or upload a survey file. The simpler the file's formatting, the better the results will be.

## Setting up the runtime environment

This next code block installs all necessary Python and system packages into the current environment.

**If you're running in Google Colab and it prompts you to restart the notebook in the middle of the installation steps, just click CANCEL.**

In [None]:
# install Google Colab Support and surveyeval package
%pip install colab-or-not surveyeval

# download NLTK data
import nltk
nltk.download('punkt', force=True)

# set up our notebook environment (including LibreOffice)
from colab_or_not import NotebookBridge
notebook_env = NotebookBridge(
    system_packages=["libreoffice"],
    config_path="~/.hbai/surveyeval.env",
    config_template={
        "openai_api_key": "",
        "openai_model": "",
        "azure_api_key": "",
        "azure_api_base": "",
        "azure_api_engine": "",
        "azure_api_version": "",
        "anthropic_api_key": "",
        "anthropic_model": "",
        "langsmith_api_key": "",
    }
)
notebook_env.setup_environment()

## Initializing for survey evaluation

The next code block initializes the notebook by loading settings and initializing for survey evaluation.

In [None]:
# read all supported secrets
openai_api_key = notebook_env.get_setting('openai_api_key')
openai_model = notebook_env.get_setting('openai_model', 'gpt-4o')
azure_api_key = notebook_env.get_setting('azure_api_key')
azure_api_base = notebook_env.get_setting('azure_api_base')
azure_api_engine = notebook_env.get_setting('azure_api_engine')
azure_api_version = notebook_env.get_setting('azure_api_version')
anthropic_api_key = notebook_env.get_setting("anthropic_api_key")
anthropic_model = notebook_env.get_setting("anthropic_model")
langsmith_api_key = notebook_env.get_setting('langsmith_api_key')

# complain if we don't have the bare minimum to run
if (not openai_api_key
        and not (azure_api_key and azure_api_base and azure_api_engine and azure_api_version)
        and not (anthropic_api_key and anthropic_model)):
    raise Exception('We need settings set for OpenAI access (direct or via Azure) or for Anthropic access (direct). See the instructions above for more details.')

# set LLM provider in preference order
if azure_api_key:
    api_provider = "azure"
elif anthropic_api_key:
    api_provider = "anthropic"
else:
    api_provider = "openai"

# configure logging to output all messages to stdout, initialize logger
import logging, sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

# configure fixed output location
output_dir = notebook_env.get_output_dir(not_colab_dir="~/surveyeval", colab_subdir="")

## Prompting for your survey file, plus contextual details

This next code block prompts you to upload or select a single survey file for evaluation. If you don't have a survey file handy, you can use [this short excerpt from the DHS](https://github.com/higherbar-ai/ai-workflows/blob/main/resources/sample_dhs_questions.txt).

It then prompts you for three parameters that allow you to contextualize the evaluation:

1. **Evaluation context** - a brief description of the survey's purpose and target audience
2. **Evaluation locations** - a list of locations where the survey will be conducted, including which languages are used in which settings (if applicable)
3. **Extra evaluation instructions** - any additional instructions you want to provide to the evaluator

You don't have to supply any of these details (you can just press *OK* on each prompt), but the more details you provide the more accurate the evaluation will be.

In [None]:
# prompt for a single file and keep prompting till we get one
input_path = ""
while True:
    # prompt for a survey file
    survey_files = notebook_env.get_input_files("Survey file to evaluate")

    # complain if we didn't get just a single file
    if len(survey_files) != 1:
        print()
        print('Please upload a single survey file.')
        print()
    else:
        # fetch the path of the uploaded file
        input_path = survey_files[0]

        # break from loop
        break

# prompt for parameters
evaluation_context = input("Evaluation context: ").strip()
evaluation_locations = input("Evaluation locations and languages: ").strip()
evaluation_extra_instructions = input("Extra evaluation instructions (if any): ").strip()

# report results
print(f"Survey to evaluate: {input_path}")
print (f"Evaluation context: {evaluation_context}")
print (f"Evaluation locations: {evaluation_locations}")
print (f"Extra evaluation instructions: {evaluation_extra_instructions}")

# Reading the survey file

The first substantive step is to read the survey file to be evaluated. This step uses the `SurveyInterface` class in the `survey_parser` module to read the survey file and extract the raw content. That content is then written to the `evaluation-output-path` you configured in the `.ini` file, as either Markdown text in `raw_data.md` or JSON text in `raw_data.json`. Review the raw content to ensure that it looks reasonable before continuing.

If you're running in Google Colab, files will be saved into the content folder. Find, view, and download them by clicking on the folder icon in the left sidebar.

If you're running elsewhere, they will be saved into a `surveyeval` subdirectory created off of your user home directory.

## Data from REDCap and XLSForm (SurveyCTO, ODK, etc.) saved as JSON

The parser can read structured data directly from REDCap and XLSForm files (used by SurveyCTO, ODK, and other platforms). In those cases, the data will be written to `raw_data.json` in JSON format (rather than the default Markdown format), and parsing will be quick and painless (as no additional work will be required to parse its structure).

## Other file formats saved as Markdown

For some file formats (e.g., .pdf or .docx files), the parser will use the LLM to help convert documents into Markdown format — but that can be slow and a little bit costly (about $0.015/page). To disable this feature, set `use_llm` to `False` in the `read_survey_contents()` call in the code cell below.

## Beware garbage-in-garbage-out

Depending on the format of your survey file, the parser might do a poor job reading the content. In that case, it's not worth continuing to parse and then evaluate the survey, because you'll only get garbage out. If the raw text content looks bad, you should either fix the survey file or use a different survey file.

### Fixing the survey file

The closer you get to a simple Word file with minimal formatting, the easier it will be for the parser to parse the content. Try to:

1. Use headings for each module
2. Label each question with a unique number or ID
3. Keep different translations for each question together, with the same unique number or ID

In [None]:
import json, os

# initialize for file reading and parsing
from surveyeval.survey_parser import SurveyInterface
if api_provider == "openai":
    survey_interface = SurveyInterface(openai_api_key=openai_api_key, openai_model=openai_model, langsmith_api_key=langsmith_api_key)
elif api_provider == "azure":
    survey_interface = SurveyInterface(azure_api_key=azure_api_key, azure_api_engine=azure_api_engine, azure_api_base=azure_api_base, azure_api_version=azure_api_version, langsmith_api_key=langsmith_api_key)
elif api_provider == "anthropic":
    survey_interface = SurveyInterface(anthropic_api_key=anthropic_api_key, anthropic_model=anthropic_model, langsmith_api_key=langsmith_api_key)
else:
    raise Exception('Unsupported API provider.')

# read survey file contents: set use_llm=False to disable LLM-based parsing
survey_contents = survey_interface.read_survey_contents(os.path.expanduser(input_path), use_llm=True)

# output read contents to file in output path
already_parsed = isinstance(survey_contents, dict)
raw_data_output_path = os.path.join(os.path.expanduser(output_dir), "raw_data.json" if already_parsed else "raw_data.md")
with open(raw_data_output_path, 'w', encoding='utf-8') as f:
    if already_parsed:
        # write structured data to file as JSON
        json.dump(survey_contents, f, indent=2)

        print()
        print(f"JSON data written to {raw_data_output_path}. Because file was read as structured data to begin with, the parsing step will be quick and easy.")
    else:
        # write raw data to file as Markdown text
        f.write(survey_contents)

        print()
        print(f"Raw data written to {raw_data_output_path}. Review it to ensure that it looks reasonable before continuing on to parsing.")

# Parsing the survey file

The next step is to parse the raw data into a series of modules and questions. If the original input file was read as structured data (e.g., an XLSForm file), this step is quick, easy, and generally error-free. Otherwise, the parser uses AI assistance to make sense of the raw data. The parsed survey data is then written to `parsed_data.json` for your review. Be sure that it looks reasonable before continuing on to the next step.

The parsed data should look something like this:

```
{
    "MODULE 1": {
        "module_name": "MODULE 1",
        "module_title": "",
        "module_intro": "",
        "questions": {
            "question_id_1": [
                {
                    "question": "Question 1 in English",
                    "language": "English",
                    "options": [
                        {
                            "label": "YES",
                            "value": "1"
                        },
                        {
                            "label": "NO",
                            "value": "2"
                        }
                    ],
                    "instructions": "Instructions, if any"
                },
                {
                    "question": "Question 1 in Spanish",
                    "language": "Spanish",
                    "options": [
                        {
                            "label": "YES",
                            "value": "1"
                        },
                        {
                            "label": "NO",
                            "value": "2"
                        }
                    ],
                    "instructions": "Instructions, if any"
                }
            ],
            "question_id_2": [
                {
                    "question": "Question 2 in English",
                    "language": "English",
                    "options": [],
                    "instructions": "Instructions, if any"
                },
                {
                    "question": "Question 2 in Spanish",
                    "language": "Spanish",
                    "options": [],
                    "instructions": "Instructions, if any"
                }
            ]
        }
    },
    "MODULE 2": {
        "module_name": "MODULE 2",
        "module_title": "",
        "module_intro": "",
        "questions": {
            "question_id_3": [
                {
                    "question": "Question 3 in English",
                    "language": "English",
                    "options": [
                        {
                            "label": "YES",
                            "value": "1"
                        },
                        {
                            "label": "NO",
                            "value": "2"
                        }
                    ],
                    "instructions": "Instructions, if any"
                },
                {
                    "question": "Question 3 in Spanish",
                    "language": "Spanish",
                    "options": [
                        {
                            "label": "YES",
                            "value": "1"
                        },
                        {
                            "label": "NO",
                            "value": "2"
                        }
                    ],
                    "instructions": "Instructions, if any"
                }
            ],
            "question_id_4": [
                {
                    "question": "Question 4 in English",
                    "language": "English",
                    "options": [],
                    "instructions": "Instructions, if any"
                },
                {
                    "question": "Question 4 in Spanish",
                    "language": "Spanish",
                    "options": [],
                    "instructions": "Instructions, if any"
                }
            ]
        }
    }
}
```

## Beware garbage-in-garbage-out

Depending on the format of your survey file, the survey parser might do a poor job parsing the content. In that case, it's not worth continuing to evaluate the survey, because you'll only get garbage out. If the parsed content looks bad, you should either fix the survey file or use a different survey file.

## Fixing the survey file

The closer you get to a simple Word file with minimal formatting, the easier it will be for the parser to parse the content. Try to:
 
1. Use headings for each module
2. Label each question with a unique number or ID
3. Keep different translations for each question together, with the same unique number or ID  


In [None]:
# parse file
data = survey_interface.parse_survey_contents(survey_contents=survey_contents, survey_context=evaluation_context)

# output parsed data to file so that the parsing step can itself be evaluated
parsed_data_output_path = os.path.join(os.path.expanduser(output_dir), "parsed_data.json")
with open(parsed_data_output_path, 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2)

print()
print(f"Parsed data written to {parsed_data_output_path}. Review it to ensure that it looks reasonable before continuing.")

## Optional: Outputting parsed data to XLSForm

If you want to output the parsed data to an XLSForm file, you can use the `output_parsed_data_to_xlsform()` function. This function will write the parsed data to an XLSForm file that you can use to create a survey in a tool like SurveyCTO or ODK.

In [None]:
# OUTPUT PARSED DATA TO XLSFORM

import os
import re

# construct filename for XLSForm output
input_filename = os.path.basename(input_path)
input_filename_without_ext = os.path.splitext(input_filename)[0]
new_filename = f"{input_filename_without_ext}_xlsform.xlsx"
xlsform_output_path = os.path.join(os.path.expanduser(output_dir), new_filename)

# construct a safe form ID from the input filename
safe_form_id = re.sub(r'\W+', '_', input_filename_without_ext).lower()
# if it doesn't begin with a letter, put a 'f_' on the front
if not safe_form_id[0].isalpha():
    safe_form_id = 'f_' + safe_form_id

# output parsed data to XLSForm file
survey_interface.output_parsed_data_to_xlsform(data=data, form_id=safe_form_id, form_title="Automated form output", output_file=xlsform_output_path)

print(f"XLSForm written to {xlsform_output_path}.")

# Running the evaluation

The final step is to run the evaluation. This step uses the `evaluation_engine` module to evaluate the parsed survey data.

The evaluation process has been parallelized to run more quickly, but it can still take a long time. You can monitor progress by watching the debug output in the console. If you want to stop the evaluation process, you can interrupt the kernel in your Jupyter notebook or press Ctrl+C in the console.

Once you see the results, you might want to adjust the evaluation. If so, you can re-run the notebook and use the opportunity to provide extra instructions when prompted to do so. For example:

    * When making recommendations, use conversational language and phrasing appropriate to
      informal household interviews.
     
    * When evaluating language and phrasing for bias, stereotypes, and prejudice, don't be
      more sensitive than is appropriate in the survey locations considered (e.g., don't
      impose Western values in non-Western settings).

## Rate limit errors

If you encounter rate limit errors, you can change the following line to lower the ``chunk_size`` (which is the number of parallel requests made to the LLM):

```python
await process_task_chunks(tasks, chunk_size=5, delay=1)
```

In [None]:
# initialize evaluation engine
from surveyeval import EvaluationEngine, PhrasingEvaluationLens, ValidatedInstrumentEvaluationLens, BiasEvaluationLens, TranslationEvaluationLens

# set model based on provider
if api_provider == "openai":
    eval_model = openai_model
elif api_provider == "azure":
    eval_model = azure_api_engine
elif api_provider == "anthropic":
    eval_model = anthropic_model
else:
    raise Exception('Unsupported LLM provider.')

evaluation_engine = EvaluationEngine(
    evaluation_model=eval_model,
    evaluation_provider=api_provider,
    openai_api_key=openai_api_key,
    azure_api_key=azure_api_key,
    azure_api_base=azure_api_base,
    azure_api_version=azure_api_version,
    anthropic_api_key=anthropic_api_key,
    temperature=0,
    max_retries=3,
    logger=logger,
    extra_evaluation_instructions=evaluation_extra_instructions,
    langsmith_api_key=langsmith_api_key
)

# initialize evaluation lenses
phrasing_lens = PhrasingEvaluationLens(evaluation_engine)
validated_instrument_lens = ValidatedInstrumentEvaluationLens(evaluation_engine)
bias_lens = BiasEvaluationLens(evaluation_engine)
translation_lens = TranslationEvaluationLens(evaluation_engine)

# initialize a list to store results data
results_data = []

# prepare for parallel processing of evaluation tasks
import asyncio
async def yield_chunked_tasks(task_list: list, chunk_size: int):
    """
    Yield successive chunk_size-sized chunks from tasks.

    :param task_list: List of tasks to be chunked.
    :type task_list: list
    :param chunk_size: Size of each chunk.
    :type chunk_size: int
    :return: Yields chunks of tasks.
    """

    for i in range(0, len(task_list), chunk_size):
        yield task_list[i:i + chunk_size]

async def process_task_chunks(task_list: list, chunk_size: int, delay: int):
    """
    Process chunks of tasks with a specified delay.

    :param task_list: List of tasks to process.
    :type task_list: list
    :param chunk_size: Size of each task chunk.
    :type chunk_size: int
    :param delay: Delay between processing each chunk (in seconds).
    :type delay: int
    """

    async def task_wrapper(task):
        evaluation_type, text_excerpt, eval_lens, func = task
        result = await func()
        return evaluation_type, text_excerpt, eval_lens, result

    chunk_index = 0
    async for chunk in yield_chunked_tasks(task_list, chunk_size):
        # process each task in the current chunk
        wrapped_chunk = [task_wrapper(task) for task in chunk]
        results_chunk = await asyncio.gather(*wrapped_chunk)
        results_data.extend(results_chunk)
        await asyncio.sleep(delay)
        chunk_index += 1

# initialize variables for tasks and results
tasks = []
translations = []

# run through the parsed questionnaire data and generate list of async review tasks
k = 0
for module_key, module in data.items():
    # run through all questions in the module, assembling questions and translations for review
    module_questions = []
    full_module_text = f"{module_key}\n\n"
    if module['module_intro']:
        full_module_text += f"{module['module_intro']}\n\n"
    for question in module['questions'].values():
        # consider each translation of the question
        languages = []
        translations = []
        for translation in question:
            option_labels = ' | '.join(option['label'] for option in translation['options'])
            full_question = f"Question: {translation['question']}\nOptions: {option_labels}\nInstructions: {translation['instructions']}\n"

            # conduct bias review on each translation independently
            tasks.append(("bias", full_question, bias_lens, lambda: bias_lens.a_evaluate(
                survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=full_question)))

            # save first language only for full phrasing and module-level review
            if not languages:
                module_questions.append(full_question)
                full_module_text += full_question
                full_module_text += "\n"

            # remember this translation for translation review
            languages.append(translation['language'])
            translations.append(full_question)

        # conduct translation review whenever there are multiple translations
        if len(languages) > 1:
            primary_language_text = ""
            for idx, language in enumerate(languages):
                if not language:
                    language = "Unknown-language"

                if idx == 0:
                    primary_language_text = f"Primary language ({language}):\n\n{translations[idx]}\n\n"
                else:
                    questions_by_language = primary_language_text + f"Translated language ({language}):\n\n{translations[idx]}\n\n"

                    # conduct translation review in pairwise fashion, with each translation against the first
                    # (primary) language
                    tasks.append(("translation", questions_by_language, translation_lens, lambda: translation_lens.a_evaluate(survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=questions_by_language)))

    # execute phrasing evaluation on each module question (first language only)
    for module_question in module_questions:
        tasks.append(("phrasing", module_question, phrasing_lens, lambda: phrasing_lens.a_evaluate(survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=full_module_text, survey_question=module_question)))

    # execute module-level evaluation lenses
    tasks.append(("validated_instrument", full_module_text, validated_instrument_lens, lambda: validated_instrument_lens.a_evaluate(survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=full_module_text)))

# process tasks in chunks, with a delay between chunks
await process_task_chunks(tasks, chunk_size=5, delay=1)

print()
print(f"Completed evaluation of {len(results_data)} tasks. Execute next cell to output results.")

## Organizing and outputting the results

This final code cell organizes and outputs final results to `aggregated_results.txt` as well as a `logs.txt` file containing a detailed log of the evaluation process.

If you're running in Google Colab, files will be saved into the content folder. Find, view, and download them by clicking on the folder icon in the left sidebar.

If you're running elsewhere, they will be saved into a `surveyeval` subdirectory created off of your user home directory.

In [None]:
# output results, including a detailed processing log
logs = []
results = []
results_by_type = {
    "phrasing": [],
    "bias": [],
    "translation": [],
    "validated_instrument": []
}

def add_log(new_history):
    for his in new_history:
        logs.append("PROMPT:")
        logs.append(his[0])
        logs.append("RESPONSE:")
        logs.append(his[1])

for output in results_data:
    eval_type = output[0]
    excerpt = output[1]
    lens = output[2]
    result_dict = output[3]

    # retrieve and format result
    if result_dict["result"] == "success":
        formatted_result = lens.format_result(result=result_dict["response"])
    else:
        formatted_result = f"Error: {result_dict['error']}"

    # determine appropriate evaluation header
    eval_headers = {
        "phrasing": "Phrasing evaluation for excerpt:",
        "bias": "Bias evaluation for excerpt:",
        "translation": "Translation evaluation for excerpt:",
        "validated_instrument": "Module evaluation for excerpt:"
    }
    eval_header = eval_headers.get(eval_type, "UNKNOWN EVALUATION")

    # report result
    if formatted_result is not None and formatted_result:
        # report first for aggregate results output
        results.append(eval_header)
        results.append(excerpt.strip())
        results.append(formatted_result.strip() + "\n")
        # report also for type-specific results output
        type_results = results_by_type.get(eval_type, [])
        type_results.append("Excerpt considered:")
        type_results.append(excerpt.strip())
        type_results.append(formatted_result.strip() + "\n")
    
    # log full exchange+result for transparency
    logs.append(eval_header)
    if result_dict["history"]:
        add_log(result_dict["history"])
    logs.append(formatted_result)

# write results to files in evaluation_output_path
print()
logs_output_path = os.path.join(os.path.expanduser(output_dir), "logs.txt")
with open(logs_output_path, 'w', encoding='utf-8') as f:
    f.write('\n\n'.join(logs))
print(f"Detailed logs written to {logs_output_path}.")

results_output_path = os.path.join(os.path.expanduser(output_dir), "aggregated_results.txt")
with open(results_output_path, 'w', encoding='utf-8') as f:
    f.write('\n\n'.join(results))
print(f"All results written to {results_output_path}.")

for eval_type, eval_results in results_by_type.items():
    if eval_results:
        type_results_output_path = os.path.join(os.path.expanduser(output_dir), f"aggregated_results_{eval_type}.txt")
        with open(type_results_output_path, 'w', encoding='utf-8') as f:
            f.write('\n\n'.join(eval_results))
        print(f"{eval_type} results written to {type_results_output_path}.")