# Using the file evaluation example

Use this workbook to evaluate a single survey file at a time. Once you install dependencies and configure settings, you can run this workbook to perform the evaluation in four steps:

1. **Initalize** to load credentials and configuration.
2. **Read** the survey file to be evaluated.
3. **Parse** the survey file to extract the survey modules, questions, and translations.
4. **Evaluate** the extracted survey content.  

In the process, this workbook will use the `evaluation_engine` module to conduct the evaluation and the `questionnaire_file_reader` and `questionnaire_file_parser` modules to read and parse the survey file.

## Preparing to run this workbook

### Installing dependencies

1. Install Python dependencies: `pip install -r requirements.txt`
2. Install poppler: `brew install poppler` (macOS) or [see here for other platforms](https://pdf2image.readthedocs.io/en/latest/installation.html)
3. Install tesseract: `brew install tesseract` (macOS) or [see here for other platforms](https://tesseract-ocr.github.io/tessdoc/Installation.html)
4. Download an English pipeline for spaCy: `python -m spacy download en_core_web_sm`

### Configuring settings

This workbook begins by loading credentials and configuration from an `.ini` file stored in `~/.hbai/file-evaluation-example.ini`. The `~` in the path refers to the current user's home directory, and the `.ini` file contents are as follows:

    [openai]
    summarize-model=gpt-3.5-turbo
    summarize-model-max-tokens=4000
    openai-api-key=sk-keyhere
    azure-api-key=keyhere
    azure-api-base=https://almitra-azure-1.openai.azure.com/
    azure-api-version=2023-05-15

    [parsing]
    extraction-model=gpt-4-turbo
    splitter_chunk_size=6000
    splitter_overlap_size=500

    [evaluation]
    evaluation-model=gpt-4-turbo
    tiktoken-model=gpt-4-turbo
    evaluation-context=Baseline survey on consumption, savings, and financial well-being
    evaluation-locations=United States (English), Mexico (Spanish)
    evaluation-extra-instructions=
    evaluation-input-path=~/Files/evaluation/example_baseline_survey.docx
    evaluation-output-path=~/Files/evaluation
    
    [langsmith]
    langsmith-api-key=optional_key_here

The easiest way to get started is to take the `file-evaluation-example.ini` in the root of this repository, copy it to `~/.hbai/file-evaluation-example.ini`, and customize as follows:

1. Configure for OpenAI access:
    1. If you want to use OpenAI directly: replace `openai-api-key` with your OpenAI API key.
    2. If you want to use Azure's OpenAI service: replace `azure-api-key` with your Azure API key, and replace `azure-api-base` and `azure-api-version` with the base URL and version of the Azure API you want to use.
2. Set the `evaluation-context` and `evaluation-locations` with short descriptions to help the AI system understand the relevant context and setting. For each location in `evaluation-locations`, include both the location and the language, so that the system knows which languages are used in which settings.  
3. Set the `evaluation-input-path` to the path of the survey file you want to evaluate.
4. Set the `evaluation-output-path` to the path of the directory where you want to write the evaluation results.

## Running this workbook

Once you've installed dependencies and configured settings, you can run this workbook to evaluate a survey file. Run each cell in order, and follow the instructions in the output to review the raw and parsed data before continuing. All results (including intermediate outputs and log files) will be written to the `evaluation-output-path` you configured in the `.ini` file.

In [None]:
### INITIALIZATION

# for convenience, auto-reload modules when they've changed
%load_ext autoreload
%autoreload 2

import configparser
import os
import logging
import sys

# configure logging to output all messages to stdout, initialize logger
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

# load credentials and other configuration from a local ini file
inifile_location = os.path.expanduser("~/.hbai/file-evaluation-example.ini")
inifile = configparser.RawConfigParser()
inifile.read(inifile_location)

# load configuration
openai_api_key = inifile.get("openai", "openai-api-key")
azure_api_key = inifile.get("openai", "azure-api-key")
azure_api_base = inifile.get("openai", "azure-api-base")
azure_api_version = inifile.get("openai", "azure-api-version")
if openai_api_key and not azure_api_key:
    api_provider = "openai"
else:
    api_provider = "azure"
max_combined_tokens = int(inifile.get("openai", "summarize-model-max-tokens"))
summarize_model = inifile.get("openai", "summarize-model")
parsing_model = inifile.get("parsing", "extraction-model")
splitter_chunk_size = int(inifile.get("parsing", "splitter_chunk_size"))
splitter_overlap_size = int(inifile.get("parsing", "splitter_overlap_size"))
gpt_model = inifile.get("evaluation", "evaluation-model")
tiktoken_model = inifile.get("evaluation", "tiktoken-model")
evaluation_context = inifile.get("evaluation", "evaluation-context")
if not evaluation_context:
    evaluation_context = "Unknown"
evaluation_locations = inifile.get("evaluation", "evaluation-locations")
if not evaluation_locations:
    evaluation_locations = "Unknown"
evaluation_extra_instructions = inifile.get("evaluation", "evaluation-extra-instructions")
if not evaluation_extra_instructions:
    evaluation_extra_instructions = ""
evaluation_input_path = inifile.get("evaluation", "evaluation-input-path")
evaluation_output_path = inifile.get("evaluation", "evaluation-output-path")
langsmith_api_key = inifile.get("langsmith", "langsmith-api-key")

# support LangSmith for debugging (optional)
if langsmith_api_key:
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_PROJECT"] = "survey-eval"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
    os.environ["LANGCHAIN_API_KEY"] = langsmith_api_key

# Reading the survey file

The first step is to read the survey file to be evaluated. This step uses the `questionnaire_file_reader` module to read the survey file and extract the raw text content of each page. The raw text content is then written to `raw_data.txt` in the `evaluation-output-path` you configured in the `.ini` file. Review the raw text content to ensure that it looks reasonable before continuing.

## Beware garbage-in-garbage-out

Depending on the format of your survey file, the `questionnaire_file_reader` module might do a bad job reading the content. In that case, it's not worth continuing to parse and then evaluate the survey, because you'll only get garbage out. If the raw text content looks bad, you should either fix the survey file or use a different survey file.

## Fixing the survey file

The closer you get to a simple Word file with minimal formatting, the easier it will be for the `questionnaire_file_reader` module to read the content.
 

In [None]:
# initialize for file reading and parsing
from surveyeval.survey_parser import set_langchain_splits, create_schema, generate_extractor_chain, extract_data_from_file
set_langchain_splits(splitter_chunk_size, splitter_overlap_size)
schema, extraction_validator = create_schema()
extractor_chain = generate_extractor_chain(parsing_model, azure_api_base, azure_api_key, azure_api_version, schema)

In [None]:
# get data from URL
from surveyeval.survey_parser import get_data_from_url
raw_data = get_data_from_url(os.path.expanduser(evaluation_input_path))
# output raw data to file in output path
raw_data_output_path = os.path.join(os.path.expanduser(evaluation_output_path), "raw_data.txt")
with open(raw_data_output_path, 'w') as f:
    for item in [doc.page_content if not isinstance(doc, str) else doc for doc in raw_data]:
        f.write(f"{item}\n")

print(f"Raw data written to {raw_data_output_path}. Review it to ensure that it looks reasonable before continuing.")

# Parsing the survey file

The next step is to parse the raw data into a series of modules and questions. This step uses the `questionnaire_file_parser` module — as well as AI assistance — to make sense of the raw data. The parsed survey data is then written to `parsed_data.json` in the `evaluation-output-path` you configured in the `.ini` file. Review the parsed content to ensure that it looks reasonable before continuing.

The parsed data should look like this:

```
{
  "MODULE 1": {
    "question_id_1": [
      {
        "question": "Question 1 in English",
        "language": "English",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      },
      {
        "question": "Question 1 in Spanish",
        "language": "Spanish",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      }
    ],
    "question_id_2": [
      {
        "question": "Question 2 in English",
        "language": "English",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      },
      {
        "question": "Question 2 in Spanish",
        "language": "Spanish",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      }
    ]
  },
  "MODULE 2": {
    "question_id_3": [
      {
        "question": "Question 3 in English",
        "language": "English",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      },
      {
        "question": "Question 3 in Spanish",
        "language": "Spanish",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      }
    ],
    "question_id_4": [
      {
        "question": "Question 4 in English",
        "language": "English",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      },
      {
        "question": "Question 4 in Spanish",
        "language": "Spanish",
        "options": "Options, if any",
        "instructions": "Instructions, if any"
      }
    ]
  }
}
```

## Beware garbage-in-garbage-out

Depending on the format of your survey file, the `questionnaire_file_parser` module might do a bad job parsing the content. In that case, it's not worth continuing to evaluate the survey, because you'll only get garbage out. If the parsed content looks bad, you should either fix the survey file or use a different survey file. You can also try to adjust the `splitter_chunk_size` and `splitter_overlap_size` settings in the `.ini` file to see if that helps.

## Fixing the survey file

The closer you get to a simple Word file with minimal formatting, the easier it will be for the `questionnaire_file_parser` module to parse the content. Try to:
 
1. Make module headings obvious
2. Label each question with a unique number or ID
3. Keep different translations for each question together, with the same unique number or ID  
 

In [None]:
# PARSE FILE FOR EVALUATION

# output path to file we'll process
import json
print(evaluation_input_path)

# parse file
data = await extract_data_from_file(os.path.expanduser(evaluation_input_path), extractor_chain)

# output parsed data to file so that the parsing step can itself be evaluated
parsed_data_output_path = os.path.join(os.path.expanduser(evaluation_output_path), "parsed_data.json")
with open(parsed_data_output_path, 'w') as f:
    json.dump(data, f, indent=2)

print(f"Parsed data written to {parsed_data_output_path}. Review it to ensure that it looks reasonable before continuing.")

# Running the evaluation

The final step is to run the evaluation. This step uses the `evaluation_engine` module to evaluate the parsed survey data. The evaluation results are then written to `aggregated_results.txt` in the `evaluation-output-path` you configured in the `.ini` file, along with a `logs.txt` file containing a detailed log of the evaluation process.

The evaluation process has been parallelized to run more quickly, but it can still take a long time. You can monitor progress by watching the debug output in the console. If you want to stop the evaluation process, you can interrupt the kernel in your Jupyter notebook or press Ctrl+C in the console.

Once you see the results, you might want to adjust the evaluation. If so, you can use the `extra_evaluation_instructions` setting in the `.ini` file to provide additional instructions to the evaluation lenses. For example:

    evaluation-extra-instructions=* When making recommendations, use conversational language and phrasing 
     appropriate to informal household interviews.
     
     * When evaluating language and phrasing for bias, stereotypes, and prejudice, don't be more sensitive than 
     is appropriate in the survey locations considered (e.g., don't impose Western values in non-Western settings).

Note that if you have more than one line of extra instructions, each line from the second one on should be indented by one space.

In [None]:
# initialize evaluation engine
if 'evaluation_engine' in sys.modules:
    # if it's already been loaded, force-reload to pick up any changes
    del sys.modules['evaluation_engine']
from surveyeval import EvaluationEngine, PhrasingEvaluationLens, ValidatedInstrumentEvaluationLens, BiasEvaluationLens, TranslationEvaluationLens
evaluation_engine = EvaluationEngine(
    summarize_model=summarize_model,
    summarize_provider=api_provider,
    evaluation_model=gpt_model,
    evaluation_provider=api_provider,
    openai_api_key=openai_api_key,
    azure_api_key=azure_api_key,
    azure_api_base=azure_api_base,
    azure_api_version=azure_api_version,
    temperature=0.01,
    tiktoken_model_name=tiktoken_model,
    max_retries=3,
    logger=logger,
    extra_evaluation_instructions=evaluation_extra_instructions
)

# initialize evaluation lenses
phrasing_lens = PhrasingEvaluationLens(evaluation_engine)
validated_instrument_lens = ValidatedInstrumentEvaluationLens(evaluation_engine)
bias_lens = BiasEvaluationLens(evaluation_engine)
translation_lens = TranslationEvaluationLens(evaluation_engine)

In [None]:
# CONDUCT EVALUATION

# initialize a list to store results data
results_data = []

# prepare for parallel processing of evaluation tasks
import asyncio
async def yield_chunked_tasks(task_list: list, chunk_size: int):
    """
    Yield successive chunk_size-sized chunks from tasks.

    :param task_list: List of tasks to be chunked.
    :type task_list: list
    :param chunk_size: Size of each chunk.
    :type chunk_size: int
    :return: Yields chunks of tasks.
    """
    
    for i in range(0, len(task_list), chunk_size):
        yield task_list[i:i + chunk_size]

async def process_task_chunks(task_list: list, chunk_size: int, delay: int):
    """
    Process chunks of tasks with a specified delay.

    :param task_list: List of tasks to process.
    :type task_list: list
    :param chunk_size: Size of each task chunk.
    :type chunk_size: int
    :param delay: Delay between processing each chunk (in seconds).
    :type delay: int
    """

    async def task_wrapper(task):
        evaluation_type, text_excerpt, func = task
        result = await func
        return evaluation_type, text_excerpt, result
    
    chunk_index = 0
    async for chunk in yield_chunked_tasks(task_list, chunk_size):
        # process each task in the current chunk
        wrapped_chunk = [task_wrapper(task) for task in chunk]
        results_chunk = await asyncio.gather(*wrapped_chunk)
        results_data.extend(results_chunk)
        await asyncio.sleep(delay)
        chunk_index += 1

# initialize variables for tasks and results
results = []
tasks = []
translations = []

# run through the parsed questionnaire data and generate list of async review tasks
k = 0
for module in data.values():
    # run through all questions in the module, assembling questions and translations for review
    module_questions = []
    full_module_text = ""
    for question in module.values():
        # consider each translation of the question
        languages = []
        translations = []
        for translation in question:
            full_question = f"Question: {translation['question']}\nOptions: {translation['options']}\nInstructions: {translation['instructions']}\n"
            
            # conduct bias review on each translation independently
            tasks.append(("bias", full_question, bias_lens.a_evaluate(
                survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=full_question)))
            
            # save first language only for full phrasing and module-level review
            if not languages:
                module_questions.append(full_question)
                full_module_text += full_question
                full_module_text += "\n"
                
            # remember this translation for translation review
            languages.append(translation['language'])
            translations.append(full_question)
        
        # conduct translation review whenever there are multiple translations
        if len(languages) > 1:
            primary_language_text = ""
            for idx, language in enumerate(languages):
                if not language:
                    language = "Unknown-language"
                
                if idx == 0:
                    primary_language_text = f"Primary language ({language}):\n\n{translations[idx]}\n\n"
                else:
                    questions_by_language = primary_language_text + f"Translated language ({language}):\n\n{translations[idx]}\n\n"
                    
                    # conduct translation review in pairwise fashion, with each translation against the first
                    # (primary) language
                    tasks.append(("translation", questions_by_language, translation_lens.a_evaluate(survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=questions_by_language)))

    # execute phrasing evaluation on each module question (first language only)
    for module_question in module_questions:
        tasks.append(("phrasing", module_question, phrasing_lens.a_evaluate(survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=full_module_text, survey_question=module_question)))

    # execute module-level evaluation lenses
    tasks.append(("validated_instrument", full_module_text, validated_instrument_lens.a_evaluate(
        survey_context=evaluation_context, survey_locations=evaluation_locations, survey_excerpt=full_module_text)))

# process tasks in chunks, with a delay between chunks
await process_task_chunks(tasks, chunk_size=20, delay=5)

# output results, including a detailed processing log
logs = []

def add_log(new_history):
    for his in new_history:
        logs.append("PROMPT:")
        logs.append(his[0])
        logs.append("RESPONSE:")
        logs.append(his[1])

for output in results_data:
    eval_type = output[0]
    excerpt = output[1]
    res, history = output[2]

    if eval_type == "phrasing":
        # retrieve formatted result
        formatted_result = phrasing_lens.format_result(res)
        eval_header = "Phrasing evaluation:"
    elif eval_type == "bias":
        # retrieve formatted result
        formatted_result = bias_lens.format_result(res)
        eval_header = "Bias evaluation:"
    elif eval_type == "translation":
        # retrieve formatted result
        formatted_result = translation_lens.format_result(res)
        eval_header = "Translation evaluation:"
    elif eval_type == "validated_instrument":
        # retrieve formatted result
        formatted_result = validated_instrument_lens.format_result(res)
        eval_header = "Module evaluation:"
    else:
        # flag unexpected evaluation types
        eval_header = "UNKNOWN EVALUATION"
        formatted_result = res

    # log exchange+result
    logs.append(eval_header)
    add_log(history)
    logs.append(formatted_result)

    # report result
    if formatted_result is not None and formatted_result:
        results.append(eval_header)
        results.append(excerpt)
        results.append(formatted_result)

# write results to file in evaluation_output_path

results_output_path = os.path.join(os.path.expanduser(evaluation_output_path), "aggregated_results.txt")
logs_output_path = os.path.join(os.path.expanduser(evaluation_output_path), "logs.txt")
with open(results_output_path, 'w') as f:
    f.write('\n\n'.join(results))
with open(logs_output_path, 'w') as f:
    f.write('\n\n'.join(logs))

print(f"Results written to {results_output_path}.")
print(f"Detailed logs written to {logs_output_path}.")