# Q+A with or without context and ground truth (input CSV, outputs CSV and JSONL)

This notebook will help you set up an input file (`JSONL`) to use to run Azure OpenAI evaluation requiring data with `question` and `answer` pairs. (This is not designed to generate massive testing data but should work to generate spreadsheets with a few hundred (give or take) rows.)

### Prerequisites

#### Input file

`csv` file with one column (with a header of `question`). Save the input file in the `input_files` folder. 

Optionally, the input file can contain a `ground_truth` column
 
| question    | 
| -------- | 
| What color is green?  |
| Where am I?           |
| Do dogs like cats?    | 



#### Environment Variables

You'll need to first create a `.env` file in the root of this directory containing values for `DC_CHAT_URL` and `DC_API_TOKEN` as shown in the `env.example` file. 

- To obtain the `DC_API_TOKEN` 
   - log into the Digital Collections website (either staging or production will work)
   - then visit the corresponding staging or production API route and copy the token into the `.env` file
     - Production: `https://api.dc.library.northwestern.edu/api/v2/auth/token`
     - Staging: `https://dcapi.rdc-staging.library.northwestern.edu/api/v2/auth/token`
   - Note that these tokens are by default good for 1 day so you'll need to redo these steps once it expires
   - After updating your `.env` file you will need to **restart the kernel**
- `DC_CHAT_URL`: Decide whether you want to hit the production or staging endpoing and use one of these values:
  - Staging `https://pimtkveo5ev4ld3ihe4qytadxe0jvcuz.lambda-url.us-east-1.on.aws`
  - Production `https://hdtl6p2qzfxszvbhdb7dyunuxe0dgexo.lambda-url.us-east-1.on.aws`

## Output

JSONL file containing records with `question` and `answer` fields (and optionally `context`)



| question              | answer         |
| --------              | -------        |
| What color is green?  | Green.         |
| Where am I?           | Here.          |
| Do dogs like cats?    | No.            | 




## Setup the Environment

First start by importing and setting up the libraries we need:

In [None]:
#install required packages
%pip install pandas
%pip install requests
%pip install python-dotenv

In [None]:
# import required packages
import pandas as pd
import os
import random
import json, requests
from datetime import datetime
from dotenv import load_dotenv

load_dotenv()

# Load environment variables from .env file
DC_CHAT_URL = os.getenv('DC_CHAT_URL')
DC_API_TOKEN = os.getenv('DC_API_TOKEN')


In [108]:
# Setup: Functions to get an answer to the question

def format_answer(response, with_context=False):
    if with_context:
        return pd.Series([response['answer'], response['context']])
    else:
        return response['answer']
    
def format_error(with_context=False):
    if with_context:
        return pd.Series(["--ERROR--", "--ERROR--"])
    else:
        return "--ERROR--"

def get_answer(question, with_context=False):
    url = DC_CHAT_URL
    header = {'Content-Type': 'application/json'}
    
    body = {
        'message': 'chat',
        'auth': DC_API_TOKEN,
        'ref': 'DEV-TEAM-TEST-' + str(random.random()),
        "question": question
    }
    print("Asking question: " + question)
    
    
    try:
        response = requests.post(url, json.dumps(body), headers=header)
        response.raise_for_status()
        print(f"Response: {response.status_code}")
        if response.status_code != 200:
            print('Status:', response.status_code, response.reason)
            return format_error(with_context)
        response_json = response.json()
        return format_answer(response_json, with_context)
    except Exception as err:
        print(f"Other error occurred: {err}")
        return format_error(with_context)
    
def get_answers(questions, with_context=False):
    if with_context:
       questions[['answer', 'context']] = questions['question'].apply(lambda x:get_answer(x, with_context))
    else:
        questions['answer'] = questions['question'].apply(lambda x:get_answer(x, with_context))
    return questions

## Configure your input file and load data

Setup the input file name and make sure it is readable

In [None]:
# put your input file inside the `input_files` folder
# put your input filename here
input_filename = '40_realistic_with_ground_truth.csv'

# read the input file
questions = pd.read_csv(os.path.join('input_files', input_filename))

# preview the input file
questions.head()

## Generate the answers from DCAPI

Configure `with_context` to whether you want to fetch context along with the answers

Run the below to fetch answers (and optionally context)

In [None]:
# Run to getenerate answers (will take some time)

# Set with_context to True if you want to get the context column along with the answer
# (Needed for some of the Azure evaluations)
with_context = True

# get answers
result = get_answers(questions, with_context)

# preview answers
result.head()


## Write the results to file

It will write both `CSV` and `JSONL` files. (`JSONL` seems to be a little less buggy in Azure but YMMV)

In [None]:
# write the output files
timestamp = datetime.now().strftime('%Y%m%d%H%M%S')
os.makedirs(os.path.join('output_files', timestamp), exist_ok=True)
output_base_path = f"output_files/{timestamp}"
jsonl_filename = os.path.join(output_base_path, f"{os.path.splitext(os.path.basename(input_filename))[0]}.jsonl")

outJson = questions.to_json(orient="records", lines=True) 
with open(jsonl_filename, 'w') as outfile:
    outfile.write(outJson)

csv_filename = os.path.join(output_base_path, f"{os.path.splitext(os.path.basename(input_filename))[0]}.csv")
questions.to_csv(csv_filename, index=False)

print(f"Output files saved to: {jsonl_filename} and {csv_filename}")