### This notebook shows how to codify **open question** responses
This is so we can better group respondants together using facets.

*In this version, we want to make sure we code 1 answer to 1 code.. versus look at all the answers of that respondnt within that question and come up with a few codes.*

This notebook will cover the process for 1 field to codify, you can repeat it for other fields.
It will created the codes, store a file with the codes and then use those to code EACH response.

The file will be a csv that will include the unique respondant identify and the new field
So that it can be used for typical `VLOOKUP` in sheets or `MERGE` in pandas if you want to use python

### 1. Setup the AI side

In [1]:
# Define the path for the data file (csv) as exported from Survey Monkey
data_folder = '../survey_data/'
# Define the survey data file
# here we use the shortTitles version of the survey data (see data preparation)
csv_survey = data_folder + 'survey_data_shortTitles.csv'  # Replace with desired output file path

# The AI language model
model =  "gpt-4-0125-preview"; # "gpt-3.5-turbo-1106" # 

# get the API key from the environment variable
import os
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

chat = ChatOpenAI(model=model)

### 2. Create codes for each of those *open* questions

In our survey we have a few questions that are open that need to be codified.
So lets create a JSON file with the list of codes and their descriptions using AI by showing the AI all the responses for that question and some "meta data" about that respondant (optional) to help the AI create better codes.

For our example we are going to codify the "Key responsibilities" question. Ie what are the key responsibilities of the role.
And use the Job Titles and Job Title, Role in Rural water sector as the meta data to help the AI come up with a good contextualized list.

So for example we will inject into the prompt

```
Key Responsibilities: .....
Job Titles: Project Manager
Job Title: WASH & NUT Field Manager
Role in Rural water sector: Project Manager
----
(repeat for all responses)
```

We will later use those codes to apply to each of the rows in the survey data.

You can set `SAMPLE_ONLY` if you only want to test this with a few responses, otherwise it will use all the responses.

In [None]:
import pandas as pd
# ====================================
# USE SAMPLE ONLY 
# Set to true to only do a sample of the responses.. to test 
SAMPLE_ONLY = False
# ==========================================


# ==================================================================================================
# Change these to the fields you want to use 
#
field_to_code = "CountryHelpQuestions" # this must exist as a column in the survey data
coded_field = "CountryHelpQuestions_Coded" # this is a new colum that will be added to the survey data
unique_id = "RespondentId" # unique id for each response in the survey
number_of_codes = 15
# provide some context
context = "In the context of a Survey of WASH practitioners responding to a survey about capacity building needs, this concerns questions that team members ask the respondant."
#==================================================================================================

# make sure all those fields exist in the survey data and raise an error if not
df = pd.read_csv(csv_survey)
for field in [field_to_code, unique_id]:
    if field not in df.columns:
        print(f"ERROR: field '{field}' not found in the survey data")
        raise ValueError(f"ERROR: field '{field}' not found in the survey data. Maybe the column names changes or you are using the wrong csv file")


# for each response create a string with field: value\n for code_to_field and all fields in fields_to_help
data_string = ""

# a df_with_data will only contain the responses that have data in the field_to_code and fields_to_help
df_with_data = df[~pd.isna(df[field_to_code])]

# How many responses with data
print(f"Number of responses with data: {len(df_with_data)}")

count = 0
for index,row in df_with_data.iterrows():
    # check if we are only doing a sample
    if SAMPLE_ONLY and count > 10:
        break
    count += 1
    data_string += f"ID: {row[unique_id]}\n"
    data_string += f"{field_to_code}: {row[field_to_code]}\n"
    data_string += "-----\n"


print("\n\n\n",f"Number of reponses with data used: {len(data_string.split('-----'))-1}","\n\n\n")

# print(data_string[:4000])


# define the prompt
tpl = """ 
You are an expert in analyzing data. {context}.You will be given all the anwsers to the question "{field_to_code}".
Come up with {number_of_codes} short labels for the answers to the question "{field_to_code}" based on all the other answers in the survey.

List of answers to {field_to_code} in context: 
===

{data_string}

===
The codes should be a descriptive word group of 4 to 6 words each.
ONLY return the {number_of_codes} codes, one per line, and add a description for it in 40 words, nothing else. For example:

Word Group Code: This a description of the code in 30 words.
Another Code Here: This a another description of the code in 30 words.

The codes must be a normal sentence ie Water Management, not WaterManagement NEVER USE snake_case or camelCase for the codes.
"""

# Create a prompt with the template
prompt_tpl = PromptTemplate.from_template(tpl)
content = prompt_tpl.format(field_to_code=field_to_code, data_string=data_string, context=context, number_of_codes=number_of_codes)

print(content)


Now lets send the request to the AI to create the codes.
And save the codes to a file (JSON)

In [None]:
import re
# send to the LLM
messages = [ HumanMessage(content=content) ]

chat.invoke(messages)
resp = chat.batch([messages])
print(resp[0].content)

# write a json file splitting the response (Code: Description) into codes and descriptions as  { code, description }
import json
codes = []
for line in resp[0].content.split("\n"):
    if line.strip() == "":
        continue
    code, description = line.split(": ")
    # remove any numbering part of the string like 1. using regex for the id and trim
    code = re.sub(r"^\d+\.\s*", "", code).strip()
    codes.append({"code": code, "description": description.strip()})

with open(f"{data_folder}{field_to_code}_codes.json", "w") as f:
    json.dump(codes, f, indent=2)


### 3. Apply the codes to each of the responses

Now we need to apply the codes to each of the responses. We will also use AI for that.

To save time and tokens we should apply the codes in batches so we don't repeat all the code options for each call.

We will have it return just the unique id of the response and the code that was applied to it.

Then we will save that to a file (CSV) so we can use it in sheets or pandas to merge it with the original data.

The filename starts with "Lookup"


In [None]:
import json
# ====================================
# USE SAMPLE ONLY 
# Set to true to only do a sample of the responses.. to test 
SAMPLE_ONLY = False
# ==========================================


# we are reusing the codes from the previous step, read the file
with open(f"{data_folder}{field_to_code}_codes.json", "r") as f:
    codes = json.load(f)
          
print(f"Number of codes: {len(codes)}")

# in batches of 5 responses, collate the data that is used but the LLM to guess what the best code is for each response
# reuse the data_string from above
responses = data_string.split("-----")
print(f"Number of responses: {len(responses)}")

# create an AI process to generate the code
# create the prompt
tpl = """ 
You are an expert in coding data. {context}. You will be given codes to choose from, to code open text answers separated by semi-colons, for e.g.

{field_to_code}: something; something else; more here; and again; 

You will be given a batch of answers, each line with have an ID.
You can ONLY choose from the codes provided to assign to each part of the answer. Assign a code for each part. So if you get 1, only give 1, if you get 2, then only give 2, and so on. Repeat the code if necessary.
Return the codes you think best describes each answer, as one per line, codes separated by semi-colon in the format ID; Code1; Code2;, Code3;.. etc. Without the description. For example: 

54; Coded something; Coded something else; ...

Ensure the coded answers are representative of the answers given. If you are not sure, leave it blank. 
List of Answers:
{batch_of_answers}

List of Codes and their descriptions you have to choose from:
{codes}

Never ever make up a code, only use the codes provided.
"""

# Create a prompt with the template
prompt_tpl = PromptTemplate.from_template(tpl)

# create a fuynction we can use for each batch


def get_codes_for_this_batch(batch_of_answers):
    # get codes as 1 per line with Code: Description from codes
    codes_list = "\n".join(
        [f"{code['code']}: {code['description']}" for code in codes])
    content = prompt_tpl.format(
        batch_of_answers=batch_of_answers, codes=codes_list, field_to_code=field_to_code, context=context)
    print(content)
    # send to the LLM
    messages = [HumanMessage(content=content)]
    chat.invoke(messages)
    resp = chat.batch([messages])
    print(resp[0].content)
    # split id: code and return array of arrays
    return [line.split("; ") for line in resp[0].content.split("\n") if line.strip() != ""]
# split responses in batches of 5 and process through the LLM
all_reponses = []
per_batch = 5
## print "Doing sample only" if SAMPLE_ONLY is True
if SAMPLE_ONLY:
    print("Doing sample only of 3 batches or 15 responses")

## if smaple only, only do 3 batches else do all
for i in range(0, SAMPLE_ONLY and 15 or len(responses), per_batch):
    batch_of_answers = "\n".join(responses[i:i+5])
    # number batch as int
    print(f"Batch {int(i/per_batch)+1}: {len(responses[i:i+5])}")
    this_batch = get_codes_for_this_batch(batch_of_answers)
    # merge into all_reponses
    all_reponses += this_batch

# print(all_reponses)



In [None]:
print(all_reponses)
# now create a CSV file with the response id's and codes from all_responses and map the id to unique_id field and code to coded_field
# given array of arrays we can use pandas to create a dataframe

# In case the respnse has more than 1 code column, add a new column for each code using coded_field name with number 1, 2, 3, etc. as suffix
# check column count then create a dataframe using above
columns = [unique_id]

# iterative over the columns in the responses to find with one with the most columns
columns_in_responses = 0
for response in all_reponses:
    columns_in_responses = max(columns_in_responses, len(response))

print(f"Number of columns in responses: {columns_in_responses}")

for i in range(columns_in_responses-1):
    columns.append(f"{coded_field}_{i+1}")

df_codes = pd.DataFrame(all_reponses, columns=columns)
# export as csv
df_codes.to_csv(f"{data_folder}lookup_{field_to_code}_codes.csv", index=False)
