## Using Notebook Environments
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown (text) cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

## Environment Setup
The code below allows the notebook to acces your google files, including reading the files in, and writing files in specified locations.
Once you run it, you'll have to give the notebbok the access to a google drive of your choosen google account.

In [None]:
# mount google drive
from google.colab import drive
drive.mount("/content/drive")

We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the `import <name> as <nickname>` syntax (e.g. `import pandas as pd`). These are usually standardized nicknames.

We'll use following packages:

1. `sys` and `os` for managing package installations and directories
2. `re` for string matching
3. `textwrap` and `IPython` to make the display of some of the tables more readable
4. `pandas` is a popular package for data frames manipulation
5. `huggingface_hub` for making API calls and prompt the LLMs

In [None]:
import sys
import os
import re
import textwrap
from IPython.display import display, HTML
import pandas as pd
from huggingface_hub import InferenceClient

# the code below installs huggingface hub if it's missing
if 'google.colab' in sys.modules:  # If in Google Colab environment

    # Installing requisite packages
    !pip install huggingface_hub &> /dev/null

# this sets the working directory to the exercises folder
os.chdir('/content/drive/My Drive/llms_egproc/exercises/')

# Identifying decision reasons
In this exercise, we will explore the capabilities of LLMs to identify decision reasons in verbal reports using the Hugging Face (HF) ecosystem.

By the end of this exercise, you will have learned how to:
- Design a zero-shot prompt
- Have large models on the Hugging Face servers evaluate your prompts  
- Validate the model output

## Getting access to a Large Language Model
Powerful LLMs like LLAMA-3-70b and even its smaller siblings are typically too large to download and run on your local machine. One alternative solution to get access to LLMs is to use APIs that run the LLMs on remote servers. In this exercise, we will use the API provided by Hugging Face, which is the central actor in the open language model sphere.

In addition to offering hundreds of thousands of LLMs, including the most powerful open models developed by Meta, Google, and others, Hugging Face offers a collection of Python libraries to facilitate the use of open LLMs. We will make use of the `huggingface_hub` library to interact with the models running on Hugging Face's servers.  

We start by setting up the API `InferenceClient` which we have loaded from the `huggingface_hub` library in the previous chunk. The function takes two argumentsare:
* `model`: the model name (see, e.g., https://huggingface.co/meta-llama for collection of Llama models)
* `token`: your personal access token obtained through your Hugging Face account.


In [None]:
# paste your token here
API_TOKEN = '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'

# we'll use latest, small LLAMA-3 model
LLAMA_version = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# pass model version and the token to the InferenceClient function and save the output under some name, e.g., LLAMA
LLAMA = InferenceClient(model = LLAMA_version, token = API_TOKEN)

## Using the InferenceClient
Now, you can use `.text_generation` method of the `LLAMA` object to prompt the model. It takes the prompt and a `max_new_tokens` argument controlling the size of the text output as inputs.
1. run the chunk and investigate the output.
2. try other prompts.

In [None]:
# let's create some prompt
prompt = 'Once upon a time'

# Get a response from the Meta-Llama-3.1-8B-Instruct
response = LLAMA.text_generation(prompt = prompt, max_new_tokens = 300)
print(response)

Being an assistant model, Llama distinguishes between system and user messages of the prompt. The two are separated using a collection of special tokens. The code below shows an example with special tokens. Run the prompt as is and study the result. Then change the system role to one of the following and run again:
*   Priest
*   Economist
*   Generic grandpa

In [None]:
# system message sets the role to a decision scientis
prompt = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are an expert decision scientist
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is the best way to make financial decisions?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

# run prompt
response = LLAMA.text_generation(prompt, max_new_tokens = 200)
print(response)

## Piecing the prompt together
In this section, we will build a prompt for reason identification. Let's first take a look at our `prompt_template`. Read it from your google drive and study its individual components. Which prompting techniques do you see at work?

In [None]:
# Read the File
prompt_path = 'prompts/prompt_v1.txt'

# Open the file and read its contents
with open(prompt_path, 'r') as file:
    prompt_template = file.read()

print(prompt_template)

To complete the prompt, the following three pieces of information need to be replaced for each trial. This information will come from files on your drive.

1. The DECISION REASON
2. The DECISION PROBLEM
3. The VERBAL REPORT

The code below reads the files with this information and specifies a function that will help us fill in this information efficiently.

In [None]:
# read in decision problems, decision reasons, and verbal reports
decision_problems = pd.read_csv('data/decision_problems.csv', encoding = 'utf-8')
decision_reasons = pd.read_csv('data/decision_reasons.csv', encoding = 'utf-8')
verbal_reports = pd.read_csv('data/verbal_reports.csv', encoding = 'utf-8')

# merge verbal reports with decision problems
problems_reports = pd.merge(decision_problems, verbal_reports, on = 'problem_id')

# function for constructing the full prompt
def generate_prompt(prompt_template, decision_problem, decision_reason, verbal_report):

    # Replace placeholders with actual values
    full_prompt = prompt_template.replace("DECISION_PROBLEM", decision_problem)
    full_prompt = full_prompt.replace("DECISION_REASON", decision_reason)
    full_prompt = full_prompt.replace("VERBAL_REPORT", verbal_report)

    return full_prompt

Now, let's complete the first prompt. Use the code below to create the first set of prompts for the maximum outcome decision reason. Run code.

In [None]:
# set maximum outcome as value for the selected_reason variable
selected_reason = 'maximum outcome'

# get the description of the reason from the decision_reasons data frame
selected_description = decision_reasons.loc[decision_reasons['decision reason name'] == selected_reason, 'decision reason description'].values[0]

# Create a list for storing prompts for the maximum outcome reason
maximum_outcome_prompts = []

# Generate prompts for the specific decision reason
# this loops over each row of data with verbal reports and corrsponding decision problems
for _, row in problems_reports.iterrows():

    # here we are using the generate prompt function to create prompts for all verbal reports and the maximum outcome reason
    prompt = generate_prompt(
        prompt_template,
        row['decision_problem'],
        selected_description,  # Use the selected description
        row['verbal_report']
    )
    maximum_outcome_prompts.append(prompt)

Have a look at the first full prompt. You can change the index to inspect prompts for other problems and reports.

In [None]:
# you can investigate other prompts by changing the number from 0 to some other value
print(maximum_outcome_prompts[0])

## Testing the identification of reasons

In this section, we will try out identifying reasons using the LLM. The code below sends the first maximum outcome prompt to the LLM. Run it and make sense of the output relative to the prompt. You can again try other prompts by changing the the index.

In [None]:
# pass the first prompt to LLAMA and save the output
response = LLAMA.text_generation(maximum_outcome_prompts[0], max_new_tokens = 1000)

# print the llama evaluation
print(response)

You may have picked up that our prompt instructs you to return the confidence for the presence of the reason in a particular format. We do this so that it is easy to extract the confidence value from the output text. The code below defines a simple function to extract the confidence and applies it to a few examples, including the response produced for the first full prompt.

In [None]:
# Function for extracting confidence assessments
def extract_confidence(s):

    # Regular expression to match patterns like @number@ or @@number@@
    pattern = r'@+(\s*\d+\s*)@+'

    # Search for the pattern in the string
    match = re.search(pattern, s)

    if match:
        # Extract the number and convert it to an integer
        number_str = match.group(1).strip()
        return int(number_str)

    return 999 # if match was not found, the function will return 999

# test the extract_confidence function
print('test 1: ' + str(extract_confidence('something else @10 @ xxx')))
print('test 2: ' + str(extract_confidence('something else @@13@@ xxx')))
print('test 3: ' + str(extract_confidence(response)))

## Putting everything together
Now, you'll run the analysis on the entire data set, i.e., the verbal reports for each decision problem. In each case, the model will assess whether the verbal report reflects the use of a selected reason and return a confidence value.  

We have already created the prompts for each verbal report in the chunks above. These prompts stored in the `maximum_outcome_prompts` object were setup to identify the **maximum outcome** reason.  

The code below runs these reasons through Llama and stores the results in different ways. The code also defines a few functions that will help us to inspect the results.

Running this code will take some time. To keep you updated, the code will print index showing how many of the verbal reports have been processed. Run the code.   


In [None]:
# list for storing the output from the LLAMA model
maximum_outcome_eval = []

# analyzed reason
selected_reason = 'maximum outcome'

# new column in the problems_reports data set for stroting the confidence assesments
# remind that selected reason was set to 'maximum outcome'
problems_reports[selected_reason] = None

# Iterate over the list of prompts, get responses, and extract numerical estimates and add them to the data set with problems and reports
for i, prompt in enumerate(maximum_outcome_prompts):

    # response from LLAMA
    response = LLAMA.text_generation(prompt, max_new_tokens = 1000)
    maximum_outcome_eval.append(response) # save the response to the maximum_outcome_eval list

    # extract the confidence value from the response
    confidence_assesment = extract_confidence(response)

    # confidence value into the data
    problems_reports.at[i, selected_reason] = confidence_assesment

    # monitor progress
    print(str(i) + '/' + str(problems_reports.shape[0]))

# Function to wrap text
def wrap_text(text, width=100):
    return "<br>".join(textwrap.wrap(text, width))

# display data frames in HTML
def disp_tab(dd):
    dd = dd.map(lambda x: wrap_text(str(x), width=40))
    dd = dd.to_html(escape=False)
    return display(HTML(dd))

# Function to show verbal reports with confidence assesment below and above specified values
def show_verbal_reports_in_range(data, reason, min_threshold = 100, max_treshold = 0, show_na = False):

    # if true, display the responses for which conf assessemnt couldn't be extracted
    if show_na:
        filtered_data = data[data[reason] == 999]
    else:
        filtered_data = data[(data[reason] <= min_threshold) | (data[reason] >= max_treshold) & (data[reason] != 999)]

     # wrap the text for nicer display
    filtered_data.loc[:, 'verbal_report'] = filtered_data['verbal_report'].apply(wrap_text)
    filtered_data.loc[:, 'decision_problem'] = filtered_data['decision_problem'].apply(lambda x: wrap_text(x, width=40))

    # select only the columns with report and confidence assesment
    filtered_data = filtered_data[['decision_problem', 'verbal_report', 'choice', reason]]
    filtered_data = filtered_data.to_html(escape=False) # to html

    return display(HTML(filtered_data))

With the function below, you can return verbal reports for which the LLM indicated either an extremely low or high confidence that the **maximum outcome** reason was used. Inspect the verbal reports. Do you agree with the LLMs assessments?

In [None]:
# Show verbal reports for which the Maximum Outcome reason was assessed to be used with high confidecne, i.e., between 80 to 100
show_verbal_reports_in_range(problems_reports, 'maximum outcome', min_threshold = 20, max_treshold = 80)

Using the same function, you can view the entire data set. That is, at least in those cases where the LLM provided a confidence value such that it could be properly extracted (i.e., assement between 0 and 100). Take a look at the indicies and see if there are entries missing.

In [None]:
# Show all results with correct confidence assesments
show_verbal_reports_in_range(problems_reports, 'maximum outcome', min_threshold = 100, max_treshold = 0)

If there are entries missing, it can be worthwhile inspecting the complete text output that the LLM provided. Thereby you can assess why the extraction of the confidence value may have failed.

To display cases for which confidence extraction didn't work, run the follwing line of code:


In [None]:
# Show all results with incorrect confidence assesments
show_verbal_reports_in_range(problems_reports, 'maximum outcome', show_na = True)

Inspecting the full output can also be valuable for other reasons. For instance, the text output may reveal whether the LLM is fully understanding the concepts of your prompt and closely following your instructions.  

The complete LLM output was saved in the `maximum_outcome_eval` object. You can access it by printing individual elements. You can select them by their index.

Take a look at the full outputs of iterations in which the confidence assessment couldn't be correctly extracted. To do this, pick an index from the table above and replace the `1` in the line below.

In [None]:
# Print complete LLM output for index 1
print(maximum_outcome_eval[1])

Finally, you can use `disp_tab` function to display the entire data set

In [None]:
disp_tab(problems_reports)

# SURE OUTCOME
Run the analyses for the **sure outcome** decision reason. The code below implements the entire pipeline. Run it and see whether LLM identifies cases with low or high confidence. Again, this may take a moment to finish.

In [None]:
# Select the description
analyzed_reason = "sure outcome"
selected_description = decision_reasons.loc[decision_reasons['decision reason name'] == analyzed_reason, 'decision reason description'].values[0]
print(selected_description)

# Create a list for storing prompts for the sure outcome reason
sure_outcome_prompts = []

# Generate prompts for the specific decision reason
for _, row in problems_reports.iterrows():

    # here we are using the generate prompt function to create prompts for all verbal reports and the sure outcome reason
    prompt = generate_prompt(
        prompt_template,
        row['decision_problem'],
        selected_description,  # Use the selected description
        row['verbal_report']
    )
    sure_outcome_prompts.append(prompt)

# Run the LLM
# list for storing the output from the LLAMA model
sure_outcome_eval = []

# new column in the problems_reports data set for stroting the confidence assesments
# remind that selected reason was set to 'sure outcome'
problems_reports[analyzed_reason] = None

# Iterate over the list of prompts, get responses, and extract numerical estimates and add them to the data set with problems and reports
for i, prompt in enumerate(sure_outcome_prompts):

    # response from LLAMA
    llama_response = LLAMA.text_generation(prompt, max_new_tokens = 4000)
    sure_outcome_eval.append(llama_response) # save the response to the sure_outcome_eval list

    # extract the confidence value from the response
    confidence_assesment = extract_confidence(llama_response)

    # confidence value into the data
    problems_reports.at[i, analyzed_reason] = confidence_assesment

    # monitor progress
    print(str(i) + '/' + str(problems_reports.shape[0]-1))

# show cases with confidence lower than 20 and higher than 80
show_verbal_reports_in_range(problems_reports, 'sure outcome', 20, 80)

# Run other reasons
If you have the time, run the code below to review the list of reasons we prepared for you. Select a reason that you like (or don't like) and try to recreate the analysis using this reason. To do this, simply replace

Good luck!

In [None]:
# display decision reasons
disp_tab(decision_reasons)