## LLM API call using OpenAI API

The purpose of this notebook is to demonstrate the use of LLM API calls for increasing understandibility of the XAI methods in the context of CNN skin cancer prediction. In each individual API call, the LLM model is supplied with a set of thorough instructions aimed at making its outputs more accessible to non-technical audiences. 

**Prerequisites:**

Please note, this LLM pipeline uses data output by the ``04_xai_pipeline.ipynb`` notebook, therefore, please make sure the files are available in relevant folders.
For the analysis, you also need to supply your own sample that needs to be stored in ``user_inputs`` folder.

### Tools

In [None]:
from openai import OpenAI, APIError, APIConnectionError, RateLimitError
from pathlib import Path
import os
from dotenv import load_dotenv
import base64
import pandas as pd
import numpy as np
from pydantic import BaseModel, Field
from typing import Literal, Optional, Annotated, List
import matplotlib.pyplot as plt
from PIL import Image
import csv

### Constants

In [None]:
# Define paths to the relevant folders
root_dir = Path.cwd().parent
user_inputs_dir = root_dir / 'user_inputs'
results_dir = root_dir / 'results'
xai_output_dir = results_dir / 'xai_output'

# Define relevant input data paths
sample_path = user_inputs_dir / 'user_sample1.jpg'
sample_probs_path = xai_output_dir / 'model_output.csv'
xai_gradcam_output_path = xai_output_dir / 'user_sample1_xai_gradcam.png'
xai_shap_output_path = xai_output_dir / 'user_sample1_xai_shap.png'
xai_influence_output_path = xai_output_dir / 'user_sample1_influence_function.csv'

### OpenAI client setup

In order to use OpenAI API you will need an OPENAI_API_KEY. You then need to create .env file with your own key, e.g.:

``OPENAI_API_KEY="YOUR_OPENAI_API_KEY" ``

In [None]:
# Define .env file path
env_path = root_dir / '.env'

# Load the .env variable (API key)
load_dotenv(dotenv_path=env_path, 
            override=True) # Outputs True if variables could be loaded

In [None]:
# Access the loaded OpenAI API key
key = os.getenv("OPENAI_API_KEY")

# Construct client instance using the key
client = OpenAI(api_key=key)

### Loading the data

In this step we load the outputs from the 04_xai_pipeline.ipynb notebook and encode the images to the format supported by OpenAI.

In [None]:
# Function to encode the images to base64 byte objects in string format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# encode the xai output images
xai_gradcam_enc = encode_image(xai_gradcam_output_path)
xai_shap_enc = encode_image(xai_shap_output_path)

# encode the user sample
user_sample_enc = encode_image(sample_path)

# Read in the sample probabilities output by the model
with open(sample_probs_path, "r",  encoding="utf-8") as f:
    sample_probs = f.read()

### Instructions

In constructing the instructions for LLM, we have followed OpenAI's best prompting practices available [here](https://cookbook.openai.com/examples/gpt4-1_prompting_guide).

Among the others, a few-shot learning technique was used in order for the model to learn patterns present in expected outputs.

The instructions are stored in ``data/llm`` folder.

In [None]:
# Read in the instructions
with open(os.path.join(os.getcwd(), "../data/llm/llm_instructions.md"), "r", encoding="utf-8") as f:
    instructions = f.read()

### Influence Functions Statistics

LLMs are prone to inaccuracy when analysing datasets, which raises particular concerns in clinical contexts. In our project, one of the XAI method outputs—Influence Functions—is supplied as a .csv file. To minimise accuracy issues, the function below analyses the dataset, and these statistics will be fed to the LLM in the subsequent step.

In [None]:

def influence_functions_stats(filepath : Path, predicted_class : str) -> tuple[float, float, float | None]:
    """
    Calculate statistics for the Influence Functions data. In particular, the function calculates:
    - percentage of the influential training cases that share their ground truth with CNN-predicted class
    - percentage of the influential training cases that don't share their ground truth with CNN-predicted class
    - percentage of the ground-truth-aligned cases that were misclassified during training

    Arguments:
    filepath (pathlib.Path): path to the CSV dataset containing output of the Ifluential Function
    predicted_class (str): CNN predicted class for a user sample

    Returns:
    A tuple of containing statistics for the 3 calculations (as in the description).
    """

    # Read in the influence functions data
    influence_data = pd.read_csv(filepath)

    # Filter for influential training cases that share ground truth with predicted class
    alligned_groundtruth = influence_data[influence_data['ground_truth'] == predicted_class]

    # Set default values
    groundtruth_alignment_percentage, groundtruth_misalignment_percentage, misclassified_percentage = 0, 100, None

    # Check for the count of alligned cases
    if len(alligned_groundtruth) > 0:
        # Calculate the percentage of influential training cases that share ground truth with predicted class
        groundtruth_alignment_percentage = (len(alligned_groundtruth) / len(influence_data['ground_truth'])) * 100

        # Calculate the percentage of the aligned cases that were misclassified during training
        misclassified_percentage = round((len(alligned_groundtruth[alligned_groundtruth["ground_truth"] != alligned_groundtruth["prediction"]]) / len(alligned_groundtruth)) * 100, 2)

    # Calculate the percentage of influential training cases whose ground truth does NOT match the predicted class
    groundtruth_misalignment_percentage = 100 - groundtruth_alignment_percentage

    return groundtruth_alignment_percentage, groundtruth_misalignment_percentage, misclassified_percentage

# Establish the CNN-predicted class
sample_data = pd.read_csv(sample_probs_path)
predicted_class = str(sample_data.loc[sample_data['confidence'].idxmax(), 'class'])

# Compute the Influence Functions stats
influence_stats = influence_functions_stats(filepath=xai_influence_output_path, 
                                            predicted_class=predicted_class)

print(
    f"Prediction: {predicted_class}", 
    f"\nGround-truth-aligned cases: {influence_stats[0]}%", 
    f"\nGround-truth misalignment percentage: {influence_stats[1]}%",
    f"\nMisclassified percentage: {influence_stats[2]}%"
    )

### LLM API call

We use a flagship model from OpenAI, GPT-4.1 (model snapshot: 2025-04-14). According to the company's documentation, it is highly capable at complex task while expressing strict adherance to instructions.  

In [None]:
try:
    response = client.responses.create(
        model="gpt-4.1-2025-04-14",
        input=[
            {
                "role": "developer",
                "content" : instructions
            },
            {
                "role": "user",
                "content": [
                    { 
                        "type": "input_text",
                        "text": str(sample_probs) }, # sample probabilities
                    { 
                        "type": "input_text",
                        "text": str(influence_stats)}, # influence functions stats
                        
                    {
                        "type": "input_image",
                        "image_url": f"data:image/png;base64,{xai_gradcam_enc}", # GradCAM output, base64-encoded PNG file
                        "detail": "auto"
                    },
                    {
                        "type": "input_image",
                        "image_url": f"data:image/png;base64,{xai_shap_enc}", # SHAP output, base64-encoded PNG file
                        "detail": "auto"
                    },
                    {
                        "type": "input_image",
                        "image_url": f"data:image/jpg;base64,{user_sample_enc}", # Original user sample, base64-encoded JPG file
                        "detail": "auto"
                    },

                ],
            }
        ],
        temperature=0.0
    )

    # Print the LLM interpretation
    print(response.output_text)

# Handle potential errors
except TimeoutError as e:
    print(f"The LLM API call encountered TimeoutError: {e}")

except openai.APIError as e:
    print(f"OpenAI API error: {e}")    

except openai.APIConnectionError as e:
    print(f"Connection error: {e}")
    
except openai.RateLimitError as e:
    print(f"Rate limit exceeded: {e}")
    
except Exception as e:
    print(f"Unexpected error: {e}")

In [None]:
# Load and display xai images for context
img1 = Image.open(sample_path)
img2 = Image.open(xai_gradcam_output_path)
img3 = Image.open(xai_shap_output_path)


fig, axes = plt.subplots(1, 3, figsize=(12, 4))
axes[0].imshow(img1)
axes[0].axis('off')
axes[0].set_title('Original Sample')
axes[1].imshow(img2)
axes[1].axis('off')
axes[1].set_title('GradCAM Visualisation')
axes[2].imshow(img3)
axes[2].axis('off')
axes[2].set_title('SHAP Visualisation')

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()


### Write the LLM interpretaton to .txt file

In [None]:
# Define path for the LLM output
llm_output_path = xai_output_dir / 'llm_output.txt'

with open(file=llm_output_path, mode="w", encoding="utf-8") as f:
    f.write(response.output_text)

### Parse LLM Output for Quantitative Analysis

The call below parses the LLM interpretation while extracting cruicial insights to enable a quantitative analysis against the original CNN prediction. 

In [None]:
class Prediction(BaseModel):
    prediction: Literal['Benign', 'Malignant'] = Field(
        description="'Benign' for when the AI analysis suggests low or moderately low concern for malignancy; 'Malignant' for when the AI analysis indicates high or moderately high concern for malignancy. " \
        "For borderline predictions, extract the class that is mentioned in the Confidence Level section."
        )
    confidence: Annotated[float, Field(
        description="Model confidence, as indicated in the Confidence Level section. 2 decimal places."
        )]
    influential_cases_percentage: Annotated[int, Field(
        description="Influence Functions: Percentage of the most influential training cases that were diagnosed with the same class as the class predicted by the model." \
        "For borderline predictions, the class predicted by the model is indicated in the Confidence Level section."
    )]

try:
    extraction = client.responses.parse(
        model="gpt-4o-2024-08-06",
        input=[
            {
                "role": "system",
                "content": "You are an expert at structured data extraction. You will be given unstructured text from AI analysis and you should convert it into the given structure.",
            },
            {
                "role": "user", 
                "content": response.output_text
                },
        ],
        text_format=Prediction,
    )

    extracted_data = extraction.output_parsed.model_dump()

# Handle potential errors
except TimeoutError as e:
    print(f"The LLM API call encountered TimeoutError: {e}")

except openai.APIError as e:
    print(f"OpenAI API error: {e}")    

except openai.APIConnectionError as e:
    print(f"Connection error: {e}")
    
except openai.RateLimitError as e:
    print(f"Rate limit exceeded: {e}")
    
except Exception as e:
    print(f"Unexpected error: {e}")

The code below parses the LLM interpretation while searching for keywords indicating a 'borderline' prediction. According to llm_instructons.md we expect the LLM to interpret as *borderline* all predictions output by the CNN that are between `>= 0.5 and < 0.6`.

In [None]:
# Define custom function for extracting the borderline prediction status
def borderline_parser(llm_output : str, key_words: List[str]) -> bool:
    """
    Parse the LLM output and search for the key words to confirm if the given prediction 
    was interpreted as "borderline". 

    Arguments:
    llm_output (str): original LLM interpretation of XAI methods in the skin cancer prediction
    key_words (List[str]): key words to match against the LLM output

    Returns:
    A bool value for the presence or absence of the "borderline" prediction status.
    """

    # Narrow-down parsing focus if exact headng is present in the LLM output
    if '**Confidence Level**' in llm_output:
        start_indx = 0
        end_indx = llm_output.find("**Confidence Level**")
        llm_output = llm_output[start_indx : end_indx].lower()
    else:
        llm_output = llm_output.lower()

    borderline = False
    key_words = [word.lower() for word in key_words]

    # Parse for the key words (with lowered case)
    for word in key_words:
        if word in llm_output:
            borderline = True
            break

    return borderline

# Implement the function and update extracted_data with the function's finding
key_words = ["borderline"]
borderline_status = borderline_parser(llm_output=response.output_text, 
                                      key_words=key_words)
extracted_data["borderline"] = borderline_status


### Write the parsed data to the .csv file

In [None]:
# Define path for the extracted data 
parsed_llm_output_path = xai_output_dir / 'parsed_llm_output.csv'

# Write extracted_data to a csv file for quantitative analysis
with open(file=parsed_llm_output_path, mode='w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(extracted_data.keys())
    writer.writerow(extracted_data.values())