# Tutorial: Zero-Shot Learning for Beverage Classification in Images

In this tutorial we are going to implement Zero-Shot Learning (ZSL) using two popular deep learning models: CLIP (Contrastive Language-Image Pre-training) and LLaVA (Large Language and Vision Assistant model). This notebook is available in two forms:

1.   [Online (Google Colab)](https://colab.research.google.com/github/ltu-capr/zsl-image-tutorial/blob/master/ZSL_for_image_beverage_classification.ipynb): For experimenting on Google's free platform without installing anything on your computer.
2.   [Offline (Jupyter Notebook)](https://github.com/ltu-capr/zsl-image-tutorial): For experimenting locally on your own computer. This takes some additional setup, but is the best option for working with sensitive data.

To run the code in a cell, either click inside it and press Ctrl + Enter, or click the 'play' button to the left of the cell.

*The ZSL models at the core of this notebook run much faster with graphics processing unit (GPU) acceleration. If you are in Google Colab, you can enable GPU acceleration in the settings by going to Runtime > Change runtime type > Hardware accelerator (select "GPU").*

*To support most machines, this notebook uses a quantised version of LLaVA, that reduces the memory requirements at a small cost to accuracy. As such, a GPU is **required** to use LLaVA in this notebook*

## Example scenario: Beverage Classification in Images

In this scenario, we want to classify the type of beverage depicted in an image from a set of predefined categories, and thus, determine if an alcoholic beverage is present.

The candidate beverage types we aim to classify are:

*    Beer
*    Wine
*    Water
*    Coffee
*    Tea
*    No Beverage

To approach this, we will use two popular deep learning models that operate in different ways:

* CLIP (Contrastive Language-Image Pre-training)
* LLaVA (Large Language and Vision Assistant)


### Cell 0.0: Install software package requirements


*   bitsandbytes and accelerate are used to quantise the LLaVA model, such that we can run it on machines with smaller amounts of available memory.
*   matplotlib is used for visualising the confusion matrix and examples in the dataset.
*   numpy is used for efficient numerical computations.
*   Pandas is used to load and save data in CSV (comma separated value) format.
*   PyTorch and Transformers are used to run the models (torchvision is used by Transformers to preprocess the images).
*   tqdm is used to show progress bars.
*   scikit-learn is used for metric computation.

In [None]:
!pip install accelerate==1.2.0 bitsandbytes==0.45.0 matplotlib numpy pandas transformers==4.46.0 tqdm scikit-learn
!pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu124

### Cell 0.1: Import essential modules

In [None]:
import os
from PIL import Image

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score
from tqdm.notebook import tqdm
from transformers import CLIPProcessor, CLIPModel, LlavaNextProcessor, LlavaNextForConditionalGeneration, BitsAndBytesConfig

In [None]:
if not torch.cuda.is_available():
    print('[WARNING] No GPU was detected. CLIP will run without a GPU, however a GPU is required to run the LLaVA model.\n' \
          'If using Google Colab, you can enable the GPU by selecting Runtime > Change runtime type > Hardware accelerator (select "GPU")')

### Cell 0.2 Define useful functions
Here we define some useful functions to help compute metrics we want to report for both models. These functions will be used in Cells 1.4 and 2.4, and in the Prompt Engineering (optional) section. The code here may look daunting, and it is okay if you don't fully understand it right away. If you prefer, you can simply run the code cell and move on to the next cell.

In [None]:
def per_class_sensitivity_specificity(ground_truth_labels, predicted_labels, label_names):
    """Computes per-class sensitivity (recall) and specificity metrics.

    Metrics returned in order of label_names.
    """
    sensitivity_scores, specificity_scores = [], []
    for label_name in label_names:
        # Turn into binary problem. recall_score returns specificity then sensitivity
        specificity, sensitivity = recall_score(
            np.array(ground_truth_labels) == label_name, np.array(predicted_labels) == label_name,
            pos_label=True, average=None, zero_division=np.nan)
        sensitivity_scores.append(sensitivity)
        specificity_scores.append(specificity)
    return sensitivity_scores, specificity_scores


def generate_metric_report(ground_truth_labels, predicted_labels, label_names):
    """Computes and displays a report containing a collection of metrics.

    Per-class metrics include:
    - Precision
    - Recall (sensitivity)
    - Specificity
    - F1-Score (harmonic mean of precision and recall)

    Overall metrics include:
    - Accuracy
    - Unweighted Average Recall (UAR)
    """
    # Determine the length of the longest label name to format the table.
    len_longest_label = max(len(label_name) for label_name in label_names)

    # Compute per-class metrics.
    per_class_precision = precision_score(ground_truth_labels, predicted_labels, labels=label_names, average=None, zero_division=np.nan)
    per_class_recall, per_class_specificity = per_class_sensitivity_specificity(ground_truth_labels, predicted_labels, label_names=label_names)
    per_class_f1_score = f1_score(ground_truth_labels, predicted_labels, labels=label_names, average=None, zero_division=np.nan)

    # Display the per-class summary.
    print(f'{"":^{len_longest_label}s}    {"precision":>9s}    {"recall":>6s}    {"specificity":>11s}    {"f1-score":>9s}    {"support":>6s}\n')
    for label_idx, label_name in enumerate(label_names):
        print(f'{label_name:>{len_longest_label}s}    {per_class_precision[label_idx]:>9.2f}    {per_class_recall[label_idx]:>6.2f}    {per_class_specificity[label_idx]:>11.2f}    {per_class_f1_score[label_idx]:>9.2f}    {ground_truth_labels.count(label_name):>6d}')
    print('\n')

    # Compute overall metrics.
    accuracy = accuracy_score(ground_truth_labels, predicted_labels)
    uar = recall_score(ground_truth_labels, predicted_labels, labels=label_names, average='macro', zero_division=np.nan)

    # Display the overall metrics.
    print(f'{"Number of Examples":>35s}: {len(ground_truth_labels)}')
    print(f'{"Overall Accuracy":>35s}: {accuracy:.2%}')
    print(f'{"Unweighted Average Recall (UAR)":>35s}: {uar:.2%}')

### Cell 0.3: Load the "beverage" dataset

To evaluate model performance across different beverage types and scenes, we have constructed a dataset consisting of the following beverages:

**Alcoholic Beverages**
*   Beer bottle
*   Beer cup
*   Red wine glass
*   White wine glass


**Non-Alcoholic Beverages**
*   Coffee cup
*   Coffee plunger
*   Tea cup
*   Water bottle
*   Water cup


Multiple images of each beverage have been captured across five distinct scenes, both indoor and outdoor, with the beverage positioned at three different distances from the camera: foreground, midground, and background. The image below shows examples of the images contained in the dataset:

<div>
<img src="https://github.com/ltu-capr/zsl-image-tutorial/blob/main/dataset_example.jpg?raw=1" width="70%"/>
</div>

In total, for each of the nine beverage types, images have been taken across five different scenes and at three focal distances, resulting in 15 images per beverage. This leads to a complete dataset of 135 images (9 beverages $\times$ 15 images each).

**Note for offline notebook version**: You may encounter an error when downloading the image dataset when running the below code cell. If so:
1. Manually download the dataset from [here](https://github.com/ltu-capr/zsl-image-tutorial/raw/refs/heads/main/Data/Images.zip).
2. Extract the downloaded ZIP folder and place the folder in the same location as this notebook. The name of the folder should be 'Images', and inside the folder should be 135 images.

In [None]:
# Download and extract the zipped image dataset.
!wget -O images.zip 'https://github.com/ltu-capr/zsl-image-tutorial/raw/refs/heads/main/Data/Images.zip'
!unzip -o -q images.zip -d ./Images

# Load the CSV file containing the labels for each image.
# Here we are giving the URL for a sample file that we've made publicly
# available on the Internet.
labels_location = 'https://raw.githubusercontent.com/ltu-capr/zsl-image-tutorial/main/Data/images_labelled.csv'
beverage_dataframe = pd.read_csv(labels_location)

### Cell 0.4: Visualise the "beverage" dataset
Here we visualise a few of the examples from the dataset.

In [None]:
# Define how many images to show.
rows, cols = 3, 6
num_images = rows * cols

# Create a 3x6 grid for displaying images.
fix, axes = plt.subplots(3, 6, figsize=(18, 12))
axes = axes.flatten()

for index, sample in beverage_dataframe[:num_images].iterrows():
    # Load the image and create a small thumbnail.
    im = Image.open('Images/' + sample['image_name'] + '.jpg')
    im.thumbnail((256, 256), Image.Resampling.LANCZOS)

    # Display the image, setting the title as the beverage type and focal location.
    axes[index].imshow(im)
    axes[index].set_title(f'{sample["beverage"]}/{sample["position"]}')

    # Hide the axis markers.
    axes[index].tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

# Display the plot.
plt.show()

## Example 1: ZSL Classification of Beverages in Images Using CLIP

CLIP works by taking an image and a collection of descriptors, and generates an image-text similarity score for each image-descriptor pair. With this, we can provide CLIP an image along with a set of candidate labels, and generate a prediction by using the candidate label with the highest image-text similarity score.

### Cell 1.0: Initialise the CLIP model

Initialise the CLIP model for use in zero-shot image classification. It may take a while for the model to download.

In [None]:
# Use the GPU if it's available.
clip_device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

# Load the pre-trained model.
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(clip_device)

# Load an object used to prepare images in the right way required by CLIP
# This will: resize, centrecrop, and normalise the images.
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

### Cell 1.1: Initialise classification labels

In order to perform classification, we must nominate candidate labels for the model to choose between. In this scenario we have 6x labels, however you can choose as many labels as you need.

It is possible to further engineer the text in these labels.

In [None]:
candidate_labels = [
    'A photo containing beer',
    'A photo containing wine',
    'A photo containing coffee',
    'A photo containing tea',
    'A photo containing water',
    'A photo containing no drinks'
]

### Cell 1.2: Make model predictions

Here we generate predictions for all images in our dataset.

In this example we also implement *batching*, enabling the model to process multiple images simultaneously. This is especially useful for larger datasets, as it can reduce computational overhead and speed up predictions.

In [None]:
# Set the batch size (how many images to process at once).
clip_batch_size = 10

# Generate prediction results for all images in our dataset.
all_results = []
for index in tqdm(range(0, len(beverage_dataframe), clip_batch_size)):
    # Extract the next set of data to process.
    next_data = beverage_dataframe.iloc[index:index + clip_batch_size]

    # Open the images.
    images = [Image.open('Images/' + sample['image_name'] + '.jpg') for _, sample in next_data.iterrows()]

    # Preprocess the data to the correct format.
    inputs = clip_processor(
        text=candidate_labels,
        images=images, return_tensors='pt', padding=True
    ).to(clip_device)

    # Get the model outputs.
    with torch.no_grad():
        outputs = clip_model(**inputs)

    # Get the logits per-image (the image-text similarity score to each candidate label).
    image_logits = outputs.logits_per_image.cpu()

    # Convert logits per-image into probabilities with a softmax, then convert to a list.
    image_probs = image_logits.softmax(dim=1)
    image_probs = list(torch.unbind(image_probs, dim=0))

    # Store the probabilities.
    all_results.extend(image_probs)

    # Visualise last result in batch as model is running.
    last_image = images[-1]
    last_image_name = next_data.iloc[-1]['image_name']
    last_image_probs = all_results[-1]

    # Determine the index of the highest probability.
    highest_idx = torch.argmax(last_image_probs)

    # Prepare the image for displaying
    thumb_im = last_image.copy()
    thumb_im.thumbnail((256, 256), Image.Resampling.LANCZOS)
    display(thumb_im)

    print(f'Most likely label for {last_image_name}: {candidate_labels[highest_idx]} (probability: {100 * last_image_probs[highest_idx].item():.1f}%)')
    print(f'Per-class probabilities:')
    for lbl_idx, label in enumerate(candidate_labels):
        print(f'\t{label}: {100 * last_image_probs[lbl_idx]:.1f}%')

### Cell 1.3 Save the model predictions

This code prepares an output CSV file containing model predictions which can be used for further analysis.

In [None]:
def save_clip_predictions(file_name, dataframe, all_results, candidate_labels, actual_label_column=None, label_class_map=None):
    # Arrange the results in tabular form with neat columns.
    rows = []
    for result, (index, sample) in zip(all_results, dataframe.iterrows()):
        scores_as_percentages = [round(score * 100, 2) for score in result.tolist()]
        row = {'image_name': sample['image_name'], **dict(zip(candidate_labels, scores_as_percentages))}
        rows.append(row)
    results_df = pd.DataFrame(rows, columns=['image_name', *candidate_labels])
    results_df['predicted_label'] = results_df[candidate_labels].idxmax(axis=1)

    if actual_label_column is not None and actual_label_column in dataframe.columns:
        if label_class_map is not None:
            results_df['predicted_class'] = results_df['predicted_label'].map(label_class_map)
        else:
            results_df['predicted_class'] = results_df['predicted_label'].str.replace('A photo containing ', '')
        results_df['actual_class'] = dataframe[actual_label_column]

    # Save output to a CSV file.
    os.makedirs('Outputs', exist_ok=True)
    output_file_name = os.path.join('Outputs', file_name)
    results_df.to_csv(output_file_name, index=False)

    try:
        # If we are on Google Colab, download the results.
        from google.colab import files
        files.download(output_file_name)
    except ModuleNotFoundError:
        # If we are not on Google Colab, show the output file location.
        print('Output file saved:')
        print(os.path.abspath(output_file_name))


save_clip_predictions('beverage_classification_clip_predictions.csv',
                      beverage_dataframe, all_results, candidate_labels,
                      'beverage_type')

### Cell 1.4. Measure the accuracy of model predictions (optional)

Zero-shot learning does not require hand-annotated labels to generate predictions, but they can be used to validate the model's accuracy. Here we compare the model's outputs with hand-annotated (ground truth) labels. If you don't have hand-annotated labels for your data, skip this step.

#### Cell 1.4a. Metric computation
In the code cell below, we will generate a table that summarises the precision, recall, specificity and F1 score for each beverage type individually, in addition to other statistics summarising the overall model performance, such as the overall accuracy and unweighted average recall (UAR).

In the generated report, *support* refers to the number of ground truth labels belonging to that class.

In [None]:
# Put the predictions and ground truth into the right format.
predicted_labels = [candidate_labels[torch.argmax(result)] for result in all_results]
ground_truth_labels = list('A photo containing ' + beverage_dataframe['beverage_type'])

# Produce the metrics report.
generate_metric_report(ground_truth_labels, predicted_labels, label_names=candidate_labels)

These statistics show that CLIP only achieves an overall accuracy of 34.07%. Though there is no baseline for accuracy, this is still a poor result, and suggests further engineering of the classification labels may be needed, or a different model may be more appropriate for this data.

#### Cell 1.4b. Confusion matrix

We can analyse the model's performance in a more detailed manner by visualising a confusion matrix. A confusion matrix shows how many dataset examples there are for each possible pair of true and predicted labels. Numbers which do not lie on the main diagonal of the matrix correspond to misclassifications. By inspecting the classification matrix, we can quickly observe specific classes that the model is performing poorly on. For example, in this case we can see the model is best at correctly identifying photos containing beer, wine, and coffee, however struggles with other beverage types, such as water.

In [None]:
ConfusionMatrixDisplay.from_predictions(ground_truth_labels, predicted_labels, labels=candidate_labels, xticks_rotation='vertical', cmap='Blues', colorbar=False);

#### Cell 1.4c. Foreground/midground/background performance

We can take a closer look at what other factors might be causing this misclassification. For instance, we can look at the classification performance in foreground, midground, and background. In this case, we can see that the model is much better at making correct predictions when the beverage is in the foreground as opposed to the midground or background.

In [None]:
for position in ('foreground', 'midground', 'background'):
    predicted_labels = []
    ground_truth_labels = []
    for result, (index, sample) in zip(all_results, beverage_dataframe.iterrows()):
        if sample['position'] == position:
            predicted_labels.append(candidate_labels[torch.argmax(result)])
            ground_truth_labels.append('A photo containing ' + sample['beverage_type'])

    accuracy = accuracy_score(ground_truth_labels, predicted_labels)
    print(f'Accuracy for {position}: {accuracy:.2%}', flush=True)

    ConfusionMatrixDisplay.from_predictions(ground_truth_labels, predicted_labels, labels=candidate_labels, xticks_rotation='vertical', cmap='Blues', colorbar=False);
    plt.show()

#### Cell 1.4d. Alcohol vs. not-alcohol evaluation

Finally, we can take a look at how well the model performed at correctly classifying whether an image contains an alcoholic beverage (beer or wine). If no beverage was detected, we classify this as no alcohol present.

In [None]:
# Put the predictions and ground truth into the right format.
predicted_labels = [candidate_labels[torch.argmax(result)] for result in all_results]
predicted_labels = ['alcohol' if 'beer' in label or 'wine' in label else 'not alcohol' for label in predicted_labels]
ground_truth_labels = list(beverage_dataframe['alcohol_notalcohol'])

# Produce the metrics report.
generate_metric_report(ground_truth_labels, predicted_labels, label_names=['alcohol', 'not alcohol'])

# Generate the confusion matrix.
ConfusionMatrixDisplay.from_predictions(ground_truth_labels, predicted_labels, labels=['alcohol', 'not alcohol'], xticks_rotation='vertical', cmap='Blues', colorbar=False);

## Example 2 - ZSL Classification of Beverages in Images Using LLaVA

LLaVA is a large multimodal model that can accept both images and text as input. With LLaVA, we can provide an image along with a question, such as "What beverage is contained in the image?" or "Does the image contain an alcoholic beverage?". To make the answer suitable for classification, we can also ask the model to restrict its response to a set of candidate categories or a simple yes or no.

### Cell 2.0 Initialise the LLaVA model

Initialise the LLaVA model for use in zero-shot image classification. It will take a while for the model to download.

LLaVA is quite a large model, which by default might not be able to run on most machines due to memory constraints. To support most machines, we will use a quantised version of the model, that reduces the memory requirements at a small cost to accuracy.

This process requires the use of a GPU, meaning that a GPU is required to run the quantised version of LLaVA.

In [None]:
# Must use the GPU when running a quantised LLaVA.
if not torch.cuda.is_available():
    raise RuntimeError('Must have a GPU available to run a quantised version of LLaVA.')
llava_device = 'cuda:0'

# Specify which version of LLaVA we will use.
model_id = 'llava-hf/llava-v1.6-mistral-7b-hf'

# Set up parameters relating to model quantisation.
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
)

# Load the pre-trained model.
llava_model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, quantization_config=quantization_config)

# Load an object used to prepare data in the right way required by LLaVA.
llava_processor = LlavaNextProcessor.from_pretrained(
    model_id, patch_size=llava_model.config.vision_config.patch_size,
    vision_feature_select_strategy=llava_model.config.vision_feature_select_strategy,
)
llava_model.generation_config.pad_token_id = llava_processor.tokenizer.eos_token_id

### Cell 2.1 Initialise prompt

In order to perform classification we must first create a prompt for the LLaVA model. Here, we will ask the model to identify the presence of any specific beverages (beer, wine, coffee, tea, water) or none in the image. This approach allows us to handle the task as a multi-class classification problem, similar to what we did for CLIP.

We also limit the maximum number of tokens (words or word parts) that LLaVA can generate in its response. Here, we set it to a low value (10) as we only expect a single-word answer, though this choice is less critical as our prompt specifically instructs LLaVA to respond with only one word. Raising this limit would let LLaVA generate longer responses, which can be useful if a more detailed answer is needed.

In [None]:
# Here we specify the prompt to give to LLaVA.
prompt = 'What type of beverage does this image contain? Pick the most relevant one from this list: beer, wine, coffee, tea, water, none. Answer in one word.'

# Here we specify how many tokens (words or word parts) can be returned by LLaVA. We restrict this to a small number.
max_new_tokens = 10

### Cell 2.2 Make model predictions

Here we generate predictions for all images in our dataset.

In this example we also implement *batching*, enabling the model to process multiple images simultaneously. This is especially useful for larger datasets, as it can reduce computational overhead and speed up predictions.

Depending on the GPU used, a smaller batch size may be required due to memory constraints.

In [None]:
# Set the batch size (how many images to process at once).
llava_batch_size = 10

# Generate prediction results for all images in our dataset.
all_results = []
for index in tqdm(range(0, len(beverage_dataframe), llava_batch_size)):
    # Extract the next set of data to process.
    next_data = beverage_dataframe.iloc[index:index + llava_batch_size]

    # Open the images.
    images = [Image.open('Images/' + sample['image_name'] + '.jpg') for _, sample in next_data.iterrows()]

    # Create the formatted prompt for LLaVA.
    llava_prompt = f'[INST] <image>\n{prompt} [/INST]'

    # Prepare the inputs to pass to LLaVA.
    inputs = llava_processor(
        text=[llava_prompt] * len(images),
        images=images, return_tensors='pt'
    ).to(llava_device)

    # Get the model predictions.
    with torch.no_grad():
        outputs = llava_model.generate(**inputs, max_new_tokens=max_new_tokens)

    # Process all predictions.
    for output in outputs:
        # Get the generated text and extract just the response.
        generated_text = llava_processor.decode(output.cpu(), skip_special_tokens=True)
        response = generated_text.split(f'{prompt} [/INST]')[1].strip().lower()

        # Store the response.
        all_results.append(response)

    # Visualise last result in batch as model is running.
    last_image = images[-1]
    last_image_name = next_data.iloc[-1]['image_name']
    last_image_response = all_results[-1]

    # Prepare the image for displaying
    thumb_im = last_image.copy()
    thumb_im.thumbnail((256, 256), Image.Resampling.LANCZOS)
    display(thumb_im)

    print(f'Image: {last_image_name}')
    print(f'Prompt: {prompt}')
    print(f'Response: {last_image_response}')

### Cell 2.3 Save the model predictions

This code prepares an output CSV file containing model predictions which can be used for further analysis.

In [None]:
def save_llava_predictions(file_name, dataframe, all_results, actual_label_column=None, response_class_map=None, default_class=None):
    # Arrange the results in tabular form with neat columns.
    results_df = pd.DataFrame([])
    results_df['image_name'] = dataframe['image_name']
    results_df['predicted_label'] = all_results

    if actual_label_column is not None and actual_label_column in dataframe.columns:
        if response_class_map is not None:
            results_df['predicted_class'] = results_df['predicted_label'].map(lambda x: response_class_map.get(x, default_class))
        else:
            results_df['predicted_class'] = results_df['predicted_label']
        results_df['actual_class'] = dataframe[actual_label_column]

    # Save output to a CSV file.
    os.makedirs('Outputs', exist_ok=True)
    output_file_name = os.path.join('Outputs', file_name)
    results_df.to_csv(output_file_name, index=False)

    try:
        # If we are on Google Colab, download the results.
        from google.colab import files
        files.download(output_file_name)
    except ModuleNotFoundError:
        # If we are not on Google Colab, show the output file location.
        print('Output file saved:')
        print(os.path.abspath(output_file_name))

save_llava_predictions('beverage_classification_llava_predictions.csv',
                       beverage_dataframe, all_results, 'beverage_type')

### Cell 2.4 Measure the accuracy of model predictions (optional)

Zero-shot learning does not require hand-annotated labels to generate predictions, but they can be used to validate the model's accuracy. Here we compare the model's outputs with hand-annotated (ground truth) labels. If you don't have hand-annotated labels for your data, skip this step.

#### Cell 2.4a. Metric computation

In the code cell below, we will generate a table that summarises the precision, recall, specificity and F1 score for each beverage type individually, in addition to other statistics summarising the overall model performance, such as the overall accuracy and unweighted average recall (UAR).

In the generated report, *support* refers to the number of ground truth labels belonging to that class.

In [None]:
# Put the predictions and ground truth into the right format.
predicted_labels = all_results
ground_truth_labels = list(beverage_dataframe['beverage_type'])

# Produce the metrics report.
generate_metric_report(ground_truth_labels, predicted_labels, label_names=('beer', 'wine', 'coffee', 'tea', 'water', 'none'))

These results look much better than for CLIP, with an overall accuracy of 79.26%.

#### Cell 2.4b. Confusion matrix

We can analyse accuracy in a more detailed manner by visualising a confusion matrix. Here we can see the model does a really good job! In this case we can see the model often incorrectly classifies tea as coffee, however distinguishing between the two is very challenging!

In [None]:
ConfusionMatrixDisplay.from_predictions(ground_truth_labels, predicted_labels, labels=('beer', 'wine', 'coffee', 'tea', 'water', 'none'), xticks_rotation='vertical', cmap='Blues', colorbar=False);

#### Cell 2.4c. Foreground/midground/background performance

We can further analyse the classification performance in foreground, midground, and background. Similarly to CLIP, the model performs better when the beverage is contained in the foreground, and worse when in the background.

In [None]:
for position in ('foreground', 'midground', 'background'):
    predicted_labels = []
    ground_truth_labels = []
    for result, (index, sample) in zip(all_results, beverage_dataframe.iterrows()):
        if sample['position'] == position:
            predicted_labels.append(result)
            ground_truth_labels.append(sample['beverage_type'])

    accuracy = accuracy_score(ground_truth_labels, predicted_labels)
    print(f'Accuracy for {position}: {accuracy:.2%}', flush=True)

    ConfusionMatrixDisplay.from_predictions(ground_truth_labels, predicted_labels, labels=('beer', 'wine', 'coffee', 'tea', 'water', 'none'), xticks_rotation='vertical', cmap='Blues', colorbar=False);
    plt.show()

#### Cell 2.4d. Alcohol vs. not-alcohol evaluation

Finally, we can take a look at how well the model performed at correctly classifying whether an image contains an alcoholic beverage (beer or wine). If no beverage was detected, we classify this as no alcohol present. There were no instances where the model incorrectly labelled not alcohol as alcohol!

In [None]:
# Put the predictions and ground truth into the right format.
predicted_labels = ['alcohol' if label in {'beer', 'wine'} else 'not alcohol' for label in all_results]
ground_truth_labels = list(beverage_dataframe['alcohol_notalcohol'])

# Produce the metrics report.
generate_metric_report(ground_truth_labels, predicted_labels, label_names=['alcohol', 'not alcohol'])

# Generate the confusion matrix.
ConfusionMatrixDisplay.from_predictions(ground_truth_labels, predicted_labels, labels=['alcohol', 'not alcohol'], xticks_rotation='vertical', cmap='Blues', colorbar=False);

## Prompt Engineering (optional)

In this section, we present a more comprehensive analysis of our CLIP and LLaVA results by testing and evaluating different prompts. While this section is optional, it provides deeper insight into the prompt engineering process and demonstrates how different prompts can be compared.

### CLIP

This section evaluates several sets of candidate labels for the CLIP model to compare how various sets of prompts impact model performance.

Please make sure to run Cell 1.0 and Cell 1.3 before running this section.

#### Candidate Labels

The code cell below defines a list of different candidate label sets that will be passed to CLIP to make predictions on our validation dataset.

Each item in the list is a dictionary with:
* `name`: An identifier for the candidate label set. This name appears when summarising model performance and is used as the filename when storing predictions a CSV.
* `labels`: A list of `(label, class)` pairs:
  * `label` is the prompt given to CLIP.
  * `class` is the ground truth class associated with that label.

An example item is shown below:
```
{
    'name': 'A photo containing',
    'labels': [
        ('A photo containing beer', 'beer'),
        ('A photo containing wine', 'wine'),
        ('A photo containing coffee', 'coffee'),
        ('A photo containing tea', 'tea'),
        ('A photo containing water', 'water'),
        ('A photo containing no drinks', 'no drinks'),
    ]
}
```

If you would like to explore further prompt engineering on our validation dataset, feel free to add your own candidate label sets below or modify the existing ones as you see fit.

In [None]:
# Sets of candidate labels to be evaluated
clip_label_set = [
    # Candidate label set 1: Prefixing beverages by 'A photo containing'
    {
        'name': 'A photo containing',
        'labels': [
            ('A photo containing beer', 'beer'),
            ('A photo containing wine', 'wine'),
            ('A photo containing coffee', 'coffee'),
            ('A photo containing tea', 'tea'),
            ('A photo containing water', 'water'),
            ('A photo containing no drinks', 'no drinks'),
        ]
    },

    # Candidate label set 2: Prefixing beverages by 'An image of'
    {
        'name': 'An image of',
        'labels': [
            ('An image of beer', 'beer'),
            ('An image of wine', 'wine'),
            ('An image of coffee', 'coffee'),
            ('An image of tea', 'tea'),
            ('An image of water', 'water'),
            ('An image of no drinks', 'no drinks'),
        ]
    },

    # Candidate label set 3: Just listing the beverage type
    {
        'name': 'No Prefix',
        'labels': [
            ('beer', 'beer'),
            ('wine', 'wine'),
            ('coffee', 'coffee'),
            ('tea', 'tea'),
            ('water', 'water'),
            ('no drinks', 'no drinks'),
        ]
    },

    # Candidate label set 4: Adding descriptions of the beverage receptacle
    {
        'name': 'Receptacle Description',
        'labels': [
            ('beer bottle or glass', 'beer'),
            ('wine glass', 'wine'),
            ('coffee mug or plunger', 'coffee'),
            ('teacup', 'tea'),
            ('water bottle or glass', 'water'),
            ('no drinks', 'no drinks'),
        ]
    },

    # Candidate label set 5: Separate labels for each beverage receptacle
    {
        'name': 'Separate Receptacle Description',
        'labels': [
            ('beer bottle', 'beer'),
            ('beer glass', 'beer'),
            ('wine glass', 'wine'),
            ('coffee mug', 'coffee'),
            ('coffee plunger', 'coffee'),
            ('teacup', 'tea'),
            ('water bottle', 'water'),
            ('water glass', 'water'),
            ('no drinks', 'no drinks'),
        ]
    },

    # Candidate label set 6: Combining candidate label set 3 and 5
    {
        'name': 'Beverage and Separate Receptacle Description',
        'labels': [
            ('beer', 'beer'),
            ('beer bottle', 'beer'),
            ('beer glass', 'beer'),
            ('wine', 'wine'),
            ('wine glass', 'wine'),
            ('coffee', 'coffee'),
            ('coffee mug', 'coffee'),
            ('coffee plunger', 'coffee'),
            ('tea', 'tea'),
            ('teacup', 'tea'),
            ('water', 'water'),
            ('water bottle', 'water'),
            ('water glass', 'water'),
            ('no drinks', 'no drinks'),
        ]
    },

    # Candidate label set 7: Candidate label set 6, with added labels
    #                        containing 'person holding' prefixes
    {
        'name': 'Descriptive Phrases',
        'labels': [
            ('beer', 'beer'),
            ('person holding beer', 'beer'),
            ('beer bottle', 'beer'),
            ('person holding beer bottle', 'beer'),
            ('beer glass', 'beer'),
            ('person holding beer glass', 'beer'),
            ('wine', 'wine'),
            ('person holding wine', 'wine'),
            ('wine glass', 'wine'),
            ('person holding wine glass', 'wine'),
            ('coffee', 'coffee'),
            ('person holding coffee', 'coffee'),
            ('coffee mug', 'coffee'),
            ('person holding coffee mug', 'coffee'),
            ('coffee plunger', 'coffee'),
            ('person holding coffee plunger', 'coffee'),
            ('tea', 'tea'),
            ('teacup', 'tea'),
            ('person holding teacup', 'tea'),
            ('water', 'water'),
            ('person holding water', 'water'),
            ('water bottle', 'water'),
            ('person holding water bottle', 'water'),
            ('water glass', 'water'),
            ('person holding water glass', 'water'),
            ('no drinks', 'no drinks'),
        ]
    },
]

#### Model Inference

The code cell below runs the CLIP model on **ALL** sets of candidate labels and automatically downloads separate CSV files with predictions for each set.

Note: This cell may take some time to complete.

In [None]:
# Set the batch size
clip_batch_size = 10

# Run inference on each label type
clip_all_label_results = []
for candidate_label_set in clip_label_set:
    # Extract information about these candidate labels (including class mapping)
    label_set_name = candidate_label_set['name']
    candidate_labels, candidate_classes = [list(dat) for dat in zip(*candidate_label_set['labels'])]
    unique_candidate_classes = list(dict.fromkeys(candidate_classes))
    candidate_label_map = {label: class_name for label, class_name in candidate_label_set['labels']}

    # Generate prediction results for all images in the dataset.
    current_results = []
    progress = tqdm(range(0, len(beverage_dataframe), clip_batch_size))
    for index in progress:
        progress.set_description(f'Processing for "{label_set_name}"... labels')
        next_data = beverage_dataframe.iloc[index:index + clip_batch_size]
        images = [Image.open('Images/' + sample['image_name'] + '.jpg') for _, sample in next_data.iterrows()]
        inputs = clip_processor(text=candidate_labels, images=images, return_tensors='pt', padding=True).to(clip_device)
        with torch.no_grad():
            outputs = clip_model(**inputs)
        image_logits = outputs.logits_per_image.cpu()
        image_probs = list(torch.unbind(image_logits.softmax(dim=1), dim=0))
        current_results.extend(image_probs)

    # Save output to a CSV file.
    save_clip_predictions(f'clip_predictions_{label_set_name.lower()}.csv',
                          beverage_dataframe, current_results, candidate_labels,
                          'beverage_type', label_class_map=candidate_label_map)

    # Store a summary of the results
    clip_all_label_results.append(current_results)

#### Evaluation

The code cell below evaluates performance on the validation dataset for each set of candidate labels (similar to Cell 1.4). For every candidate label set, the following will be generated:

* A table summarising key per-class metrics, along with metrics aggregated across the entire validation dataset (similar to Cell 1.4a).
* A confusion matrix across all beverage classes (similar to Cell 1.4b).

Finally, a summary table will be produced showing the overall accuracy and unweighted average recall for each candidate label set, sorted by accuracy. This provides a quick comparison between sets and can help identify which ones may be best suited for application to larger datasets.

In [None]:
# Store for summary statistics (name, accuracy, UAR)
clip_summary_statistics = []

for candidate_label_set, results in zip(clip_label_set, clip_all_label_results):
    label_set_name = candidate_label_set['name']
    print(f'Metrics for label set: "{label_set_name}"\n')

    # Extract information
    candidate_labels, candidate_classes = [list(dat) for dat in zip(*candidate_label_set['labels'])]
    unique_candidate_classes = list(dict.fromkeys(candidate_classes))
    candidate_label_map = {label: class_name for label, class_name in candidate_label_set['labels']}

    # Extract labels
    predicted_labels = [candidate_labels[torch.argmax(result)] for result in results]
    ground_truth_class = list(beverage_dataframe['beverage_type'])

    # Map labels to class
    predicted_class = [candidate_label_map[label] for label in predicted_labels]

    # Produce the metrics report
    generate_metric_report(ground_truth_class, predicted_class, label_names=unique_candidate_classes)

    # Display the confusion matrix
    ConfusionMatrixDisplay.from_predictions(ground_truth_class, predicted_class, labels=unique_candidate_classes,
                                            xticks_rotation='vertical', cmap='Blues', colorbar=False)
    plt.show()

    # Store information for final summary
    clip_summary_statistics.append({
        'name': label_set_name,
        'accuracy': accuracy_score(ground_truth_class, predicted_class),
        'UAR': recall_score(ground_truth_class, predicted_class, labels=unique_candidate_classes,
                            average='macro', zero_division=np.nan),
    })

    # Produce some spacing for the next label set
    print('\n' * 2)
    print('-' * 70)


# Print final summary ranked by accuracy then UAR
clip_summary_statistics.sort(key=lambda x: (x['accuracy'], x['UAR']), reverse=True)
len_longest_name = len(max(clip_summary_statistics, key=lambda x: len(x['name']))['name'])
print('Final Summary Across Candidate Labels\n(Sorted by Highest Accuracy)\n')
print(f'{"Name":^{len_longest_name}s} | {"Accuracy"} | {"UAR":^6s}')
print('-' * len_longest_name + ' + ' + '-' * 8 + ' + ' + '-' * 6)
for summary_stat in clip_summary_statistics:
    print(f'{summary_stat["name"]:{len_longest_name}s} | {summary_stat["accuracy"]:^8.2%} | {summary_stat["UAR"]:6.2%}')

### LLaVA

This section evaluates several sets of prompts for the LLaVA model to compare how various sets of prompts impact model performance.

Please make sure to run Cell 2.0 and Cell 2.3 before running this section.

#### Prompts

The code cell below defines a list of different prompts that will be passed to LLaVA to make predictions on our validation dataset.

Each item in the list is a dictionary with:
* `name`: An identifier for the prompt. This name appears when summarising model performance and is used as the filename when storing predictions a CSV.
* `prompt`: The prompt given to LLaVA.
* `response_map`: A dictionary mapping LLaVA's responses to the corresponding ground truth beverage class.
* `default_class`: The class assigned when LLaVA's response is not found in `response_map`. This handles unexpected responses and is typically set to the negative class (e.g. `none`).
* `max_new_tokens`: The maximum number of tokens LLaVA is allowed to generate in its response.

An example item is shown below:
```
{
    'name': 'Beverage Receptacle',
    'prompt': 'What type of beverage does this image contain? Pick the most relevant one from this list: beer bottle, beer glass, wine glass, coffee mug, coffee plunger, teacup, water bottle, water glass, none. Respond with exactly one item from the list.',
    'response_map': {
        'beer bottle': 'beer',
        'beer glass': 'beer',
        'wine glass': 'wine',
        'coffee mug': 'coffee',
        'coffee plunger': 'coffee',
        'teacup': 'tea',
        'water bottle': 'water',
        'water glass': 'water',
        'none': 'none',
    },
    'default_class': 'none',
    'max_new_tokens': 10,
}
```

If you would like to explore further prompt engineering on our validation dataset, feel free to add your own prompts below or modify the existing ones as you see fit.

In [None]:
# Sets of prompts to be evaluated
llava_prompt_set = [
    # Prompt 1: Asking for a beverage name in a list
    {
        'name': 'Beverage name',
        'prompt': 'What type of beverage does this image contain? Pick the most relevant one from this list: beer, wine, coffee, tea, water, none. Answer in one word.',
        'response_map': {
            'beer': 'beer',
            'wine': 'wine',
            'coffee': 'coffee',
            'tea': 'tea',
            'water': 'water',
            'none': 'none',
        },
        'default_class': 'none',
        'max_new_tokens': 10,
    },

    # Prompt 2: Asking for conservative predictions on the alcohol classes
    {
        'name': 'Beverage name Alcohol Conservative',
        'prompt': 'What type of beverage does this image contain? Pick the most relevant one from this list: beer, wine, coffee, tea, water, none. Only predict beer or wine if you are completely confident. Answer in one word.',
        'response_map': {
            'beer': 'beer',
            'wine': 'wine',
            'coffee': 'coffee',
            'tea': 'tea',
            'water': 'water',
            'none': 'none',
        },
        'default_class': 'none',
        'max_new_tokens': 10,
    },

    # Prompt 3: Adding separate receptacle descriptions in the prompt
    {
        'name': 'Beverage Receptacle',
        'prompt': 'What type of beverage does this image contain? Pick the most relevant one from this list: beer bottle, beer glass, wine glass, coffee mug, coffee plunger, teacup, water bottle, water glass, none. Respond with exactly one item from the list.',
        'response_map': {
            'beer bottle': 'beer',
            'beer glass': 'beer',
            'wine glass': 'wine',
            'coffee mug': 'coffee',
            'coffee plunger': 'coffee',
            'teacup': 'tea',
            'water bottle': 'water',
            'water glass': 'water',
            'none': 'none',
        },
        'default_class': 'none',
        'max_new_tokens': 10,
    },

    # Prompt 4: Same as prompt 3, but also adding the beverage name
    {
        'name': 'Beverage Type and Receptacle',
        'prompt': 'What type of beverage does this image contain? Pick the most relevant one from this list: beer, beer bottle, beer glass, wine, red wine, white wine, wine glass, coffee, coffee mug, coffee plunger, tea, teacup, water, water bottle, water glass, none. Respond with exactly one item from the list.',
        'response_map': {
            'beer': 'beer',
            'beer bottle': 'beer',
            'beer glass': 'beer',
            'wine': 'wine',
            'red wine': 'wine',
            'white wine': 'wine',
            'wine glass': 'wine',
            'coffee': 'coffee',
            'coffee mug': 'coffee',
            'coffee plunger': 'coffee',
            'tea': 'tea',
            'teacup': 'tea',
            'water': 'water',
            'water bottle': 'water',
            'water glass': 'water',
            'none': 'none',
        },
        'default_class': 'none',
        'max_new_tokens': 10,
    },
]

#### Model Inference

The code cell below runs the LLaVA model on **ALL** prompts and automatically downloads separate CSV files with predictions for each prompt.

Note: This cell may take some time to complete.

In [None]:
# Set the batch size
llava_batch_size = 10

# Run inference on each prompt set
llava_all_prompt_results = []
for prompt_set in llava_prompt_set:
    # Extract information about these prompts (including response mapping)
    prompt_set_name = prompt_set['name']
    current_llava_prompt = prompt_set['prompt']
    current_llava_max_tokens = prompt_set['max_new_tokens']
    response_map = {k.lower(): v.lower() for k, v in prompt_set['response_map'].items()}
    candidate_classes = list(prompt_set['response_map'].values()) + [prompt_set['default_class']]
    unique_candidate_classes = list(dict.fromkeys(candidate_classes))

    # Generate prediction results for all images in the dataset.
    current_results = []
    progress = tqdm(range(0, len(beverage_dataframe), llava_batch_size))
    for index in progress:
        progress.set_description(f'Processing for "{prompt_set_name}"... prompt')
        next_data = beverage_dataframe.iloc[index:index + llava_batch_size]
        images = [Image.open('Images/' + sample['image_name'] + '.jpg') for _, sample in next_data.iterrows()]
        llava_prompt = f'[INST] <image>\n{current_llava_prompt} [/INST]'
        inputs = llava_processor(text=[llava_prompt] * len(images), images=images, return_tensors='pt').to(llava_device)
        with torch.no_grad():
            outputs = llava_model.generate(**inputs, max_new_tokens=current_llava_max_tokens)
        for output in outputs:
            generated_text = llava_processor.decode(output.cpu(), skip_special_tokens=True)
            response = generated_text.split(f'{current_llava_prompt} [/INST]')[1].strip().lower()
            current_results.append(response)

    # Save output to a CSV file.
    save_llava_predictions(f'llava_predictions_{prompt_set_name.lower()}.csv',
                           beverage_dataframe, current_results, 'beverage_type',
                           response_class_map=response_map,
                           default_class=prompt_set['default_class'])

    # Store a summary of the results
    llava_all_prompt_results.append(current_results)

#### Evaluation

The code cell below evaluates performance on the validation dataset for each prompt (similar to Cell 2.4). For every prompt, the following will be generated:

* A table summarising key per-class metrics, along with metrics aggregated across the entire validation dataset (similar to Cell 2.4a).
* A confusion matrix across all beverage classes (similar to Cell 2.4b).

Finally, a summary table will be produced showing the overall accuracy and unweighted average recall for each prompt, sorted by accuracy. This provides a quick comparison between prompts and can help identify which ones may be best suited for application to larger datasets.

In [None]:
# Store for summary statistics (name, accuracy, UAR)
llava_summary_statistics = []

for prompt_set, results in zip(llava_prompt_set, llava_all_prompt_results):
    prompt_set_name = prompt_set['name']
    print(f'Metrics for prompt: "{prompt_set_name}"\n')

    # Extract information
    response_map = {k.lower(): v.lower() for k, v in prompt_set['response_map'].items()}
    candidate_classes = list(prompt_set['response_map'].values()) + [prompt_set['default_class']]
    unique_candidate_classes = list(dict.fromkeys(candidate_classes))

    # Extract labels
    predicted_labels = results
    ground_truth_class = list(beverage_dataframe['beverage_type'])

    # Map labels to class
    predicted_class = [response_map.get(label, prompt_set['default_class']) for label in predicted_labels]

    # Produce the metrics report
    generate_metric_report(ground_truth_class, predicted_class, label_names=unique_candidate_classes)

    # Display the confusion matrix
    ConfusionMatrixDisplay.from_predictions(ground_truth_class, predicted_class, labels=unique_candidate_classes,
                                            xticks_rotation='vertical', cmap='Blues', colorbar=False)
    plt.show()

    # Store information for final summary
    llava_summary_statistics.append({
        'name': prompt_set_name,
        'accuracy': accuracy_score(ground_truth_class, predicted_class),
        'UAR': recall_score(ground_truth_class, predicted_class, labels=unique_candidate_classes,
                            average='macro', zero_division=np.nan),
    })

    # Produce some spacing for the next label set
    print('\n' * 2)
    print('-' * 70)


# Print final summary ranked by accuracy then UAR
llava_summary_statistics.sort(key=lambda x: (x['accuracy'], x['UAR']), reverse=True)
len_longest_name = len(max(llava_summary_statistics, key=lambda x: len(x['name']))['name'])
print('Final Summary Across Prompts\n(Sorted by Highest Accuracy)\n')
print(f'{"Name":^{len_longest_name}s} | {"Accuracy"} | {"UAR":^6s}')
print('-' * len_longest_name + ' + ' + '-' * 8 + ' + ' + '-' * 6)
for summary_stat in llava_summary_statistics:
    print(f'{summary_stat["name"]:{len_longest_name}s} | {summary_stat["accuracy"]:^8.2%} | {summary_stat["UAR"]:6.2%}')

## Try Your Own Analysis

When running the following cell, you will be asked to select/input your data file using widgets that appear directly below the cell.

In [None]:
try:
    # If we are on Google Colab, show an upload widget.
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        file_locations = list(uploaded.keys())
    else:
        file_locations = ''
except ModuleNotFoundError:
    # If we are not on Google Colab, ask for the names of all files.
    file_locations = []
    while True:
        next_location = input('Please enter the path to the image(s) you would like to classify. '
                              'Type \q to finish entering filepaths.')
        if next_location == '\q':
            break

for file_path in file_locations:
    if not os.path.isfile(file_path):
        print(f'File not found: {file_path}.')
        print('Please run this cell again.')

### CLIP

*Ensure that you have run through all previous code cells in the "Example 1" section first, as this code makes use of the classifier we initialised in that part of the tutorial.*

For simplicity, the code below does *not* use batching.

In [None]:
# Here we specify the label options that the model will choose from.
# Make sure that you update these options to suit your data and experiment
# with different wordings.
candidate_labels = [
    'A photo containing beer',
    'A photo containing wine',
    'A photo containing water',
    'A photo containing coffee',
    'A photo containing tea',
    'A photo containing no drinks',
]

# Put into a dataframe for compatibility with the previous code.
custom_clip_dataframe = pd.DataFrame(file_locations, columns=['image_name'])

# Generate prediction results.
all_results = []
for index, sample in tqdm(custom_clip_dataframe.iterrows(), total=len(custom_clip_dataframe)):
    image = Image.open(sample['image_name'])
    inputs = clip_processor(text=candidate_labels, images=image, return_tensors='pt', padding=True).to(clip_device)
    with torch.no_grad():
        outputs = clip_model(**inputs)
    image_logits = outputs.logits_per_image[0].cpu()
    image_probs = image_logits.softmax(dim=0)
    all_results.append(image_probs)

# Save output to a CSV file.
save_clip_predictions('clip_predictions.csv', custom_clip_dataframe, all_results, candidate_labels)

### LLaVA

*Ensure that you have run through all previous code cells in the "Example 2" section first, as this code makes use of the classifier we initialised in that part of the tutorial.*

For simplicity, the code below does *not* use batching.

In [None]:
# Here we specify the prompt for the model.
# Make sure that you update the prompt to suit your data and experiment with
# different wordings.
prompt = 'What type of beverage does this image contain? Pick the most relevant one from this list: beer, wine, coffee, tea, water, none. Answer in one word.'

# Here we specify how many tokens (words or word parts) can be returned by
# LLaVA. Increase this limit to enable the model to give a more detailed response,
# depending on the prompt.
max_new_tokens = 10

# Put into a dataframe for compatibility with the previous code.
custom_llava_dataframe = pd.DataFrame(file_locations, columns=['image_name'])

# Generate prediction results.
all_results = []
for index, sample in tqdm(custom_llava_dataframe.iterrows(), total=len(custom_llava_dataframe)):
    image = Image.open(sample['image_name'])
    llava_prompt = f'[INST] <image>\n{prompt} [/INST]'
    inputs = llava_processor(text=llava_prompt, images=image, return_tensors='pt').to(llava_device)
    with torch.no_grad():
        outputs = llava_model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_text = llava_processor.decode(outputs[0].cpu(), skip_special_tokens=True)
    response = generated_text.split(f'{prompt} [/INST]')[1].strip().lower()
    all_results.append(response)

# Save output to a CSV file.
save_llava_predictions('llava_predictions.csv', custom_llava_dataframe, all_results)