<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2023_07_31_GPT_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT Evaluation Notebook [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/badge/latestdoi/660157642)
![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **GPT Evaluation** Notebook helps to calculate interrater correlation scores like Cohen's Kappa for a combination of human annotations and GPT- (or other LLM) based annotations.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/social-media-lab: DOI Release (v0.0.1). Zenodo. https://doi.org/10.5281/zenodo.8199902
```

In [2]:
# @title Read the LabelStudio Annotations from file
# @markdown  Enther the location (e.g. on Google Drive) of your exported `json` file and run this cell. It loads the annotations in a long format into the variable `all_annotations_df`.


human_annotations_json = '/content/drive/MyDrive/2023-06-28-GPT-Caption-Human-Labels.json' # @param {type:"string"}

import json

with open(human_annotations_json, 'r') as f:
  j = f.read()
  exported_annotations = json.loads(j)

import pandas as pd
from tqdm.notebook import tqdm

def process_result(result, coder, md):
    value_type = result.get('type', "")
    metadata = {
        **md,
        "coder": coder,
        "from_name": result.get('from_name', "")
    }
    annotations = []
    if value_type == "choices":
        choices = result['value'].get('choices', [])
        for choice in choices:
            r = {**metadata}
            r['value'] = choice
            annotations.append(r)
    elif value_type == "taxonomy":
        taxonomies = result['value'].get('taxonomy', [])
        for taxonomy in taxonomies:
            if len(taxonomy) > 1:
                taxonomy = " > ".join(taxonomy)
            elif len(taxonomy) == 1:
                taxonomy = taxonomy[0]
            r = {**metadata}
            r['value'] = taxonomy
            annotations.append(r)
    return annotations

all_annotations = []

for data in tqdm(exported_annotations):
    annotations = data.get("annotations")
    metadata = {
        **data.get("data")
    }

    for annotation in annotations:
      coder = annotation['completed_by']['id']
      results = annotation.get("result")
      if results:
        for result in results:
          all_annotations.extend(process_result(result, coder, metadata))
      else:
        print("Skipped Missing Result")


all_annotations_df = pd.DataFrame(all_annotations)
shortcode_values = all_annotations_df['shortcode'].unique()

  0%|          | 0/208 [00:00<?, ?it/s]

In [3]:
# Check the dataframe

all_annotations_df.head()

Unnamed: 0,caption,filename,username,shortcode,coder,from_name,value
0,ich möchte eine klimaneutrale wirtschaft mit g...,./olafscholz/2021-09-22_12-20-14_UTC.jpg,olafscholz,CUH6kOnqm7d,10389,content,Arbeit
1,ich möchte eine klimaneutrale wirtschaft mit g...,./olafscholz/2021-09-22_12-20-14_UTC.jpg,olafscholz,CUH6kOnqm7d,10389,content,Umwelt
2,ich möchte eine klimaneutrale wirtschaft mit g...,./olafscholz/2021-09-22_12-20-14_UTC.jpg,olafscholz,CUH6kOnqm7d,10389,content,Wirtschaft und Finanzen
3,ich möchte eine klimaneutrale wirtschaft mit g...,./olafscholz/2021-09-22_12-20-14_UTC.jpg,olafscholz,CUH6kOnqm7d,10389,call,Nein
4,ich möchte eine klimaneutrale wirtschaft mit g...,./olafscholz/2021-09-22_12-20-14_UTC.jpg,olafscholz,CUH6kOnqm7d,10387,content,Arbeit


## Contingency Table & Majority Decision

One approach to assess the quality of machine labelled data is the comparison between machine-generated labels and human-generated labels, commonly known as "gold standard" labels. This process is called "label agreement" or "inter-rater agreement" and is widely used in various fields, including natural language processing, machine learning, and computational social science.

We are going to use create a `majority_decision` column using the human annotations: We have chosen an uneven number of annotators in order to find a majority for each label. First, we are going to create a contingency table (or matrix), then we can determine the majority decision.

The next code block assumes a multi-checkbox.

**Select one of the two following code blocks!** The first reads one file, the second reads multiple files following the same file-name patterns.

In [15]:
# @title Read a single annotated file
# @markdown Reading the GPT annotations. Enter the correct file path below

import pandas as pd
import numpy as np

gpt_path = '/content/drive/MyDrive/2023-07-31-GPT-Zero-Shot-Captions-V2-Au\xDFenpolitik.csv' # @param {type:"string"}
gpt_column = 'topic' # @param {type:"string"}
gpt_value = 'Au\xDFenpolitik' # @param {type:"string"}

gpt_values = [gpt_value]

gpt_tables = {
    gpt_value: pd.read_csv(gpt_path)
}

In [30]:
# @title Read multiple annoated files
# @markdown Reading the GPT annotations. Enter the correct file path below

import pandas as pd
import numpy as np

gpt_path = '/content/drive/MyDrive/2023-07-31-GPT-Zero-Shot-Captions-V2-{value}.csv' # @param {type:"string"}
gpt_column = 'topic' # @param {type:"string"}
gpt_values = ['Bildung und Ausbildung', 'Digitalisierung und Infrastruktur'] # @param {type:"raw"}

gpt_tables = {}
for value in gpt_values:
  gpt_tables[value] = pd.read_csv(gpt_path.replace("{value}", value))

In [31]:
# @markdown  Enter the `from_name` for the variable you're interested in at the moment. (Refer to your LabelStudio Interface for the right `from_name`).

import pandas as pd
import numpy as np

from_name = 'content' # @param {type:"string"}
# Assuming your DataFrame is named 'df'
filtered_df = all_annotations_df[all_annotations_df['from_name'].str.contains(from_name, case=False)]


values = filtered_df['value'].unique()

tables = {}
for value in gpt_values:
    tmp = filtered_df[filtered_df['value'] == value]
    contingency_matrix = pd.crosstab(tmp['shortcode'], tmp['coder'], values=tmp['value'], aggfunc='first')
    contingency_matrix = contingency_matrix.reindex(shortcode_values)

    # Assuming your contingency_matrix is named 'contingency_matrix'
    decisions = contingency_matrix.mode(axis=1)
    majority_decisions = decisions.iloc[:, 0]
    contingency_matrix['majority_decision'] = majority_decisions

    # Check if there is at most one NaN value in each row
    row_nan_counts = contingency_matrix.isnull().sum(axis=1)
    majority_decisions[row_nan_counts > 1] = np.nan

    contingency_matrix['majority_decision'] = majority_decisions
    contingency_matrix = contingency_matrix.fillna(False)
    contingency_matrix['majority_decision'] = contingency_matrix['majority_decision'].apply(lambda x: True if x==value else False)
    gpt_df = gpt_tables[value]
    tables[value] = contingency_matrix.merge(gpt_df[['shortcode', 'gpt']], on='shortcode', how='left')


In [33]:
# @title Calculate Cohen's Kappa between Majority and GPT
# @markdown  Running this cell calculates Cohen's Kappa between the `majority_decision` and `gpt` columns. It runs over the annotations in the `gpt_value` table. The code can easily be modified to run over several different *topics*.

output = 'Formatted' # @param ["Formatted", "Markdown"]


from tabulate import tabulate
import itertools
from IPython.display import Markdown, display

import sklearn.metrics as metrics

table_rows = []
for value in tables.keys():
    if not value in gpt_values:
        continue
    contingency_matrix = tables[value]
    try:
        contingency_matrix = contingency_matrix.loc[:, ['majority_decision', 'gpt']]
    except:
        continue

    string_mapping = {}

    for column in contingency_matrix.columns:
        unique_strings = contingency_matrix[column].unique()
        for string in unique_strings:
            if string not in string_mapping:
                string_mapping[string] = len(string_mapping)

    for column in contingency_matrix.columns:
        contingency_matrix[column] = contingency_matrix[column].map(string_mapping)

    ## Extract unique raters
    raters = contingency_matrix.columns.tolist()

    # Calculate pairwise Cohen's kappa
    combinations = list(itertools.combinations(raters, 2))

    vals = []

    for rater1, rater2 in combinations:
        filtered_matrix = contingency_matrix[[rater1, rater2]].dropna()
        kappa = metrics.cohen_kappa_score(filtered_matrix[rater1], filtered_matrix[rater2])
        vals.append(kappa)
        table_rows.append([value, kappa])

headers = ["Category", "Cohen's Kappa"]

# Generate the table in Markdown format
table = tabulate(table_rows, headers, tablefmt="github")

display(Markdown("### List of Kappas:"))

if output == "Formatted":
  display(Markdown(table))

else:
  print(table)


### List of Kappas:

| Category                          |   Cohen's Kappa |
|-----------------------------------|-----------------|
| Bildung und Ausbildung            |        0.792588 |
| Digitalisierung und Infrastruktur |        0.697924 |