# Part III – LLM Dialogue-Act Annotation with Qwen (Ollama)

This notebook:
- Loads my mini-corpus of utterances.
- Uses a local LLM (Qwen via Ollama) to annotate dialogue acts.
- Performs iterative prompt refinement (multiple prompt versions).
- Annotates the full dataset with the best prompt.
- Computes inter-annotator agreement (Cohen's κ) between humans and the LLM.

**Note:** This template assumes you have Ollama installed and a `qwen` model available.
You may need to adjust file paths and model names to match your setup.

In [None]:
import pandas as pd
import json
import ollama
from sklearn.metrics import cohen_kappa_score, confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.max_colwidth', 200)


## 1. Load data

This cell loads the utterances exported from ELAN.

It assumes the file is a tab-delimited text file with the following columns:

1. tier speaker name
2. participant name
3. start time (string)
4. start time (seconds)
5. end time (string)
6. end time (seconds)
7. duration (string)
8. duration (seconds)
9. utterance text

If you later add human annotation labels (e.g. `human1_label`, `human2_label`), you can
merge them into this DataFrame or load a different file.

In [None]:
utter_path = 'AttarA2_with_human_annotation.txt'

df = pd.read_csv(
    utter_path,
    sep='\t',
    header=None,
    names=['tier_speaker', 'speaker', 'start_str', 'start_sec', 'end_str', 'end_sec', 'dur_str', 'dur_sec', 'utterance'],
    engine='python'
)

print(df.head())
print('Total utterances:', len(df))

### Optional: Add human label columns

If you already have human annotations in a separate file, you can merge them here.
For now, we create placeholder columns `human1_label` and `human2_label` that you can
fill externally (e.g. in Excel) and re-load later.

In [None]:
if 'human1_label' not in df.columns:
    df['human1_label'] = pd.NA
if 'human2_label' not in df.columns:
    df['human2_label'] = pd.NA

df.head()

## 2. Define dialogue-act labels

These are the labels we will use for annotation, based on your scheme.

In [None]:
DA_LABELS = [
    'STATEMENT',
    'QUESTION',
    'ANSWER',
    'ACKNOWLEDGEMENT/BACKCHANNEL',
    'DIRECTIVE/REQUEST',
    'REPAIR/CLARIFICATION',
    'EXPRESSIVE/EMOTIVE',
    'APOLOGY',
    'GREETING',
    'GOODBYE/CLOSING',
    'OTHER'
]
DA_LABELS

## 3. Helper: call Qwen via Ollama

This function sends a batch of utterances to the model with a given prompt and expects
JSON output of the form:

```json
[{"index": 1, "label": "STATEMENT"}, ...]
```

In [None]:
def annotate_batch_with_llm(utterances, prompt, temperature=0.3, model_name='qwen'):
    joined = '\n'.join([f"{i+1}. {u}" for i, u in enumerate(utterances)])

    full_prompt = f"""{prompt}

Here are the utterances to annotate:
{joined}

Return ONLY valid JSON as a list of objects:
[
  {{"index": 1, "label": "STATEMENT"}},
  ...
]
"""

    response = ollama.chat(
        model=model_name,
        messages=[{'role': 'user', 'content': full_prompt}],
        options={'temperature': temperature}
    )

    text = response['message']['content'].strip()

    try:
        annotations = json.loads(text)
    except json.JSONDecodeError as e:
        print('Failed to parse JSON from model output:')
        print(text)
        raise e

    return annotations

## 4. Prompt Version 1 – initial instructions

This is the base prompt that defines your labels and how the model should respond.
You will refine it over several versions.

In [None]:
prompt_v1 = '''
You are annotating dialogue acts in a conversation.

Use EXACTLY ONE label per utterance, chosen from:
STATEMENT, QUESTION, ANSWER, ACKNOWLEDGEMENT/BACKCHANNEL,
DIRECTIVE/REQUEST, REPAIR/CLARIFICATION,
EXPRESSIVE/EMOTIVE, APOLOGY, GREETING,
GOODBYE/CLOSING, OTHER.

Definitions:
- STATEMENT: provides information, description, or opinion.
- QUESTION: requests information or clarification.
- ANSWER: directly responds to a question or confirmation request.
- ACKNOWLEDGEMENT/BACKCHANNEL: signals attention, understanding, or agreement
  without adding new content (e.g. "yeah", "mm-hmm", "right").
- DIRECTIVE/REQUEST: asks the other speaker to do something or to clarify something.
- REPAIR/CLARIFICATION: corrects or reformulates previous speech, or asks for clarification
  (e.g. "No, I meant last week", "Wait, who was there?").
- EXPRESSIVE/EMOTIVE: conveys emotional stance, evaluation, or affective reaction
  (e.g. "That was amazing!", "Ugh, I hated that part").
- APOLOGY: acknowledges fault or expresses regret.
- GREETING: opens the interaction or marks social connection.
- GOODBYE/CLOSING: signals the end of the interaction or closing the topic.
- OTHER: does not fit the categories above (e.g. laughter, filler noises, unintelligible speech).

Rules:
- Assign EXACTLY ONE label per utterance.
- Only use labels from the set given above.
- If unsure, choose the best fitting category and do NOT invent new labels.
'''
print(prompt_v1)

## 5. Test Prompt V1 on a small subset

We start with 10 utterances to see how well the model follows the instructions.

In [None]:
test_df = df.iloc[:10].copy()
test_utterances = test_df['utterance'].tolist()

annotations_v1 = annotate_batch_with_llm(
    utterances=test_utterances,
    prompt=prompt_v1,
    temperature=0.3,
    model_name='qwen'
)

annotations_v1

In [None]:
# Attach V1 annotations to the test DataFrame for inspection
idx_to_label_v1 = {ann['index']: ann['label'] for ann in annotations_v1}

test_df['llm_v1_label'] = [idx_to_label_v1[i+1] for i in range(len(test_df))]
test_df[['utterance', 'human1_label', 'human2_label', 'llm_v1_label']]

## 6. Prompt Refinement Log (fill in as you iterate)

- **Version 1**: Basic definitions of all dialogue-act labels.
  - Problems observed: (e.g., short "yeah" labeled as STATEMENT).
- **Version 2**: ...
- **Version 3**: ...
- **Version 4**: ...
- **Version 5**: ...

You should create new prompt variables `prompt_v2`, `prompt_v3`, etc., and
re-run the same `test_utterances` to compare results.

### Example: Prompt Version 2

Below is an example refinement that you can further adjust based on the errors you see.

In [None]:
prompt_v2 = prompt_v1 + '''

Additional rules:
- If an utterance consists only of very short tokens like "yeah", "mm-hmm", "uh-huh", "right",
  and does not add new information, ALWAYS label it as ACKNOWLEDGEMENT/BACKCHANNEL.
- If an utterance clearly asks for information or ends with a question mark,
  label it as QUESTION, not STATEMENT.
'''
print(prompt_v2)

In [None]:
annotations_v2 = annotate_batch_with_llm(
    utterances=test_utterances,
    prompt=prompt_v2,
    temperature=0.2,
    model_name='qwen'
)

idx_to_label_v2 = {ann['index']: ann['label'] for ann in annotations_v2}
test_df['llm_v2_label'] = [idx_to_label_v2[i+1] for i in range(len(test_df))]

test_df[['utterance', 'human1_label', 'human2_label', 'llm_v1_label', 'llm_v2_label']]

## 7. Choose a final prompt

After trying at least 5 prompt versions, assign your best-performing one to `final_prompt`.

In [None]:
# TODO: after refining up to prompt_v5 (for example), set your final prompt here
final_prompt = prompt_v2  # replace with your best prompt, e.g., prompt_v5
print(final_prompt[:500])

## 8. Annotate the full dataset with the final prompt

This function runs the LLM over the entire dataset in batches and stores
the predicted label in a new column `llm_label`.

In [None]:
def annotate_full_df_with_llm(df, prompt, batch_size=20, temperature=0.2, model_name='qwen'):
    llm_labels = []
    n = len(df)

    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        batch_utts = df['utterance'].iloc[start:end].tolist()
        annotations = annotate_batch_with_llm(
            utterances=batch_utts,
            prompt=prompt,
            temperature=temperature,
            model_name=model_name
        )
        idx_to_label = {ann['index']: ann['label'] for ann in annotations}
        batch_labels = [idx_to_label[i+1] for i in range(len(batch_utts))]
        llm_labels.extend(batch_labels)
        print(f'Annotated {end}/{n} utterances')

    return llm_labels

df['llm_label'] = annotate_full_df_with_llm(
    df=df,
    prompt=final_prompt,
    batch_size=20,
    temperature=0.2,
    model_name='qwen'
)

df.head()

## 9. Save LLM-annotated data

Save the DataFrame with the new `llm_label` column for later analysis.

In [None]:
out_path = 'AttarA2_with_llm_labels.csv'
df.to_csv(out_path, index=False)
print('Saved:', out_path)

## 10. Compute Cohen's κ

This section computes agreement between:
- Human 1 vs Human 2
- Human 1 vs LLM
- Human 2 vs LLM

Make sure you have filled in `human1_label` and `human2_label` before running this.
If they are still empty, κ will not be meaningful.

In [None]:
mask = df['human1_label'].notna() & df['human2_label'].notna() & df['llm_label'].notna()
eval_df = df[mask].copy()

if len(eval_df) == 0:
    print('No rows with all three labels present yet. Fill human1_label and human2_label first.')
else:
    h1 = eval_df['human1_label']
    h2 = eval_df['human2_label']
    llm_labs = eval_df['llm_label']

    kappa_h1_h2 = cohen_kappa_score(h1, h2)
    kappa_h1_llm = cohen_kappa_score(h1, llm_labs)
    kappa_h2_llm = cohen_kappa_score(h2, llm_labs)

    print("Cohen's κ (Human1 vs Human2):", kappa_h1_h2)
    print("Cohen's κ (Human1 vs LLM):   ", kappa_h1_llm)
    print("Cohen's κ (Human2 vs LLM):   ", kappa_h2_llm)

## 11. Confusion matrices

If you have enough labeled data, you can visualise where the LLM disagrees
with human annotators using confusion matrices.

In [None]:
def plot_confusion_matrix(y_true, y_pred, labels, title):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    fig, ax = plt.subplots(figsize=(8, 8))
    im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    ax.figure.colorbar(im, ax=ax)
    ax.set(
        xticks=np.arange(cm.shape[1]),
        yticks=np.arange(cm.shape[0]),
        xticklabels=labels,
        yticklabels=labels,
        ylabel='True label',
        xlabel='Predicted label',
        title=title
    )
    plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor')

    thresh = cm.max() / 2.0
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, cm[i, j],
                    ha='center', va='center',
                    color='white' if cm[i, j] > thresh else 'black')

    fig.tight_layout()
    plt.show()

if len(eval_df) > 0:
    plot_confusion_matrix(h1, h2, DA_LABELS, 'Human1 vs Human2')
    plot_confusion_matrix(h1, llm_labs, DA_LABELS, 'Human1 vs LLM')
    plot_confusion_matrix(h2, llm_labs, DA_LABELS, 'Human2 vs LLM')
else:
    print('Not enough labeled data to plot confusion matrices yet.')