<a href="https://colab.research.google.com/github/ninenine-9/legaldetainment/blob/main/binRemovalAttempt1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

💥 REMOVING UNCLASSIFIED_OTHER 💥

# 0️⃣ Schema

In [1]:
from pydantic import BaseModel
from typing import Optional, Literal

class CodedResponse(BaseModel):
    GuiltMoreLikely: Literal["Yes", "No"]
    GuiltLessLikely: Literal["Yes", "No"]
    NoInformation_Evidence: Literal["Yes", "No"]
    InnocentUntilProvenGuilty: Literal["Yes", "No"]
    Confound: Literal["Yes", "No"]
    # Unclassified_Other: Literal["Yes", "No"]

binary_columns = [
        "GuiltMoreLikely",
        "GuiltLessLikely",
        "NoInformation_Evidence",
        "InnocentUntilProvenGuilty",
        "Confound"
        # "Unclassified_Other"
    ]

binary_labels = ["Yes", "No"]

# Define which codes are in the mutually exclusive cluster
cluster_codes = ["GuiltMoreLikely", "GuiltLessLikely", "NoInformation_Evidence"]
independent_codes = ["InnocentUntilProvenGuilty", "Confound"]
all_codes = cluster_codes + independent_codes # + ["Unclassified_Other"]

print("✔️ Schema, Columns & Labels done")

✔️ Schema, Columns & Labels done


# 1️⃣ Imports

In [2]:
import pandas as pd
import numpy as np
import torch
import time
from google.colab import drive
from pydantic import BaseModel
from typing import Optional, Literal
import os
from transformers import pipeline
from sklearn.metrics import precision_score, recall_score, f1_score # 🆕 NEW

print("✔️ Imports done")

✔️ Imports done


# 2️⃣ Data loading

## Retrieval

In [3]:
%cd /content

!git clone https://github.com/ninenine-9/legaldetainment.git

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!cp -r /content/legaldetainment /content/drive/MyDrive/

print("✔️ Git Repository successfully cloned")

/content
Cloning into 'legaldetainment'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 61 (delta 21), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (61/61), 479.87 KiB | 11.42 MiB/s, done.
Resolving deltas: 100% (21/21), done.
Mounted at /content/drive
✔️ Git Repository successfully cloned


## Uploading
`/content/legaldetainment/DATA/legaldetainment_humandata.xlsx` contains our human-coded data. Below is an explanation of the spreadsheet's various sheets:
- `data` = the full dataset (minus `removed`)
- `removed` = participants I've removed because of inconclusive coding
- `ALL(10)` = a sample of 10 participants
- `30participants` = a sample of 30 participants

`/content/legaldetainment/DATA/legaldetainment_blankdata.xlsx` contains our blank data. It contains Pno, Qual and the relevant column headers for the codes (but the content of these columns is blank). Below is an explanation of the spreadsheet's various sheets:
- `ALL` = the full dataset (minus `removed`)
- `ALL(10)` = a sample of 10 participants
- `ALL(20)` = a sample of 20 participants
- `30participants` = a sample of 30 participants
- etc..

⚠️ YOU SHOULD ALWAYS HAVE MATCHING SHEETS BETWEEN THE BLANK DATASET AND THE HUMAN-CODED DATA!
For example, if you create a new tab/sheet for testing 40 participants, you must have a sheet in both the full dataset and human-coded data with corresponding participants (Pno).


---


_example_

`legaldetainment_347p_ALL.xlsx, sheet = ALL(3)`
| Pno | Qual |
|:-----------|:------------:|
| 200 | Lorem ipsum  |
| 35 | dolor sit amet  |
| 298 | Duis aute irure  |

`legaldetainment_blankdata.xlsx, sheet = ALL(3)`
| Pno | Qual |
|:-----------|:------------:|
| 200 | Lorem ipsum  |
| 35 | dolor sit amet  |
| 298 | Duis aute irure  |

`Same Pnos in both datasets ✅`


❔ Still unsure? the last block of this section can check this for you


### Materials

In [4]:
with open("/content/legaldetainment/INPUTS/legaldetainment_story.md", "r", encoding="utf-8") as f:
    study_context = f.read()


with open("/content/legaldetainment/INPUTS/legaldetainment_codinginstructions.md", "r", encoding="utf-8") as f:
    coding_manual = f.read()

### Blank dataset

In [5]:
data = pd.read_excel("/content/legaldetainment/DATA/legaldetainment_blankdata.xlsx", sheet_name = '30participants') # 📋 for cell H & G

data = data.dropna(subset=["Pno"])
data.columns = data.columns.str.strip()

data = data[["Pno", "Qual", "GuiltMoreLikely",
        "GuiltLessLikely",
        "NoInformation_Evidence",
        "InnocentUntilProvenGuilty",
        "Confound"]]

print(f"Dataset size: {data.shape}")


Dataset size: (30, 7)


### Human-coded (comparison) dataset

In [6]:
humandata = pd.read_excel("/content/legaldetainment/DATA/legaldetainment_humandata.xlsx", sheet_name = '30participants') # 📋 for cell I

humandata = humandata.dropna(subset=["Pno"])
humandata = humandata.replace({0: "No", 1: "Yes", "0": "No", "1": "Yes"})
humandata = humandata.rename(columns={
    "NoInformation/Evidence": "NoInformation_Evidence",
    "Unclassified/Other": "Unclassified_Other"
})
humandata.columns = humandata.columns.str.strip()

humandata = humandata[["Pno", "Qual", "GuiltMoreLikely",
        "GuiltLessLikely",
        "NoInformation_Evidence",
        "InnocentUntilProvenGuilty",
        "Confound"]]

print("✔️ Data loading done")

✔️ Data loading done


# 3️⃣ Model loading

In [7]:
device = 0 if torch.cuda.is_available() else -1

CONFIDENCE_THRESHOLD = 0.1 # ✏️ Higher = more conservative (more blanks). Lower = more guesses


# ℹ️ ✏️ model dependent (see line 20-21, must match)
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device=device)
print("✔️ Model loading done")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


✔️ Model loading done


# 4️⃣ Model running

v2 problem
your threshold logic is working for the other columns, but for Unclassified_Other the post-processing rule dominates, which is why changing the threshold does not change final counts for that code.

If you want threshold changes to affect Unclassified_Other, you’d need to modify or remove that exclusivity/post-processing rule.

In [8]:
results = []
start_time = time.time()

# ➋ Loop through participant responses
for i, row in data.iterrows():
    text = row["Qual"]
    result_row = row.to_dict()

    # Handle empty or missing text
    if not isinstance(text, str) or not text.strip():
        for col in binary_columns:
            result_row[col] = "No"
            result_row[f"{col}_score_Yes"] = 0.0
            result_row[f"{col}_score_No"] = 0.0
        results.append(result_row)
        continue

    scores_record = {}

    # Classify all columns except Unclassified_Other
    for col in cluster_codes + independent_codes:
        response = classifier(text, candidate_labels=["Yes", "No"])
        if isinstance(response, list):
            response = response[0]

        score_yes = response["scores"][response["labels"].index("Yes")]
        score_no  = response["scores"][response["labels"].index("No")]

        # Threshold-based assignment
        result_row[col] = "Yes" if score_yes >= CONFIDENCE_THRESHOLD else "No"

        # Store scores
        result_row[f"{col}_score_Yes"] = score_yes
        result_row[f"{col}_score_No"] = score_no
        scores_record[col] = score_yes

    # Enforce exclusivity within cluster codes
    yes_cluster = [c for c in cluster_codes if result_row[c] == "Yes"]
    if len(yes_cluster) > 1:
        best = max(yes_cluster, key=lambda c: scores_record[c])
        for c in cluster_codes:
            result_row[c] = "Yes" if c == best else "No"

    results.append(result_row)

scores = [r[f"{col}_score_Yes"] for r in results]
print(min(scores), max(scores), np.mean(scores))
sum_yes = sum(row[col] == "Yes" for row in results)
print(f"{col}: {sum_yes} Yes at threshold {CONFIDENCE_THRESHOLD}")


# ➎ Timer end
end_time = time.time()
elapsed = end_time - start_time
avg_time = elapsed / len(data)
print(f"\n✅ Classification completed in {elapsed:.2f} seconds.")
print(f"Average time per entry: {avg_time:.3f} seconds.")

print("✔️ Running completed")


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipython-input-3109537771.py", line 22, in <cell line: 0>
    response = classifier(text, candidate_labels=["Yes", "No"])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/pipelines/zero_shot_classification.py", line 209, in __call__
    return super().__call__(sequences, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/pipelines/base.py", line 1459, in __call__
    return next(
           ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/pipelines/pt_utils.py", line 126, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/pipelines/pt_utils

TypeError: object of type 'NoneType' has no len()

Option 1 – Keep Unclassified_Other as a fallback but respect threshold

In [None]:
results = []
start_time = time.time()

# ➋ Loop through participant responses
for i, row in data.iterrows():
    text = row["Qual"]
    result_row = row.to_dict()

    # Handle empty or missing text
    if not isinstance(text, str) or not text.strip():
        for col in binary_columns:
            result_row[col] = "No"
            result_row[f"{col}_score_Yes"] = 0.0
            result_row[f"{col}_score_No"] = 0.0
        results.append(result_row)
        continue

    # Store Yes scores for exclusivity logic
    scores_record = {}

    # Classify all columns except Unclassified_Other first
    for col in binary_columns:
        if col == "Unclassified_Other":
            continue  # handle separately
        response = classifier(text, candidate_labels=["Yes", "No"])
        if isinstance(response, list):
            response = response[0]
        best_label = response["labels"][0]
        best_score = response["scores"][0]

        # Assign label based on threshold
        result_row[col] = best_label if best_score >= CONFIDENCE_THRESHOLD else "No"

        # Store individual scores
        result_row[f"{col}_score_Yes"] = response["scores"][response["labels"].index("Yes")]
        result_row[f"{col}_score_No"] = response["scores"][response["labels"].index("No")]
        scores_record[col] = result_row[f"{col}_score_Yes"]

    # Enforce exclusivity within cluster_codes
    yes_cluster = [c for c in cluster_codes if result_row[c] == "Yes"]
    if len(yes_cluster) > 1:
        # Keep only the one with highest confidence
        best = max(yes_cluster, key=lambda c: scores_record[c])
        for c in cluster_codes:
            result_row[c] = "Yes" if c == best else "No"

    # Classify Unclassified_Other with threshold
    col = "Unclassified_Other"
    response = classifier(text, candidate_labels=["Yes", "No"])
    if isinstance(response, list):
        response = response[0]
    score_yes = response["scores"][response["labels"].index("Yes")]
    score_no = response["scores"][response["labels"].index("No")]
    result_row[col] = "Yes" if score_yes >= CONFIDENCE_THRESHOLD else "No"
    result_row[f"{col}_score_Yes"] = score_yes
    result_row[f"{col}_score_No"] = score_no

    # Final fallback rule for Unclassified_Other
    if all(result_row[c] == "No" for c in cluster_codes + independent_codes + [col]):
        result_row[col] = "Yes"

    results.append(result_row)

scores = [r[f"{col}_score_Yes"] for r in results]
print(min(scores), max(scores), np.mean(scores))
sum_yes = sum(row[col] == "Yes" for row in results)
print(f"{col}: {sum_yes} Yes at threshold {CONFIDENCE_THRESHOLD}")


# ➎ Timer end
end_time = time.time()
elapsed = end_time - start_time
avg_time = elapsed / len(data)
print(f"\n✅ Classification completed in {elapsed:.2f} seconds.")
print(f"Average time per entry: {avg_time:.3f} seconds.")

print("✔️ Running completed")


0.0648532509803772 0.4666767120361328 0.24034789924820263
Unclassified_Other: 30 Yes at threshold 0.1

✅ Classification completed in 319.06 seconds.
Average time per entry: 10.635 seconds.
✔️ Running completed


classifying independent codes first

In [None]:

results = []
start_time = time.time()

for i, row in data.iterrows():
    text = row["Qual"]
    result_row = row.to_dict()

    # Handle empty or missing text
    if not isinstance(text, str) or not text.strip():
        for col in binary_columns:
            result_row[col] = "No"
            result_row[f"{col}_score_Yes"] = 0.0
            result_row[f"{col}_score_No"] = 0.0
        results.append(result_row)
        continue

    scores_record = {}

    # 1️⃣ Classify cluster + independent codes first
    for col in cluster_codes + independent_codes:
        response = classifier(text, candidate_labels=["Yes", "No"])
        if isinstance(response, list):
            response = response[0]

        score_yes = response["scores"][response["labels"].index("Yes")]
        score_no  = response["scores"][response["labels"].index("No")]

        # Threshold-based assignment
        result_row[col] = "Yes" if score_yes >= CONFIDENCE_THRESHOLD else "No"

        # Store scores
        result_row[f"{col}_score_Yes"] = score_yes
        result_row[f"{col}_score_No"] = score_no
        scores_record[col] = score_yes

    # 2️⃣ Enforce exclusivity within cluster codes
    yes_cluster = [c for c in cluster_codes if result_row[c] == "Yes"]
    if len(yes_cluster) > 1:
        best = max(yes_cluster, key=lambda c: scores_record[c])
        for c in cluster_codes:
            result_row[c] = "Yes" if c == best else "No"

    # 3️⃣ Classify Unclassified_Other based on threshold first
    col = "Unclassified_Other"
    response = classifier(text, candidate_labels=["Yes", "No"])
    if isinstance(response, list):
        response = response[0]
    score_yes = response["scores"][response["labels"].index("Yes")]
    score_no  = response["scores"][response["labels"].index("No")]

    result_row[col] = "Yes" if score_yes >= CONFIDENCE_THRESHOLD else "No"
    result_row[f"{col}_score_Yes"] = score_yes
    result_row[f"{col}_score_No"] = score_no

    # 4️⃣ Final fallback: only if all other codes are "No" after thresholding
    if all(result_row[c] == "No" for c in cluster_codes + independent_codes):
        result_row[col] = "Yes"

    results.append(result_row)

    if (i + 1) % 10 == 0:
        print(f"Processed {i + 1}/{len(data)} responses...")
scores = [r[f"{col}_score_Yes"] for r in results]
print(min(scores), max(scores), np.mean(scores))
sum_yes = sum(row[col] == "Yes" for row in results)
print(f"{col}: {sum_yes} Yes at threshold {CONFIDENCE_THRESHOLD}")

# Timer end
end_time = time.time()
elapsed = end_time - start_time
avg_time = elapsed / len(data)
print(f"\n✅ Classification completed in {elapsed:.2f} seconds.")
print(f"Average time per entry: {avg_time:.3f} seconds.")
print("✔️ Running completed")

# Convert results to DataFrame if needed
results_df = pd.DataFrame(results)


Processed 10/30 responses...
Processed 20/30 responses...
Processed 30/30 responses...
0.0648532509803772 0.4666767120361328 0.24034789924820263
Unclassified_Other: 30 Yes at threshold 0.1

✅ Classification completed in 308.20 seconds.
Average time per entry: 10.273 seconds.
✔️ Running completed


# 5️⃣Saving outputs

In [None]:
output_path = "/content/legaldetainment/RESULTS/30p_CF01_troubleshoot.csv" # ✏️ must rename for each run (+1 for example)
output_df = pd.DataFrame(results)
os.makedirs(os.path.dirname(output_path), exist_ok=True)
output_df.to_csv(output_path, index=False)
print(f"✔️ Results saved to {output_path}")

{'sequence': 'The judge decided not to detain x up until their trial therefore he may have already formed the opinion that he is not guilty.', 'labels': ['No', 'Yes'], 'scores': [0.7886896729469299, 0.21131031215190887]}
✔️ Results saved to /content/legaldetainment/RESULTS/30p_CF01.csv


# 6️⃣ Evaluation

## 🤖 Of LM vs human

In [None]:
# humandata = humandata.replace({'0': 'No', '1': 'Yes'})

# # Clean up column names in output_df by stripping whitespace
# output_df.columns = output_df.columns.str.strip()

# # Add print statements to check column names before merge
# print("Columns in humandata before merge:")
# print(humandata.columns)
# print("\nColumns in output_df before merge:")
# print(output_df.columns)

# Compare human and model coding
eval_df = humandata[["Pno"] + all_codes].merge(
    output_df[["Pno"] + all_codes],
    on="Pno",
    how="inner"
)

# Boolean: all codes match for that Pno
eval_df["all_match"] = eval_df.apply(
    lambda row: all(row[code + "_x"] == row[code + "_y"] for code in all_codes),
    axis=1
)

# Count matches
total = len(eval_df)
matches = eval_df["all_match"].sum()
percentage = matches / total * 100 if total else 0

print(f"\nTotal Pnos compared: {total}")
print(f"Matching on all codes: {matches} ({percentage:.1f}%)") # 📋 for cell N & O

# Print Pnos that match on all codes
matching_pnos = eval_df[eval_df['all_match'] == True]['Pno'].tolist()
print(f"Pnos matching on all codes: {matching_pnos}") # 📋 for cell P


Total Pnos compared: 30
Matching on all codes: 14 (46.7%)
Pnos matching on all codes: [232, 247, 223, 212, 342, 18, 89, 259, 267, 253, 299, 327, 175, 31]


## 🚀 Of LM performance

In [None]:
import sys
from sklearn.metrics import precision_recall_fscore_support

# 🔸 Redirect stdout to file
log_file = "/content/legaldetainment/RESULTS/30p_CF01_log_troubleshoot.txt"
original_stdout = sys.stdout
with open(log_file, "w", encoding="utf-8") as f:
    sys.stdout = f

    print("\n" + "="*60)
    print("Per-Code Metrics:")
    print("="*60)

    for code in all_codes:
        human_col = f"{code}_x"
        model_col = f"{code}_y"

        human_vals = eval_df[human_col]
        model_vals = eval_df[model_col]

        precision, recall, f1, support = precision_recall_fscore_support(
            human_vals, model_vals,
            labels=["Yes", "No"],
            average=None,
            zero_division=0
        )

        precision_yes = precision[0]
        recall_yes = recall[0]
        f1_yes = f1[0]
        support_yes = support[0]

        print(f"\n{code}:")
        print(f"  Precision: {precision_yes:.3f}")
        print(f"  Recall:    {recall_yes:.3f}")
        print(f"  F1-Score:  {f1_yes:.3f}")
        print(f"  Support:   {support_yes} actual 'Yes' cases")

    print("\n" + "="*60)
    print("Overall Metrics:")
    print("="*60)

    all_human = []
    all_model = []
    for code in all_codes:
        all_human.extend(eval_df[f"{code}_x"].tolist())
        all_model.extend(eval_df[f"{code}_y"].tolist())

    precision, recall, f1, support = precision_recall_fscore_support(
        all_human, all_model,
        labels=["Yes", "No"],
        average=None,
        zero_division=0
    )

    print(f"Overall Precision: {precision[0]:.3f}")
    print(f"Overall Recall:    {recall[0]:.3f}")
    print(f"Overall F1-Score:  {f1[0]:.3f}")

    # ✅ restore stdout automatically when file closes
    sys.stdout = original_stdout

print(f"Console output saved to: {log_file}")


Console output saved to: /content/legaldetainment/RESULTS/30p_CF01_log.txt
