## quantitative Evaluation der Ergebnisse - Gemma 2 2B

In diesem notebook wird der dem Goldstandard angepasste Output von Gemma 2 2B untersucht.

Es werden Precision, Recall und F1-Score f√ºr den gesamten Datensatz berechnet


---

In [1]:
import json
with open("../../data/NER/gemma2/gemma2_goldstandard_adjusted.json", "r", encoding="utf-8") as f:
    data = json.load(f)

#### Anzahl der Entit√§ten bei denen keine echte Entit√§tenextraktion stattgefunden hat
Text wurde vom Modell komplett oder in gro√üen Teilen aus dem Ground Truth entnommen und zeigte keinen Zusammenhang zur gesuchten Entit√§t

In [2]:
# Entit√§ten ohne echte Textextraktion
from collections import defaultdict # defaultdict, setzt automatisch standardwert, wenn noch keiner da ist / (int) - Intwert 0 gesetzt


counts = defaultdict(int)

for item in data:
    for entity in item.get("entities", []):
         if entity.get("no_extraction") == True:
            label = entity.get("label")
            counts[label] += 1
             

print("keine richtige Entit√§tenextraktion:")
for label in ["EVENT", "TOPIC", "DATE", "TIME", "LOC"]:
    print(f"{label}: {counts[label]}")

keine richtige Entit√§tenextraktion:
EVENT: 21
TOPIC: 13
DATE: 1
TIME: 1
LOC: 12


#### Anzahl der Entit√§ten die inhaltich richtig vom Modell erkannt wurden, aber w√∂rtlich abwichen
Hier sind gr√∂√üere Anpassungen eingeschlossen, wie das Anpassen von ganzen Wortgruppen und S√§tzen, vorwiegend bei "EVENT" und "TOPIC"und kleinere Anpassungen wie das Ver√§nderung von Datums- und Zeitformaten ( 2. August -> 2.8.) und das Trennen einzelner l√§ngerer "LOC"-Ausgaben in mehrere "LOC"-Entit√§ten (LOC: Hauptsr. 2, 12345 Berlin -> LOC: Hauptstr.2 und LOC: 12345 Berlin )

--> siehe auch notebook [06_prepare_json_for_evaluation](06_prepare_json_for_evaluation.ipynb)

In [3]:
# Anzahl dem Goldstandard angepasster Entit√§ten
from collections import defaultdict # defaultdict, setzt automatisch standardwert, wenn noch keiner da ist / (int) - Intwert 0 gesetzt


counts = defaultdict(int)

for item in data:
    for entity in item.get("entities", []):
        if entity.get("aligned_to_gold") == True:
            label = entity.get("label")
            counts[label] += 1

print("Anzahl der nachtr√§glich dem Goldstandard angepasste Entit√§ten:")
for label in ["EVENT", "TOPIC", "DATE", "TIME", "LOC"]:
    print(f"{label}: {counts[label]}")

Anzahl der nachtr√§glich dem Goldstandard angepasste Entit√§ten:
EVENT: 53
TOPIC: 46
DATE: 15
TIME: 35
LOC: 100


---
### Berechnung Precision, Recall, F1-Score

In [6]:
import json
from tqdm import tqdm
from collections import defaultdict
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd

# === Dateien laden ===
with open("../../data/data_annotated.json", encoding="utf-8") as f:
    gold_data = json.load(f)

with open("../../data/NER/gemma2/gemma2_goldstandard_adjusted.json", encoding="utf-8") as f:
    gemma_data = json.load(f)

label_names = ["LOC", "DATE", "TIME", "EVENT", "TOPIC"]

def extract_entities_gold(entry):
    ent_dict = {}
    for ent in entry.get("entities", []):
        label = ent.get("label")
        text = ent.get("text", "").strip()
        if label and text:
            ent_dict.setdefault(label, []).append(text)
    return ent_dict

def extract_entities_pred(entry):
    ent_dict = {}
    for ent in entry.get("entities", []):
        label = ent.get("label")
        text = ent.get("standardized_text", "").strip()
        if label and text:
            ent_dict.setdefault(label, []).append(text)
    return ent_dict

# Mapping f√ºr schnellen Zugriff
gemma_dict = {entry["file_name"]: entry for entry in gemma_data}

# Globale Auswertungsdaten
y_true_all = []
y_pred_all = []
y_true_per_label = defaultdict(list)
y_pred_per_label = defaultdict(list)
label_stats = defaultdict(lambda: [0, 0, 0])
all_results = []

for gold_entry in tqdm(gold_data):
    file_name = gold_entry["file_name"]
    gold_ents = extract_entities_gold(gold_entry)
    pred_entry = gemma_dict.get(file_name)
    pred_ents = extract_entities_pred(pred_entry) if pred_entry else {}

    true_pos, false_pos, false_neg = [], [], []

    for label in label_names:
        gold_texts = set(t.strip() for t in gold_ents.get(label, []) if isinstance(t, str))
        pred_texts = set(t.strip() for t in pred_ents.get(label, []) if isinstance(t, str))

        # True Positives
        matched = gold_texts & pred_texts
        for text in matched:
            y_true_all.append(1)
            y_pred_all.append(1)
            y_true_per_label[label].append(1)
            y_pred_per_label[label].append(1)
            true_pos.append({"label": label, "text": text})

        # False Negatives
        for text in gold_texts - matched:
            y_true_all.append(1)
            y_pred_all.append(0)
            y_true_per_label[label].append(1)
            y_pred_per_label[label].append(0)
            false_neg.append({"label": label, "text": text})

        # False Positives
        for text in pred_texts - matched:
            y_true_all.append(0)
            y_pred_all.append(1)
            y_true_per_label[label].append(0)
            y_pred_per_label[label].append(1)
            false_pos.append({"label": label, "text": text})

        # Pro-Label Z√§hlung
        label_stats[label][0] += len(matched)
        label_stats[label][1] += len(pred_texts - matched)
        label_stats[label][2] += len(gold_texts - matched)

    # Metriken pro Datei
    tp_count = len(true_pos)
    fp_count = len(false_pos)
    fn_count = len(false_neg)

    precision_local = tp_count / (tp_count + fp_count) if (tp_count + fp_count) > 0 else 0
    recall_local = tp_count / (tp_count + fn_count) if (tp_count + fn_count) > 0 else 0
    f1_local = 2 * precision_local * recall_local / (precision_local + recall_local) if (precision_local + recall_local) > 0 else 0

    result_entry = {
        "file_name": file_name,
        "precision": precision_local,
        "recall": recall_local,
        "f1": f1_local,
        "true_positives": true_pos,
        "false_positives": false_pos,
        "false_negatives": false_neg
    }
    all_results.append(result_entry)

# Speichern
with open("../../data/NER/gemma2/results_gemma.json", "w", encoding="utf-8") as f:
    json.dump(all_results, f, ensure_ascii=False, indent=2)

# === Gesamtmetriken ===
precision = sum(y_pred_all[i] == y_true_all[i] == 1 for i in range(len(y_true_all))) / sum(y_pred_all) if sum(y_pred_all) > 0 else 0
recall = sum(y_pred_all[i] == y_true_all[i] == 1 for i in range(len(y_true_all))) / sum(y_true_all) if sum(y_true_all) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print("\n=== Gesamtbewertung ===")
print(f"Precision: {precision:.3f}")
print(f"Recall   : {recall:.3f}")
print(f"F1-Score : {f1:.3f}")

# === Bewertung pro Label ===
print("\n=== Bewertung pro Label ===")
for label, (tp_l, fp_l, fn_l) in label_stats.items():
    p = tp_l / (tp_l + fp_l) if (tp_l + fp_l) > 0 else 0
    r = tp_l / (tp_l + fn_l) if (tp_l + fn_l) > 0 else 0
    f = 2 * p * r / (p + r) if (p + r) > 0 else 0
    print(f"{label:<10} P: {p:.2f}  R: {r:.2f}  F1: {f:.2f}")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [00:00<00:00, 61826.42it/s]


=== Gesamtbewertung ===
Precision: 0.524
Recall   : 0.409
F1-Score : 0.459

=== Bewertung pro Label ===
LOC        P: 0.59  R: 0.41  F1: 0.48
DATE       P: 0.63  R: 0.43  F1: 0.51
TIME       P: 0.78  R: 0.69  F1: 0.73
EVENT      P: 0.37  R: 0.36  F1: 0.36
TOPIC      P: 0.27  R: 0.20  F1: 0.23





In [17]:
import json
from collections import defaultdict

with open("../../data/data_annotated.json", encoding="utf-8") as f:
    gold = json.load(f)

with open("../../data/NER/gemma2_goldstandard_adjusted.json", encoding="utf-8") as f:
    ner = json.load(f)

# Erzeuge Lookup nach file_name f√ºr schnellen Zugriff
gold_by_file = {entry["file_name"]: entry for entry in gold}
ner_by_file = {entry["file_name"]: entry for entry in ner}

# Z√§hler f√ºr genaue √úbereinstimmungen pro Label
match_counts = defaultdict(int)
total_counts = defaultdict(int)

def get_entities_gold(entry):
    """Erzeuge ein Dict {label: set of texts} f√ºr eine Datei"""
    entities = defaultdict(set)
    for e in entry.get("entities", []):
        label = e["label"]
        text = e["text"].strip().lower()
        entities[label].add(text)
    return entities

def get_entities_pred(entry):
    """Erzeuge ein Dict {label: set of texts} f√ºr eine Datei"""
    entities = defaultdict(set)
    for e in entry.get("entities", []):
        label = e["label"]
        text = e["standardized_text"].strip().lower()
        entities[label].add(text)
    return entities

# Iteriere √ºber alle Dateien im Goldstandard
for file_name, gold_entry in gold_by_file.items():
    ner_entry = ner_by_file.get(file_name)
    if ner_entry is None:
        continue  # Kein Output vorhanden f√ºr diese Datei

    gold_entities = get_entities_gold(gold_entry)
    ner_entities = get_entities_pred(ner_entry)

    # Betrachte alle Labels, die vorkommen
    all_labels = set(gold_entities.keys()) | set(ner_entities.keys())
    for label in all_labels:
        gold_set = gold_entities.get(label, set())
        ner_set = ner_entities.get(label, set())

        # Z√§hle Totalf√§lle (Gold)
        total_counts[label] += len(gold_set)

        # Z√§hle nur exakte √úbereinstimmungen
        matches = gold_set & ner_set
        match_counts[label] += len(matches)

# Ergebnis als dict
results = {label: {"matches": match_counts[label], "total": total_counts[label]} for label in total_counts}
results

{'DATE': {'matches': 124, 'total': 282},
 'LOC': {'matches': 165, 'total': 391},
 'TIME': {'matches': 151, 'total': 210},
 'EVENT': {'matches': 84, 'total': 227},
 'TOPIC': {'matches': 55, 'total': 259}}

---
#### Berechnung ohne Einbeziehung von TOPIC f√ºr den sp√§teren Vergleich mit Modellkombination Flair + regelbasierte Erkennung

In [7]:
import json
from tqdm import tqdm
from collections import defaultdict
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd

# === Dateien laden ===
with open("../../data/data_annotated.json", encoding="utf-8") as f:
    gold_data = json.load(f)

with open("../../data/NER/gemma2/gemma2_goldstandard_adjusted.json", encoding="utf-8") as f:
    gemma_data = json.load(f)

# === Relevante Labels (ohne TOPIC) ===
label_names = ["LOC", "DATE", "TIME", "EVENT"]

# === Entit√§ten extrahieren ===
def extract_entities_gold(entry):
    ent_dict = {}
    for ent in entry.get("entities", []):
        label = ent.get("label")
        text = ent.get("text", "").strip()
        if label in label_names and text:
            ent_dict.setdefault(label, []).append(text)
    return ent_dict

def extract_entities_pred(entry):
    ent_dict = {}
    for ent in entry.get("entities", []):
        label = ent.get("label")
        text = ent.get("standardized_text", "").strip()
        if label in label_names and text:
            ent_dict.setdefault(label, []).append(text)
    return ent_dict

# === Mapping der Vorhersagen f√ºr schnellen Zugriff ===
gemma_dict = {entry["file_name"]: entry for entry in gemma_data}

# === Vergleich vorbereiten ===
y_true_all = []
y_pred_all = []
y_true_per_label = defaultdict(list)
y_pred_per_label = defaultdict(list)
label_stats = defaultdict(lambda: [0, 0, 0])
all_results = []

# === Vergleichsloop √ºber alle Goldstandard-Dateien ===
for gold_entry in tqdm(gold_data):
    file_name = gold_entry["file_name"]
    gold_ents = extract_entities_gold(gold_entry)
    pred_entry = gemma_dict.get(file_name)
    pred_ents = extract_entities_pred(pred_entry) if pred_entry else {}

    true_pos, false_pos, false_neg = [], [], []

    for label in label_names:
        gold_texts = set(t.strip() for t in gold_ents.get(label, []) if isinstance(t, str))
        pred_texts = set(t.strip() for t in pred_ents.get(label, []) if isinstance(t, str))

        # True Positives
        matched = gold_texts & pred_texts
        for text in matched:
            y_true_all.append(1)
            y_pred_all.append(1)
            y_true_per_label[label].append(1)
            y_pred_per_label[label].append(1)
            true_pos.append({"label": label, "text": text})

        # False Negatives
        for text in gold_texts - matched:
            y_true_all.append(1)
            y_pred_all.append(0)
            y_true_per_label[label].append(1)
            y_pred_per_label[label].append(0)
            false_neg.append({"label": label, "text": text})

        # False Positives
        for text in pred_texts - matched:
            y_true_all.append(0)
            y_pred_all.append(1)
            y_true_per_label[label].append(0)
            y_pred_per_label[label].append(1)
            false_pos.append({"label": label, "text": text})

        # Labelweise Statistiken
        label_stats[label][0] += len(matched)
        label_stats[label][1] += len(pred_texts - matched)
        label_stats[label][2] += len(gold_texts - matched)

    # Lokale Metriken berechnen
    tp = len(true_pos)
    fp = len(false_pos)
    fn = len(false_neg)

    precision_local = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall_local = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1_local = 2 * precision_local * recall_local / (precision_local + recall_local) if (precision_local + recall_local) > 0 else 0

    all_results.append({
        "file_name": file_name,
        "precision": precision_local,
        "recall": recall_local,
        "f1": f1_local,
        "true_positives": true_pos,
        "false_positives": false_pos,
        "false_negatives": false_neg
    })

# === Ergebnisse speichern ===
with open("../../data/NER/gemma2/results_gemma_without_topic.json", "w", encoding="utf-8") as f:
    json.dump(all_results, f, ensure_ascii=False, indent=2)

# === Gesamtmetriken ===
precision, recall, f1, _ = precision_recall_fscore_support(
    y_true_all, y_pred_all, average="binary", zero_division=0
)

print("\nüîç Gesamtergebnis")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1-Score:  {f1:.3f}")

# === Pro-Label-Metriken ===
rows = []
for label in label_names:
    y_true = y_true_per_label[label]
    y_pred = y_pred_per_label[label]

    if not y_true and not y_pred:
        p = r = f = 0.0
    else:
        p, r, f, _ = precision_recall_fscore_support(y_true, y_pred, average="binary", zero_division=0)

    rows.append({
        "Label": label,
        "Precision": round(p, 3),
        "Recall": round(r, 3),
        "F1-Score": round(f, 3),
        "Anzahl in Gold": sum(y_true),
        "Anzahl Predicted": sum(y_pred)
    })

df_metrics = pd.DataFrame(rows).sort_values("Label")
print("\nüìä Metriken pro Kategorie:")
print(df_metrics.to_string(index=False))


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [00:00<00:00, 66397.09it/s]


üîç Gesamtergebnis
Precision: 0.582
Recall:    0.455
F1-Score:  0.511

üìä Metriken pro Kategorie:
Label  Precision  Recall  F1-Score  Anzahl in Gold  Anzahl Predicted
 DATE      0.626   0.428     0.508             290               198
EVENT      0.368   0.357     0.363             235               228
  LOC      0.589   0.407     0.482             405               280
 TIME      0.783   0.692     0.734             214               189



