# Evaluating Cross-Lingual Transfer and Inoculation Strategies for Low-Resource Hate Speech Detection in Swedish

Implement and evaluate:
1. mBERT & XLM-R trained on English -> Evaluate on Swedish (Zero-shot Baseline).
2. mBERT & XLM-R trained on English + All Other Non-Swedish Data -> Evaluate on Swedish.
3. Inoculation: Fine-tune models from (1) and (2) with a small, fixed amount of Swedish data -> Evaluate on Swedish.
4. KB-BERT trained on the same small amount of Swedish data -> Evaluate on Swedish.
5. KB-BERT trained on all available Swedish training data -> Evaluate on Swedish.

## Environment Setup

Setting up my environment and installing relevant libraries.

In [None]:
import os
DRIVE_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/'
DATA_PATH = DRIVE_PATH + 'data/'
OUTPUT_PATH = DRIVE_PATH + 'models/'
RESULTS_PATH = DRIVE_PATH + 'results/'

# Create directories
os.makedirs(DATA_PATH, exist_ok=True)
os.makedirs(OUTPUT_PATH, exist_ok=True)
os.makedirs(RESULTS_PATH, exist_ok=True)

## Data Acquisition and Preparation

Datasets:
1. Swedish (Target): BiaSWE
2. English (Base Training): EACL 2021 + Urban Dictionary Misogyny (Combined)
3. Danish (Augmentation): DKhate
4. German (Augmentation): GeRMS-AT

In [None]:
import pandas as pd
from datasets import load_dataset, DatasetDict, Dataset
from huggingface_hub import login

# Authentication
try:
    login(token="INSERT TEXT") # Insert personal token here
    print("Hugging Face Hub login potentially successful (or token provided).")
except Exception as e:
    print(f"Hub login failed or token invalid: {e}")

print("\nLoading datasets from Hugging Face Hub...")

# Load DKHate
try:
    dk_dataset_dict = load_dataset("DDSC/dkhate")
    print("\nSuccessfully loaded DKhate:")
    print(dk_dataset_dict)
    # Expected output: DatasetDict with 'train' and 'test' splits
except Exception as e:
    print(f"Error loading DKhate: {e}")
    dk_dataset_dict = None # Set to None if loading fails

# Load BiaSWE
try:
    sw_dataset_dict = load_dataset("AI-Sweden-Models/BiaSWE")
    print("\nSuccessfully loaded BiaSWE:")
    print(sw_dataset_dict)
    # Expected output: DatasetDict possibly with 'train', 'validation', 'test' splits
except Exception as e:
    print(f"Error loading BiaSWE: {e}")
    sw_dataset_dict = None # Set to None if loading fails

Hugging Face Hub login potentially successful (or token provided).

Loading datasets from Hugging Face Hub...



Successfully loaded DKhate:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2960
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 329
    })
})



Successfully loaded BiaSWE:
DatasetDict({
    train: Dataset({
        features: ['text', 'annotations'],
        num_rows: 150
    })
    val: Dataset({
        features: ['text', 'annotations'],
        num_rows: 150
    })
    test: Dataset({
        features: ['text', 'annotations'],
        num_rows: 150
    })
})


## DKhate Dataset

Upon inspection, the dataset seems imbalanced with roughly 87% class 0 vs. 13% class 1. In other words,  the neutral examples outnumber the hate-ridden speech examples.

Things to consider/implement:
* Class weights during training
* Look at F-1 score (weighted or macro), precision, and recall when evaluating the model

In [None]:
import datasets
from datasets import DatasetDict, Dataset

def preprocess_dkhate_dict(dataset_dict: DatasetDict) -> DatasetDict:
    """
    Preprocesses the DKhate DatasetDict loaded from Hugging Face Hub.

    - Lowercases the 'text' column.
    - Maps the original 'label' column (containing "NOT", "OFF")
      to numerical values (0, 1) in the same 'label' column.

    Args:
        dataset_dict: The raw DatasetDict object for DKhate (e.g., from load_dataset).

    Returns:
        A new DatasetDict with the processed data in all splits.
        Returns None if input is not a DatasetDict.
    """
    if not isinstance(dataset_dict, DatasetDict):
        print("Error: Input must be a datasets.DatasetDict")
        return None

    print("Preprocessing DKhate DatasetDict...")

    # Define the mapping for the labels
    label_mapping = {"NOT": 0, "OFF": 1}

    def map_labels_and_lowercase(example):
        """Internal function to process a single example dictionary."""
        # Lowercase text
        # Check if 'text' exists and is a string before lowercasing
        if 'text' in example and isinstance(example['text'], str):
            example['text'] = example['text'].lower()
        else:
            # Handle cases where 'text' might be missing or not a string if necessary
            example['text'] = "" # Assign empty string or handle as needed

        # Map labels
        original_label_key = 'label'
        if original_label_key in example:
            original_label_value = example[original_label_key]
            # Overwrite the 'label' field with numerical mapping
            # Use .get() with a default value (-1) for robustness against unexpected labels
            example['label'] = label_mapping.get(original_label_value, -1)
            # Warning for -1 values if they occur
            if example['label'] == -1:
                print(f"Warning: Unexpected label value '{original_label_value}' found in DKhate example: {example}")
        else:
            # Handle cases where the expected label column is missing
            print(f"Warning: Label key '{original_label_key}' not found in DKhate example: {example}")
            example['label'] = -1 # Assign default/error value

        return example

    # Use .map() method to apply the function to all examples in all splits
    processed_dict = dataset_dict.map(
        map_labels_and_lowercase,
        batched=False, # Process example by example
    )

    print("DKhate preprocessing complete.")

    # Verify the feature type for the 'label' column changed
    if 'train' in processed_dict: # Check a specific split if it exists
         print("Processed features (example from train split):")
         print(processed_dict['train'].features)
         # The 'label' feature should now show an integer type (e.g., int64)

    return processed_dict

processed_dk_dataset_dict = preprocess_dkhate_dict(dk_dataset_dict)
if processed_dk_dataset_dict:
  print("\nExample processed data (first train example):")
  print(processed_dk_dataset_dict['train'][0])

# Check label distribution in processed training data
train_labels = processed_dk_dataset_dict['train']['label']
label_counts = {}
for label in train_labels:
  label_counts[label] = label_counts.get(label, 0) + 1
print("\nValue counts for processed numerical label column (train split):")
print(label_counts) # Should show counts for 0 and 1

Preprocessing DKhate DatasetDict...


DKhate preprocessing complete.
Processed features (example from train split):
{'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}

Example processed data (first train example):
{'text': 'jeg tror det vil være dejlig køligt, men jeg vil have det meget meget svært, såfremt personen i billedet skulle lægge softicen der. ', 'label': 0}

Value counts for processed numerical label column (train split):
{0: 2576, 1: 384}


## BiaSwe Dataset

450 datapoints split into train, val, and test sets.

The annotations contain hate-speech detection ("yes" or "no"), misogyny detection ("yes" or "no"), category detection ("Stereotype", "Erasure and minimization", "Violence against women", "Sexualization and objectification" and "Anti-feminism and denial of discrimination") and, finally, severity rating (on a scale of 1 to 10).

Structure of each datapoint:
{"text": "...", "annotations": {"annotator 1": {"hate_speech": "...", "misogyny": "...", "category": "...", "rating": "...", "comment": "..."}, "annotatorevalleeedddccddddddkkkk



In [None]:
import datasets
from datasets import DatasetDict, Dataset
import json

def preprocess_biaswe_dict(dataset_dict: DatasetDict) -> DatasetDict:
    """
    Preprocesses the BiaSWE DatasetDict.

    - Extracts text (assuming top-level 'text' key).
    - Lowercases text.
    - Aggregates misogyny annotations using majority vote (conservative tie-breaking).
    - Creates a final numerical 'label' column (0=Not Misogyny, 1=Misogyny).

    Args:
        dataset_dict: The raw DatasetDict object for BiaSWE.

    Returns:
        A new DatasetDict with processed data ('text', 'label').
        Returns None if input is not a DatasetDict.
    """
    if not isinstance(dataset_dict, DatasetDict):
        print("Error: Input must be a datasets.DatasetDict")
        return None

    print("Preprocessing BiaSWE DatasetDict...")

    annotator_keys = ["annotator 1", "annotator 2", "annotator 3", "annotator 4"]

    def get_final_label_and_text(example):
        """Processes a single example to extract text and aggregate labels."""
        processed_example = {}

        # Extract and lowercase text (assuming top-level 'text' key)
        text_key = 'text' # Verify this key exists in dataset structure
        if text_key in example and isinstance(example[text_key], str):
            processed_example['text'] = example[text_key].lower()
        else:
            processed_example['text'] = "" # Handle missing/invalid text

        # Aggregate Annotations
        annotations_key = 'annotations' # Verify this key
        yes_votes = 0
        no_votes = 0
        valid_votes = 0

        if annotations_key in example and example[annotations_key]:
            annotations = example[annotations_key]
            # Handle if annotations are stored as a JSON string
            if isinstance(annotations, str):
                try:
                    annotations = json.loads(annotations)
                except json.JSONDecodeError:
                    annotations = {} # Assign empty dict if JSON is invalid

            if isinstance(annotations, dict): # Check if it's a dictionary
                for key in annotator_keys:
                    if key in annotations and annotations[key]: # Check if annotator exists and is not null
                        annotator_data = annotations[key]
                        # Check if 'misogyny' key exists within the annotator's data
                        if isinstance(annotator_data, dict) and 'misogyny' in annotator_data:
                           misogyny_label = annotator_data['misogyny']
                           # Standardize potential labels (adjust based on actual values)
                           if isinstance(misogyny_label, str):
                               label_lower = misogyny_label.lower()
                               if label_lower in ['yes', 'misogyny', '1']: # Add other positive variants if needed
                                   yes_votes += 1
                                   valid_votes += 1
                               elif label_lower in ['no', 'not misogyny', '0']: # Add other negative variants
                                   no_votes += 1
                                   valid_votes += 1
                               # Ignore NaN or empty strings implicitly

        # Determine Final Label (Majority Vote, Conservative Tie-breaking)
        if yes_votes > no_votes:
            processed_example['label'] = 1
        elif no_votes > yes_votes:
            processed_example['label'] = 0
        else: # Tie situation (includes 0 vs 0 if no valid votes)
            processed_example['label'] = 0 # Conservative tie-breaking

        return processed_example

    # Use .map() to apply the function
    # Create 'text' and 'label' from scratch based on complex logic.
    processed_dict = dataset_dict.map(
        get_final_label_and_text,
        batched=False,
        remove_columns=dataset_dict['train'].column_names # Remove all original columns
    )

    print("BiaSWE preprocessing complete.")

    # Verify features
    if 'train' in processed_dict:
         print("Processed features (example from train split):")
         print(processed_dict['train'].features)
         # Should now only have 'text' (string) and 'label' (int64)

    return processed_dict

processed_sw_dataset_dict = preprocess_biaswe_dict(sw_dataset_dict)

if processed_sw_dataset_dict:
  print("\nExample processed data (first train example):")
  print(processed_sw_dataset_dict['train'][0])

# Check label distribution in processed training data
train_labels = processed_sw_dataset_dict['train']['label']
label_counts = {}
for label in train_labels:
  label_counts[label] = label_counts.get(label, 0) + 1
print("\nValue counts for processed numerical label column (train split):")
print(label_counts)

Preprocessing BiaSWE DatasetDict...


BiaSWE preprocessing complete.
Processed features (example from train split):
{'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}

Example processed data (first train example):
{'text': 'ni som bor i hyreslägenhet! varför i helvete gör ni det? inte råd?: hej!  tycker de är allt för mycket folk som söker bostad och gnäller att det inte finns något.. köp en för helvete! vad gör ni av era pengar egentligen? så min fråga är varför köper inte fler personer lägenhet? varför super ni upp hela lönen istället för att spara till kontantinsats? eller trivs ni så bra i hyresghetton?', 'label': 0}

Value counts for processed numerical label column (train split):
{0: 80, 1: 70}


In [None]:
# Saving Pre-processed Hugging Face Datasets

DRIVE_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/'
PROCESSED_DATA_PATH = DRIVE_PATH + 'processed_data/'

# Create the base processed data directory if it doesn't exist
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)
print(f"Will save processed datasets to: {PROCESSED_DATA_PATH}")

# Define specific paths for each dataset
save_path_dk = os.path.join(PROCESSED_DATA_PATH, 'dkhate_processed_hf')
save_path_sw = os.path.join(PROCESSED_DATA_PATH, 'biaswe_processed_hf')

# Saving DKhate
# Check if the processed dataset variable exists and is a DatasetDict
if 'processed_dk_dataset_dict' in locals() and isinstance(processed_dk_dataset_dict, datasets.DatasetDict):
    print(f"\nSaving processed DKhate dataset to: {save_path_dk}")
    try:
        processed_dk_dataset_dict.save_to_disk(save_path_dk)
        print("-> DKhate dataset saved successfully.")
    except Exception as e:
        print(f"!! Error saving DKhate dataset: {e}")
else:
    print("\nVariable 'processed_dk_dataset_dict' not found or not a DatasetDict. Skipping save.")

# Saving BiaSWE
# Check if the processed dataset variable exists and is a DatasetDict
if 'processed_sw_dataset_dict' in locals() and isinstance(processed_sw_dataset_dict, datasets.DatasetDict):
    print(f"\nSaving processed BiaSWE dataset to: {save_path_sw}")
    try:
        processed_sw_dataset_dict.save_to_disk(save_path_sw)
        print("-> BiaSWE dataset saved successfully.")
    except Exception as e:
        print(f"!! Error saving BiaSWE dataset: {e}")
else:
    print("\nVariable 'processed_sw_dataset_dict' not found or not a DatasetDict. Skipping save.")

print("\n--- Saving Complete ---")
print("You should now find the folders 'dkhate_processed_hf' and 'biaswe_processed_hf'")
print(f"inside your '{PROCESSED_DATA_PATH}' directory on Google Drive.")

Will save processed datasets to: /content/drive/MyDrive/SwedishHateSpeechProject/processed_data/

Variable 'processed_dk_dataset_dict' not found or not a DatasetDict. Skipping save.

Variable 'processed_sw_dataset_dict' not found or not a DatasetDict. Skipping save.

--- Saving Complete ---
You should now find the folders 'dkhate_processed_hf' and 'biaswe_processed_hf'
inside your '/content/drive/MyDrive/SwedishHateSpeechProject/processed_data/' directory on Google Drive.


## English Datasets

This model training uses two English Datasets that will be concatenated to maximise the quantity of training data.

In [None]:
import pandas as pd
import datasets
from datasets import Dataset, DatasetDict
import os
import numpy as np

# Configuration
DRIVE_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/'
DATA_PATH = os.path.join(DRIVE_PATH, 'data/')
PROCESSED_DATA_PATH = os.path.join(DRIVE_PATH, 'processed_data/')
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)

# Define Input/Output Paths
ud_csv_path = os.path.join(DATA_PATH, "ManualTag_Misogyny.csv") # Urban Dictionary
eacl_csv_path = os.path.join(DATA_PATH, "final_labels.csv") # EACL

save_path_ud = os.path.join(PROCESSED_DATA_PATH, 'ud_misogyny_processed_hf')
save_path_eacl = os.path.join(PROCESSED_DATA_PATH, 'eacl_misogyny_processed_hf')

## Urban Dictionary Dataset

In [None]:
import pandas as pd

def process_ud_misogyny(csv_path: str) -> pd.DataFrame | None:
    """Loads and processes the simple UD misogyny dataset."""
    print(f"\nProcessing Urban Dictionary Misogyny from: {csv_path}")
    try:
        df = pd.read_csv(csv_path, encoding='latin-1')
        print(f"Loaded UD successfully. Shape: {df.shape}")
        print("Initial columns:", df.columns)


        # Verify 'definition' and 'is_misogyny' columns exist and label is 0/1
        if 'Definition' not in df.columns or 'is_misogyny' not in df.columns: # Adjust column names if needed
             print(f"!! Error: Expected 'Definition' and 'is_misogyny' columns not found in UD data. Please inspect.")
             return None
        print("Label value counts:")
        print(df['is_misogyny'].value_counts()) # Check if it's already 0/1
        print(f"Data type of 'is_misogyny': {df['is_misogyny'].dtype}")
        print(f"Data type of 'Definition': {df['Definition'].dtype}")

        # Check for missing values
        if df['is_misogyny'].isnull().any():
          print("Warning: Found NaN values. Filling with 0.")
          df.fillna({'is_misogyny':0}, inplace=True) # Replace the NaN with 0

        # Ensure label is integer
        df['is_misogyny'] = df['is_misogyny'].astype(int)
        print(f"Updated label value counts: {df['is_misogyny'].value_counts()}")
        # Clean Text (ensure string type first)
        df['Definition'] = df['Definition'].astype(str).str.lower()

        # Rename columns
        df = df.rename(columns = {'Definition':'text', 'is_misogyny':'label'})
        # Select final columns
        final_df = df[['text', 'label']].copy()
        print(f"Processed UD successfully. Shape: {final_df.shape}")
        return final_df

    except FileNotFoundError:
        print(f"!! Error: File not found at {csv_path}")
        return None
    except Exception as e:
        print(f"!! Error processing UD: {e}")
        return None

# Process and inspect
ud_processed_df = process_ud_misogyny(ud_csv_path)
ud_processed_df.info()
ud_processed_df.head()


Processing Urban Dictionary Misogyny from: /content/drive/MyDrive/SwedishHateSpeechProject/data/ManualTag_Misogyny.csv
Loaded UD successfully. Shape: (2286, 2)
Initial columns: Index(['Definition', 'is_misogyny'], dtype='object')
Label value counts:
is_misogyny
0.0    1251
1.0    1034
Name: count, dtype: int64
Data type of 'is_misogyny': float64
Data type of 'Definition': object
Updated label value counts: is_misogyny
0    1252
1    1034
Name: count, dtype: int64
Processed UD successfully. Shape: (2286, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2286 entries, 0 to 2285
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2286 non-null   object
 1   label   2286 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 35.8+ KB


Unnamed: 0,text,label
0,ur gonna die... queer,0
1,valuptuous man boobs.,0
2,variation of brother.,0
3,very impressive penis,0
4,what i call my penis.,0


## Online Misogyny EACL 2021

In [None]:
def process_eacl_misogyny(csv_path: str) -> pd.DataFrame | None:
    """Loads and processes the complex EACL misogyny dataset."""
    print(f"\nProcessing EACL Misogyny from: {csv_path}")
    try:
        df = pd.read_csv(csv_path)
        print(f"Loaded EACL successfully. Shape: {df.shape}")
        # Keep only potentially relevant columns
        relevant_cols = ['body', 'level_1', 'split']
        if not all(col in df.columns for col in relevant_cols):
            print(f"!! Error: Expected columns ('body', 'level_1', 'split') not found in EACL data.")
            print("Available columns:", df.columns)
            return None

        df_subset = df[relevant_cols].copy()
        print("\nOriginal EACL level_1 label value counts:")
        print(df_subset['level_1'].value_counts())

        # Rename columns
        df_subset.rename(columns={'body': 'text', 'level_1': 'label_original', 'split': 'original_split'}, inplace=True)

        # Map Labels
        positive_label_str = 'Misogynistic'
        label_map = {positive_label_str: 1, 'Nonmisogynistic': 0}

        df_subset['label'] = df_subset['label_original'].map(label_map)

        # Check for mapping errors (NaNs)
        if df_subset['label'].isna().any():
            print("!! Warning: Found NaN values in EACL 'label' column after mapping.")
            print("Original labels that failed to map:")
            print(df_subset[df_subset['label'].isna()]['label_original'].value_counts())
            print("Dropping rows with mapping errors...")
            df_subset.dropna(subset=['label'], inplace=True)
            df_subset['label'] = df_subset['label'].astype(int)

        # Clean Text
        df_subset['text'] = df_subset['text'].astype(str).str.lower()

        # Select final columns (including the original split information)
        final_df = df_subset[['text', 'label', 'original_split']].copy()
        print(f"Processed EACL successfully. Shape: {final_df.shape}")
        print("Final numerical label value counts:")
        print(final_df['label'].value_counts())
        print("\nOriginal split distribution:")
        print(final_df['original_split'].value_counts())
        return final_df

    except FileNotFoundError:
        print(f"!! Error: File not found at {csv_path}")
        return None
    except Exception as e:
        print(f"!! Error processing EACL: {e}")
        return None

# Call function to create Dataset
eacl_processed_df = process_eacl_misogyny(eacl_csv_path)


Processing EACL Misogyny from: /content/drive/MyDrive/SwedishHateSpeechProject/data/final_labels.csv
Loaded EACL successfully. Shape: (6567, 18)

Original EACL level_1 label value counts:
level_1
Nonmisogynistic    5868
Misogynistic        699
Name: count, dtype: int64
Processed EACL successfully. Shape: (6567, 3)
Final numerical label value counts:
label
0    5868
1     699
Name: count, dtype: int64

Original split distribution:
original_split
train    5264
test     1303
Name: count, dtype: int64


## Combining and Processing English Datasets

In [None]:
import pandas as pd
import datasets
from datasets import Dataset, DatasetDict
import os
import numpy as np
import traceback

def convert_split_save_english(df: pd.DataFrame | None, save_path: str, dataset_name: str, use_original_splits: bool = False):
    """
    Converts DataFrame to Dataset, casts 'label' to ClassLabel,
    handles splits (predefined or new), and saves the final DatasetDict.
    """
    if df is None or df.empty:
        print(f"\nSkipping convert/split/save for {dataset_name}: Input DataFrame is None or empty.")
        return

    print(f"\nConverting, splitting, and saving {dataset_name}...")
    try:
        # Define the ClassLabel feature once
        # Use consistent names for easier concatenation later if needed elsewhere
        class_label_feature = datasets.ClassLabel(num_classes=2, names=['Not Problematic', 'Problematic'])

        # This variable will hold the final DatasetDict with train/validation splits
        final_dataset_dict = None

        # Using Predefined Splits (EACL)
        if use_original_splits and 'original_split' in df.columns:
            print(f"Using predefined splits for {dataset_name}...")
            dataset_dict_temp = {} # Temporary dictionary to hold casted splits
            for split_name in df['original_split'].unique():
                 valid_split_name = split_name.lower()
                 # Ensure standard split names
                 if valid_split_name not in ['train', 'validation', 'test']:
                     print(f"Warning: Mapping original split '{split_name}' to 'train'.")
                     valid_split_name = 'train' # Default unknown splits to train

                 # Create DataFrame for this specific split
                 split_df = df[df['original_split'] == split_name].copy()
                 split_df.drop(columns=['original_split'], inplace=True)

                 if split_df.empty:
                     print(f"Warning: Split '{valid_split_name}' is empty. Skipping.")
                     continue

                 # Convert DataFrame split to Dataset object
                 split_ds = datasets.Dataset.from_pandas(split_df, preserve_index=False)

                 # Cast the label column for this specific split Dataset
                 print(f"Casting 'label' to ClassLabel for split '{valid_split_name}'...")
                 try:
                     split_ds = split_ds.cast_column("label", class_label_feature)
                     dataset_dict_temp[valid_split_name] = split_ds # Store the casted split
                 except Exception as cast_error:
                      print(f"!! Error casting label for split '{valid_split_name}': {cast_error}")
                      # Skip split if invalid
                      print(f"!! Skipping split '{valid_split_name}' due to casting error.")


            if not dataset_dict_temp:
                 print(f"!! Error: No valid splits were processed for {dataset_name}. Cannot proceed.")
                 return

            # Create the final DatasetDict from the successfully processed splits
            final_dataset_dict = datasets.DatasetDict(dataset_dict_temp)
            print("Features after casting (example from first available split):", final_dataset_dict[list(final_dataset_dict.keys())[0]].features)

            # Ensure 'validation' split exists (for predefined splits)
            if 'train' not in final_dataset_dict:
                 print(f"!! Error: No 'train' split found after processing original splits for {dataset_name}. Cannot save.")
                 return # Cannot proceed without train split

            if 'validation' not in final_dataset_dict:
                 if 'test' in final_dataset_dict:
                     print("Warning: No 'validation' split found. Renaming 'test' split to 'validation'.")
                     final_dataset_dict['validation'] = final_dataset_dict.pop('test')
                 elif 'train' in final_dataset_dict:
                     print("Warning: No 'validation' or 'test' split found. Creating 10% validation split from train.")
                     try:
                         # Stratification requires ClassLabel
                         train_val_split = final_dataset_dict['train'].train_test_split(test_size=0.1, stratify_by_column='label', seed=42)
                         final_dataset_dict['train'] = train_val_split['train']
                         final_dataset_dict['validation'] = train_val_split['test']
                     except ValueError as e_stratify:
                          print(f"!! Error stratifying during validation split creation: {e_stratify}. Creating random split.")
                          train_val_split = final_dataset_dict['train'].train_test_split(test_size=0.1, seed=42)
                          final_dataset_dict['train'] = train_val_split['train']
                          final_dataset_dict['validation'] = train_val_split['test']
                 else:
                     # This case should be prevented by the 'train' check above
                     print("!! Error: Cannot create validation split as no 'train' split exists.")
                     return


        # Creating New Splits (UD)
        else:
            print(f"Creating new train/validation splits for {dataset_name}...")
            # Convert entire DataFrame to Dataset
            full_ds = datasets.Dataset.from_pandas(df, preserve_index=False) # Use a clear variable name 'full_ds'

            # Cast Value label to ClassLabel before splitting
            print("Casting 'label' column to ClassLabel for stratification...")
            try:
                full_ds = full_ds.cast_column("label", class_label_feature) # Cast the full dataset
                print(f"Features after casting: {full_ds.features}")
            except Exception as cast_error:
                 print(f"!! Error casting label for {dataset_name}: {cast_error}. Cannot proceed with stratification.")
                 return

            # Split the casted dataset into train/validation
            try:
                print("Splitting data (80% train / 20% validation) with stratification...")
                split_dict = full_ds.train_test_split(test_size=0.2, stratify_by_column='label', seed=42)
                split_dict['validation'] = split_dict.pop('test') # Rename 'test' from split to 'validation'
                final_dataset_dict = datasets.DatasetDict(split_dict)
            except ValueError as e_stratify:
                 print(f"!! Error stratifying during split: {e_stratify}. Creating random split.")
                 split_dict = full_ds.train_test_split(test_size=0.2, seed=42) # Fallback to random
                 split_dict['validation'] = split_dict.pop('test')
                 final_dataset_dict = datasets.DatasetDict(split_dict)

        # Saving
        if final_dataset_dict:
            print(f"\nFinal splits for {dataset_name}: {final_dataset_dict}")
            print(f"Saving processed {dataset_name} splits to: {save_path}")
            final_dataset_dict.save_to_disk(save_path)
            print(f"-> {dataset_name} saved successfully.")
        else:
            print(f"!! Skipping save for {dataset_name} as final DatasetDict was not created.")

    except Exception as e:
        print(f"!! An unexpected error occurred during convert/split/save for {dataset_name}: {e}")
        traceback.print_exc() # Print detailed traceback for debugging


print("--- Starting English Dataset Processing ---")
# Process UD (create new splits, will cast label inside function)
convert_split_save_english(ud_processed_df, save_path_ud, "UD Misogyny", use_original_splits=False)
# Process EACL (use predefined splits, will cast label inside function for each split)
convert_split_save_english(eacl_processed_df, save_path_eacl, "EACL Misogyny", use_original_splits=True)
print("\n--- English Dataset Processing Complete ---")

--- Starting English Dataset Processing ---

Converting, splitting, and saving UD Misogyny...
Creating new train/validation splits for UD Misogyny...
Casting 'label' column to ClassLabel for stratification...


Features after casting: {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['Not Problematic', 'Problematic'], id=None)}
Splitting data (80% train / 20% validation) with stratification...

Final splits for UD Misogyny: DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1828
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 458
    })
})
Saving processed UD Misogyny splits to: /content/drive/MyDrive/SwedishHateSpeechProject/processed_data/ud_misogyny_processed_hf


-> UD Misogyny saved successfully.

Converting, splitting, and saving EACL Misogyny...
Using predefined splits for EACL Misogyny...
Casting 'label' to ClassLabel for split 'train'...


Casting 'label' to ClassLabel for split 'test'...


Features after casting (example from first available split): {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['Not Problematic', 'Problematic'], id=None)}

Final splits for EACL Misogyny: DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5264
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1303
    })
})
Saving processed EACL Misogyny splits to: /content/drive/MyDrive/SwedishHateSpeechProject/processed_data/eacl_misogyny_processed_hf


-> EACL Misogyny saved successfully.

--- English Dataset Processing Complete ---


## GeRMS-AT Dataset

In [None]:
import pandas as pd
import datasets
from datasets import Dataset, DatasetDict
import os
import numpy as np
import json

# Configuration and Paths
DRIVE_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/'
DATA_PATH = os.path.join(DRIVE_PATH, 'data/')
PROCESSED_DATA_PATH = os.path.join(DRIVE_PATH, 'processed_data/')
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)
germs_path = os.path.join(DATA_PATH, "germeval-competition-traindev.jsonl")
save_path_germs = os.path.join(PROCESSED_DATA_PATH, 'germs_at_processed_hf')


def load_jsonl_robustly(file_path, encoding='utf-8'):
    """Loads a JSONL file line by line, skipping lines with decode errors."""
    data_list = []
    print(f"Attempting robust load for: {file_path}")
    line_num = 0
    skipped_lines = 0
    try:
        with open(file_path, 'r', encoding=encoding) as f:
            for line in f:
                line_num += 1
                if not line.strip(): continue # Skip empty lines
                try:
                    data_list.append(json.loads(line))
                except json.JSONDecodeError as jde:
                    print(f"!! JSON Decode Error on line {line_num}: {jde}. Skipping line.")
                    skipped_lines += 1
                    continue
        print(f"Successfully parsed {len(data_list)} lines. Skipped {skipped_lines} lines.")
        if not data_list: return None
        return pd.DataFrame(data_list)
    except FileNotFoundError:
        print(f"!! Error: File not found at {file_path}")
        return None
    except Exception as e:
        print(f"!! Error during manual line-by-line reading: {e}")
        return None

In [None]:
# Processing Function for GeRMS-AT
def process_germs_jsonl_df(df: pd.DataFrame, df_name: str,
                           text_col: str = 'text', # Default text col name
                           annotation_col: str = 'annotations' # Default annotation col name
                           ) -> pd.DataFrame | None:
    """
    Processes a GeRMS-AT DataFrame (train or test) from JSONL.
    Handles different annotation column names.
    """
    if df is None or df.empty:
        print(f"Skipping processing for {df_name} as input DataFrame is None or empty.")
        return None

    print(f"\nProcessing {df_name} DataFrame (using text='{text_col}', annotations='{annotation_col}')...")
    try:
        # Verify columns exist using the passed arguments
        if text_col not in df.columns or annotation_col not in df.columns:
            print(f"!! Error: Expected columns '{text_col}' and '{annotation_col}' not found in {df_name}.")
            print("Available columns:", df.columns)
            return None

        # Label Mapping
        vote_mapping = {
            '0-Kein': 0, '1-Implizit': 1, '2-Explizit': 1, '3-Verdeckt': 1, '4-Extrem': 1
        }

        processed_rows = []
        for index, row in df.iterrows():
            processed_row = {}
            processed_row['text'] = str(row[text_col]).lower() if pd.notna(row[text_col]) else ""

            # Aggregate Annotations - Using the passed annotation_col name
            annotations_data = row[annotation_col]
            yes_votes, no_votes, valid_votes = 0, 0, 0

            # Check if test data has label info or just annotator list
            # If the first item in annotations_data is dict -> likely has labels
            # If the first item is string -> likely just annotator IDs
            has_labels_in_test = False
            if isinstance(annotations_data, list) and annotations_data:
                 if isinstance(annotations_data[0], dict):
                      has_labels_in_test = True
                 # Add specific check for the 'annotators' column if needed
                 elif annotation_col == 'annotators' and isinstance(annotations_data[0], str):
                      print(f"Info: '{annotation_col}' column in {df_name} appears to contain only annotator IDs, cannot derive label.")
                      has_labels_in_test = False # Explicitly set

            if has_labels_in_test: # Only process if label structures
                 # Handle potential JSON string representation
                 if isinstance(annotations_data, str):
                     try: annotations_data = json.loads(annotations_data)
                     except json.JSONDecodeError: annotations_data = []

                 if isinstance(annotations_data, list):
                     for annotation in annotations_data:
                         if isinstance(annotation, dict) and 'label' in annotation:
                             original_label = annotation['label']
                             vote = vote_mapping.get(original_label)
                             if vote == 1: yes_votes += 1; valid_votes += 1
                             elif vote == 0: no_votes += 1; valid_votes += 1

                 # Determine Final Label based on votes
                 if yes_votes > no_votes: processed_row['label'] = 1
                 else: processed_row['label'] = 0 # Default to 0 if tie or no labels found
            else:
                 # Assigning -1 is common for unlabeled test sets.
                 processed_row['label'] = -1 # Indicate missing label for test set

            processed_rows.append(processed_row)

        final_df = pd.DataFrame(processed_rows)

        if final_df.empty:
             print(f"Warning: Resulting DataFrame for {df_name} is empty.")
             return None

        print(f"\nProcessed {df_name} successfully. Final Shape: {final_df.shape}")
        print("Final numerical label value counts:")
        print(final_df['label'].value_counts()) # Check if -1 appears for test set
        return final_df

    except Exception as e:
        print(f"!! Error during processing {df_name}: {e}")
        return None


# Load Data
print("--- Loading GeRMS-AT Data ---")
df_train_germs = load_jsonl_robustly(germs_path)

# Apply Processing with Correct Column Names
print("\n--- Applying Processing Functions ---")
# For Train data, specify annotation_col='annotations'
processed_train_germs_df = process_germs_jsonl_df(df_train_germs, "GeRMS-AT Train",
                                                   text_col='text',
                                                   annotation_col='annotations')

# Combine into DatasetDict
print("\n--- Creating Final DatasetDict ---")
final_germs_dataset_dict = None

print("\n--- Applying Processing Function to GeRMS-AT Train Data ---")
processed_train_germs_df = process_germs_jsonl_df(df_train_germs, "GeRMS-AT Train",
                                                   text_col='text',
                                                   annotation_col='annotations')


# Split Processed Train Data into Train/Validation
print("\n--- Splitting Processed GeRMS-AT Train into Train/Validation ---")
final_germs_dataset_dict = None
if processed_train_germs_df is not None and not processed_train_germs_df.empty:
    try:
        # Convert full processed train DataFrame to Dataset
        full_train_ds = datasets.Dataset.from_pandas(processed_train_germs_df, preserve_index=False)

        # Cast label to ClassLabel FOR STRATIFICATION
        print("Casting label to ClassLabel...")
        full_train_ds = full_train_ds.cast_column("label", datasets.ClassLabel(num_classes=2, names=['Not Problematic', 'Problematic']))
        print(f"Features after casting: {full_train_ds.features}")

        # Split into final train and validation sets (e.g., 80/20 or 90/10)
        split_percentage = 0.2 # Use 20% for validation
        print(f"Splitting into train ({1-split_percentage:.0%}) / validation ({split_percentage:.0%})...")
        train_val_dict = full_train_ds.train_test_split(test_size=split_percentage, stratify_by_column='label', seed=42)

        final_germs_dataset_dict = datasets.DatasetDict({
            'train': train_val_dict['train'],
            'validation': train_val_dict['test'] # Rename 'test' split from train_test_split to 'validation'
        })
        print("Successfully created GeRMS-AT DatasetDict with Train/Validation splits:")
        print(final_germs_dataset_dict)

    except Exception as e:
        print(f"!! Error creating/splitting DatasetDict: {e}")
else:
    print("Skipping DatasetDict creation/splitting due to processing errors or empty DataFrame.")


# Save the NEW Train/Validation DatasetDict
print("\n--- Saving Processed GeRMS-AT Train/Validation Splits ---")
if final_germs_dataset_dict:
    print(f"Saving processed GeRMS-AT dataset to: {save_path_germs}")
    try:
        final_germs_dataset_dict.save_to_disk(save_path_germs)
        print("-> GeRMS-AT dataset saved successfully.")
    except Exception as e:
        print(f"!! Error saving GeRMS-AT dataset: {e}")
else:
    print("Skipping save as DatasetDict was not created.")

print("\n--- GeRMS-AT Dataset Processing Complete (Train/Validation Splits Created) ---")

--- Loading GeRMS-AT Data ---
Attempting robust load for: /content/drive/MyDrive/SwedishHateSpeechProject/data/germeval-competition-traindev.jsonl
Successfully parsed 5998 lines. Skipped 0 lines.

--- Applying Processing Functions ---

Processing GeRMS-AT Train DataFrame (using text='text', annotations='annotations')...

Processed GeRMS-AT Train successfully. Final Shape: (5998, 2)
Final numerical label value counts:
label
0    5675
1     323
Name: count, dtype: int64

--- Creating Final DatasetDict ---

--- Applying Processing Function to GeRMS-AT Train Data ---

Processing GeRMS-AT Train DataFrame (using text='text', annotations='annotations')...

Processed GeRMS-AT Train successfully. Final Shape: (5998, 2)
Final numerical label value counts:
label
0    5675
1     323
Name: count, dtype: int64

--- Splitting Processed GeRMS-AT Train into Train/Validation ---
Casting label to ClassLabel...


Features after casting: {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['Not Problematic', 'Problematic'], id=None)}
Splitting into train (80%) / validation (20%)...
Successfully created GeRMS-AT DatasetDict with Train/Validation splits:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4798
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
})

--- Saving Processed GeRMS-AT Train/Validation Splits ---
Saving processed GeRMS-AT dataset to: /content/drive/MyDrive/SwedishHateSpeechProject/processed_data/germs_at_processed_hf


-> GeRMS-AT dataset saved successfully.

--- GeRMS-AT Dataset Processing Complete (Train/Validation Splits Created) ---


# Model-Specific Tokenization

Each model requires a different tokenizer to be implemented. The code below tokenizes the pre-processes dictionaries according to the transformer model that will be used.

* bert-base-multilingual-cased (mBERT)
* xlm-roberta-base (XLM-R)
* KB/bert-base-swedish-cased (KB-BERT)

In [None]:
import datasets
import os

# Configuration
DRIVE_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/'
PROCESSED_DATA_PATH = DRIVE_PATH + 'processed_data/'

# Define specific paths
load_path_dk = os.path.join(PROCESSED_DATA_PATH, 'dkhate_processed_hf')
load_path_sw = os.path.join(PROCESSED_DATA_PATH, 'biaswe_processed_hf')
load_path_germs = os.path.join(PROCESSED_DATA_PATH, 'germs_at_processed_hf')
load_path_ud = os.path.join(PROCESSED_DATA_PATH, 'ud_misogyny_processed_hf')
load_path_eacl = os.path.join(PROCESSED_DATA_PATH, 'eacl_misogyny_processed_hf')

# Load Datasets
print("Loading processed datasets from disk...")

# Load DKhate
try:
    loaded_dk_dataset_dict = datasets.load_from_disk(load_path_dk)
    print("\nSuccessfully loaded processed DKhate dataset:")
    print(loaded_dk_dataset_dict)
except Exception as e:
    print(f"\n!! Error loading DKhate dataset from {load_path_dk}: {e}")

# Load BiaSWE
try:
    loaded_sw_dataset_dict = datasets.load_from_disk(load_path_sw)
    print("\nSuccessfully loaded processed BiaSWE dataset:")
    print(loaded_sw_dataset_dict)
except Exception as e:
    print(f"\n!! Error loading BiaSWE dataset from {load_path_sw}: {e}")

# Load & Combine English Datasets
# Variables to hold loaded data
loaded_ud_dataset_dict = None
loaded_eacl_dataset_dict = None
combined_eng_dataset_dict = None # Initialize combined dict

# Load the processed datasets
print("Loading processed English datasets...")
try:
    loaded_ud_dataset_dict = datasets.load_from_disk(load_path_ud)
    loaded_eacl_dataset_dict = datasets.load_from_disk(load_path_eacl)
    print("\nSuccessfully loaded processed English datasets:")
    print("UD:", loaded_ud_dataset_dict)
    print("EACL:", loaded_eacl_dataset_dict)
    print("\nUD Features:", loaded_ud_dataset_dict['train'].features)
    print("EACL Features:", loaded_eacl_dataset_dict['train'].features)

    # Concatenate Datasets
    print("\nAttempting to concatenate datasets...")
    # Ensure both dictionaries and required splits exist before concatenating
    if loaded_ud_dataset_dict and loaded_eacl_dataset_dict and \
       'train' in loaded_ud_dataset_dict and 'train' in loaded_eacl_dataset_dict and \
       'validation' in loaded_ud_dataset_dict and 'validation' in loaded_eacl_dataset_dict:

        combined_eng_dataset_dict = datasets.DatasetDict({
            'train': datasets.concatenate_datasets([loaded_ud_dataset_dict['train'], loaded_eacl_dataset_dict['train']]),
            'validation': datasets.concatenate_datasets([loaded_ud_dataset_dict['validation'], loaded_eacl_dataset_dict['validation']])
        })
        print("\nSuccessfully concatenated English datasets:")
        print(combined_eng_dataset_dict)
        print("\nCombined Train Features:", combined_eng_dataset_dict['train'].features)
    else:
        print("!! Cannot concatenate: One or both datasets/splits are missing.")

except FileNotFoundError:
    print(f"!! Error: Could not load one or both datasets.")
    print(f"  Check paths: {load_path_ud}, {load_path_eacl}")
except Exception as e:
    print(f"\n!! An error occurred during loading or casting: {e}")
    import traceback
    traceback.print_exc()

if combined_eng_dataset_dict:
     print("\nReady to tokenize combined_eng_dataset_dict.")
else:
     print("\nCombined English dataset was not created due to errors.")

# Load Germs-EVAL
try:
    loaded_germs_dataset_dict = datasets.load_from_disk(load_path_germs)
    print("\nSuccessfully loaded processed Germs-EVAL dataset:")
    print(loaded_germs_dataset_dict)
except Exception as e:
    print(f"\n!! Error loading Germs-EVAL dataset from {load_path_germs}: {e}")

Loading processed datasets from disk...

Successfully loaded processed DKhate dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2960
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 329
    })
})

Successfully loaded processed BiaSWE dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 150
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 150
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 150
    })
})
Loading processed English datasets...

Successfully loaded processed English datasets:
UD: DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1828
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 458
    })
})
EACL: DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5264
    })
    

In [None]:
from transformers import AutoTokenizer
import datasets
import os

# Configuration
PROCESSED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/processed_data/'
TOKENIZED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/'
os.makedirs(TOKENIZED_DATA_PATH, exist_ok=True)

# Define model checkpoints
model_ckpt_mbert = "bert-base-multilingual-cased"
model_ckpt_xlmr = "xlm-roberta-base"
model_ckpt_kbbert = "KB/bert-base-swedish-cased"

MAX_LENGTH = 128 # Alternative 256

# Load Tokenizers
print("Loading tokenizers...")
tokenizer_mbert = AutoTokenizer.from_pretrained(model_ckpt_mbert)
tokenizer_xlmr = AutoTokenizer.from_pretrained(model_ckpt_xlmr)
tokenizer_kbbert = AutoTokenizer.from_pretrained(model_ckpt_kbbert)
print("Tokenizers loaded.")

# Define Tokenization Function
def tokenize_function(examples, tokenizer):
    """Applies tokenizer to text examples."""
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH)

# Define Function to Tokenize and Save a DatasetDict
def tokenize_and_save(dataset_dict, tokenizer, save_path, dataset_name, model_suffix):
    """Tokenizes all splits in a DatasetDict and saves the result."""
    if dataset_dict is None:
        print(f"Skipping tokenization for {dataset_name} ({model_suffix}) as input is None.")
        return None
    print(f"\nTokenizing {dataset_name} for {model_suffix}...")
    try:
        tokenized_dsd = dataset_dict.map(
            tokenize_function,
            batched=True,
            fn_kwargs={"tokenizer": tokenizer}, # Pass tokenizer to the function
            remove_columns=["text"] # Remove original text column
        )
        # Set format for PyTorch/TF if needed later, or do it upon loading
        # tokenized_dsd.set_format("torch", columns=["input_ids", "attention_mask", "token_type_ids", "label"]) # Adjust columns based on tokenizer type

        print(f"Tokenization complete for {dataset_name} ({model_suffix}).")
        print(f"Saving tokenized data to: {save_path}")
        tokenized_dsd.save_to_disk(save_path)
        print(f"-> Saved {dataset_name} ({model_suffix}) successfully.")
        return tokenized_dsd
    except Exception as e:
        print(f"!! Error tokenizing/saving {dataset_name} ({model_suffix}): {e}")
        import traceback
        traceback.print_exc()
        return None

# Tokenize Datasets for Each Model

# For mBERT
save_path_eng_mbert = os.path.join(TOKENIZED_DATA_PATH, 'combined_english_mbert')
save_path_dk_mbert = os.path.join(TOKENIZED_DATA_PATH, 'dkhate_mbert')
save_path_ge_mbert = os.path.join(TOKENIZED_DATA_PATH, 'germs_at_mbert')
save_path_sw_mbert = os.path.join(TOKENIZED_DATA_PATH, 'biaswe_mbert')

tokenized_eng_mbert = tokenize_and_save(combined_eng_dataset_dict, tokenizer_mbert, save_path_eng_mbert, "English Combined", "mBERT")
tokenized_dk_mbert = tokenize_and_save(loaded_dk_dataset_dict, tokenizer_mbert, save_path_dk_mbert, "DKhate", "mBERT")
tokenized_ge_mbert = tokenize_and_save(loaded_germs_dataset_dict, tokenizer_mbert, save_path_ge_mbert, "GeRMS-AT", "mBERT")
tokenized_sw_mbert = tokenize_and_save(loaded_sw_dataset_dict, tokenizer_mbert, save_path_sw_mbert, "BiaSWE", "mBERT")

# For XLM-R
save_path_eng_xlmr = os.path.join(TOKENIZED_DATA_PATH, 'combined_english_xlmr')
save_path_dk_xlmr = os.path.join(TOKENIZED_DATA_PATH, 'dkhate_xlmr')
save_path_ge_xlmr = os.path.join(TOKENIZED_DATA_PATH, 'germs_at_xlmr')
save_path_sw_xlmr = os.path.join(TOKENIZED_DATA_PATH, 'biaswe_xlmr')

tokenized_eng_xlmr = tokenize_and_save(combined_eng_dataset_dict, tokenizer_xlmr, save_path_eng_xlmr, "English Combined", "XLM-R")
tokenized_dk_xlmr = tokenize_and_save(loaded_dk_dataset_dict, tokenizer_xlmr, save_path_dk_xlmr, "DKhate", "XLM-R")
tokenized_ge_xlmr = tokenize_and_save(loaded_germs_dataset_dict, tokenizer_xlmr, save_path_ge_xlmr, "GeRMS-AT", "XLM-R")
tokenized_sw_xlmr = tokenize_and_save(loaded_sw_dataset_dict, tokenizer_xlmr, save_path_sw_xlmr, "BiaSWE", "XLM-R")

# For KB-BERT (Swedish Data Only)
save_path_sw_kbbert = os.path.join(TOKENIZED_DATA_PATH, 'biaswe_kbbert')
tokenized_sw_kbbert = tokenize_and_save(loaded_sw_dataset_dict, tokenizer_kbbert, save_path_sw_kbbert, "BiaSWE", "KB-BERT")

print("\n--- Tokenization Process Complete ---")

Loading tokenizers...


Tokenizers loaded.

Tokenizing English Combined for mBERT...
Tokenization complete for English Combined (mBERT).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/combined_english_mbert


-> Saved English Combined (mBERT) successfully.

Tokenizing DKhate for mBERT...
Tokenization complete for DKhate (mBERT).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/dkhate_mbert


-> Saved DKhate (mBERT) successfully.

Tokenizing GeRMS-AT for mBERT...


Tokenization complete for GeRMS-AT (mBERT).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/germs_at_mbert


-> Saved GeRMS-AT (mBERT) successfully.

Tokenizing BiaSWE for mBERT...
Tokenization complete for BiaSWE (mBERT).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_mbert


-> Saved BiaSWE (mBERT) successfully.

Tokenizing English Combined for XLM-R...
Tokenization complete for English Combined (XLM-R).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/combined_english_xlmr


-> Saved English Combined (XLM-R) successfully.

Tokenizing DKhate for XLM-R...


Tokenization complete for DKhate (XLM-R).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/dkhate_xlmr


-> Saved DKhate (XLM-R) successfully.

Tokenizing GeRMS-AT for XLM-R...


Tokenization complete for GeRMS-AT (XLM-R).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/germs_at_xlmr


-> Saved GeRMS-AT (XLM-R) successfully.

Tokenizing BiaSWE for XLM-R...
Tokenization complete for BiaSWE (XLM-R).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_xlmr


-> Saved BiaSWE (XLM-R) successfully.

Tokenizing BiaSWE for KB-BERT...


Tokenization complete for BiaSWE (KB-BERT).
Saving tokenized data to: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_kbbert


-> Saved BiaSWE (KB-BERT) successfully.

--- Tokenization Process Complete ---


## Training Setup using Hugging Face Trainer

In [None]:
import datasets
from datasets import load_from_disk, DatasetDict
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
import torch # Ensure PyTorch is available
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os

# Configuration
# Paths where TOKENIZED data is saved
TOKENIZED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/'
# Base path for saving model checkpoints and results for EACH experiment
RESULTS_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/results/'
os.makedirs(RESULTS_PATH, exist_ok=True)

# Model checkpoints
model_ckpt_mbert = "bert-base-multilingual-cased"
model_ckpt_xlmr = "xlm-roberta-base"
model_ckpt_kbbert = "KB/bert-base-swedish-cased" # Verify exact name

# Define Compute Metrics Function
def compute_metrics(p):
    """Computes classification metrics."""
    preds = np.argmax(p.predictions, axis=1)
    labels = p.label_ids
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='binary'
    )
    acc = accuracy_score(labels, preds)
    # You can add more metrics here if needed
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Define Function to Load Tokenized Data
def load_tokenized_dataset(path: str) -> DatasetDict | None:
    """Loads a tokenized dataset saved to disk."""
    if not os.path.exists(path):
        print(f"!! Error: Dataset not found at {path}")
        return None
    try:
        loaded_dsd = load_from_disk(path)
        print(f"Successfully loaded dataset from {path}")
        # Set format for PyTorch here!
        # Determine columns based on model type implicitly handled by Trainer usually,
        # but explicitly setting is safer. We'll let Trainer handle it for now,
        # but you might need to add .set_format later if issues arise.
        # loaded_dsd.set_format("torch") # Simple version
        print(loaded_dsd)
        return loaded_dsd
    except Exception as e:
        print(f"!! Error loading dataset from {path}: {e}")
        return None

## Experiment 1: English-Only Training (XLM-R, m-BERT)

In [None]:
print("\n--- Starting Experiment 1: English-Only Training ---")

# Load Tokenized English Data
tokenized_eng_mbert_path = os.path.join(TOKENIZED_DATA_PATH, 'combined_english_mbert')
tokenized_eng_xlmr_path = os.path.join(TOKENIZED_DATA_PATH, 'combined_english_xlmr')

dsd_eng_mbert = load_tokenized_dataset(tokenized_eng_mbert_path)
dsd_eng_xlmr = load_tokenized_dataset(tokenized_eng_xlmr_path)

# Train Model Function
def train_model(model_checkpoint: str,
                tokenized_dataset: DatasetDict,
                output_dir_suffix: str,
                num_labels: int = 2,
                num_epochs: int = 3,
                learning_rate: float = 2e-5,
                batch_size: int = 16
                ):

    run_name = f"{os.path.basename(model_checkpoint)}_{output_dir_suffix}"
    output_dir = os.path.join(RESULTS_PATH, run_name)
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 2,
        learning_rate=learning_rate,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        report_to="none",
        seed=42,
    )
    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["validation"],
        compute_metrics=compute_metrics,
    )
    print(f"Starting training for {run_name}...")
    try:
        trainer.train()
        print("Training finished.")
        trainer.save_model(output_dir) # Save best model
        print(f"Best model saved to {output_dir}")
        return trainer
    except Exception as e:
        print(f"!! Error during training for {run_name}: {e}")
        import traceback
        traceback.print_exc()
        return None


--- Starting Experiment 1: English-Only Training ---
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/combined_english_mbert
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7092
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1761
    })
})
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/combined_english_xlmr
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 7092
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1761
    })
})


## Run Training for Exp 1

In [None]:
# Check whether a GPU is available,
# If so, run everything on a GPU
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
print(f"Using {device} device")

# Train mBERT on English
trainer_mbert_eng = train_model(model_ckpt_mbert, dsd_eng_mbert, "mBERT_eng_only_lr2e-5_b16_ep6", num_epochs=6, learning_rate=2e-5, batch_size= 16 )
# Train XLM-R on English
trainer_xlmr_eng = train_model(model_ckpt_xlmr, dsd_eng_xlmr, "XLM-R_eng_only_BEST_lr2e-5_batch32_ep6", num_epochs=6, learning_rate=2e-5, batch_size=32 )

print("\n--- Experiment 1 Training Calls Complete ---")

Using cuda device


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training for bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3243,0.244527,0.90971,0.725389,0.864198,0.625
2,0.2123,0.283324,0.914253,0.762205,0.809365,0.720238
3,0.1523,0.319539,0.911414,0.75625,0.796053,0.720238
4,0.0929,0.455421,0.897785,0.73913,0.720339,0.758929
5,0.0451,0.540042,0.906871,0.752266,0.763804,0.741071
6,0.0242,0.534121,0.916525,0.769231,0.813953,0.729167


Training finished.
Best model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training for xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4265,0.27199,0.898921,0.689895,0.831933,0.589286
2,0.2732,0.270798,0.898921,0.732733,0.739394,0.72619
3,0.2054,0.274108,0.903464,0.759887,0.723118,0.800595
4,0.1547,0.315028,0.911982,0.769688,0.768546,0.770833
5,0.118,0.363766,0.913685,0.772455,0.777108,0.767857
6,0.1013,0.350257,0.919364,0.780186,0.812903,0.75


Training finished.
Best model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6

--- Experiment 1 Training Calls Complete ---


## Experiment 2 (Cross-lingual Data Impact)

In [None]:
import datasets
from datasets import load_from_disk, DatasetDict, concatenate_datasets, ClassLabel, Features, Value
import os
import traceback

#  Load Required Tokenized Datasets
print("--- Loading Tokenized Datasets for Cross-Lingual Training ---")
dsd_eng_mbert = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'combined_english_mbert'))
dsd_ge_mbert = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'germs_at_mbert'))
dsd_dk_mbert = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'dkhate_mbert'))

dsd_eng_xlmr = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'combined_english_xlmr'))
dsd_ge_xlmr = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'germs_at_xlmr'))
dsd_dk_xlmr = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'dkhate_xlmr'))

#  Combine Datasets for mBERT
print("\n--- Combining Datasets for mBERT (Eng+De+Da) ---")
combined_dsd_mbert = None
try:
    # Check if all datasets were successfully loaded AND casted
    if dsd_eng_mbert and dsd_ge_mbert and dsd_dk_mbert and \
       'train' in dsd_eng_mbert and 'train' in dsd_ge_mbert and 'train' in dsd_dk_mbert and \
       'validation' in dsd_eng_mbert and 'validation' in dsd_ge_mbert: # Check validation splits exist

        mbert_train_splits = [dsd_eng_mbert['train'], dsd_ge_mbert['train'], dsd_dk_mbert['train']]
        mbert_val_splits = [dsd_eng_mbert['validation'], dsd_ge_mbert['validation']]
        # Add dkhate validation only if it exists and was processed
        # if dsd_dk_mbert and 'validation' in dsd_dk_mbert: mbert_val_splits.append(dsd_dk_mbert['validation'])

        combined_train_mbert = concatenate_datasets(mbert_train_splits)
        combined_val_mbert = concatenate_datasets(mbert_val_splits)
        combined_train_mbert = combined_train_mbert.shuffle(seed=42)

        combined_dsd_mbert = DatasetDict({'train': combined_train_mbert,'validation': combined_val_mbert})
        print("Combined mBERT DatasetDict (Eng+De+Da) created:")
        print(combined_dsd_mbert)
    else:
        print("!! Could not combine mBERT datasets: One or more source datasets/splits missing or failed casting.")

except Exception as e:
    print(f"!! Error combining datasets for mBERT: {e}")
    traceback.print_exc()


#  Combine Datasets for XLM-R
print("\n--- Combining Datasets for XLM-R (Eng+De+Da) ---")
combined_dsd_xlmr = None
try:
    # Check if all datasets were successfully loaded AND casted
    if dsd_eng_xlmr and dsd_ge_xlmr and dsd_dk_xlmr and \
       'train' in dsd_eng_xlmr and 'train' in dsd_ge_xlmr and 'train' in dsd_dk_xlmr and \
       'validation' in dsd_eng_xlmr and 'validation' in dsd_ge_xlmr: # Check validation splits exist

        xlmr_train_splits = [dsd_eng_xlmr['train'], dsd_ge_xlmr['train'], dsd_dk_xlmr['train']]
        xlmr_val_splits = [dsd_eng_xlmr['validation'], dsd_ge_xlmr['validation']]
        # Add dkhate validation only if it exists and was processed
        # if dsd_dk_xlmr and 'validation' in dsd_dk_xlmr: xlmr_val_splits.append(dsd_dk_xlmr['validation'])

        combined_train_xlmr = concatenate_datasets(xlmr_train_splits)
        combined_val_xlmr = concatenate_datasets(xlmr_val_splits)
        combined_train_xlmr = combined_train_xlmr.shuffle(seed=42)

        combined_dsd_xlmr = DatasetDict({'train': combined_train_xlmr,'validation': combined_val_xlmr})
        print("Combined XLM-R DatasetDict (Eng+De+Da) created:")
        print(combined_dsd_xlmr)
    else:
        print("!! Could not combine XLM-R datasets: One or more source datasets/splits missing or failed casting.")

except Exception as e:
    print(f"!! Error combining datasets for XLM-R: {e}")
    traceback.print_exc()

--- Loading Tokenized Datasets for Cross-Lingual Training ---
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/combined_english_mbert
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7092
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1761
    })
})
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/germs_at_mbert
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4798
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1200
    })
})
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/dkhate_mbert
DatasetDict({
    train: Dataset({
    

## Cross-Lingual Training

In [None]:
# Define suffix for these runs
cross_lingual_suffix = "eng_de_da"

# Check whether a GPU is available,
# If so, run everything on a GPU
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
print(f"Using {device} device")

# Train mBERT on Eng+De+Da
print("\n--- Training mBERT on Combined Data (Eng+De+Da) ---")
trainer_mbert_cross = train_model(
    model_checkpoint=model_ckpt_mbert,
    tokenized_dataset=combined_dsd_mbert, # Use the combined dataset
    output_dir_suffix=cross_lingual_suffix,
    num_epochs=6,
    learning_rate=2e-5,
    batch_size=16
)

# Train XLM-R on Eng+De+Da
print("\n--- Training XLM-R on Combined Data (Eng+De+Da) ---")
trainer_xlmr_cross = train_model(
    model_checkpoint=model_ckpt_xlmr,
    tokenized_dataset=combined_dsd_xlmr,
    output_dir_suffix=cross_lingual_suffix,
    num_epochs=6,
    learning_rate=2e-5,
    batch_size=32
)

print(f"\n--- Cross-Lingual Training (Experiment 2) Calls Complete ---")
print(f"Models saved in folders ending with '{cross_lingual_suffix}'")

Using cuda device

--- Training mBERT on Combined Data (Eng+De+Da) ---


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training for bert-base-multilingual-cased_eng_de_da...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.2943,0.242995,0.924012,0.672489,0.807692,0.57606
2,0.2268,0.291422,0.925701,0.659443,0.869388,0.531172
3,0.1746,0.324765,0.911516,0.664962,0.682415,0.648379
4,0.1151,0.356219,0.924688,0.6754,0.811189,0.578554
5,0.0704,0.463005,0.91388,0.668401,0.69837,0.640898
6,0.0412,0.505368,0.914556,0.660403,0.715116,0.613466


Training finished.
Best model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/bert-base-multilingual-cased_eng_de_da

--- Training XLM-R on Combined Data (Eng+De+Da) ---


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training for xlm-roberta-base_eng_de_da...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3288,0.319904,0.880784,0.503516,0.577419,0.446384
2,0.2447,0.288613,0.921986,0.618182,0.916667,0.466334
3,0.1956,0.232557,0.921648,0.698701,0.728997,0.670823
4,0.1563,0.279862,0.925363,0.692629,0.783019,0.620948
5,0.1233,0.317121,0.914218,0.681704,0.685139,0.678304
6,0.1042,0.33718,0.919959,0.688568,0.727778,0.653367


Training finished.
Best model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/xlm-roberta-base_eng_de_da

--- Cross-Lingual Training (Experiment 2) Calls Complete ---
Models saved in folders ending with 'eng_de_da'


## Experiment 3 (Inoculation)

In [None]:
import datasets
from datasets import load_from_disk, DatasetDict, Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os
import traceback

# Configuration
TOKENIZED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/'
RESULTS_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/results/'
INOCULATION_RESULTS_PATH = os.path.join(RESULTS_PATH, 'inoculation_models/') # Subfolder for clarity
os.makedirs(INOCULATION_RESULTS_PATH, exist_ok=True)

# Model checkpoints (base names)
model_ckpt_mbert = "bert-base-multilingual-cased"
model_ckpt_xlmr = "xlm-roberta-base"

# Paths to English only models
best_mbert_eng_only_dir = os.path.join(RESULTS_PATH, "bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6")
best_xlmr_eng_only_dir = os.path.join(RESULTS_PATH, "xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6")

# Load Tokenized Swedish Data
print("\n--- Loading Tokenized Swedish (BiaSWE) Data for Inoculation ---")
dsd_sw_mbert = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'biaswe_mbert'))
dsd_sw_xlmr = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'biaswe_xlmr'))

# Prepare Inoculation Subset
inoculation_size = 20 # Same as KB-BERT low-data size
dsd_sw_inoc_mbert = None
dsd_sw_inoc_xlmr = None

def create_inoculation_set(full_swedish_dsd, size, model_suffix):
    """Creates the small inoculation training set."""
    inoc_set = None
    if full_swedish_dsd and 'train' in full_swedish_dsd:
        train_set = full_swedish_dsd['train']
        if len(train_set) >= size:
            print(f"\nCreating inoculation training set ({size} samples) for {model_suffix}...")
            # Ensure ClassLabel - important if loading saved data that might not have it
            if not isinstance(train_set.features['label'], datasets.ClassLabel):
                 print(f"Casting Swedish train label to ClassLabel for {model_suffix} inoculation set...")
                 try:
                     cl_feature = datasets.ClassLabel(num_classes=2, names=['Not Misogyny', 'Misogyny'])
                     train_set = train_set.cast_column("label", cl_feature)
                 except Exception as e: print(f"Warning: Failed to cast label for {model_suffix} inoculation subset selection: {e}")

            # Attempt stratified selection
            try:
                 # Use test_size logic to get a small train set
                 # train_test_split needs at least 2 samples per class for stratification usually
                 if size >= 2 * train_set.features['label'].num_classes:
                      inoc_split = train_set.train_test_split(train_size=size, stratify_by_column='label', seed=42)
                      inoc_train_set = inoc_split['train']
                 else:
                      print("Size too small for stratification, selecting randomly...")
                      indices = np.random.choice(len(train_set), size, replace=False)
                      inoc_train_set = train_set.select(indices)

            except ValueError: # Stratification might fail
                 print("Stratified selection failed. Selecting randomly...")
                 indices = np.random.choice(len(train_set), size, replace=False)
                 inoc_train_set = train_set.select(indices)

            inoc_set = datasets.DatasetDict({'train': inoc_train_set}) # Only need train split
            print(f"Inoculation set created for {model_suffix}: {inoc_set}")
        else:
            print(f"Warning: Train set size ({len(train_set)}) is less than desired inoculation_size ({size}).")
    else:
        print(f"Could not create inoculation set for {model_suffix}: Original Swedish dataset or train split not loaded.")
    return inoc_set

dsd_sw_inoc_mbert = create_inoculation_set(dsd_sw_mbert, inoculation_size, "mBERT")
dsd_sw_inoc_xlmr = create_inoculation_set(dsd_sw_xlmr, inoculation_size, "XLM-R")


# Inoculation Training Function
def run_inoculation(base_model_path: str,
                    inoculation_dataset: DatasetDict,
                    output_dir_suffix: str, # Specific suffix for this inoculation run
                    num_labels: int = 2,
                    num_epochs: int = 2,        # Fewer Epochs
                    learning_rate: float = 1e-5 # Lower Learning Rate
                   ):
    """Loads a base model, fine-tunes it on the inoculation set."""
    if inoculation_dataset is None or 'train' not in inoculation_dataset:
        print(f"Skipping inoculation for {base_model_path} - inoculation dataset invalid.")
        return None
    if not os.path.exists(base_model_path):
         print(f"Skipping inoculation: Base model path not found at {base_model_path}")
         return None

    print(f"\n--- Inoculating model from {os.path.basename(base_model_path)} ---")
    print(f"    Suffix: {output_dir_suffix}, Epochs: {num_epochs}, LR: {learning_rate}")

    # Load the fine-tuned English-only model
    try:
        model = AutoModelForSequenceClassification.from_pretrained(base_model_path, num_labels=num_labels)
        # Also load the corresponding tokenizer
        # Determine original checkpoint from path if possible, otherwise use default
        if "bert-base-multilingual-cased" in base_model_path:
            tokenizer_checkpoint = "bert-base-multilingual-cased"
        elif "xlm-roberta-base" in base_model_path:
            tokenizer_checkpoint = "xlm-roberta-base"
        else:
            print("Warning: Could not determine base tokenizer from path, using mBERT default.")
            tokenizer_checkpoint = "bert-base-multilingual-cased" # Fallback
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)
    except Exception as e:
        print(f"!! Error loading base model or tokenizer from {base_model_path}: {e}")
        return None

    # Define Training Arguments for inoculation
    run_name = f"{os.path.basename(base_model_path)}_inoc_{output_dir_suffix}"
    output_dir = os.path.join(INOCULATION_RESULTS_PATH, run_name)

    # Calculate total steps and logging/saving steps if training for epochs on small data
    # Avoid evaluating/saving too often on tiny data
    num_training_steps = len(inoculation_dataset['train']) * num_epochs
    logging_steps = max(1, num_training_steps // 5) # Log ~5 times total
    save_steps = num_training_steps + 10 # Effectively save only at the end

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=4, #
        learning_rate=learning_rate,
        weight_decay=0.01,
        # No evaluation during inoculation usually needed
        eval_strategy="no",
        save_strategy="steps", # Save based on steps
        save_steps=save_steps, # Save only at the end
        logging_strategy="steps", # Log based on steps
        logging_steps=logging_steps,
        save_total_limit=1,          # Only keep the final checkpoint
        load_best_model_at_end=False,# Not applicable without evaluation
        report_to="none",
        seed=42,
    )

    # Instantiate Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=inoculation_dataset["train"],
        # No eval_dataset needed
        # compute_metrics=compute_metrics, # Not evaluating during training
        tokenizer=tokenizer # Pass tokenizer for default collation
    )

       # Train (Inoculate)
    print("Starting inoculation fine-tuning...")
    try:
        trainer.train()
        print("Inoculation finished.")
        trainer.save_model(output_dir) # Save the final model state
        tokenizer.save_pretrained(output_dir) # Save tokenizer alongside model
        print(f"Inoculated model saved to {output_dir}")
        return trainer
    except Exception as e:
        print(f"!! Error during inoculation for {run_name}: {e}")
        traceback.print_exc()
        return None


--- Loading Tokenized Swedish (BiaSWE) Data for Inoculation ---
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_mbert
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150
    })
    val: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150
    })
})
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_xlmr
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 150
    })
    val: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 150
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        

## Run Inoculation Experiments

In [None]:
# Check whether a GPU is available,
# If so, run everything on a GPU
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
print(f"Using {device} device")

# 1. Inoculate best English-Only mBERT
print("\n--- Running Inoculation for mBERT (from Eng-Only) ---")
inoc_mbert_suffix = f"inoc_{inoculation_size}samp"
trainer_mbert_inoc = run_inoculation(
    base_model_path=best_mbert_eng_only_dir, # Path to best Eng-only mBERT
    inoculation_dataset=dsd_sw_inoc_mbert,   # Swedish data tokenized for mBERT
    output_dir_suffix=inoc_mbert_suffix,
    num_epochs=2,
    learning_rate=1e-5
)

# 2. Inoculate best English-Only XLM-R
print("\n--- Running Inoculation for XLM-R (from Eng-Only) ---")
inoc_xlmr_suffix = f"inoc_{inoculation_size}samp"
trainer_xlmr_inoc = run_inoculation(
    base_model_path=best_xlmr_eng_only_dir,  # Path to best Eng-only XLM-R
    inoculation_dataset=dsd_sw_inoc_xlmr,    # Swedish data tokenized for XLM-R
    output_dir_suffix=inoc_xlmr_suffix,
    num_epochs=2,
    learning_rate=1e-5
)

print("\n--- Inoculation Training Calls Complete ---")

Using cuda device

--- Running Inoculation for mBERT (from Eng-Only) ---

--- Inoculating model from bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6 ---
    Suffix: inoc_20samp, Epochs: 2, LR: 1e-05


  trainer = Trainer(


Starting inoculation fine-tuning...


Step,Training Loss
8,2.7254


Inoculation finished.
Inoculated model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/inoculation_models/bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6_inoc_inoc_20samp

--- Running Inoculation for XLM-R (from Eng-Only) ---

--- Inoculating model from xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6 ---
    Suffix: inoc_20samp, Epochs: 2, LR: 1e-05


  trainer = Trainer(


Starting inoculation fine-tuning...


Step,Training Loss
8,1.749


Inoculation finished.
Inoculated model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/inoculation_models/xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6_inoc_inoc_20samp

--- Inoculation Training Calls Complete ---


## Experiment 4 (Monolingual KB-BERT vs. Multilingual Transformers)

In [None]:
import datasets
from datasets import load_from_disk, DatasetDict
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os

# Configuration
TOKENIZED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/'
RESULTS_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/results/'
os.makedirs(RESULTS_PATH, exist_ok=True)

# Model checkpoint for KB-BERT
model_ckpt_kbbert = "KB/bert-base-swedish-cased"

# Load Tokenized Swedish Data (KB-BERT version)
print("\n--- Loading Tokenized Swedish (BiaSWE) Data for KB-BERT ---")
tokenized_sw_kbbert_path = os.path.join(TOKENIZED_DATA_PATH, 'biaswe_kbbert')
dsd_sw_kbbert = load_tokenized_dataset(tokenized_sw_kbbert_path)

# Create Low-Data Subset
low_data_size = 20
dsd_sw_kbbert_low_data = None
if dsd_sw_kbbert and 'train' in dsd_sw_kbbert and 'val' in dsd_sw_kbbert:
    train_set_kb = dsd_sw_kbbert['train']
    if len(train_set_kb) >= low_data_size:
        print(f"\nCreating low-data training set ({low_data_size} samples)...")
        # Ensure ClassLabel for stratification
        if not isinstance(train_set_kb.features['label'], datasets.ClassLabel):
             print("Casting Swedish train label to ClassLabel for low-data selection...")
             try:
                 class_label_feature_sw = datasets.ClassLabel(num_classes=2, names=['Not Misogyny', 'Misogyny'])
                 # Important: Cast on a copy or reload if you need the original untyped later
                 train_set_kb = train_set_kb.cast_column("label", class_label_feature_sw)
             except Exception as e: print(f"Warning: Failed to cast label for subset selection: {e}")

        # Attempt stratified selection
        try:
             low_data_split = train_set_kb.train_test_split(train_size=low_data_size, stratify_by_column='label', seed=42)
             low_data_train_set = low_data_split['train']
        except ValueError: # Stratification might fail
             print("Stratified selection failed (maybe too few samples per class). Selecting randomly...")
             indices = np.random.choice(len(train_set_kb), low_data_size, replace=False)
             low_data_train_set = train_set_kb.select(indices)

        # Create the final DatasetDict for low-data run
        dsd_sw_kbbert_low_data = datasets.DatasetDict({
            'train': low_data_train_set,
            'validation': dsd_sw_kbbert['val'] # Use original full validation set
        })
        print("Low-data Swedish dataset created:")
        print(dsd_sw_kbbert_low_data)
    else:
        print(f"Warning: Train set size ({len(train_set_kb)}) is less than desired low_data_size ({low_data_size}). Cannot create low-data set.")
else:
    print("Could not create low-data set: Original Swedish dataset or splits not loaded correctly.")


--- Loading Tokenized Swedish (BiaSWE) Data for KB-BERT ---
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_kbbert
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150
    })
    val: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150
    })
})

Creating low-data training set (20 samples)...
Casting Swedish train label to ClassLabel for low-data selection...
Low-data Swedish dataset created:
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 20
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150
    })
})
Usin

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training for bert-base-swedish-cased_swe_high_data...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6913,0.617774,0.706667,0.694444,0.78125,0.625
2,0.5732,0.551822,0.753333,0.778443,0.747126,0.8125
3,0.4569,0.521943,0.753333,0.741259,0.84127,0.6625
4,0.3416,0.484443,0.733333,0.75,0.75,0.75
5,0.2286,0.501402,0.746667,0.736111,0.828125,0.6625
6,0.1852,0.494953,0.746667,0.739726,0.818182,0.675


Training finished.
Best model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/bert-base-swedish-cased_swe_high_data

--- KB-BERT Baseline Training Calls Complete ---


## Training KB-BERT

In [None]:
# Check whether a GPU is available,
# If so, run everything on a GPU
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
print(f"Using {device} device")

# 1. Train KB-BERT on Low Swedish Data
print("\n--- Training KB-BERT on Low Swedish Data ---")
trainer_kbbert_low = train_model(
    model_checkpoint=model_ckpt_kbbert,
    tokenized_dataset=dsd_sw_kbbert_low_data,
    output_dir_suffix="swe_low_data",
    num_epochs=10,
    learning_rate=3e-5
)

# 2. Train KB-BERT on All Swedish Data
print("\n--- Training KB-BERT on High (All) Swedish Data ---")
# Use standard parameters for the full dataset run
trainer_kbbert_high = train_model(
    model_checkpoint=model_ckpt_kbbert,
    tokenized_dataset=dsd_sw_kbbert,
    output_dir_suffix="swe_high_data",
    num_epochs=6,
    learning_rate=2e-5,
    batch_size=16
)

print("\n--- KB-BERT Baseline Training Calls Complete ---")

## Evaluation of Unseen Swedish Test Set

In [None]:
import datasets
from datasets import load_from_disk, DatasetDict
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os
import pandas as pd

# Configuration
TOKENIZED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/'
RESULTS_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/results/'
INOCULATION_RESULTS_PATH = os.path.join(RESULTS_PATH, 'inoculation_models/')

# Define Base Model Checkpoints
model_ckpt_mbert = "bert-base-multilingual-cased"
model_ckpt_xlmr = "xlm-roberta-base"
model_ckpt_kbbert = "KB/bert-base-swedish-cased"

# Paths to BEST Saved Model Checkpoints
best_mbert_eng_only_dir = os.path.join(RESULTS_PATH, "bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6")
best_xlmr_eng_only_dir = os.path.join(RESULTS_PATH, "xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6")
best_mbert_cross_dir = os.path.join(RESULTS_PATH, "mBERT_cross_eng+de+da_lr2e-5_b16_ep6")
best_xlmr_cross_dir = os.path.join(RESULTS_PATH, "XLM-R_cross_eng+de+da_lr2e-5_b32_ep6")
inoc_mbert_dir = os.path.join(INOCULATION_RESULTS_PATH, f"{os.path.basename(best_mbert_eng_only_dir)}_inoc_inoc_20samp")
inoc_xlmr_dir = os.path.join(INOCULATION_RESULTS_PATH, f"{os.path.basename(best_xlmr_eng_only_dir)}_inoc_inoc_20samp")
kbbert_low_dir = os.path.join(RESULTS_PATH, "KB-BERT_swe_low20samp_lr3e-5_ep10")
kbbert_high_dir = os.path.join(RESULTS_PATH, "KB-BERT_swe_high150samp_lr2e-5_b16_ep6")

# List of models/paths to evaluate
evaluation_runs = [
    {"name": "mBERT Eng-Only (Zero-Shot)", "model_path": best_mbert_eng_only_dir, "tokenizer_name": model_ckpt_mbert, "test_data_suffix": "mbert"},
    {"name": "XLM-R Eng-Only (Zero-Shot)", "model_path": best_xlmr_eng_only_dir, "tokenizer_name": model_ckpt_xlmr, "test_data_suffix": "xlmr"},
    {"name": "mBERT Cross (Eng+De+Da)",    "model_path": best_mbert_cross_dir, "tokenizer_name": model_ckpt_mbert, "test_data_suffix": "mbert"},
    {"name": "XLM-R Cross (Eng+De+Da)",    "model_path": best_xlmr_cross_dir, "tokenizer_name": model_ckpt_xlmr, "test_data_suffix": "xlmr"},
    {"name": "mBERT Inoc (Eng-Only Base)", "model_path": inoc_mbert_dir, "tokenizer_name": model_ckpt_mbert, "test_data_suffix": "mbert"},
    {"name": "XLM-R Inoc (Eng-Only Base)", "model_path": inoc_xlmr_dir, "tokenizer_name": model_ckpt_xlmr, "test_data_suffix": "xlmr"},
    {"name": "KB-BERT Low-Data (20 Swe)",  "model_path": kbbert_low_dir, "tokenizer_name": model_ckpt_kbbert, "test_data_suffix": "kbbert"},
    {"name": "KB-BERT High-Data (150 Swe)","model_path": kbbert_high_dir, "tokenizer_name": model_ckpt_kbbert, "test_data_suffix": "kbbert"},
]

# Load Compute Metrics Function
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    labels = p.label_ids
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary', zero_division=0)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'eval_f1': f1, 'eval_precision': precision, 'eval_recall': recall}


# Evaluation Loop
print("\n--- Starting Final Evaluation on Swedish Test Set ---")
all_results = []

# Load base Swedish tokenized data path template
swedish_base_path = os.path.join(TOKENIZED_DATA_PATH, 'biaswe_{}') # Placeholder for suffix

for run in evaluation_runs:
    run_name = run["name"]
    model_path = run["model_path"]
    tokenizer_name = run["tokenizer_name"]
    test_data_suffix = run["test_data_suffix"]
    swedish_test_path = swedish_base_path.format(test_data_suffix)

    print(f"\n--- Evaluating: {run_name} ---")
    print(f"  Model Path: {model_path}")
    print(f"  Tokenizer: {tokenizer_name}")
    print(f"  Test Data: {swedish_test_path}")

    # Check if model path exists
    if not os.path.exists(model_path):
        print("!! Error: Model path not found. Skipping evaluation.")
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": "Model Not Found"})
        continue

    # Load Model and Tokenizer
    try:
        model = AutoModelForSequenceClassification.from_pretrained(model_path)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) # Use original tokenizer name
    except Exception as e:
        print(f"!! Error loading model or tokenizer: {e}")
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": f"Loading Failed: {e}"})
        continue

    # Load the correct tokenized Swedish Test Set
    try:
        swedish_test_data = load_from_disk(swedish_test_path)['test']
        # Optional: Set format if needed (Trainer often handles this)
        # swedish_test_data.set_format("torch")
    except Exception as e:
        print(f"!! Error loading Swedish test data from {swedish_test_path}: {e}")
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": f"Test Data Load Failed: {e}"})
        continue

    # Create dummy TrainingArguments (needed for Trainer init)
    # Output dir is not really used here but required
    dummy_output_dir = os.path.join(RESULTS_PATH, "temp_evaluation_output")
    training_args = TrainingArguments(
        output_dir=dummy_output_dir,
        per_device_eval_batch_size=32, # Use a reasonable eval batch size
        report_to="none",
    )

    # Instantiate Trainer for prediction
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        # tokenizer=tokenizer # Pass if using dynamic padding during eval (unlikely needed)
    )

    # Predict
    try:
        print("Running prediction on Swedish test set...")
        predictions = trainer.predict(swedish_test_data)
        # metrics dict keys will be like 'test_accuracy', 'test_eval_f1', etc.
        metrics = predictions.metrics
        print("Metrics:", metrics)

        # Extract core metrics (adjust keys based on compute_metrics output)
        result = {
            "Experiment": run_name,
            "F1": metrics.get('test_eval_f1', 'N/A'), # Prefixed with 'test_' by predict
            "Precision": metrics.get('test_eval_precision', 'N/A'),
            "Recall": metrics.get('test_eval_recall', 'N/A'),
            "Accuracy": metrics.get('test_accuracy', 'N/A'),
            "Error": None
        }
        all_results.append(result)

    except Exception as e:
        print(f"!! Error during prediction/evaluation for {run_name}: {e}")
        import traceback
        traceback.print_exc()
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": f"Prediction Failed: {e}"})

# Display Results
print("\n--- Final Evaluation Results on Swedish Test Set ---")
results_df = pd.DataFrame(all_results)
# Format results nicely
results_df['F1'] = pd.to_numeric(results_df['F1'], errors='coerce').map('{:.4f}'.format)
results_df['Precision'] = pd.to_numeric(results_df['Precision'], errors='coerce').map('{:.4f}'.format)
results_df['Recall'] = pd.to_numeric(results_df['Recall'], errors='coerce').map('{:.4f}'.format)
results_df['Accuracy'] = pd.to_numeric(results_df['Accuracy'], errors='coerce').map('{:.4f}'.format)

print(results_df.to_markdown(index=False)) # Print markdown table

# Save this DataFrame to CSV
results_df.to_csv(os.path.join(RESULTS_PATH, "final_evaluation_results.csv"), index=False)


--- Starting Final Evaluation on Swedish Test Set ---

--- Evaluating: mBERT Eng-Only (Zero-Shot) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6
  Tokenizer: bert-base-multilingual-cased
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_mbert
Running prediction on Swedish test set...


Metrics: {'test_loss': 2.999316930770874, 'test_model_preparation_time': 0.0029, 'test_accuracy': 0.54, 'test_eval_f1': 0.25806451612903225, 'test_eval_precision': 0.8571428571428571, 'test_eval_recall': 0.1518987341772152, 'test_runtime': 1.2297, 'test_samples_per_second': 121.978, 'test_steps_per_second': 4.066}

--- Evaluating: XLM-R Eng-Only (Zero-Shot) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6
  Tokenizer: xlm-roberta-base
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_xlmr
Running prediction on Swedish test set...


Metrics: {'test_loss': 2.121135950088501, 'test_model_preparation_time': 0.0031, 'test_accuracy': 0.5066666666666667, 'test_eval_f1': 0.11904761904761904, 'test_eval_precision': 1.0, 'test_eval_recall': 0.06329113924050633, 'test_runtime': 1.0834, 'test_samples_per_second': 138.451, 'test_steps_per_second': 4.615}

--- Evaluating: mBERT Cross (Eng+De+Da) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/mBERT_cross_eng+de+da_lr2e-5_b16_ep6
  Tokenizer: bert-base-multilingual-cased
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_mbert
Running prediction on Swedish test set...


Metrics: {'test_loss': 2.4500181674957275, 'test_model_preparation_time': 0.0051, 'test_accuracy': 0.47333333333333333, 'test_eval_f1': 0.07058823529411765, 'test_eval_precision': 0.5, 'test_eval_recall': 0.0379746835443038, 'test_runtime': 1.1381, 'test_samples_per_second': 131.796, 'test_steps_per_second': 4.393}

--- Evaluating: XLM-R Cross (Eng+De+Da) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/XLM-R_cross_eng+de+da_lr2e-5_b32_ep6
  Tokenizer: xlm-roberta-base
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_xlmr
Running prediction on Swedish test set...


Metrics: {'test_loss': 0.9394775629043579, 'test_model_preparation_time': 0.0073, 'test_accuracy': 0.5866666666666667, 'test_eval_f1': 0.40384615384615385, 'test_eval_precision': 0.84, 'test_eval_recall': 0.26582278481012656, 'test_runtime': 1.1642, 'test_samples_per_second': 128.848, 'test_steps_per_second': 4.295}

--- Evaluating: mBERT Inoc (Eng-Only Base) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/inoculation_models/bert-base-multilingual-cased_mBERT_eng_only_lr2e-5_b16_ep6_inoc_inoc_20samp
  Tokenizer: bert-base-multilingual-cased
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_mbert
Running prediction on Swedish test set...


Metrics: {'test_loss': 1.715613842010498, 'test_model_preparation_time': 0.0051, 'test_accuracy': 0.66, 'test_eval_f1': 0.5785123966942148, 'test_eval_precision': 0.8333333333333334, 'test_eval_recall': 0.4430379746835443, 'test_runtime': 1.1036, 'test_samples_per_second': 135.92, 'test_steps_per_second': 4.531}

--- Evaluating: XLM-R Inoc (Eng-Only Base) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/inoculation_models/xlm-roberta-base_XLM-R_eng_only_BEST_lr2e-5_batch32_ep6_inoc_inoc_20samp
  Tokenizer: xlm-roberta-base
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_xlmr
Running prediction on Swedish test set...


Metrics: {'test_loss': 0.9193404316902161, 'test_model_preparation_time': 0.0058, 'test_accuracy': 0.6866666666666666, 'test_eval_f1': 0.6356589147286822, 'test_eval_precision': 0.82, 'test_eval_recall': 0.5189873417721519, 'test_runtime': 1.1027, 'test_samples_per_second': 136.035, 'test_steps_per_second': 4.534}

--- Evaluating: KB-BERT Low-Data (20 Swe) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/KB-BERT_swe_low20samp_lr3e-5_ep10
  Tokenizer: KB/bert-base-swedish-cased
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_kbbert


Running prediction on Swedish test set...


Metrics: {'test_loss': 0.677925705909729, 'test_model_preparation_time': 0.0056, 'test_accuracy': 0.52, 'test_eval_f1': 0.6210526315789474, 'test_eval_precision': 0.5315315315315315, 'test_eval_recall': 0.7468354430379747, 'test_runtime': 1.1268, 'test_samples_per_second': 133.125, 'test_steps_per_second': 4.437}

--- Evaluating: KB-BERT High-Data (150 Swe) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/KB-BERT_swe_high150samp_lr2e-5_b16_ep6
  Tokenizer: KB/bert-base-swedish-cased
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_kbbert
Running prediction on Swedish test set...


Metrics: {'test_loss': 0.5915638208389282, 'test_model_preparation_time': 0.003, 'test_accuracy': 0.6733333333333333, 'test_eval_f1': 0.7167630057803468, 'test_eval_precision': 0.6595744680851063, 'test_eval_recall': 0.7848101265822784, 'test_runtime': 1.0938, 'test_samples_per_second': 137.136, 'test_steps_per_second': 4.571}

--- Final Evaluation Results on Swedish Test Set ---
| Experiment                  |     F1 |   Precision |   Recall |   Accuracy | Error   |
|:----------------------------|-------:|------------:|---------:|-----------:|:--------|
| mBERT Eng-Only (Zero-Shot)  | 0.2581 |      0.8571 |   0.1519 |     0.54   |         |
| XLM-R Eng-Only (Zero-Shot)  | 0.119  |      1      |   0.0633 |     0.5067 |         |
| mBERT Cross (Eng+De+Da)     | 0.0706 |      0.5    |   0.038  |     0.4733 |         |
| XLM-R Cross (Eng+De+Da)     | 0.4038 |      0.84   |   0.2658 |     0.5867 |         |
| mBERT Inoc (Eng-Only Base)  | 0.5785 |      0.8333 |   0.443  |     0.66   |     

## Experiment 5: Investigating if Scope Impacted Cross-Lingual Training

In [None]:
import datasets
from datasets import load_from_disk, DatasetDict, concatenate_datasets
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os
import traceback

# Configuration
TOKENIZED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/'
RESULTS_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/results/'
os.makedirs(RESULTS_PATH, exist_ok=True)

# Model checkpoints
model_ckpt_mbert = "bert-base-multilingual-cased"
model_ckpt_xlmr = "xlm-roberta-base"

# Load Helper Functions
# compute_metrics, load_tokenized_dataset, train_model

# Load Required Tokenized Datasets for Eng+De
print("--- Loading Tokenized Datasets for Eng+De Experiment ---")

# mBERT versions
dsd_eng_mbert = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'combined_english_mbert'))
dsd_ge_mbert = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'germs_at_mbert'))

# XLM-R versions
dsd_eng_xlmr = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'combined_english_xlmr'))
dsd_ge_xlmr = load_tokenized_dataset(os.path.join(TOKENIZED_DATA_PATH, 'germs_at_xlmr'))


# Combine Datasets for mBERT (Eng+De)
print("\n--- Combining Datasets for mBERT (Eng+De) ---")
combined_dsd_mbert_eng_de = None
try:
    # Ensure both necessary datasets and splits were loaded
    if dsd_eng_mbert and dsd_ge_mbert and \
       'train' in dsd_eng_mbert and 'train' in dsd_ge_mbert and \
       'validation' in dsd_eng_mbert and 'validation' in dsd_ge_mbert:

        mbert_train_splits_ed = [dsd_eng_mbert['train'], dsd_ge_mbert['train']]
        mbert_val_splits_ed = [dsd_eng_mbert['validation'], dsd_ge_mbert['validation']]

        combined_train_mbert_ed = concatenate_datasets(mbert_train_splits_ed)
        combined_val_mbert_ed = concatenate_datasets(mbert_val_splits_ed)

        # Shuffle the combined training set
        combined_train_mbert_ed = combined_train_mbert_ed.shuffle(seed=42)

        combined_dsd_mbert_eng_de = DatasetDict({
            'train': combined_train_mbert_ed,
            'validation': combined_val_mbert_ed
        })
        print("Combined mBERT DatasetDict (Eng+De) created:")
        print(combined_dsd_mbert_eng_de)
    else:
        print("!! Could not combine mBERT datasets for Eng+De: One or more source datasets/splits missing.")

except Exception as e:
    print(f"!! Error combining datasets for mBERT (Eng+De): {e}")
    traceback.print_exc()


# Combine Datasets for XLM-R (Eng+De)
print("\n--- Combining Datasets for XLM-R (Eng+De) ---")
combined_dsd_xlmr_eng_de = None
try:
     # Ensure both necessary datasets and splits were loaded
    if dsd_eng_xlmr and dsd_ge_xlmr and \
       'train' in dsd_eng_xlmr and 'train' in dsd_ge_xlmr and \
       'validation' in dsd_eng_xlmr and 'validation' in dsd_ge_xlmr:

        xlmr_train_splits_ed = [dsd_eng_xlmr['train'], dsd_ge_xlmr['train']]
        xlmr_val_splits_ed = [dsd_eng_xlmr['validation'], dsd_ge_xlmr['validation']]

        combined_train_xlmr_ed = concatenate_datasets(xlmr_train_splits_ed)
        combined_val_xlmr_ed = concatenate_datasets(xlmr_val_splits_ed)

        # Shuffle the combined training set
        combined_train_xlmr_ed = combined_train_xlmr_ed.shuffle(seed=42)

        combined_dsd_xlmr_eng_de = DatasetDict({
            'train': combined_train_xlmr_ed,
            'validation': combined_val_xlmr_ed
        })
        print("Combined XLM-R DatasetDict (Eng+De) created:")
        print(combined_dsd_xlmr_eng_de)
    else:
        print("!! Could not combine XLM-R datasets for Eng+De: One or more source datasets/splits missing.")

except Exception as e:
    print(f"!! Error combining datasets for XLM-R (Eng+De): {e}")
    traceback.print_exc()

# Define suffix for these runs
eng_de_suffix = "eng_de_only"

# Train mBERT on Eng+De
print("\n--- Training mBERT on Combined Data (Eng+De) ---")
trainer_mbert_eng_de = train_model(
    model_checkpoint=model_ckpt_mbert,
    tokenized_dataset=combined_dsd_mbert_eng_de,
    output_dir_suffix=eng_de_suffix,
    num_epochs=6,
    learning_rate=2e-5,
    batch_size=16
)

# Train XLM-R on Eng+De
print("\n--- Training XLM-R on Combined Data (Eng+De) ---")
trainer_xlmr_eng_de = train_model(
    model_checkpoint=model_ckpt_xlmr,
    tokenized_dataset=combined_dsd_xlmr_eng_de,
    output_dir_suffix=eng_de_suffix,
    num_epochs=6,
    learning_rate=2e-5,
    batch_size=32
)


print(f"\n--- Eng+De Training (Additional Experiment) Calls Complete ---")
print(f"Models saved in folders ending with '{eng_de_suffix}'")

--- Loading Tokenized Datasets for Eng+De Experiment ---
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/combined_english_mbert
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7092
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1761
    })
})
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/germs_at_mbert
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4798
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1200
    })
})
Successfully loaded dataset from /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/combined_english_xlmr
DatasetDict({
    train: Dataset({


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training for bert-base-multilingual-cased_eng_de_only...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.2913,0.262701,0.910165,0.672414,0.664234,0.680798
2,0.2247,0.303068,0.926714,0.663566,0.877049,0.533666
3,0.1745,0.303617,0.91253,0.65786,0.699438,0.620948
4,0.1157,0.39523,0.921648,0.670455,0.778878,0.588529
5,0.0707,0.425911,0.92131,0.675035,0.765823,0.603491
6,0.0436,0.493345,0.913205,0.661397,0.701117,0.625935


Training finished.
Best model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/bert-base-multilingual-cased_eng_de_only

--- Training XLM-R on Combined Data (Eng+De) ---


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training for xlm-roberta-base_eng_de_only...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3108,0.265963,0.906451,0.667467,0.643519,0.693267
2,0.2153,0.225714,0.930767,0.687023,0.885827,0.561097
3,0.1693,0.277913,0.920297,0.700508,0.713178,0.688279
4,0.1305,0.296866,0.934144,0.731034,0.817901,0.660848
5,0.1023,0.351796,0.925701,0.71867,0.737533,0.700748
6,0.0796,0.369574,0.924688,0.715198,0.732984,0.698254


Training finished.
Best model saved to /content/drive/MyDrive/SwedishHateSpeechProject/results/xlm-roberta-base_eng_de_only

--- Eng+De Training (Additional Experiment) Calls Complete ---
Models saved in folders ending with 'eng_de_only'


## Evaluation

In [None]:
import datasets
from datasets import load_from_disk, DatasetDict
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os
import pandas as pd

# Configuration
TOKENIZED_DATA_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/'
RESULTS_PATH = '/content/drive/MyDrive/SwedishHateSpeechProject/results/'
INOCULATION_RESULTS_PATH = os.path.join(RESULTS_PATH, 'inoculation_models/')

# Define Base Model Checkpoints
model_ckpt_mbert = "bert-base-multilingual-cased"
model_ckpt_xlmr = "xlm-roberta-base"

# Paths to BEST Saved Model Checkpoints
#best_mbert_eng_de_dir = os.path.join(RESULTS_PATH, "bert-base-multilingual-cased_eng_de_only_lr2e-5_b16_ep6")
best_xlmr_eng_de_dir = os.path.join(RESULTS_PATH, "xlm-roberta-base_eng_de_only_lr2e-5_b32_ep6")

# List of models/paths to evaluate
evaluation_runs = [
    {"name": "XLM-R Cross (Eng+De)",    "model_path": best_xlmr_eng_de_dir, "tokenizer_name": model_ckpt_xlmr, "test_data_suffix": "xlmr"},
]

# Load Compute Metrics Function
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    labels = p.label_ids
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary', zero_division=0)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'eval_f1': f1, 'eval_precision': precision, 'eval_recall': recall}


# Evaluation Loop
print("\n--- Starting Final Evaluation on Swedish Test Set ---")
all_results = []

# Load base Swedish tokenized data path template
swedish_base_path = os.path.join(TOKENIZED_DATA_PATH, 'biaswe_{}') # Placeholder for suffix

for run in evaluation_runs:
    run_name = run["name"]
    model_path = run["model_path"]
    tokenizer_name = run["tokenizer_name"]
    test_data_suffix = run["test_data_suffix"]
    swedish_test_path = swedish_base_path.format(test_data_suffix)

    print(f"\n--- Evaluating: {run_name} ---")
    print(f"  Model Path: {model_path}")
    print(f"  Tokenizer: {tokenizer_name}")
    print(f"  Test Data: {swedish_test_path}")

    # Check if model path exists
    if not os.path.exists(model_path):
        print("!! Error: Model path not found. Skipping evaluation.")
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": "Model Not Found"})
        continue

    # Load Model and Tokenizer
    try:
        model = AutoModelForSequenceClassification.from_pretrained(model_path)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) # Use original tokenizer name
    except Exception as e:
        print(f"!! Error loading model or tokenizer: {e}")
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": f"Loading Failed: {e}"})
        continue

    # Load the correct tokenized Swedish Test Set
    try:
        swedish_test_data = load_from_disk(swedish_test_path)['test']
    except Exception as e:
        print(f"!! Error loading Swedish test data from {swedish_test_path}: {e}")
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": f"Test Data Load Failed: {e}"})
        continue

    # Create dummy TrainingArguments (needed for Trainer init)
    # Output dir is not really used here but required
    dummy_output_dir = os.path.join(RESULTS_PATH, "temp_evaluation_output")
    training_args = TrainingArguments(
        output_dir=dummy_output_dir,
        per_device_eval_batch_size=32, # Use a reasonable eval batch size
        report_to="none",
    )

    # Instantiate Trainer for prediction
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        # tokenizer=tokenizer # Pass if using dynamic padding during eval (unlikely needed)
    )

    # Predict
    try:
        print("Running prediction on Swedish test set...")
        predictions = trainer.predict(swedish_test_data)
        # metrics dict keys will be like 'test_accuracy', 'test_eval_f1', etc.
        metrics = predictions.metrics
        print("Metrics:", metrics)

        # Extract core metrics (adjust keys based on compute_metrics output)
        result = {
            "Experiment": run_name,
            "F1": metrics.get('test_eval_f1', 'N/A'), # Prefixed with 'test_' by predict
            "Precision": metrics.get('test_eval_precision', 'N/A'),
            "Recall": metrics.get('test_eval_recall', 'N/A'),
            "Accuracy": metrics.get('test_accuracy', 'N/A'),
            "Error": None
        }
        all_results.append(result)

    except Exception as e:
        print(f"!! Error during prediction/evaluation for {run_name}: {e}")
        import traceback
        traceback.print_exc()
        all_results.append({"Experiment": run_name, "F1": "N/A", "Precision": "N/A", "Recall": "N/A", "Accuracy": "N/A", "Error": f"Prediction Failed: {e}"})

# Display Results
print("\n--- Final Evaluation Results on Swedish Test Set ---")
results_df = pd.DataFrame(all_results)
# Format results nicely
results_df['F1'] = pd.to_numeric(results_df['F1'], errors='coerce').map('{:.4f}'.format)
results_df['Precision'] = pd.to_numeric(results_df['Precision'], errors='coerce').map('{:.4f}'.format)
results_df['Recall'] = pd.to_numeric(results_df['Recall'], errors='coerce').map('{:.4f}'.format)
results_df['Accuracy'] = pd.to_numeric(results_df['Accuracy'], errors='coerce').map('{:.4f}'.format)

print(results_df.to_markdown(index=False)) # Print markdown table

# Save this DataFrame to CSV
results_df.to_csv(os.path.join(RESULTS_PATH, "final_evaluation_results.csv"), index=False)


--- Starting Final Evaluation on Swedish Test Set ---

--- Evaluating: XLM-R Cross (Eng+De) ---
  Model Path: /content/drive/MyDrive/SwedishHateSpeechProject/results/xlm-roberta-base_eng_de_only_lr2e-5_b32_ep6
  Tokenizer: xlm-roberta-base
  Test Data: /content/drive/MyDrive/SwedishHateSpeechProject/tokenized_data/biaswe_xlmr
Running prediction on Swedish test set...


Metrics: {'test_loss': 2.5675976276397705, 'test_model_preparation_time': 0.0029, 'test_accuracy': 0.5, 'test_eval_f1': 0.0963855421686747, 'test_eval_precision': 1.0, 'test_eval_recall': 0.05063291139240506, 'test_runtime': 1.0068, 'test_samples_per_second': 148.99, 'test_steps_per_second': 4.966}

--- Final Evaluation Results on Swedish Test Set ---
| Experiment           |     F1 |   Precision |   Recall |   Accuracy | Error   |
|:---------------------|-------:|------------:|---------:|-----------:|:--------|
| XLM-R Cross (Eng+De) | 0.0964 |           1 |   0.0506 |        0.5 |         |
