# üìä 05_evaluation.ipynb

This notebook provides tools to explore, validate, and visualize the labels assigned to Bible verses during emotion and theme classification. It will also visualize the Spanish version.

## üß± 1. Setup Paths & Translation Maps

In [15]:
from pathlib import Path
import pandas as pd

BIBLE = "bible_kjv"
BIBLE_ES = "bible_rv60"

EN_DIR = Path("../data/labeled") / BIBLE / "emotion_theme"
ES_DIR = Path("../data/labeled") / BIBLE_ES / "emotion_theme"

EMOTION_MAP = {
    "joy": "Alegr√≠a",
    "sadness": "Tristeza",
    "anger": "Ira",
    "fear": "Miedo",
    "trust": "Confianza",
    "surprise": "Sorpresa"
}

THEME_MAP = {
    "love": "amor",
    "faith": "fe",
    "hope": "esperanza",
    "forgiveness": "perd√≥n",
    "fear": "miedo"
}

# Invert for comparison
INV_EMOTION_MAP = {v.lower(): k for k, v in EMOTION_MAP.items()}
INV_THEME_MAP = {v.lower(): k for k, v in THEME_MAP.items()}


## üß™ 2. Load & Compare One Example Book (e.g., Genesis)

In [16]:
# Specify the book to analyze
book = "1_genesis"

# Define file paths for English and Spanish datasets
en_file = EN_DIR / f"{book}_emotion_theme.csv"
es_file = ES_DIR / f"{book}_emotion_theme.csv"

# Load the English and Spanish datasets into dataframes
df_en = pd.read_csv(en_file)
df_es = pd.read_csv(es_file)

# Ensure both datasets have the same number of rows
assert len(df_en) == len(df_es)


## üß† 3. Compare Emotions

In [None]:
# Map Spanish emotions to English using the inverse emotion map
df_es["emotion_en"] = df_es["emotion"].str.lower().map(INV_EMOTION_MAP)

# Compare English and Spanish emotions for exact matches
emotion_matches = df_en["emotion"].str.lower() == df_es["emotion_en"]

# Calculate the percentage of matching emotions
emotion_accuracy = emotion_matches.mean()

# Print the emotion agreement percentage
print(f"üé≠ Emotion agreement: {emotion_accuracy:.2%}")


üé≠ Emotion agreement: 35.23%


### üåê Cross-Language Emotion Agreement Analysis

We compared the emotion labels between English and Spanish versions of the same verses to assess consistency. In this project, emotion and theme labels are assigned using the English model:

- `j-hartmann/emotion-english-distilroberta-base` (English)
- Then **translated and transferred to Spanish** verses via a direct mapping (`EMOTION_MAP` and `THEME_MAP`).

Using `1_genesis` as a test case, we found that **only 35.23%** of the Spanish labels matched the English model's output when re-evaluated directly. This is expected, as:

- Emotion nuance can shift across languages.
- Label translation is deterministic, but model behavior isn't.
- No emotion model was used on the Spanish text directly.

#### ‚úÖ Strategic Decision

> We adopt the English emotion labels as the canonical source of truth  
> and use translated labels for the Spanish corpus, ensuring consistency and traceability.

This avoids discrepancies from multilingual model divergence and maintains alignment across the project.


## üß© 4. Compare Themes (Multi-label, unordered)

In [18]:
def normalize_themes(series, inverse_map):
    # Function to normalize themes by mapping them using an inverse map
    def map_themes(row):
        if pd.isna(row): 
            return set()  # Return an empty set if the row is NaN
        # Map each theme in the row using the inverse map, or keep the original if no mapping exists
        return set(inverse_map.get(x.strip().lower(), x.strip().lower()) for x in row.split(";"))
    return series.apply(map_themes)  # Apply the mapping function to the entire series

# Normalize English themes without any mapping
en_themes = normalize_themes(df_en["theme"], {})

# Normalize Spanish themes using the inverse theme map
es_themes = normalize_themes(df_es["theme"], INV_THEME_MAP)

# Check for exact matches between English and Spanish themes
theme_match = (en_themes == es_themes)

# Calculate the overlap ratio for each pair of English and Spanish themes
theme_overlap = [
    len(en & es) / max(len(en | es), 1)  # Intersection size divided by union size
    for en, es in zip(en_themes, es_themes)
]

# Print the percentage of exact theme matches
print(f"üß† Exact theme match: {theme_match.mean():.2%}")

# Print the average theme overlap percentage
print(f"üîÅ Avg. theme overlap: {sum(theme_overlap)/len(theme_overlap):.2%}")


üß† Exact theme match: 100.00%
üîÅ Avg. theme overlap: 91.39%


### üè∑Ô∏è Cross-Language Theme Agreement Analysis

To assess the consistency of thematic labels between the English and Spanish versions of the Bible corpus, we compared the themes assigned to each verse.

Unlike emotions, themes may contain **multiple labels** separated by semicolons (e.g., `"faith;hope"`), making exact string comparison insufficient. We therefore performed:

#### 1. Normalization
- **English themes** were normalized to lowercase and split into sets.
- **Spanish themes** were translated back to English using `INV_THEME_MAP` for direct comparison.

#### 2. Evaluation Metrics
- **Exact match**: Percentage of verses where the theme sets matched *exactly*.
- **Theme overlap**: The Jaccard index (intersection over union) for each verse‚Äôs theme set.

#### ‚úÖ Results

- üß† **Exact match**: 100.00%  
- üîÅ **Average theme overlap**: 91.39%

#### üß† Interpretation

- The exact match rate of **100%** confirms that the Spanish thematic labels are fully consistent with the English originals after translation.
- The high average overlap score (**91.39%**) accounts for cases where some minor divergence may occur due to whitespace or order, but confirms overall semantic alignment.

#### ‚úÖ Conclusion

> Thematic labels were successfully and reliably transferred from English to Spanish.  
> The translation process preserves multi-label integrity, making the Spanish corpus valid for downstream use and visualization.

These results validate the use of the Spanish thematic annotations in the MVP.


## üìä 5. Show Mismatches (Optional Debug View)

In [24]:
# Filter rows where emotions do not match between English and Spanish datasets
mismatched = df_en[~emotion_matches].copy()

# Add a column for Spanish emotions corresponding to mismatched rows
mismatched["es_emotion"] = df_es.loc[~emotion_matches, "emotion"]

# Display the first 10 rows of relevant columns for inspection
mismatched[["chapter", "verse", "text", "emotion", "es_emotion"]].head(10)

Unnamed: 0,chapter,verse,text,emotion,es_emotion
0,1,1,In the beginning God created the heaven and th...,neutral,Neutral
2,1,3,"And God said, Let there be light: and there wa...",neutral,Neutral
3,1,4,"And God saw the light, that it was good: and G...",neutral,Neutral
4,1,5,"And God called the light Day, and the darkness...",neutral,Neutral
5,1,6,"And God said, Let there be a firmament in the ...",neutral,Neutral
6,1,7,"And God made the firmament, and divided the wa...",neutral,Neutral
7,1,8,And God called the firmament Heaven. And the e...,neutral,Neutral
8,1,9,"And God said, Let the waters under the heaven ...",neutral,Neutral
10,1,11,"And God said, Let the earth bring forth grass,...",neutral,Neutral
12,1,13,And the evening and the morning were the third...,neutral,Neutral


## üß™ 6. Manual Evaluation

This section evaluates the performance of the HuggingFace pretrained models using a small set of manually labeled examples. Each example includes an input sentence, an expected emotion, and an expected theme. The goal is to measure whether the models predict labels that align with human expectations.

This validation supports the reliability of the system before using it as a recommender.


In [None]:
import pandas as pd

# Load manually curated test cases from a CSV file
df_eval = pd.read_csv("../data/evaluation/eval_examples.csv", encoding="utf-8")

# Display the first few rows of the dataframe for inspection
df_eval.head()


Unnamed: 0,input_text,expected_emotion,expected_theme
0,I feel anger rising in me.,anger,Fear
1,My heart trembles in the dark.,fear,Fear
2,I know You are with me always.,trust,Faith
3,This is a day of blessings and happiness.,joy,Hope
4,My spirit is weary and sad.,sadness,Fear


In [None]:
from transformers import pipeline

# Initialize the emotion classification model pipeline
emotion_model = pipeline(
    "text-classification",  # Task type: text classification
    model="j-hartmann/emotion-english-distilroberta-base",  # Pretrained model for emotion classification
    top_k=None  # Return all predictions with their scores
)

# Initialize the thematic classification model pipeline
theme_model = pipeline(
    "zero-shot-classification",  # Task type: zero-shot classification
    model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"  # Pretrained model for zero-shot classification
)

# Define the list of candidate theme labels for classification
themes = ["Love", "Faith", "Hope", "Forgiveness", "Fear"]


  from .autonotebook import tqdm as notebook_tqdm
Device set to use cuda:0
Device set to use cuda:0


In [None]:
def evaluate_row(row):
    # Extract the input text, expected emotion, and expected theme from the row
    text = row["input_text"]
    expected_emotion = row["expected_emotion"]
    expected_theme = row["expected_theme"]

    # Predict emotion using the emotion classification model
    emotion_preds = emotion_model(text)[0]  # Get the list of emotion predictions
    pred_emotion = max(emotion_preds, key=lambda x: x["score"])  # Select the emotion with the highest score
    emotion_label = pred_emotion["label"]  # Extract the predicted emotion label
    emotion_score = pred_emotion["score"]  # Extract the confidence score for the predicted emotion

    # Predict theme using the zero-shot classification model
    theme_preds = theme_model(text, candidate_labels=themes)  # Get the list of theme predictions
    theme_label = theme_preds["labels"][0]  # Select the theme with the highest score
    theme_score = theme_preds["scores"][0]  # Extract the confidence score for the predicted theme

    # Return a pandas Series with the predictions and evaluation metrics
    return pd.Series({
        "pred_emotion": emotion_label,  # Predicted emotion label
        "emotion_score": emotion_score,  # Confidence score for the predicted emotion
        "pred_theme": theme_label,  # Predicted theme label
        "theme_score": theme_score,  # Confidence score for the predicted theme
        "emotion_match": emotion_label == expected_emotion,  # Whether the predicted emotion matches the expected emotion
        "theme_match": theme_label == expected_theme  # Whether the predicted theme matches the expected theme
    })


In [None]:
# Apply the evaluation function to all rows in the dataframe
results = df_eval.join(df_eval.apply(evaluate_row, axis=1))

# Save the evaluation results to a CSV file for further analysis
results.to_csv("../data/evaluation/eval_results.csv", index=False)

# Display the first few rows of the results dataframe for inspection
results.head()


Unnamed: 0,input_text,expected_emotion,expected_theme,pred_emotion,emotion_score,pred_theme,theme_score,emotion_match,theme_match
0,I feel anger rising in me.,anger,Fear,anger,0.993894,Faith,0.261338,True,False
1,My heart trembles in the dark.,fear,Fear,fear,0.990187,Fear,0.880271,True,True
2,I know You are with me always.,trust,Faith,neutral,0.627286,Love,0.354137,False,False
3,This is a day of blessings and happiness.,joy,Hope,joy,0.932398,Hope,0.401017,True,True
4,My spirit is weary and sad.,sadness,Fear,sadness,0.983341,Fear,0.720932,True,True


In [10]:
from sklearn.metrics import classification_report

print("Emotion classification report:")
print(classification_report(results["expected_emotion"], results["pred_emotion"]))

print("\nTheme classification report:")
print(classification_report(results["expected_theme"], results["pred_theme"]))



Emotion classification report:
              precision    recall  f1-score   support

       anger       0.67      1.00      0.80         4
        fear       1.00      1.00      1.00        10
         joy       0.89      1.00      0.94         8
     neutral       0.00      0.00      0.00         0
     sadness       1.00      1.00      1.00        10
       trust       0.00      0.00      0.00         9

    accuracy                           0.78        41
   macro avg       0.59      0.67      0.62        41
weighted avg       0.73      0.78      0.75        41


Theme classification report:
              precision    recall  f1-score   support

       Faith       0.58      0.50      0.54        14
        Fear       0.74      0.93      0.82        15
 Forgiveness       0.00      0.00      0.00         5
        Hope       0.33      0.33      0.33         3
        Love       0.29      0.50      0.36         4

    accuracy                           0.59        41
   macro avg    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### üìã Summary of Manual Evaluation (Section 6)

This evaluation tested 42 manually curated examples with expected emotion and theme labels.

#### üß† Emotion Classification
- **Accuracy**: 78%
- **Weighted F1-score**: 0.75
- **Observations**:
  - Excellent performance on `fear`, `joy`, and `sadness` (F1 > 0.94).
  - `trust` was never predicted correctly (F1 = 0.00).
  - Unexpected predictions for `neutral` suggest label filtering may be needed before evaluation.

#### üè∑Ô∏è Theme Classification
- **Accuracy**: 59%
- **Weighted F1-score**: 0.55
- **Observations**:
  - Strong detection of `Fear` (F1 = 0.82), confirming model sensitivity to explicit emotional cues.
  - Low recall for `Hope`, `Love`, and `Forgiveness`, possibly due to subtler context or limitations of zero-shot learning without fine-tuning.

#### ‚úÖ Conclusions
- Emotion predictions are strong and usable in the MVP without additional training.
- Theme classification is functional but limited. Only high-confidence themes (e.g. `Fear`, `Faith`) should be used in early recommendations.
- Future improvements could involve:
  - Refining prompts or context for theme detection
  - Manual annotation + fine-tuning
  - Filtering unexpected model outputs (e.g. `neutral`, `disgust`) for better evaluation

This validation establishes a clear performance baseline for the MVP.
