# Analysis of Decretum Burchardi: Topic Modeling, linguistic features, and Keyness

This notebook performs a comprehensive analysis of the *Decretum Burchardi* based on an end-to-end automation pipeline 
using a hybrid approach of digital philology tools and modern transformer-based embeddings.

**Objectives:**
1.  **Linguistic Profiling:** Analyze Morpho-Syntactic features (POS ratios) using `flair`.
2.  **Topic Modeling:** Use **BERTopic** combined with pre-calculated **OpenAI Embeddings** to cluster chapters semantically.
3.  **Keyword Extraction:** Use **KeyBERT** to identify the most representative terms per chapter.
4.  **Dataset Enrichment:** Merge all analysis results back into the Hugging Face dataset.

**Models used:**
* Embeddings: OpenAI (`text-embedding-3-large`)
* Lemmatizer: `mschonhardt/latin-lemmatizer`
* POS Tagger: `mschonhardt/latin-pos-tagger`


In [None]:
# !pip install datasets bertopic keybert flair transformers torch numpy pandas scikit-learn matplotlib


## 1. Setup and Configuration
Define models, stop words, and check for GPU availability.


In [None]:
import torch
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset, Dataset

# NLP Libraries
from transformers import pipeline
from flair.data import Sentence
from flair.models import SequenceTagger
from bertopic import BERTopic
from keybert import KeyBERT
from sklearn.feature_extraction.text import CountVectorizer

# Check for GPU
device = 0 if torch.cuda.is_available() else -1
device_name = torch.cuda.get_device_name(0) if device == 0 else "CPU"
print(f"Using device: {device_name}")

# Configuration
DATASET_ID = "mschonhardt/bdd-ep-openai-embeddings"
MODEL_LEMMATIZER = "mschonhardt/latin-lemmatizer"
MODEL_POS = "mschonhardt/latin-pos-tagger"

# Custom Latin Stopwords List (Critical for clean Topics)
LATIN_STOPWORDS = [
    "et", "in", "de", "ad", "non", "ut", "cum", "per", "a", "sed", "que", 
    "quia", "si", "ab", "ex", "unde", "sicut", "vel", "aut", "est", "sunt", 
    "esse", "fuit", "sua", "suo", "suum", "eius", "enim", "ergo", "tamen",
    "quod", "qui", "quae", "hoc", "haec", "illud", "se", "ipsum", "autem",
    "tunc", "ubi", "ibi", "nos", "vos", "me", "te", "id", "hic", "ille"
]


## 2. Load Data
Load the dataset containing the raw text and the pre-calculated OpenAI embeddings.


In [None]:
print(f"Loading dataset: {DATASET_ID}...")
dataset = load_dataset(DATASET_ID, split="train")

# Extract columns
texts = dataset["text"]
# BERTopic requires numpy arrays for embeddings
embeddings = np.array(dataset["embedding"])

print(f"Successfully loaded {len(texts)} chapters.")
print(f"Embedding shape: {embeddings.shape}")


## 3. Preprocessing: Lemmatization
Use Seq2Seq transformer model to lemmatize the Latin text. 
These lemmatized texts will be used for Topic Representation and Keyword extraction, while the OpenAI embeddings are used for Clustering.


In [None]:
print("Initializing Lemmatizer Pipeline...")
lemmatizer_pipe = pipeline("text2text-generation", model=MODEL_LEMMATIZER, device=device)

lemmatized_texts = []
batch_size = 32  # Adjust based on your VRAM

print("Starting Lemmatization (this may take a while)...")
for i in tqdm(range(0, len(texts), batch_size), desc="Lemmatizing"):
    batch = texts[i:i+batch_size]
    # Truncation ensures we don't crash on extremely long chapters
    results = lemmatizer_pipe(batch, max_length=512, truncation=True)
    lemmatized_texts.extend([res['generated_text'] for res in results])

print("Sample Lemmatized Text:", lemmatized_texts[0][:100])


## 4. Linguistic Analysis: POS Tagging
Use `flair` to calculate the density of Nouns, Verbs, and Adjectives. 


In [None]:
print("Loading POS Tagger...")
pos_tagger = SequenceTagger.load(MODEL_POS)

pos_stats = []

print("Analyzing Part-of-Speech ratios...")
for text in tqdm(texts, desc="POS Tagging"):
    sentence = Sentence(text)
    pos_tagger.predict(sentence)
    
    # Extract tags
    tags = [token.get_label('pos').value for token in sentence]
    total_tokens = len(tags)
    
    if total_tokens > 0:
        stats = {
            "noun_ratio": tags.count('NOUN') / total_tokens,
            "verb_ratio": tags.count('VERB') / total_tokens,
            "adj_ratio": tags.count('ADJ') / total_tokens
        }
    else:
        stats = {"noun_ratio": 0.0, "verb_ratio": 0.0, "adj_ratio": 0.0}
    
    pos_stats.append(stats)

# Convert to DataFrame for easy handling later
df_pos = pd.DataFrame(pos_stats)
print(df_pos.head())


## 5. Topic Modeling (BERTopic)
Combine **Clustering** based on high-quality **OpenAI Embeddings** and **Representation** based on **Lemmatized Texts**.


In [None]:
print("Initializing BERTopic...")

# Use the custom Latin stopwords list
vectorizer_model = CountVectorizer(stop_words=LATIN_STOPWORDS)

topic_model = BERTopic(
    vectorizer_model=vectorizer_model,
    verbose=True,
    n_gram_range=(1, 1) # Focus on single words for topic labels
)

print("Fitting BERTopic model...")
# Pass lemmatized text for labels, but embeddings for clustering
topics, probs = topic_model.fit_transform(lemmatized_texts, embeddings)

# Show top topics
print(topic_model.get_topic_info().head(10))

# Visualization (Interactive in Jupyter)
topic_model.visualize_topics()


## 6. Keyword Extraction (KeyBERT)
Extract the specific "Keyness" of each chapter using KeyBERT.
Use a multilingual model that works well with Latin to find words in the text that are semantically closest to the chapter's meaning.


In [None]:
print("Initializing KeyBERT...")
kw_model = KeyBERT(model='paraphrase-multilingual-MiniLM-L12-v2')

chapter_keywords = []

print("Extracting Keywords per chapter...")
for text in tqdm(lemmatized_texts, desc="KeyBERT"):
    kws = kw_model.extract_keywords(
        text, 
        keyphrase_ngram_range=(1, 1), 
        stop_words=LATIN_STOPWORDS, 
        top_n=5
    )
    # Store as comma-separated string
    chapter_keywords.append(", ".join([k[0] for k in kws]))

print(f"Sample Keywords for Chapter 0: {chapter_keywords[0]}")


## 7. Data Aggregation & Export
Combine all new features into a single Hugging Face Dataset object.


In [None]:
print("Aggregating data...")

# Create a dictionary of new features
new_features = {
    "lemmatized_text": lemmatized_texts,
    "topic_id": topics,
    # Map topic IDs to their string representation (top 3 words)
    "topic_name": [topic_model.get_topic(t)[0][0] + "_" + topic_model.get_topic(t)[1][0] if t != -1 else "outlier" for t in topics],
    "keywords": chapter_keywords,
    "noun_ratio": df_pos["noun_ratio"].tolist(),
    "verb_ratio": df_pos["verb_ratio"].tolist(),
    "adj_ratio": df_pos["adj_ratio"].tolist()
}

# Add columns to the existing dataset
dataset_enriched = dataset
for name, data in new_features.items():
    dataset_enriched = dataset_enriched.add_column(name, data)

print("Enriched Dataset Structure:")
print(dataset_enriched)


### Saving the Data
Uncomment the lines below to save locally or push to the Hub.


In [None]:
# Option 1: Save locally
# dataset_enriched.save_to_disk("./burchard_enriched")

# Option 2: Push to Hugging Face Hub
# NOTE: You need to be logged in via `huggingface-cli login`
# dataset_enriched.push_to_hub("YOUR_USERNAME/bdd-ep-analysed")

print("Analysis Complete.")
