# Topic Modeling with BERTopic on Parliamentary Speeches - Google Colab Version

This notebook is optimized for Google Colab with GPU acceleration. It implements a complete pipeline from data loading to topic modeling:

1. **Data Loading** - Loads the AT_original_complete.pkl file and processes it
2. **Data Filtering** - Creates processed version for topic modeling
3. **Dual Embedding** - Speech-level and segment-level embeddings
4. **Semantic Segmentation** - Similarity-based boundary detection  

## Key Approach - Dual Embedding Strategy:
- **First embedding**: Individual speeches using raw text (for segmentation)
- **Second embedding**: Concatenated segment texts (for topic modeling)
- **Why twice?** Re-embedding captures full discourse coherence vs. averaging individual embeddings
- **Raw text used throughout** for better semantic capture

In [None]:
# === GOOGLE COLAB SETUP ===
# Mount Google Drive to access your data
from google.colab import drive
drive.mount('/content/drive')

# Install required packages
!pip install sentence-transformers bertopic umap-learn hdbscan tqdm openai python-dotenv

# Check GPU availability and optimize for A100
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    # Optimize for A100
    torch.backends.cudnn.benchmark = True
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
else:
    print("No GPU detected - will use CPU (slower)")

# Import all required libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import os
import gc

# Set random seed for reproducibility
import random
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print("Setup complete! ‚úì")

ModuleNotFoundError: No module named 'google.colab'

## Data Loading and Processing

Load the original complete data from Google Drive and create the processed version for topic modeling.

In [None]:
# === DATA LOADING AND PROCESSING FOR GOOGLE COLAB ===

# Path to your file in Google Drive (update if needed)
data_folder = '/content/drive/MyDrive/data folder/data/'
data_path = f'{data_folder}AT_original_complete.pkl'

AT_original_df = pd.read_pickle(data_path)
print(f"‚úÖ Loaded original complete data: {AT_original_df.shape}")
print(f"Columns: {list(AT_original_df.columns)}")

# Filter out short speeches for segmentation and embedding
long_df = AT_original_df[~AT_original_df['Is_Too_Short']].copy()
short_df = AT_original_df[AT_original_df['Is_Too_Short']].copy()
print(f"Long speeches for segmentation: {len(long_df):,}")
print(f"Short speeches to assign after segmentation: {len(short_df):,}")

‚úÖ Loaded original complete data: (231752, 27)
Columns: ['Sitting_ID', 'Speech_ID', 'Title', 'Date', 'Body', 'Term', 'Session', 'Meeting', 'Sitting', 'Agenda', 'Subcorpus', 'Lang', 'Speaker_role', 'Speaker_MP', 'Speaker_minister', 'Speaker_party', 'Speaker_party_name', 'Party_status', 'Party_orientation', 'Speaker_ID', 'Speaker_name', 'Speaker_gender', 'Speaker_birth', 'Text', 'Word_Count', 'Is_Too_Short', 'Is_Filtered']

üìà Ready for topic modeling: 231,752 speeches

üìà Ready for topic modeling: 231,752 speeches


## Embedding and Segmentation Functions (GPU Optimized)

These functions are optimized for GPU acceleration and handle the dual-embedding approach:
1. **Speech-level embeddings** for similarity-based segmentation  
2. **Segment-level embeddings** for final topic modeling

In [None]:
# === EMBEDDING FUNCTIONS (A100 OPTIMIZED) ===
from sklearn.metrics.pairwise import cosine_similarity
from scipy.signal import find_peaks
from sentence_transformers import SentenceTransformer
import torch
import time
import gc
from tqdm import tqdm
import pickle
import os

def load_embedding_model(model_name="BAAI/bge-m3", device=None):
    """Load embedding model optimized for A100 GPU."""
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'

    print(f"Loading embedding model: {model_name} on {device}")
    start_time = time.time()

    try:
        model = SentenceTransformer(
            model_name, 
            device=device, 
            trust_remote_code=True,
            model_kwargs={'torch_dtype': torch.float16}  # Use FP16 for A100
        )
        # Optimize for A100
        if device == 'cuda':
            model.half()  # Use FP16 for faster inference on A100
        print(f"‚úì Model loaded in {time.time() - start_time:.2f} seconds")
        return model
    except Exception as e:
        print(f"‚ùå Error loading {model_name}: {e}")
        raise e

def chunk_text_tokenwise(text, tokenizer, chunk_size=4096, overlap=1024):
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    chunks = []
    starts = list(range(0, len(token_ids), chunk_size - overlap))
    for start in starts:
        end = min(start + chunk_size, len(token_ids))
        chunk_ids = token_ids[start:end]
        chunk_text = tokenizer.decode(chunk_ids, skip_special_tokens=True)
        chunks.append((chunk_text, len(chunk_ids)))
    return chunks

def weighted_mean(embeddings, weights):
    embeddings = np.stack(embeddings)
    weights = np.array(weights)
    weights = weights / weights.sum()
    return np.average(embeddings, axis=0, weights=weights)

def embed_text_bge(text, model, tokenizer):
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    if len(token_ids) <= 8192:
        return model.encode(text, convert_to_tensor=False, show_progress_bar=False)
    else:
        chunks = chunk_text_tokenwise(text, tokenizer, chunk_size=4096, overlap=1024)
        chunk_texts, chunk_lengths = zip(*chunks)
        # Use larger batch size for A100 and suppress progress bar
        chunk_embeddings = model.encode(
            list(chunk_texts), 
            batch_size=128, 
            convert_to_tensor=False,
            show_progress_bar=False  # Suppress internal progress bar
        )
        return weighted_mean(chunk_embeddings, chunk_lengths)

def save_checkpoint(data, checkpoint_path):
    """Save checkpoint for resuming."""
    with open(checkpoint_path, 'wb') as f:
        pickle.dump(data, f)
    print(f"üíæ Checkpoint saved: {checkpoint_path}")

def load_checkpoint(checkpoint_path):
    """Load checkpoint for resuming."""
    if os.path.exists(checkpoint_path):
        with open(checkpoint_path, 'rb') as f:
            data = pickle.load(f)
        print(f"üìÇ Checkpoint loaded: {checkpoint_path}")
        return data
    return None

def generate_speech_embeddings_for_segmentation(
    df, text_column='Text', model_name="BAAI/bge-m3", 
    batch_size=64, checkpoint_freq=5000
):
    """
    Generate BGE-m3 embeddings with A100 optimization and checkpointing.
    """
    print("=" * 60)
    print("SPEECH EMBEDDINGS: BGE-m3 optimized for A100")
    print("=" * 60)
    print(f"Processing {len(df)} speeches with batch_size={batch_size}")

    # Setup checkpointing
    checkpoint_dir = '/content/drive/MyDrive/checkpoints/'
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = f'{checkpoint_dir}speech_embeddings_checkpoint.pkl'
    
    # Try to load existing checkpoint
    checkpoint_data = load_checkpoint(checkpoint_path)
    if checkpoint_data:
        start_idx = checkpoint_data['last_processed_idx'] + 1
        embeddings = checkpoint_data['embeddings']
        print(f"üîÑ Resuming from index {start_idx}")
    else:
        start_idx = 0
        embeddings = []

    model = SentenceTransformer(model_name, device="cuda" if torch.cuda.is_available() else "cpu")
    if torch.cuda.is_available():
        model.half()  # FP16 for A100
    tokenizer = model.tokenizer
    model.max_seq_length = 8192

    texts = df[text_column].astype(str).values
    
    # Process in batches with larger batch size for A100
    total_batches = (len(texts) - start_idx + batch_size - 1) // batch_size
    
    with tqdm(total=len(texts), initial=start_idx, desc="üöÄ Embedding speeches", unit="speech") as pbar:
        for i in range(start_idx, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            
            # Process batch - handle long texts individually but in batches where possible
            batch_embeddings = []
            short_texts = []
            short_indices = []
            
            for j, text in enumerate(batch_texts):
                token_count = len(tokenizer.encode(text, add_special_tokens=False))
                if token_count <= 8192:
                    short_texts.append(text)
                    short_indices.append(j)
                else:
                    # Handle long text individually - suppress any internal progress
                    emb = embed_text_bge(text, model, tokenizer)
                    batch_embeddings.append((j, emb))
            
            # Batch process short texts with larger batch size
            if short_texts:
                short_embeddings = model.encode(
                    short_texts, 
                    batch_size=min(128, len(short_texts)),  # A100 optimized batch size
                    convert_to_tensor=False,
                    show_progress_bar=False  # Suppress internal progress bar
                )
                for idx, emb in zip(short_indices, short_embeddings):
                    batch_embeddings.append((idx, emb))
            
            # Sort by original order and add to results
            batch_embeddings.sort(key=lambda x: x[0])
            embeddings.extend([emb for _, emb in batch_embeddings])
            
            # Update progress (only once per batch)
            pbar.update(len(batch_texts))
            
            # Save checkpoint periodically (suppress checkpoint messages)
            if (i + batch_size) % checkpoint_freq == 0:
                checkpoint_data = {
                    'embeddings': embeddings,
                    'last_processed_idx': i + len(batch_texts) - 1
                }
                with open(checkpoint_path, 'wb') as f:
                    pickle.dump(checkpoint_data, f)
                # Only show checkpoint message every 20k speeches
                if (i + batch_size) % (checkpoint_freq * 4) == 0:
                    print(f"\nüíæ Checkpoint: {i + batch_size:,}/{len(texts):,} speeches processed")
                
            # Clear GPU cache periodically
            if (i + batch_size) % (checkpoint_freq * 2) == 0:
                torch.cuda.empty_cache()
                gc.collect()

    # Clean up checkpoint
    if os.path.exists(checkpoint_path):
        os.remove(checkpoint_path)
        print("üßπ Checkpoint cleaned up")

    df_with_embeddings = df.copy()
    df_with_embeddings['Speech_Embeddings'] = embeddings
    return df_with_embeddings

# Keep the existing similarity and segmentation functions (they don't need GPU optimization)
def calculate_windowed_similarity(embeddings_list, window_size=3):
    """Calculate cosine similarity between windowed embeddings."""
    if len(embeddings_list) < 2:
        return np.array([])
    if window_size < 1:
        raise ValueError("Window size must be at least 1.")

    num_utterances = len(embeddings_list)
    similarities = []

    for g in range(num_utterances - 1):
        # Window before gap
        start_before = max(0, g - window_size + 1)
        end_before = g + 1
        window_before = embeddings_list[start_before:end_before]

        # Window after gap
        start_after = g + 1
        end_after = min(num_utterances, g + 1 + window_size)
        window_after = embeddings_list[start_after:end_after]

        if not window_before or not window_after:
            similarities.append(0)
            continue

        # Calculate mean embeddings and similarity
        mean_before = np.mean([np.asarray(e) for e in window_before], axis=0)
        mean_after = np.mean([np.asarray(e) for e in window_after], axis=0)
        
        sim = cosine_similarity(mean_before.reshape(1, -1), mean_after.reshape(1, -1))[0][0]
        similarities.append(sim)
        
    return np.array(similarities)

def find_topic_boundaries(similarities, height_threshold=0.3, prominence_threshold=0.2, distance_threshold=5):
    """Find topic boundaries using peak detection on inverted similarity scores."""
    if len(similarities) == 0:
        return np.array([])
    
    # Invert similarities to find valleys (topic boundaries)
    inverted_similarities = np.maximum(0, 1 - similarities)
    
    # Find peaks in inverted similarities
    peaks, _ = find_peaks(
        inverted_similarities,
        height=height_threshold,
        prominence=prominence_threshold,
        distance=distance_threshold
    )
    
    return peaks

def segment_speeches_by_similarity(df, window_size=3, height_threshold=0.3, 
                                   prominence_threshold=0.2, distance_threshold=5):
    """Segment speeches within each sitting based on semantic similarity."""
    print(f"üîç Segmenting speeches using similarity-based approach")
    print(f"Parameters: window_size={window_size}, height_threshold={height_threshold}")
    print(f"           prominence_threshold={prominence_threshold}, distance_threshold={distance_threshold}")
    
    df_segmented = df.copy()
    segment_ids = []
    total_boundaries = 0
    
    # Process each sitting separately with progress bar
    sittings = list(df_segmented.groupby('Sitting_ID'))
    
    for sitting_id, group in tqdm(sittings, desc="üî™ Segmenting sittings", unit="sitting"):
        if len(group) < 2:
            segment_ids.extend([f"{sitting_id}_seg_0"] * len(group))
            continue
        
        # Use the speech-level embeddings for segmentation
        embeddings_list = group['Speech_Embeddings'].tolist()
        similarities = calculate_windowed_similarity(embeddings_list, window_size)
        
        if len(similarities) == 0:
            segment_ids.extend([f"{sitting_id}_seg_0"] * len(group))
            continue
        
        # Find boundaries
        boundaries = find_topic_boundaries(
            similarities, height_threshold, prominence_threshold, distance_threshold
        )
        total_boundaries += len(boundaries)
        
        # Assign segment IDs
        current_segment = 0
        sitting_segment_ids = []
        
        for i in range(len(group)):
            if i > 0 and (i - 1) in boundaries:
                current_segment += 1
            sitting_segment_ids.append(f"{sitting_id}_seg_{current_segment}")
        
        segment_ids.extend(sitting_segment_ids)
    
    df_segmented['Segment_ID'] = segment_ids
    
    # Print statistics
    total_segments = df_segmented['Segment_ID'].nunique()
    avg_segments_per_sitting = df_segmented.groupby('Sitting_ID')['Segment_ID'].nunique().mean()
    
    print(f"‚úì Segmentation complete!")
    print(f"‚úì Total boundaries detected: {total_boundaries}")
    print(f"‚úì Total segments created: {total_segments}")
    print(f"‚úì Average segments per sitting: {avg_segments_per_sitting:.2f}")
    
    return df_segmented

print("‚úì Embedding and segmentation functions loaded")

‚úì Embedding and segmentation functions loaded


## Segment Aggregation and Re-embedding Functions (GPU Optimized)

After creating segments, we aggregate the raw text and re-embed for better topic modeling representation.

In [None]:
# === SEGMENT AGGREGATION AND RE-EMBEDDING FUNCTIONS (A100 OPTIMIZED) ===

BGE_MODEL_NAME = "BAAI/bge-m3"
TOKEN_LIMIT = 8192
CHUNK_SIZE = 4096
CHUNK_OVERLAP = 1024
BGE_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def generate_segment_embeddings(df, text_column='Text', segment_id_column='Segment_ID', batch_size=32):
    """
    Generate segment embeddings with A100 optimization and batching.
    """
    print(f"üîÑ Generating segment embeddings for {df[segment_id_column].nunique()} segments...")
    
    model = SentenceTransformer(BGE_MODEL_NAME, device=BGE_DEVICE)
    if torch.cuda.is_available():
        model.half()  # FP16 for A100
    tokenizer = model.tokenizer
    model.max_seq_length = TOKEN_LIMIT

    # Aggregate texts by segment
    segment_texts = df.groupby(segment_id_column)[text_column].apply(
        lambda x: ' '.join(x.astype(str))
    ).to_dict()
    
    # Process segments in batches
    segment_ids = list(segment_texts.keys())
    segment_embeddings = {}
    
    with tqdm(total=len(segment_ids), desc="üöÄ Embedding segments", unit="segment") as pbar:
        for i in range(0, len(segment_ids), batch_size):
            batch_ids = segment_ids[i:i+batch_size]
            batch_texts = [segment_texts[seg_id] for seg_id in batch_ids]
            
            # Check which texts need chunking
            short_texts = []
            short_ids = []
            long_ids = []
            
            for seg_id, text in zip(batch_ids, batch_texts):
                token_count = len(tokenizer.encode(text, add_special_tokens=False))
                if token_count <= TOKEN_LIMIT:
                    short_texts.append(text)
                    short_ids.append(seg_id)
                else:
                    long_ids.append(seg_id)
            
            # Batch process short texts
            if short_texts:
                batch_embeddings = model.encode(
                    short_texts,
                    batch_size=min(64, len(short_texts)),  # A100 optimized
                    convert_to_tensor=False,
                    show_progress_bar=False  # Suppress internal progress bar
                )
                for seg_id, emb in zip(short_ids, batch_embeddings):
                    segment_embeddings[seg_id] = emb
            
            # Process long texts individually
            for seg_id in long_ids:
                text = segment_texts[seg_id]
                emb = embed_text_bge(text, model, tokenizer)
                segment_embeddings[seg_id] = emb
            
            pbar.update(len(batch_ids))
            
            # Clear cache periodically
            if i % (batch_size * 10) == 0:
                torch.cuda.empty_cache()
                gc.collect()

    # Map embeddings back to dataframe
    df['Segment_Embeddings'] = df[segment_id_column].map(segment_embeddings)
    print(f"‚úì Segment embeddings mapped to all speeches.")
    return df

print("‚úì A100-optimized embedding functions loaded")

## Usage Summary

### What This Notebook Does:
1. **Loads one file** from Google Drive: `AT_original_complete.pkl`
2. **Processes the data** to create topic modeling version
3. **Runs dual embedding** approach with GPU acceleration:
   - Speech-level embeddings for segmentation (`Speech_Embeddings`)
   - Segment-level embeddings for each speech (`Segment_Embeddings`)
   - Segment ID for each speech (`Segment_ID`)
4. **Saves results** back to Google Drive as a single file

### What You Need to Upload:
- **Only one file**: `AT_original_complete.pkl`
- **Upload location**: `MyDrive/thesis_data/AT_original_complete.pkl`

### Performance on Colab GPU:
- **Speech embeddings**: ~30-100 speeches/second
- **Segment embeddings**: ~10-30 segments/second  
- **Total time**: ~10-30 minutes for large datasets (vs hours on CPU)

### Next Steps:
1. **Download results** to your local machine
2. **Use the final dataframe** for further analysis

### Output File:
- `AT_with_embeddings_final.pkl` - Original dataframe with three new columns:
  - `Speech_Embeddings`
  - `Segment_ID`
  - `Segment_Embeddings`

üéâ **Happy Embedding on Colab!**

In [None]:
# === RUN THE OPTIMIZED PIPELINE ===

print("üöÄ Starting A100-optimized embedding pipeline...")
print(f"üíª Using: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print(f"üìä Processing {len(long_df)} speeches for segmentation")

# Clear any existing cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()

try:
    # STEP 1: Generate speech-level embeddings with A100 optimization
    print("\nüîÑ Generating speech-level embeddings...")
    df_with_speech_embeddings = generate_speech_embeddings_for_segmentation(
        long_df, 
        text_column='Text',
        batch_size=64,  # A100 optimized batch size
        checkpoint_freq=10000  # Checkpoint every 10k speeches
    )
    print("‚úÖ Speech-level embeddings generated!")
    
    # Clear cache before segmentation
    torch.cuda.empty_cache()
    gc.collect()

    # STEP 2: Segment speeches by similarity
    print("\nüîç Segmenting speeches by similarity...")
    df_segmented = segment_speeches_by_similarity(
        df_with_speech_embeddings, window_size=3,
        height_threshold=0.3, prominence_threshold=0.2,
        distance_threshold=5
    )
    print("‚úÖ Segmentation complete!")

    # STEP 3: Assign short speeches to nearest segment (FIXED)
    print("\nüîÑ Assigning short speeches to segments...")
    def assign_short_speeches(short_df, segmented_df):
        """Assign short speeches to segments based on their original order within sittings."""
        assigned = []
        for sitting_id, group in short_df.groupby('Sitting_ID'):
            seg_group = segmented_df[segmented_df['Sitting_ID'] == sitting_id]
            if seg_group.empty:
                # If no segments in this sitting, create a default segment
                default_segment = f"{sitting_id}_seg_0"
                assigned.extend([default_segment] * len(group))
                continue
            
            # Get unique segments for this sitting in order
            segments_in_sitting = seg_group['Segment_ID'].unique()
            
            # For each short speech, assign to the first available segment
            # This is a simple strategy - you could make it more sophisticated
            for idx, row in group.iterrows():
                # Assign to first segment (could be improved with better logic)
                assigned.append(segments_in_sitting[0])
        
        short_df = short_df.copy()
        short_df['Segment_ID'] = assigned
        return short_df

    short_df_assigned = assign_short_speeches(short_df, df_segmented)
    df_all = pd.concat([df_segmented, short_df_assigned], ignore_index=True)

    # STEP 4: Generate segment-level embeddings with A100 optimization
    print("\nüîÑ Generating segment-level embeddings...")
    df_final = generate_segment_embeddings(
        df_all, 
        text_column='Text', 
        segment_id_column='Segment_ID',
        batch_size=32  # A100 optimized for segments
    )
    print("‚úÖ Segment-level embeddings mapped!")

    # STEP 5: Save final output
    output_path = f"{data_folder}AT_with_embeddings_final.pkl"
    df_final.to_pickle(output_path)
    print(f"\nüíæ Saved final dataframe: {output_path}")
    print(f"üìä Final shape: {df_final.shape}")
    print(f"üéØ Segments created: {df_final['Segment_ID'].nunique()}")
    
    # Final cleanup
    torch.cuda.empty_cache()
    gc.collect()
    print("üßπ Memory cleaned up")

except Exception as e:
    print(f"‚ùå Error in pipeline: {e}")
    import traceback
    traceback.print_exc()