# Topic Modeling with BERTopic on Parliamentary Speeches - Google Colab Version

This notebook is optimized for Google Colab with GPU acceleration. It implements a complete pipeline from data loading to topic modeling:

1. **Data Loading** - Loads the AT_original_complete.pkl file and processes it
2. **Data Filtering** - Creates processed version for topic modeling
3. **Dual Embedding** - Speech-level and segment-level embeddings
4. **Semantic Segmentation** - Similarity-based boundary detection  
5. **Topic Discovery** - BERTopic with custom clustering
6. **Topic Naming** - LLM-generated readable names

## Key Approach - Dual Embedding Strategy:
- **First embedding**: Individual speeches using raw text (for segmentation)
- **Second embedding**: Concatenated segment texts (for topic modeling)
- **Why twice?** Re-embedding captures full discourse coherence vs. averaging individual embeddings
- **Raw text used throughout** for better semantic capture

## Setup Instructions:
1. **Upload only one file** to Google Drive: `AT_original_complete.pkl`
   - Place it in a folder like: `MyDrive/thesis_data/AT_original_complete.pkl`
2. **Enable GPU**: Runtime → Change runtime type → GPU
3. **Run the setup cell** below to mount Drive and install packages
4. **Update data path** to match your Google Drive structure

In [None]:
# === GOOGLE COLAB SETUP ===
# Mount Google Drive to access your data
from google.colab import drive
drive.mount('/content/drive')

# Install required packages
!pip install sentence-transformers bertopic umap-learn hdbscan tqdm openai python-dotenv

# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("No GPU detected - will use CPU (slower)")

# Import all required libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
import os
warnings.filterwarnings('ignore')

print("Setup complete! ✓")

## Data Loading and Processing

Load the original complete data from Google Drive and create the processed version for topic modeling.

In [None]:
# === DATA LOADING AND PROCESSING ===

# IMPORTANT: Update this path to match your Google Drive structure
data_folder = '/content/drive/MyDrive/thesis data/'  # Update this path!
data_path = f'{data_folder}AT_original_complete.pkl'

print("🚀 Loading original complete data...")


# Load ORIGINAL complete data
AT_original_df = pd.read_pickle(data_path)
print(f"✅ Loaded original complete data: {AT_original_df.shape}")
    
# Display basic info about the loaded data
print(f"\n📋 Dataset Info:")
print(f"  📊 Total speeches: {len(AT_original_df):,}")
print(f"  📅 Date range: {AT_original_df['Date'].min()} to {AT_original_df['Date'].max()}")
print(f"  📝 Columns: {list(AT_original_df.columns)}")
    
# Check if filtering columns exist, if not create them
if 'Word_Count' not in AT_original_df.columns:
    print("🔄 Calculating word counts...")
    AT_original_df['Word_Count'] = AT_original_df['Text'].apply(lambda x: len(str(x).split()))
    
if 'Is_Too_Short' not in AT_original_df.columns:
    print("🔄 Creating filtering flags...")
    min_word_count = 10
    AT_original_df['Is_Too_Short'] = AT_original_df['Word_Count'] < min_word_count
    AT_original_df['Is_Filtered'] = AT_original_df['Is_Too_Short']
    
# Create processed version for topic modeling
print("\n🔄 Creating processed version for topic modeling...")
AT_processed_df = AT_original_df[~AT_original_df['Is_Filtered']].copy()
    
# Sort by sitting and speech order for consistency
if 'Sitting_ID' in AT_processed_df.columns and 'Speech_ID' in AT_processed_df.columns:
    AT_processed_df = AT_processed_df.sort_values(['Sitting_ID', 'Speech_ID']).reset_index(drop=True)
    print("🔄 Sorted speeches by Sitting_ID and Speech_ID")
elif 'Text_ID' in AT_processed_df.columns and 'ID' in AT_processed_df.columns:
    AT_processed_df = AT_processed_df.sort_values(['Text_ID', 'ID']).reset_index(drop=True)
    print("🔄 Sorted speeches by Text_ID and ID")
    
# Standardize column names for topic modeling pipeline
if 'Text_ID' in AT_processed_df.columns and 'Sitting_ID' not in AT_processed_df.columns:
    AT_processed_df['Sitting_ID'] = AT_processed_df['Text_ID']
    print("🔄 Created Sitting_ID from Text_ID")

if 'ID' in AT_processed_df.columns and 'Speaker_ID' not in AT_processed_df.columns:
    AT_processed_df['Speaker_ID'] = AT_processed_df['ID']
    print("🔄 Created Speaker_ID from ID")
    
# Add tracking column
AT_processed_df['Used_For_Topic_Modeling'] = True
    
print(f"\n📈 Filtering summary:")
print(f"  📊 Original speeches: {len(AT_original_df):,}")
print(f"  📊 For topic modeling: {len(AT_processed_df):,}")
print(f"  🗑️  Too short speeches: {AT_original_df['Is_Too_Short'].sum():,}")
print(f"  📊 Filtered out: {len(AT_original_df) - len(AT_processed_df):,}")
    
# Save processed version for later use
AT_processed_df.to_pickle(f'{data_folder}AT_for_topic_modeling.pkl')
print(f"\n💾 Saved processed data to: {data_folder}AT_for_topic_modeling.pkl")
    
# Verify required columns for pipeline
required_cols = ['Text', 'Sitting_ID']
missing_cols = [col for col in required_cols if col not in AT_processed_df.columns]
if missing_cols:
    print(f"⚠️  Warning: Missing required columns: {missing_cols}")
    print(f"Available columns: {list(AT_processed_df.columns)}")
else:
    print(f"✅ All required columns present for topic modeling pipeline")

print("\n✅ Data loading and processing complete! Ready for topic modeling.")

## Embedding and Segmentation Functions (GPU Optimized)

These functions are optimized for GPU acceleration and handle the dual-embedding approach:
1. **Speech-level embeddings** for similarity-based segmentation  
2. **Segment-level embeddings** for final topic modeling

In [None]:
# === EMBEDDING FUNCTIONS (GPU OPTIMIZED) ===
from sklearn.metrics.pairwise import cosine_similarity
from scipy.signal import find_peaks
from sentence_transformers import SentenceTransformer
import torch
import time
import gc
from tqdm import tqdm

def load_embedding_model(model_name="nomic-ai/nomic-embed-text-v1.5", device=None):
    """
    Load a sentence embedding model optimized for Colab GPU.
    """
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    print(f"Loading embedding model: {model_name} on {device}")
    start_time = time.time()
    
    try:
        if device == 'cpu':
            torch.set_num_threads(2)  # Colab CPU optimization
            model = SentenceTransformer(
                model_name, 
                device=device, 
                trust_remote_code=True,
                model_kwargs={'torch_dtype': torch.float32}
            )
        else:
            # GPU optimization for Colab
            model = SentenceTransformer(model_name, device=device, trust_remote_code=True)
        
        print(f"✓ Model loaded in {time.time() - start_time:.2f} seconds")
        return model
        
    except Exception as e:
        print(f"❌ Error loading {model_name}: {e}")
        raise e

def generate_speech_embeddings_for_segmentation(df, text_column='Text', model_name="nomic-ai/nomic-embed-text-v1.5", batch_size=16):
    """
    FIRST EMBEDDING: Generate embeddings for individual speeches for segmentation.
    Optimized for Colab GPU with larger batch sizes and smart filtering.
    """
    print("=" * 60)
    print("FIRST EMBEDDING: Individual speeches for segmentation")
    print("=" * 60)
    print(f"Generating embeddings for {len(df)} speeches using {model_name}")
    
    # Load model
    model = load_embedding_model(model_name)
    
    # Use FULL texts
    texts = df[text_column].astype(str).tolist()
    
    # Show text length statistics and identify extremely long texts
    text_lengths = [len(text) for text in texts]
    print(f"Text length statistics (characters):")
    print(f"  Min: {min(text_lengths)}, Max: {max(text_lengths)}, Mean: {np.mean(text_lengths):.0f}")
    
    # Calculate 99.9th percentile threshold to filter out top 0.1% longest speeches
    length_threshold = np.percentile(text_lengths, 99.9)
    extremely_long_mask = np.array(text_lengths) > length_threshold
    n_extremely_long = extremely_long_mask.sum()
    
    print(f"  99.9th percentile length: {length_threshold:.0f} characters")
    print(f"  Extremely long speeches (top 0.1%): {n_extremely_long}")
    print(f"  These will be assigned zero embeddings for memory safety")
    
    # Adjust batch size for GPU
    if torch.cuda.is_available():
        batch_size = 32  # Larger batch for GPU
        print(f"Using GPU batch size: {batch_size}")
    else:
        batch_size = 8   # Smaller batch for CPU
        print(f"Using CPU batch size: {batch_size}")
    
    def embed_with_fallback(text, speech_index):
        """Embed text with GPU-optimized fallback strategies."""
        text_len = len(text)
        
        # Skip extremely long texts (top 0.1%) - assign zero embedding
        if text_len > length_threshold:
            return np.zeros(model.get_sentence_embedding_dimension())
        
        try:
            # Try full text first with GPU-optimized settings
            embedding = model.encode(
                [text], 
                batch_size=1,  # Process individually for fallback
                convert_to_tensor=False,
                normalize_embeddings=True,
                show_progress_bar=False
            )[0]
            return embedding
            
        except Exception:
            # Fallback: chunking for very long texts
            if len(text) > 10000:
                try:
                    chunks = []
                    chunk_size = 8000
                    for i in range(0, len(text), chunk_size):
                        chunk = text[i:i + chunk_size]
                        if len(chunk.strip()) > 100:
                            chunks.append(chunk)
                        if len(chunks) >= 3:
                            break
                    
                    if chunks:
                        chunk_embeddings = model.encode(
                            chunks,
                            batch_size=min(len(chunks), 4),
                            convert_to_tensor=False,
                            normalize_embeddings=True,
                            show_progress_bar=False
                        )
                        return np.mean(chunk_embeddings, axis=0)
                except Exception:
                    pass
            
            # Final fallback: truncate
            try:
                truncated_text = text[:5000]
                embedding = model.encode(
                    [truncated_text],
                    batch_size=1,
                    convert_to_tensor=False,
               
                    show_progress_bar=False
                )[0]
                return embedding
            except Exception:
                return np.zeros(model.get_sentence_embedding_dimension())
    
    # Generate embeddings with GPU-optimized progress tracking
    print("Generating speech-level embeddings...")
    start_time = time.time()
    embeddings = []
    
    # Create progress bar optimized for Colab
    progress_bar = tqdm(enumerate(texts), total=len(texts), desc="🧠 Embedding speeches", 
                       unit="speech", ncols=100, 
                       bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}]')
    
    for i, text in progress_bar:
        # GPU memory cleanup every 50 speeches (more frequent for GPU)
        if i % 50 == 0 and i > 0:
            gc.collect()
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
        
        embedding = embed_with_fallback(text, i)
        embeddings.append(embedding)
        
        # Update progress bar with GPU-optimized info
        if i % 10 == 0:
            rate = i / (time.time() - start_time)
            progress_bar.set_postfix({
                'Rate': f'{rate:.1f}/s',
                'GPU': '✓' if torch.cuda.is_available() else '✗',
                'Filtered': n_extremely_long
            })
    
    progress_bar.close()
    
    total_time = time.time() - start_time
    rate = len(texts) / total_time
    print(f"✓ Speech embeddings completed in {total_time:.2f} seconds")
    print(f"✓ Average rate: {rate:.1f} speeches/second")
    print(f"✓ Embedding shape: {np.array(embeddings).shape}")
    print(f"✓ Filtered out {n_extremely_long} extremely long speeches")
    
    # Final GPU memory cleanup
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # Add to dataframe
    df_with_embeddings = df.copy()
    df_with_embeddings['Speech_Embeddings'] = embeddings
    df_with_embeddings['Is_Extremely_Long'] = extremely_long_mask
    
    return df_with_embeddings

# Keep the existing similarity and segmentation functions (they don't need GPU optimization)
def calculate_windowed_similarity(embeddings_list, window_size=3):
    """Calculate cosine similarity between windowed embeddings."""
    if len(embeddings_list) < 2:
        return np.array([])
    if window_size < 1:
        raise ValueError("Window size must be at least 1.")

    num_utterances = len(embeddings_list)
    similarities = []

    for g in range(num_utterances - 1):
        # Window before gap
        start_before = max(0, g - window_size + 1)
        end_before = g + 1
        window_before = embeddings_list[start_before:end_before]

        # Window after gap
        start_after = g + 1
        end_after = min(num_utterances, g + 1 + window_size)
        window_after = embeddings_list[start_after:end_after]

        if not window_before or not window_after:
            similarities.append(0)
            continue

        # Calculate mean embeddings and similarity
        mean_before = np.mean([np.asarray(e) for e in window_before], axis=0)
        mean_after = np.mean([np.asarray(e) for e in window_after], axis=0)
        
        sim = cosine_similarity(mean_before.reshape(1, -1), mean_after.reshape(1, -1))[0][0]
        similarities.append(sim)
        
    return np.array(similarities)

def find_topic_boundaries(similarities, height_threshold=0.25, prominence_threshold=0.15, distance_threshold=5):
    """Find topic boundaries using peak detection on inverted similarity scores."""
    if len(similarities) == 0:
        return np.array([])
    
    # Invert similarities to find valleys (topic boundaries)
    inverted_similarities = np.maximum(0, 1 - similarities)
    
    # Find peaks in inverted similarities
    peaks, _ = find_peaks(
        inverted_similarities,
        height=height_threshold,
        prominence=prominence_threshold,
        distance=distance_threshold
    )
    
    return peaks

def segment_speeches_by_similarity(df, window_size=3, height_threshold=0.25, 
                                   prominence_threshold=0.15, distance_threshold=5):
    """Segment speeches within each sitting based on semantic similarity."""
    print(f"🔍 Segmenting speeches using similarity-based approach")
    print(f"Parameters: window_size={window_size}, height_threshold={height_threshold}")
    print(f"           prominence_threshold={prominence_threshold}, distance_threshold={distance_threshold}")
    
    df_segmented = df.copy()
    segment_ids = []
    total_boundaries = 0
    
    # Process each sitting separately with progress bar
    sittings = list(df_segmented.groupby('Sitting_ID'))
    
    for sitting_id, group in tqdm(sittings, desc="🔪 Segmenting sittings", unit="sitting"):
        if len(group) < 2:
            segment_ids.extend([f"{sitting_id}_seg_0"] * len(group))
            continue
        
        # Use the speech-level embeddings for segmentation
        embeddings_list = group['Speech_Embeddings'].tolist()
        similarities = calculate_windowed_similarity(embeddings_list, window_size)
        
        if len(similarities) == 0:
            segment_ids.extend([f"{sitting_id}_seg_0"] * len(group))
            continue
        
        # Find boundaries
        boundaries = find_topic_boundaries(
            similarities, height_threshold, prominence_threshold, distance_threshold
        )
        total_boundaries += len(boundaries)
        
        # Assign segment IDs
        current_segment = 0
        sitting_segment_ids = []
        
        for i in range(len(group)):
            if i > 0 and (i - 1) in boundaries:
                current_segment += 1
            sitting_segment_ids.append(f"{sitting_id}_seg_{current_segment}")
        
        segment_ids.extend(sitting_segment_ids)
    
    df_segmented['Segment_ID'] = segment_ids
    
    # Print statistics
    total_segments = df_segmented['Segment_ID'].nunique()
    avg_segments_per_sitting = df_segmented.groupby('Sitting_ID')['Segment_ID'].nunique().mean()
    
    print(f"✓ Segmentation complete!")
    print(f"✓ Total boundaries detected: {total_boundaries}")
    print(f"✓ Total segments created: {total_segments}")
    print(f"✓ Average segments per sitting: {avg_segments_per_sitting:.2f}")
    
    return df_segmented

print("✓ Embedding and segmentation functions loaded")

## Segment Aggregation and Re-embedding Functions (GPU Optimized)

After creating segments, we aggregate the raw text and re-embed for better topic modeling representation.

In [None]:
# === SEGMENT AGGREGATION AND RE-EMBEDDING FUNCTIONS (GPU OPTIMIZED) ===

def aggregate_segments(df, text_column='Text', segment_id_column='Segment_ID'):
    """
    Aggregate segments by concatenating their texts.
    
    Args:
        df (DataFrame): Input dataframe with segments
        text_column (str): Column name for the text data
        segment_id_column (str): Column name for the segment IDs
    
    Returns:
        DataFrame: Aggregated segments with combined texts
    """
    print(f"📦 Aggregating {len(df)} segments...")
    
    # Group by segment ID and concatenate texts
    aggregated_df = df.groupby(segment_id_column).agg({
        text_column: ' '.join,
        'Sitting_ID': 'first'  # Keep one sitting ID per segment
    }).reset_index()
    
    # Rename columns for clarity
    aggregated_df.rename(columns={text_column: 'Combined_Text'}, inplace=True)
    
    print(f"✓ Aggregated segments: {len(aggregated_df)}")
    return aggregated_df

def re_embed_segments(df, text_column='Combined_Text', model_name="nomic-ai/nomic-embed-text-v1.5"):
    """
    Re-embed aggregated segments using the segment-level embedding model.
    
    Args:
        df (DataFrame): Input dataframe with aggregated segments
        text_column (str): Column name for the text data
        model_name (str): SentenceTransformer model name
    
    Returns:
        DataFrame: Segments with updated embeddings
    """
    print(f"🔄 Re-embedding {len(df)} aggregated segments...")
    
    # Load embedding model
    model = load_embedding_model(model_name)
    
    # Use FULL texts for re-embedding
    texts = df[text_column].astype(str).tolist()
    
    # Generate embeddings with GPU-optimized settings
    embeddings = model.encode(
        texts, 
        batch_size=32,  # Optimal batch size for GPU
        convert_to_tensor=False,
        normalize_embeddings=True,
        show_progress_bar=True
    )
    
    # Add embeddings to dataframe
    df['Segment_Embeddings'] = embeddings
    
    print(f"✓ Re-embedded segments: {len(df)}")
    return df

print("✓ Segment aggregation and re-embedding functions loaded")

## Topic Modeling Functions (Colab Optimized)

These functions handle BERTopic training using the re-embedded segment representations, optimized for Colab environment.

In [None]:
# === TOPIC MODELING FUNCTIONS (COLAB OPTIMIZED) ===
from bertopic import BERTopic

def train_bertopic_model(df, text_column='Combined_Text', embeddings_column='Segment_Embeddings', 
                         min_topic_size=10, nr_topics='auto', language='english'):
    """
    Train a BERTopic model on the given data.
    
    Args:
        df (DataFrame): Input dataframe with documents
        text_column (str): Column name for the text data
        embeddings_column (str): Column name for the embeddings
        min_topic_size (int): Minimum size of topics
        nr_topics (int or str): Number of topics (auto or fixed number)
        language (str): Language for stopwords and stemming
    
    Returns:
        tuple: (topic_model, df_with_topics)
            - topic_model: Trained BERTopic model
            - df_with_topics: Input dataframe with topic assignments
    """
    print(f"🧠 Training BERTopic model on {len(df)} segments...")
    
    # Initialize BERTopic model
    topic_model = BERTopic(
        embedding_model=None,  # We will use pre-computed embeddings
        min_topic_size=min_topic_size,
        nr_topics=nr_topics,
        language=language,
        calculate_probabilities=True,
        verbose=True
    )
    
    # Fit the model
    topics, _ = topic_model.fit_transform(df[text_column], df[embeddings_column])
    
    # Add topic assignments to dataframe
    df['Topic'] = topics
    
    print(f"✓ Discovered {len(topic_model.get_topic_info())} topics")
    return topic_model, df

def fine_tune_bertopic_model(topic_model, df, text_column='Combined_Text', embeddings_column='Segment_Embeddings'):
    """
    Fine-tune a pre-trained BERTopic model on new data.
    
    Args:
        topic_model (BERTopic): Pre-trained BERTopic model
        df (DataFrame): New data for fine-tuning
        text_column (str): Column name for the text data
        embeddings_column (str): Column name for the embeddings
    """
    print(f"🔧 Fine-tuning BERTopic model on {len(df)} segments...")
    
    # Get current topics
    current_topics = topic_model.get_topic_info()
    current_topic_count = len(current_topics)
    
    # Fit the model on new data
    topic_model.partial_fit(df[text_column], df[embeddings_column])
    
    # Show updated topic info
    updated_topics = topic_model.get_topic_info()
    updated_topic_count = len(updated_topics)
    
    print(f"✓ Fine-tuned model: {current_topic_count} -> {updated_topic_count} topics")
    return topic_model

print("✓ Topic modeling functions loaded")

## Main Processing Pipeline (Colab Optimized)

This is the complete dual-embedding pipeline optimized for Google Colab with GPU acceleration.

In [None]:
# === MAIN PROCESSING PIPELINE (COLAB OPTIMIZED) ===

def run_dual_embedding_topic_pipeline(df, save_intermediate=True, data_folder='/content/drive/MyDrive/thesis_data/'):
    """
    Run the complete dual-embedding topic modeling pipeline optimized for Colab.
    
    DUAL EMBEDDING STRATEGY:
    1. First embedding: Individual speeches using RAW TEXT (for segmentation)
    2. Second embedding: Aggregated segments using RAW TEXT (for topic modeling)
    """
    results = {}
    
    # Verify required columns (flexible column checking)
    required_cols = ['Text', 'Sitting_ID']
    missing_cols = [col for col in required_cols if col not in df.columns]
    if missing_cols:
        print(f"❌ Missing required columns: {missing_cols}")
        print(f"Available columns: {list(df.columns)}")
        
        # Try to fix common column name issues
        if 'Text_ID' in df.columns and 'Sitting_ID' not in df.columns:
            df['Sitting_ID'] = df['Text_ID']
            print("🔧 Fixed: Created Sitting_ID from Text_ID")
        if 'ID' in df.columns and 'Speaker_ID' not in df.columns:
            df['Speaker_ID'] = df['ID']
            print("🔧 Fixed: Created Speaker_ID from ID")
        
        # Check again
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Still missing required columns after fixes: {missing_cols}")
    
    print("🚀 DUAL EMBEDDING PIPELINE USING RAW TEXT (COLAB OPTIMIZED)")
    print("=" * 70)
    print(f"📊 Input data shape: {df.shape}")
    print(f"🔧 Using RAW text for both embeddings (better semantic quality)")
    print(f"💻 Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
    
    ### STEP 1: SPEECH-LEVEL EMBEDDINGS (FIRST EMBEDDING)
    print("\n🔄 Generating speech-level embeddings (first embedding)...")
    df_with_speech_embeddings = generate_speech_embeddings_for_segmentation(df, text_column='Text')
    
    ### STEP 2: SEGMENTATION BASED ON SIMILARITY
    print("\n🔍 Segmenting speeches by similarity...")
    df_segmented = segment_speeches_by_similarity(df_with_speech_embeddings, window_size=3, 
                                                  height_threshold=0.25, prominence_threshold=0.15, 
                                                  distance_threshold=5)
    
    ### STEP 3: AGGREGATE SEGMENTS AND RE-EMBEDDING (SECOND EMBEDDING)
    print("\n📦 Aggregating segments and re-embedding...")
    aggregated_segments = aggregate_segments(df_segmented, text_column='Text', segment_id_column='Segment_ID')
    df_with_segment_embeddings = re_embed_segments(aggregated_segments, text_column='Combined_Text')
    
    ### STEP 4: TOPIC MODELING WITH BERTOPIC
    print("\n🧠 Running topic modeling with BERTopic...")
    topic_model, df_with_topics = train_bertopic_model(df_with_segment_embeddings, 
                                                       text_column='Combined_Text', 
                                                       embeddings_column='Segment_Embeddings', 
                                                       min_topic_size=10, nr_topics='auto', language='english')
    
    ### OPTIONAL: FINE-TUNE MODEL (UNCOMMENT TO USE)
    # print("\n🔧 Fine-tuning BERTopic model...")
    # topic_model = fine_tune_bertopic_model(topic_model, df_with_segment_embeddings, 
    #                                         text_column='Combined_Text', 
    #                                         embeddings_column='Segment_Embeddings')
    
    ### SAVE INTERMEDIATE RESULTS
    if save_intermediate:
        print(f"\n💾 Saving intermediate results...")
        df_with_topics.to_pickle(f'{data_folder}AT_with_topics_final_colab.pkl')
        topic_model.save(f'{data_folder}bertopic_model_colab')
        print(f"✓ Saved: AT_with_topics_final_colab.pkl, bertopic_model_colab")
    
    # Prepare results summary
    results['df_with_topics'] = df_with_topics
    results['segments_df'] = df_with_segment_embeddings
    results['topic_model'] = topic_model
    results['topic_info'] = topic_model.get_topic_info()
    
    # Assign LLM-generated names to topics (if available)
    if 'Topic' in df_with_topics.columns and df_with_topics['Topic'].nunique() < 100:
        print(f"🏷️  Assigning LLM-generated names to topics...")
        topic_names = df_with_topics.groupby('Topic').first().reset_index()
        topic_names = topic_names[['Topic', 'LLM_Name']]
        results['topic_info_with_names'] = topic_names
    else:
        print(f"⚠️  Topic naming skipped (too many topics or column missing)")
    
    return results

print("✓ Main processing pipeline loaded and ready for Colab")

In [None]:
# === RUN THE PIPELINE ===
print("🚀 Starting topic modeling pipeline on Colab...")
print(f"💻 Using: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print(f"📊 Processing {len(AT_processed_df)} speeches")

try:
    # Run the complete pipeline
    results = run_dual_embedding_topic_pipeline(
        AT_processed_df, 
        save_intermediate=True, 
        data_folder=data_folder  # Use the data_folder variable set earlier
    )
    
    if results is not None:
        print("\n🎉 Pipeline completed successfully!")
        print("📁 Results saved to Google Drive")
        
        # Quick results summary
        print("\n📋 QUICK RESULTS SUMMARY:")
        print(f"✓ Speeches processed: {len(results['df_with_topics'])}")
        print(f"✓ Segments created: {len(results['segments_df'])}")
        print(f"✓ Topics discovered: {len(results['topic_info'])}")
        
        # Show topic names
        print(f"\n🏷️  DISCOVERED TOPICS:")
        for _, row in results['topic_info_with_names'].iterrows():
            if row['Topic'] != -1:
                print(f"  Topic {row['Topic']}: {row['LLM_Name']} ({row['Count']} segments)")
    else:
        print("❌ Pipeline failed!")
        
except Exception as e:
    print(f"❌ Error in pipeline: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# === DOWNLOAD RESULTS ===
import shutil
from google.colab import files

# Define file paths
data_folder = '/content/drive/MyDrive/thesis_data/'
results_files = [
    f"{data_folder}AT_for_topic_modeling.pkl",
    f"{data_folder}AT_with_topics_final_colab.pkl",
    f"{data_folder}topic_info_with_names_colab.csv",
    f"{data_folder}AT_with_speech_embeddings_colab.pkl",
    f"{data_folder}AT_segments_with_embeddings_colab.pkl",
    f"{data_folder}bertopic_model_colab"
]

# Zip the results folder
shutil.make_archive('/content/results_backup', 'zip', data_folder)

print("📦 Results archived. Downloading...")
files.download('/content/results_backup.zip')

print("✓ Download complete! Check your files.")

## Download Results from Colab

After the pipeline completes successfully, you can download the results to your local machine:

### Option 1: Direct Download from Google Drive
1. Navigate to your Google Drive folder
2. Download the generated pickle and CSV files
3. Use them in your local analysis notebooks

### Option 2: Download via Colab
Run the cell below to download key results directly from Colab:

In [None]:
# === DOWNLOAD RESULTS ===
import shutil
from google.colab import files

# Define file paths
data_folder = '/content/drive/MyDrive/thesis_data/'
results_files = [
    f"{data_folder}AT_for_topic_modeling.pkl",
    f"{data_folder}AT_with_topics_final_colab.pkl",
    f"{data_folder}topic_info_with_names_colab.csv",
    f"{data_folder}AT_with_speech_embeddings_colab.pkl",
    f"{data_folder}AT_segments_with_embeddings_colab.pkl",
    f"{data_folder}bertopic_model_colab"
]

# Zip the results folder
shutil.make_archive('/content/results_backup', 'zip', data_folder)

print("📦 Results archived. Downloading...")
files.download('/content/results_backup.zip')

print("✓ Download complete! Check your files.")

## Usage Summary

### What This Notebook Does:
1. **Loads one file** from Google Drive: `AT_original_complete.pkl`
2. **Processes the data** to create topic modeling version
3. **Runs dual embedding** approach with GPU acceleration:
   - Speech-level embeddings for segmentation
   - Segment-level embeddings for topic modeling
4. **Discovers topics** using BERTopic with clustering
5. **Generates topic names** using LLM (optional)
6. **Saves results** back to Google Drive

### What You Need to Upload:
- **Only one file**: `AT_original_complete.pkl`
- **Upload location**: `MyDrive/thesis_data/AT_original_complete.pkl`

### Performance on Colab GPU:
- **Speech embeddings**: ~30-100 speeches/second
- **Segment embeddings**: ~10-30 segments/second  
- **Total time**: ~10-30 minutes for large datasets (vs hours on CPU)

### Next Steps:
1. **Download results** to your local machine
2. **Use the final dataframe** for further analysis
3. **Visualize topics** using the topic information
4. **Integrate with LIWC** analysis using the speech-topic mappings

### Files Generated:
- `AT_for_topic_modeling.pkl` - Processed data used for topic modeling
- `AT_with_topics_final_colab.pkl` - Final dataframe with topic assignments
- `topic_info_with_names_colab.csv` - Topic information and LLM-generated names
- `AT_with_speech_embeddings_colab.pkl` - Intermediate speech embeddings
- `AT_segments_with_embeddings_colab.pkl` - Intermediate segment embeddings

🎉 **Happy Topic Modeling on Colab!**