# Apply Topic Model to New Documents

This notebook makes it easy to apply a previously trained topic model to new documents.

**Use this when:**
- You have already trained a topic model
- You want to classify new documents using that model
- You want to see topic distributions for incoming data

**How to use:**
1. Edit the configuration in Section 1 (paths to your trained model and new data)
2. **Run Section 2 first** to install all prerequisites (one-time setup)
3. Run all remaining cells
4. View topic assignments and export results

**First time users:** Make sure to run Section 2 (Install Prerequisites) before running the rest of the notebook!

## Section 1: Configuration

In [None]:
# ============================================================================
# TRAINED MODEL CONFIGURATION
# ============================================================================

# Path to your trained model directory
MODEL_DIR = 'results/topic_modeling_20240216_140530'  # Change this to your model directory

# Model files (usually you don't need to change these)
MODEL_FILE = 'lda_model'
DICTIONARY_FILE = 'lda_dictionary'
BIGRAM_FILE = 'lda_bigram.pkl'  # Optional

# ============================================================================
# NEW DOCUMENTS CONFIGURATION
# ============================================================================

# Choose input type: 'csv' or 'directory'
INPUT_TYPE = 'csv'

# For CSV input
INPUT_CSV = 'Data/new_documents.csv'
TEXT_COLUMN = 'content'

# For directory input
INPUT_DIRECTORY = 'Data/new_documents/'
FILE_PATTERN = '*.txt'

# ============================================================================
# OUTPUT CONFIGURATION
# ============================================================================

# Where to save topic assignments
OUTPUT_CSV = 'results/topic_assignments.csv'

# Include probabilities for ALL topics (not just the dominant one)
INCLUDE_ALL_TOPICS = True

# ============================================================================
# PREPROCESSING PARAMETERS
# ============================================================================
# These should match what you used when training the model

ALLOWED_POS = ['NOUN', 'ADJ', 'VERB', 'ADV']
CUSTOM_STOPWORDS = []  # Should match training configuration

## Section 2: Install Prerequisites

This section installs all required packages. **Run this cell first** if you get "module not found" errors.

In [None]:
import sys
import subprocess

print("Installing required packages...\n")
print("="*60)

# List of required packages
packages = [
    'gensim',
    'pandas',
    'numpy',
    'nltk',
    'spacy',
    'scikit-learn',
    'matplotlib'
]

# Install packages
for package in packages:
    print(f"Installing {package}...")
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package, '-q'])
        print(f"  ✓ {package} installed")
    except subprocess.CalledProcessError as e:
        print(f"  ✗ Error installing {package}: {e}")

print("\n" + "="*60)
print("Installing spaCy language model...")
print("="*60)

# Install spaCy model
try:
    import spacy
    try:
        nlp = spacy.load('en_core_web_lg')
        print("✓ en_core_web_lg already installed")
    except OSError:
        print("Downloading en_core_web_lg (this may take a few minutes)...")
        subprocess.check_call([sys.executable, '-m', 'spacy', 'download', 'en_core_web_lg'])
        print("✓ en_core_web_lg installed")
except Exception as e:
    print(f"Note: If this fails, you can use the smaller model.")
    print(f"Run: python -m spacy download en_core_web_sm")
    print(f"Error: {e}")

print("\n" + "="*60)
print("Downloading NLTK data...")
print("="*60)

# Download NLTK data
try:
    import nltk
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    print("✓ NLTK data downloaded")
except Exception as e:
    print(f"✗ Error downloading NLTK data: {e}")

print("\n" + "="*60)
print("✓ All prerequisites installed!")
print("="*60)
print("\nYou can now proceed to the next sections.")

## Section 3: Validation

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent if 'notebooks' in str(Path.cwd()) else Path.cwd()
sys.path.insert(0, str(project_root))

print("Checking configuration...\n")
print("="*60)

# Check model files
model_dir_path = project_root / MODEL_DIR
model_path = model_dir_path / MODEL_FILE
dict_path = model_dir_path / DICTIONARY_FILE
bigram_path = model_dir_path / BIGRAM_FILE

if model_path.exists():
    print(f"✓ Model found: {model_path}")
else:
    print(f"✗ Model not found: {model_path}")

if dict_path.exists():
    print(f"✓ Dictionary found: {dict_path}")
else:
    print(f"✗ Dictionary not found: {dict_path}")

if bigram_path.exists():
    print(f"✓ Bigram model found: {bigram_path}")
else:
    print(f"  Bigram model not found (optional): {bigram_path}")

# Check input data
print()
if INPUT_TYPE == 'csv':
    input_path = project_root / INPUT_CSV
    if input_path.exists():
        print(f"✓ Input CSV found: {input_path}")
    else:
        print(f"✗ Input CSV not found: {input_path}")
else:
    input_dir = project_root / INPUT_DIRECTORY
    if input_dir.exists():
        files = list(input_dir.glob(FILE_PATTERN))
        print(f"✓ Input directory found: {input_dir}")
        print(f"  Found {len(files)} files matching '{FILE_PATTERN}'")
    else:
        print(f"✗ Input directory not found: {input_dir}")

print("="*60)

## Section 4: Load Model Info

In [None]:
import json

# Try to load model configuration if available
config_path = model_dir_path / 'run_configuration.json'
if config_path.exists():
    with open(config_path, 'r') as f:
        model_config = json.load(f)
    print("Model training configuration:")
    print(json.dumps(model_config, indent=2))
else:
    print("No configuration file found")

# Load topics info
topics_json = model_dir_path / 'lda_topics.json'
if topics_json.exists():
    with open(topics_json, 'r') as f:
        topics_data = json.load(f)
    
    print(f"\n\nModel has {len(topics_data)} topics:\n")
    print("="*60)
    
    for topic_id, topic_info in sorted(topics_data.items()):
        topic_num = topic_id.split('_')[1]
        top_words = ", ".join([w['word'] for w in topic_info['words'][:5]])
        print(f"Topic {topic_num}: {top_words}")
    
    print("="*60)
else:
    print("\nTopics file not found")

## Section 5: Build and Run Command

In [None]:
import subprocess

# Build command
script_path = project_root / 'scripts' / 'apply_topic_model.py'

cmd = ['python', str(script_path)]

# Add model paths
cmd.extend(['--model', str(model_path)])
cmd.extend(['--dictionary', str(dict_path)])
if bigram_path.exists():
    cmd.extend(['--bigram', str(bigram_path)])

# Add input
if INPUT_TYPE == 'csv':
    cmd.extend(['--input', str(project_root / INPUT_CSV)])
    cmd.extend(['--text-col', TEXT_COLUMN])
else:
    cmd.extend(['--input-dir', str(project_root / INPUT_DIRECTORY)])
    cmd.extend(['--file-pattern', FILE_PATTERN])

# Add output
output_path = project_root / OUTPUT_CSV
cmd.extend(['--output', str(output_path)])

# Add options
if INCLUDE_ALL_TOPICS:
    cmd.append('--include-all-topics')

if ALLOWED_POS:
    cmd.extend(['--allowed-postags'] + ALLOWED_POS)

if CUSTOM_STOPWORDS:
    cmd.extend(['--custom-stopwords'] + CUSTOM_STOPWORDS)

# Display command
print("Command:")
print(" ".join(cmd))
print("\n" + "="*60)

# Run command
print("\nApplying model to new documents...\n")

process = subprocess.Popen(
    cmd,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    universal_newlines=True,
    bufsize=1
)

for line in process.stdout:
    print(line, end='')

process.wait()

print("\n" + "="*60)
if process.returncode == 0:
    print("✓ Topic assignment completed successfully!")
else:
    print(f"✗ Error occurred (exit code: {process.returncode})")
print("="*60)

## Section 6: View Results

In [None]:
import pandas as pd

if output_path.exists():
    results_df = pd.read_csv(output_path)
    
    print(f"\nLoaded {len(results_df)} document assignments\n")
    print("="*60)
    
    # Show first few rows
    print("\nFirst 10 documents:")
    display(results_df.head(10))
    
    # Show topic distribution
    print("\n\nTopic Distribution:")
    topic_counts = results_df['dominant_topic'].value_counts().sort_index()
    
    topic_dist = []
    for topic_id, count in topic_counts.items():
        if topic_id >= 0:  # Skip -1 (failed docs)
            keywords = results_df[results_df['dominant_topic'] == topic_id]['topic_keywords'].iloc[0]
            topic_dist.append({
                'Topic': topic_id,
                'Documents': count,
                'Percentage': f"{count/len(results_df)*100:.1f}%",
                'Keywords': keywords
            })
    
    topic_dist_df = pd.DataFrame(topic_dist)
    display(topic_dist_df)
    
else:
    print(f"Output file not found: {output_path}")

## Section 7: Visualize Topic Distribution

In [None]:
import matplotlib.pyplot as plt

if 'results_df' in locals():
    # Bar chart of topic distribution
    fig, ax = plt.subplots(figsize=(12, 6))
    
    topic_counts = results_df[results_df['dominant_topic'] >= 0]['dominant_topic'].value_counts().sort_index()
    
    ax.bar(topic_counts.index, topic_counts.values)
    ax.set_xlabel('Topic ID')
    ax.set_ylabel('Number of Documents')
    ax.set_title('Topic Distribution in New Documents')
    ax.set_xticks(topic_counts.index)
    
    plt.tight_layout()
    plt.show()
    
    # Pie chart
    if len(topic_counts) <= 20:  # Only show pie chart for reasonable number of topics
        fig, ax = plt.subplots(figsize=(10, 10))
        ax.pie(topic_counts.values, labels=[f'Topic {i}' for i in topic_counts.index], autopct='%1.1f%%')
        ax.set_title('Topic Distribution (Percentage)')
        plt.show()
else:
    print("No results to visualize")

## Section 8: Analyze Specific Documents

In [None]:
if 'results_df' in locals():
    # Find documents with highest topic probability
    print("Documents with highest topic confidence:\n")
    top_confident = results_df.nlargest(10, 'topic_prob')[['doc_id', 'dominant_topic', 'topic_prob', 'topic_keywords']]
    display(top_confident)
    
    # If original data is available, show the actual text
    if INPUT_TYPE == 'csv' and (project_root / INPUT_CSV).exists():
        original_df = pd.read_csv(project_root / INPUT_CSV)
        
        print("\n\nSample document from each topic:\n")
        print("="*80)
        
        for topic_id in sorted(results_df['dominant_topic'].unique()):
            if topic_id >= 0:
                # Get highest confidence document for this topic
                topic_docs = results_df[results_df['dominant_topic'] == topic_id]
                if len(topic_docs) > 0:
                    best_doc = topic_docs.nlargest(1, 'topic_prob')
                    doc_idx = best_doc['doc_id'].iloc[0]
                    prob = best_doc['topic_prob'].iloc[0]
                    keywords = best_doc['topic_keywords'].iloc[0]
                    
                    if doc_idx < len(original_df):
                        text = original_df[TEXT_COLUMN].iloc[doc_idx]
                        text_preview = text[:200] + "..." if len(text) > 200 else text
                        
                        print(f"\nTopic {topic_id} ({keywords})")
                        print(f"Probability: {prob:.3f}")
                        print(f"Text: {text_preview}")
                        print("-" * 80)
else:
    print("No results available")

## Section 9: Export Summary Report

In [None]:
from datetime import datetime

if 'results_df' in locals():
    # Create summary report
    report = {
        'timestamp': datetime.now().isoformat(),
        'model_dir': str(MODEL_DIR),
        'input_source': str(INPUT_CSV if INPUT_TYPE == 'csv' else INPUT_DIRECTORY),
        'total_documents': len(results_df),
        'successfully_classified': len(results_df[results_df['dominant_topic'] >= 0]),
        'topic_distribution': topic_dist_df.to_dict('records') if 'topic_dist_df' in locals() else [],
    }
    
    # Save report
    report_path = output_path.parent / f"{output_path.stem}_report.json"
    with open(report_path, 'w', encoding='utf-8') as f:
        json.dump(report, f, indent=2)
    
    print(f"Report saved to: {report_path}")
    print("\nSummary:")
    print(json.dumps(report, indent=2))
else:
    print("No results to export")

---

## Summary

Your topic assignments have been saved to: **`{OUTPUT_CSV}`**

### Output columns:
- `dominant_topic`: The most likely topic for this document
- `topic_prob`: Confidence/probability of the dominant topic
- `topic_keywords`: Top keywords for the dominant topic
- `topic_X_prob`: Probability for each topic (if INCLUDE_ALL_TOPICS = True)

### Next steps:
- Use the results CSV for further analysis
- Filter documents by topic
- Integrate topic assignments into your workflow
- Apply the model to additional new documents as needed