# Batch Topic Modeling Runner

This notebook provides an easy interface to run topic modeling on large datasets using the batch processing scripts.

**Benefits:**
- Easy to configure parameters
- Save configurations for different projects
- View results directly in the notebook
- Reproducible analysis

**How to use:**
1. Edit the configuration in Section 1
2. **Run Section 2 first** to install all prerequisites (one-time setup)
3. Run all remaining cells (or Cell â†’ Run All)
4. View results at the bottom

**First time users:** Make sure to run Section 2 (Install Prerequisites) before running the rest of the notebook!

## Section 1: Configuration

Edit these settings for your specific analysis.

In [1]:
# ============================================================================
# INPUT DATA CONFIGURATION
# ============================================================================

# Choose input type: 'csv', 'directory', or 'directory_with_metadata'
INPUT_TYPE = 'directory'  # Change this to match your data format

# For CSV input
CSV_PATH = 'Data/RottenTomatoes.csv'
TEXT_COLUMN = 'content'  # Column name containing text data

# For directory input
INPUT_DIRECTORY = 'Data/gutenberg-test/clean/'
FILE_PATTERN = '*.txt'

# For directory with metadata (like Gutenberg dataset)
METADATA_CSV = 'Data/gutenberg-test/metadata.csv'
PATH_COLUMN = 'local_path'  # Column containing file paths

# ============================================================================
# OUTPUT CONFIGURATION
# ============================================================================

from datetime import datetime

# Output directory (will be created if it doesn't exist)
# Using timestamp to avoid overwriting previous results
OUTPUT_DIR = f'results/topic_modeling_{datetime.now().strftime("%Y%m%d_%H%M%S")}'

# Or use a fixed name:
# OUTPUT_DIR = 'results/my_project_topics'

# ============================================================================
# MODEL PARAMETERS
# ============================================================================

# Number of topics to extract
NUM_TOPICS = 20

# Training parameters
PASSES = 20  # Number of passes through the corpus (more = better but slower)
ITERATIONS = 200  # Maximum iterations
CHUNKSIZE = 100  # Documents to process at once (larger = faster but more memory)

# Parallel processing (set to number of CPU cores, or 1 for single-threaded)
WORKERS = 4

# Random seed for reproducibility
RANDOM_STATE = 100

# ============================================================================
# PREPROCESSING PARAMETERS
# ============================================================================

# Parts of speech to keep (default: NOUN, ADJ, VERB, ADV)
ALLOWED_POS = ['NOUN', 'ADJ', 'VERB', 'ADV']

# Additional stopwords to remove (beyond standard English stopwords)
CUSTOM_STOPWORDS = []  # Example: ['movie', 'film', 'said']

# Bigram detection parameters
MIN_COUNT = 1  # Minimum count for bigrams
BIGRAM_THRESHOLD = 100  # Threshold for bigram detection

# ============================================================================
# ADDITIONAL OPTIONS
# ============================================================================

# Create visualization (set to False for very large datasets to save time)
CREATE_VISUALIZATION = True

# Save preprocessed corpus (useful if you want to re-train with different parameters)
SAVE_CORPUS = True

# Verbose output (more detailed logging)
VERBOSE = True

## Section 2: Install Prerequisites

This section installs all required packages. **Run this cell first** if you get "module not found" errors.

## Section 3: Setup and Validation

This section checks that everything is configured correctly.

## Section 2: Setup and Validation

This section checks that everything is configured correctly.

## Section 4: Build Command

This builds the command to run the batch processing script.

## Section 3: Build Command

This builds the command to run the batch processing script.

## Section 5: Run Topic Modeling

Execute the batch processing script. This may take several minutes to hours depending on your dataset size.

## Section 4: Run Topic Modeling

Execute the batch processing script. This may take several minutes to hours depending on your dataset size.

## Section 6: View Results

Load and display the topics discovered by the model.

## Section 5: View Results

Load and display the topics discovered by the model.

## Section 7: View Visualization

Display the interactive pyLDAvis visualization if it was created.

## Section 6: View Visualization

Display the interactive pyLDAvis visualization if it was created.

## Section 8: List All Output Files

Show all files generated by the topic modeling process.

## Section 7: List All Output Files

Show all files generated by the topic modeling process.

## Section 9: Save Configuration (Optional)

Save your configuration for future reference.

## Section 8: Save Configuration (Optional)

Save your configuration for future reference.

In [8]:
import json
from datetime import datetime

config = {
    'timestamp': datetime.now().isoformat(),
    'input_type': INPUT_TYPE,
    'input_path': str(CSV_PATH if INPUT_TYPE == 'csv' else INPUT_DIRECTORY),
    'output_dir': str(OUTPUT_DIR),
    'num_topics': NUM_TOPICS,
    'passes': PASSES,
    'iterations': ITERATIONS,
    'chunksize': CHUNKSIZE,
    'workers': WORKERS,
    'random_state': RANDOM_STATE,
    'allowed_pos': ALLOWED_POS,
    'custom_stopwords': CUSTOM_STOPWORDS,
}

config_path = Path(output_full_path) / 'run_configuration.json'

if Path(output_full_path).exists():
    with open(config_path, 'w', encoding='utf-8') as f:
        json.dump(config, f, indent=2)
    print(f"Configuration saved to: {config_path}")
    print("\nConfiguration:")
    print(json.dumps(config, indent=2))
else:
    print("Cannot save configuration - output directory not found")

Cannot save configuration - output directory not found


---

## Next Steps

### To apply this model to new documents:

See the `Apply_Topic_Model.ipynb` notebook (you can create one), or use the command line:

```bash
python scripts/apply_topic_model.py \
    --model results/topic_modeling_YYYYMMDD/lda_model \
    --dictionary results/topic_modeling_YYYYMMDD/lda_dictionary \
    --input new_documents.csv \
    --output topic_assignments.csv
```

### To experiment with different parameters:

1. Go back to Section 1
2. Change NUM_TOPICS, PASSES, or other parameters
3. Run all cells again

### To process a different dataset:

1. Change INPUT_TYPE and related paths in Section 1
2. Run all cells