# Topic Modeling: Organizing Unlabeled CVs with LDA

## Overview

This notebook demonstrates **Topic Modeling** using **Latent Dirichlet Allocation (LDA)** to organize unlabeled CVs (resumes) by automatically discovering hidden topics. Unlike supervised classification, topic modeling works with completely unlabeled data, making it ideal for organizing large document collections without manual labeling. You'll learn how to apply LDA to discover topics, interpret results, and organize documents based on their dominant topics.

> "The best way to find a needle in a haystack is to organize the haystack first."

**The Problem**: You have a folder full of CVsâ€”unlabeled, unorganized. You need to find candidates for specific roles, but manually reading through hundreds of CVs is impossible.

## Objectives

- Understand what Topic Modeling is and why it's useful for unsupervised document organization
- Learn how LDA (Latent Dirichlet Allocation) discovers hidden topics in text collections
- Apply LDA to organize unlabeled documents automatically
- Interpret topic modeling results by examining top words and document-topic distributions
- Organize documents into folders based on their dominant topics

## Outline

1. **Introduction to Topic Modeling** - What it is and why it's useful
2. **What is LDA?** - Understanding Latent Dirichlet Allocation
3. **The Pipeline** - Complete workflow from data loading to organization
4. **Step 1: Loading Data** - Reading CVs from JSON files
5. **Step 2: Preprocessing** - Cleaning and preparing text
6. **Step 3: Vectorization** - Converting text to document-term matrix
7. **Step 4: Training LDA** - Discovering topics automatically
8. **Step 5: Analyzing Results** - Interpreting discovered topics
9. **Step 6: Organizing Documents** - Creating folders and organizing CVs by topic

## Topic Modeling

**Topic Modeling** is an **unsupervised learning** task that discovers hidden topics in a collection of unlabeled documents. Unlike classification (which requires labeled data), topic modeling finds patterns automatically.

**Example applications:**
- **Organizing unlabeled documents**: Group CVs by field (AI/ML, Data Analysis, etc.) without manual labeling
- **Understanding large text collections**: Discover what themes exist in news archives, research papers, or social media
- **Content recommendation**: Find documents similar to a given document based on topic similarity

**Why it's useful:**
- No labels needed: works with completely unlabeled data
- Interpretable: topics are defined by their top words, making them understandable
- Scalable: can process large document collections
- Flexible: number of topics can be adjusted based on the corpus

## What is LDA?

**Latent Dirichlet Allocation (LDA)** is a probabilistic model that discovers hidden topics in a collection of documents.

**Key idea**: 
- Each document is a **mixture of topics** (e.g., 70% AI/ML, 20% Data Analysis, 10% Software Engineering)
- Each topic is a **distribution over words** (e.g., Topic 1: 30% "PyTorch", 25% "TensorFlow", 20% "NLP"...)
- LDA discovers these topics automatically by finding words that co-occur together

**For our CVs**: LDA will discover topics like "AI/ML", "Data Analysis", "Big Data" by looking at which words appear together, then assign each CV to the most relevant topic(s).

**Reference**: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). [Latent dirichlet allocation](https://dl.acm.org/doi/10.5555/944919.944937). *Journal of machine Learning research*, 3(Jan), 993-1022.

![Left: BoW. Right: LDA](../assets/lda.png)

## The Pipeline

1. **Load CVs**: Read all JSON files from topic folders using glob patterns and extract structured fields
2. **Preprocess**: Clean the text (remove URLs, emails, etc.)
3. **Vectorize**: Convert text to document-term matrix (Bag of Words)
4. **Train LDA**: Discover topics automatically
5. **Analyze Results**: See what topics were found and which CVs belong to each
6. **Organize**: Create folders and copy CVs based on their dominant topic

## Step 1: Loading Data

In [None]:
# File > Open Folder > W5_NLP
# (VS Code root should be at W5_NLP)
# Then run: `uv sync`

In [None]:
# Standard library imports
import json
import re
import shutil
from pathlib import Path

# Third-party imports
import numpy as np
import pandas as pd

# Machine Learning
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

print("Libraries imported successfully!")

In [None]:
# Load CVs from JSON files in all topic folders
cv_dir = Path('../datasets/CVs')
# Use glob pattern to find all JSON files in Topic_* subdirectories, excluding English versions
cv_files = sorted([f for f in cv_dir.glob('Topic_*/*.json') if not f.name.endswith('_en.json')])

# Load and extract structured data from JSON
cvs_data = []
cv_names = []
cv_file_paths = []  # Store original file paths for later copying

for file in cv_files:
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        cvs_data.append(data)
        cv_names.append(file.stem)
        cv_file_paths.append(file)  # Store the full path

print(f"Loaded {len(cvs_data)} CV files from {len(set(f.parent.name for f in cv_files))} topic folders:")
for i, name in enumerate(cv_names, 1):
    print(f"  {i}. {name}")

# Combine structured fields into text for each CV
def combine_cv_fields(cv_json):
    """Combine Heading, Skills, Projects, Experience, Education into a single text"""
    parts = []
    
    # Add heading
    if 'Heading' in cv_json:
        parts.append(cv_json['Heading'])
    
    # Add skills (join list items)
    if 'Skills' in cv_json:
        skills_text = ' '.join(cv_json['Skills']) if isinstance(cv_json['Skills'], list) else cv_json['Skills']
        parts.append(skills_text)
    
    # Add projects
    if 'Projects' in cv_json:
        projects_text = ' '.join(cv_json['Projects']) if isinstance(cv_json['Projects'], list) else cv_json['Projects']
        parts.append(projects_text)
    
    # Add experience
    if 'Experience' in cv_json:
        exp_text = ' '.join(cv_json['Experience']) if isinstance(cv_json['Experience'], list) else cv_json['Experience']
        parts.append(exp_text)
    
    # Add education
    if 'Education' in cv_json:
        edu_text = ' '.join(cv_json['Education']) if isinstance(cv_json['Education'], list) else cv_json['Education']
        parts.append(edu_text)
    
    return ' '.join(parts)

# Convert JSON data to text
cvs = [combine_cv_fields(cv_data) for cv_data in cvs_data]
print(f"\nCombined structured data into text for {len(cvs)} CVs")

## Step 2: Data Preprocessing

In [None]:
def preprocess_text(text):
    """Clean text: remove URLs, emails, and normalize whitespace"""
    # Remove emails and URLs
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Keep only Arabic/English letters and numbers
    text = re.sub(r'[^\w\s\u0600-\u06FF]', ' ', text)
    return text.strip()

# Preprocess all CVs
cvs_processed = [preprocess_text(cv) for cv in cvs]
print(f"Preprocessed {len(cvs_processed)} CVs")

## Step 3: Prepare Data for LDA

Convert text to a document-term matrix (same as Bag of Words from classification).

In [None]:
# Create document-term matrix
vectorizer = CountVectorizer(
    max_features=1000,  # Top 1000 words
    min_df=2,           # Word must appear in at least 2 CVs
    max_df=0.8          # Ignore words in >80% of CVs
)

doc_term_matrix = vectorizer.fit_transform(cvs_processed)
feature_names = vectorizer.get_feature_names_out()

print(f"Document-Term Matrix: {doc_term_matrix.shape[0]} CVs Ã— {doc_term_matrix.shape[1]} words")
print(f"Sparsity: {(1 - doc_term_matrix.nnz / (doc_term_matrix.shape[0] * doc_term_matrix.shape[1])) * 100:.1f}%")

## Step 4: Train LDA Model

In [None]:
# Train LDA model
n_topics = 3  # Number of topics to discover

lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)

print(f"Training LDA to discover {n_topics} topics...")
lda.fit(doc_term_matrix)
print("âœ“ Training complete!")

## Step 5: Analyze Results

Let's see what topics LDA discovered and which words define each topic.

In [None]:
# Display top words for each topic
def display_topics(model, feature_names, n_top_words=10):
    """Display top words for each topic"""
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        top_weights = [topic[i] for i in top_words_idx]
        
        print(f"\nTopic {topic_idx + 1}:")
        print("  Top words:", ", ".join(top_words))
        print("  Weights:", [f"{w:.3f}" for w in top_weights])

display_topics(lda, feature_names, n_top_words=10)

**Interpreting the topics**: Look at the top words for each topic. Can you guess what each topic represents? For example:
- Topic with "PyTorch", "TensorFlow", "NLP" â†’ probably AI/ML
- Topic with "Tableau", "Power BI", "dashboard" â†’ probably Data Analysis
- Topic with "Hadoop", "Spark", "Kafka" â†’ probably Big Data

Now let's see which CV belongs to which topic:

In [None]:
# Get topic distribution for each CV
doc_topic_dist = lda.transform(doc_term_matrix)

# Find dominant topic for each CV
dominant_topics = doc_topic_dist.argmax(axis=1)

# Create a DataFrame to see results
df_results = pd.DataFrame({
    'CV': cv_names,
    'Dominant Topic': dominant_topics + 1,
    'Topic Probabilities': [dist for dist in doc_topic_dist]
})

# Show which CVs belong to which topic
print("CV Assignment to Topics:")
print("=" * 60)
for topic_id in range(n_topics):
    topic_cvs = df_results[df_results['Dominant Topic'] == topic_id + 1]
    print(f"\nTopic {topic_id + 1} ({len(topic_cvs)} CVs):")
    for idx, row in topic_cvs.iterrows():
        prob = row['Topic Probabilities'][topic_id]
        print(f"  - {row['CV']} ({prob:.1%})")

## Step 6: Organize CVs into Folders

Now comes the practical part: **automatically organize CVs into folders** based on their dominant topic!

In [None]:
# Create output directory structure
output_dir = Path('output/organized_cvs')
output_dir.mkdir(parents=True, exist_ok=True)

# Create a folder for each topic
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    topic_dir.mkdir(exist_ok=True)

# Copy each CV to its topic folder
for idx, (cv_name, topic_id, source_file) in enumerate(zip(cv_names, dominant_topics, cv_file_paths)):
    target_dir = output_dir / f"Topic_{topic_id + 1}"
    target_file = target_dir / f"{cv_name}.json"
    
    shutil.copy2(source_file, target_file)
    print(f"Copied {cv_name}.json â†’ Topic_{topic_id + 1}/")

print(f"\nâœ“ Organization complete! CVs are now in: {output_dir}")

### Verify the Organization

Let's check what's in each folder:

In [None]:
# Show contents of each topic folder
for topic_id in range(n_topics):
    topic_dir = output_dir / f"Topic_{topic_id + 1}"
    files = list(topic_dir.glob('*.json'))
    print(f"\nTopic_{topic_id + 1}/ ({len(files)} CVs):")
    for f in sorted(files):
        print(f"  - {f.name}")

## **Student Exercise**: discover topics on a dataset of your choice

In [2]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

categories = ['sci.space', 'sci.med', 'comp.graphics', 'rec.sport.baseball']

dataset = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers', 'quotes')
)
documents = dataset.data

In [3]:
import re

def preprocess_text(text):
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

clean_docs = [preprocess_text(doc) for doc in documents]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

noise_words = [
    'just', 'don', 'like', 'think', 'know', 'people', 'time', 'good', 
    've', 'way', 'does', 'really', 'make', 'years', 'did', 'sure', 
    'work', 'say', 'things', 'point'
]
my_stop_words = list(ENGLISH_STOP_WORDS) + noise_words

vectorizer = CountVectorizer(
    max_features=1000,
    min_df=2,
    max_df=0.95,
    stop_words=my_stop_words 
)

doc_term_matrix = vectorizer.fit_transform(clean_docs)
feature_names = vectorizer.get_feature_names_out()

print(f"Matrix shape: {doc_term_matrix.shape}")

Matrix shape: (2368, 1000)


In [10]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(
    n_components=4,
    random_state=42,
    learning_method='online',
    max_iter=10
)

lda.fit(doc_term_matrix)

0,1,2
,"n_components  n_components: int, default=10 Number of topics. .. versionchanged:: 0.19  ``n_topics`` was renamed to ``n_components``",4
,"doc_topic_prior  doc_topic_prior: float, default=None Prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `alpha`.",
,"topic_word_prior  topic_word_prior: float, default=None Prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `eta`.",
,"learning_method  learning_method: {'batch', 'online'}, default='batch' Method used to update `_component`. Only used in :meth:`fit` method. In general, if the data size is large, the online update will be much faster than the batch update. Valid options: - 'batch': Batch variational Bayes method. Use all training data in each EM  update. Old `components_` will be overwritten in each iteration. - 'online': Online variational Bayes method. In each EM update, use mini-batch  of training data to update the ``components_`` variable incrementally. The  learning rate is controlled by the ``learning_decay`` and the  ``learning_offset`` parameters. .. versionchanged:: 0.20  The default learning method is now ``""batch""``.",'online'
,"learning_decay  learning_decay: float, default=0.7 It is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is ``n_samples``, the update method is same as batch learning. In the literature, this is called kappa.",0.7
,"learning_offset  learning_offset: float, default=10.0 A (positive) parameter that downweights early iterations in online learning. It should be greater than 1.0. In the literature, this is called tau_0.",10.0
,"max_iter  max_iter: int, default=10 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the :meth:`fit` method, and not the :meth:`partial_fit` method.",10
,"batch_size  batch_size: int, default=128 Number of documents to use in each EM iteration. Only used in online learning.",128
,"evaluate_every  evaluate_every: int, default=-1 How often to evaluate perplexity. Only used in `fit` method. set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold.",-1
,"total_samples  total_samples: int, default=1e6 Total number of documents. Only used in the :meth:`partial_fit` method.",1000000.0


In [11]:
topic_keywords = {
    "Space": ["space", "nasa", "orbit", "moon", "launch", "earth"],
    "Sports": ["game", "team", "play", "baseball", "score", "win"],
    "Medicine": ["doctor", "health", "disease", "med", "pain", "medical"],
    "Graphics": ["image", "file", "jpeg", "graphics", "video", "format"]
}

def get_topic_label(top_words):
    best_label = "Unknown"
    max_match = 0
    
    for label, keywords in topic_keywords.items():
        matches = sum(1 for word in top_words if word in keywords)
        if matches > max_match:
            max_match = matches
            best_label = label
    return best_label

topic_names = {}
for topic_idx, topic in enumerate(lda.components_):
    top_indices = topic.argsort()[:-15:-1]
    top_words = [feature_names[i] for i in top_indices]
    
    label = get_topic_label(top_words)
    topic_names[topic_idx] = label
    print(f"Topic {topic_idx + 1}: {label}")

Topic 1: Sports
Topic 2: Space
Topic 3: Graphics
Topic 4: Medicine


In [7]:
import os
import shutil

output_dir = 'output/organized_newsgroups'
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)

doc_topic_dist = lda.transform(doc_term_matrix)
dominant_topics = doc_topic_dist.argmax(axis=1)

for idx, (text, topic_id) in enumerate(zip(documents, dominant_topics)):
    folder_name = topic_names[topic_id]
    
    folder_path = os.path.join(output_dir, folder_name)
    os.makedirs(folder_path, exist_ok=True)
    
    with open(os.path.join(folder_path, f"doc_{idx}.txt"), 'w', encoding='utf-8') as f:
        f.write(text)

## Summary

**What we accomplished**:
1. âœ… Loaded unlabeled CVs from a folder
2. âœ… Preprocessed the text data
3. âœ… Created a document-term matrix
4. âœ… Trained an LDA model to discover topics
5. âœ… Analyzed which CVs belong to which topic
6. âœ… **Automatically organized CVs into folders** based on discovered topics

**Key Takeaways**:
- **LDA discovers topics automatically** by finding words that co-occur together
- **Each document is a mixture of topics** - LDA assigns probabilities
- **Topic modeling is unsupervised** - no labels needed!
- **Practical application**: Organize unlabeled documents automatically

**Next Steps**:
- Try different numbers of topics (`n_topics`) and see how results change
- Experiment with preprocessing (stemming, stop words removal)
- Use topic probabilities to handle CVs that belong to multiple topics
- Visualize topics using tools like pyLDAvis

**References**:
- [Scikit-learn LDA documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- [Topic modeling visualization guide](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)

---


## Module 1 Synthesis: The Complete Pipeline

Congratulations! You've completed **Module 1: Text Analysis with Statistical NLP**. Let's reflect on the journey and see how all the pieces fit together.

### The Circular Learning Experience

Remember the question chain we started with? Let's trace how we answered each question and built a complete NLP pipeline:

1. **"What is NLP?"** â†’ We learned that NLP bridges computers and human language, with applications in understanding and generation.

2. **"How do we extract patterns from text?"** â†’ We used **Regular Expressions** to find, match, and manipulate text patternsâ€”essential for preprocessing.

3. **"How do we understand our data?"** â†’ We performed **Exploratory Data Analysis (EDA)** on corpora to assess data quality, vocabulary characteristics, and preprocessing needs.

4. **"How do we prepare text for ML?"** â†’ We applied **Preprocessing** techniques (cleaning, normalization, tokenization, stemming) to transform raw text into clean tokens.

5. **"How do we convert text to numbers?"** â†’ We used **Vectorization** (BoW, TF-IDF) to convert text into numerical features that ML models can process.

6. **"How do we build classifiers?"** â†’ We built **Text Classification** models (like sentiment analysis) using vectorized features and supervised learning.

7. **"How do we search documents?"** â†’ We implemented **Information Retrieval** systems using TF-IDF and cosine similarity to find relevant documents.

8. **"How do we discover topics?"** â†’ We applied **Topic Modeling** (LDA) to automatically organize unlabeled documents by discovering hidden topics.

### The Complete NLP Pipeline

Throughout this module, you've learned to build a complete NLP pipeline:

```
Raw Text
    â†“
[Regex: Pattern Extraction]
    â†“
[Corpus & EDA: Understanding Data]
    â†“
[Preprocessing: Cleaning & Normalization]
    â†“
[Vectorization: Text â†’ Numbers]
    â†“
[Modeling: Classification / IR / Topic Modeling]
    â†“
Actionable Insights
```

### Key Skills You've Acquired

By completing this module, you can now:

âœ… **Build supervised ML text classification pipelines**
- Preprocess Arabic and English text
- Vectorize text using BoW and TF-IDF
- Train and evaluate classifiers
- Interpret model results

âœ… **Apply keyword-based information retrieval**
- Implement TF-IDF-based search engines
- Measure document similarity using cosine similarity
- Rank and retrieve relevant documents

âœ… **Apply unsupervised ML for document organization**
- Discover hidden topics using LDA
- Organize unlabeled documents automatically
- Interpret topic modeling results

### The Foundation for What's Next

This module focused on **statistical NLP**â€”traditional methods that work well for many tasks. In **Module 2**, you'll learn about **Deep Learning approaches** (embeddings, transformers) that build on these foundations to achieve even better performance.

**What you learned here is still valuable:**
- Preprocessing techniques apply to both statistical and deep learning methods
- Understanding vectorization helps you understand embeddings
- EDA is always the first step, regardless of the approach
- The pipeline structure (preprocess â†’ vectorize â†’ model) remains the same

### Reflection Questions

Before moving to Module 2, consider:

1. **When would you use statistical NLP vs. deep learning?**
   - Statistical NLP: Fast, interpretable, works with small data
   - Deep Learning: Better accuracy, requires more data and computation

2. **What preprocessing steps are most important?**
   - Depends on your data and task, but EDA always guides the decision

3. **How does TF-IDF differ from BoW?**
   - BoW: Simple word counts
   - TF-IDF: Weighted counts that emphasize distinctive words

4. **When would you use topic modeling vs. classification?**
   - Classification: When you have labels and want to predict categories
   - Topic Modeling: When you have no labels and want to discover structure

### The Journey Continues

You've built a solid foundation in statistical NLP. The concepts you've learnedâ€”preprocessing, vectorization, classification, retrieval, and topic modelingâ€”are the building blocks for more advanced techniques.

**Next Module Preview:**
- **Module 2** introduces **Deep Learning for NLP**:
  - Tokenization with modern tools (WordPiece, BPE)
  - Word embeddings (Word2Vec, GloVe, contextual embeddings)
  - Transformers and BERT
  - Fine-tuning pre-trained models

The journey from statistical NLP to deep learning is a natural progressionâ€”you'll see how embeddings generalize vectorization, how transformers improve on traditional methods, and how pre-trained models leverage the foundations you've built.

---

**Module 1 Complete! ðŸŽ‰**

You now have the skills to work with text data using statistical methods. You understand the complete pipeline from raw text to actionable insights, and you're ready to explore the power of deep learning in Module 2.