Thematic Analysis

An LLM-powered tool for automated thematic analysis of qualitative data. Analyze PDFs, text files, or raw text to discover themes and patterns.

Built with the OpenHands Software Agent SDK.

Installation

pip install thematic-analysis

# Or from source
git clone https://github.com/neubig/thematic-analysis.git
cd thematic-analysis
pip install -e .

Quick Start

Analyze a Directory of PDFs

from thematic_analysis import ThematicAnalysisPipeline

pipeline = ThematicAnalysisPipeline()
result = pipeline.run_from_directory("/path/to/pdfs")

# View discovered themes
for theme in result.themes.themes:
    print(f"{theme.name}: {theme.description}")
    print(f"  Codes: {', '.join(theme.original_themes)}")

Analyze Text Directly

from thematic_analysis import ThematicAnalysisPipeline

texts = [
    "I feel overwhelmed by the amount of work...",
    "The team collaboration has been excellent...",
    "Deadlines are causing significant stress...",
]

pipeline = ThematicAnalysisPipeline()
result = pipeline.run_from_texts(texts)

Analyze a Single PDF

from thematic_analysis import ThematicAnalysisPipeline

pipeline = ThematicAnalysisPipeline()
result = pipeline.run_from_pdf("/path/to/document.pdf")

Configuration

Customize the analysis with PipelineConfig:

from thematic_analysis import ThematicAnalysisPipeline, PipelineConfig

config = PipelineConfig(
    # Number of parallel coders (more = diverse perspectives)
    num_coders=3,
    
    # Number of theme developers  
    num_theme_coders=2,
    
    # LLM model to use
    model="anthropic/claude-sonnet-4-20250514",
    
    # Processing batch size
    batch_size=10,
)

pipeline = ThematicAnalysisPipeline(config=config)

Key Parameters

Parameter	Default	Description
`num_coders`	3	Number of independent coders. More coders = more diverse code suggestions
`num_theme_coders`	2	Number of theme developers
`batch_size`	10	Segments processed per batch
`execution_mode`	`PARALLEL`	`PARALLEL` for speed, `SEQUENTIAL` for debugging

Coder Configuration

from thematic_analysis.agents import CoderConfig

coder_config = CoderConfig(
    model="anthropic/claude-sonnet-4-20250514",
    max_codes_per_segment=5,  # Max codes assigned to each text segment
    temperature=0.7,          # LLM temperature (higher = more creative)
)

config = PipelineConfig(coder_config=coder_config)

Using Identity Perspectives

Assign different analytical perspectives to coders for richer analysis:

from thematic_analysis import PipelineConfig

config = PipelineConfig(
    num_coders=3,
    coder_identities=[
        "a healthcare professional focused on patient outcomes",
        "an economist analyzing cost-effectiveness",
        "a patient advocate prioritizing accessibility",
    ],
)

Output Format

The pipeline returns a PipelineResult with:

result = pipeline.run_from_texts(texts)

# Themes discovered
result.themes.themes  # List of MergedTheme objects

# Each theme has:
# - name: str
# - description: str  
# - original_themes: list[str] (the codes that were merged into this theme)

# The codebook with all codes
result.codebook

# Export to JSON
result.to_json()

Environment Variables

Set your LLM API credentials:

export LLM_API_KEY="your-api-key"
export LLM_MODEL="anthropic/claude-sonnet-4-20250514"  # or other supported model

Supported File Formats

PDF (.pdf) - Automatic text extraction
Text (.txt) - Plain text files
Markdown (.md) - Markdown files

How It Works

The pipeline uses a multi-agent architecture:

Coders: Multiple independent agents analyze text segments and assign codes
Aggregator: Merges similar codes from different coders
Reviewer: Validates and refines the codebook
Theme Coders: Group codes into higher-level themes
Theme Aggregator: Produces final consolidated themes

License

MIT License - See LICENSE for details.

Acknowledgments

This project is inspired by and based on concepts from:

Qiao, T., Walker, C., Cunningham, C., & Koh, Y. S. (2025). Thematic-LM: A LLM-based Multi-agent System for Large-scale Thematic Analysis. In Proceedings of the ACM Web Conference 2025 (WWW '25). https://doi.org/10.1145/3696410.3714595

@inproceedings{qiao2025thematiclm,
  title={Thematic-LM: A LLM-based Multi-agent System for Large-scale Thematic Analysis},
  author={Qiao, Tingrui and Walker, Caroline and Cunningham, Chris and Koh, Yun Sing},
  booktitle={Proceedings of the ACM Web Conference 2025 (WWW '25)},
  year={2025},
  doi={10.1145/3696410.3714595}
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
src/thematic_analysis		src/thematic_analysis
tests		tests
.gitignore		.gitignore
ISSUES.md		ISSUES.md
README.md		README.md
pyproject.toml		pyproject.toml
run_analysis.py		run_analysis.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thematic Analysis

Installation

Quick Start

Analyze a Directory of PDFs

Analyze Text Directly

Analyze a Single PDF

Configuration

Key Parameters

Coder Configuration

Using Identity Perspectives

Output Format

Environment Variables

Supported File Formats

How It Works

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thematic Analysis

Installation

Quick Start

Analyze a Directory of PDFs

Analyze Text Directly

Analyze a Single PDF

Configuration

Key Parameters

Coder Configuration

Using Identity Perspectives

Output Format

Environment Variables

Supported File Formats

How It Works

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages