## Automated AI agent for topic discovery and topic assignment for a set of documents

The workflow is designed to automate the process of discovering topics within a set of documents and assigning these topics to the documents. The workflow uses LLMs to analyze the content of the documents, extract relevant topics, and assign up to three topics to each document.

### Generating a synthetic dataset for testing

A function `generate_synthetic_docs()` is available to generate a synthetic dataset of documents. This dataset can be used for testing the topic discovery and assignment workflow.

In [1]:
from docagent import synthetic as sn
import pandas as pd

pd.set_option("display.max_colwidth", 800)

# Generate synthetic documents for a specific industry and context
investment_bank_employee_survey = sn.generate_synthetic_docs(
    industry="investment banking",
    context="comments provided by employees on their work experience",
    num_docs=100,
)


    Generating 100 synthetic documents for the investment banking industry with the requested context using model claude-3-7-sonnet-latest. 
    This usually takes about one second per document....
              
Generated 100 synthetic documents successfully in 77 seconds.


Viewing a few of the documents in the synthetic dataset:

In [2]:
investment_bank_employee_survey.head()

Unnamed: 0,docID,content
0,1,"The fast-paced environment here challenges me daily. I've grown significantly in my analytical skills. However, the work-life balance remains a struggle, with late nights becoming routine rather than exception."
1,2,Working on complex transactions has sharpened my financial modeling abilities. The learning curve is steep but rewarding. Senior management could improve transparency about promotion criteria and career progression.
2,3,The collaborative deal teams foster knowledge sharing across experience levels. I appreciate the exposure to various industries through diverse client engagements. The hours are demanding but expected in this field.
3,4,"Client interactions provide valuable insights into corporate strategy implementation. The compensation structure adequately rewards performance, though bonus transparency could improve. Professional development opportunities are abundant if proactively sought."
4,5,"The technical training provided excellent foundation for financial analysis, though on-the-job learning proves most valuable. The pressure during active deals is intense but develops resilience and attention to detail."


### Running topic discovery

Topic discovery can be run using the `discover_topics()` function. This function takes a list of documents and returns a list of topics discovered in the documents, along with various other descriptive data about the topics.

In [3]:
from docagent import topic_discovery as td

# Discover between 6 and 10 topics in the synthetic documents
investment_bank_employee_survey_topics = await td.discover_topics(
    corpus = investment_bank_employee_survey,
    min_topics = 6,
    max_topics = 10,
)

Starting topic discovery from 100 documents using model claude-opus-4-20250514...
Topic discovery complete. Discovered 7 topics in 27 seconds.


We can view the details of the topics discovered in the synthetic dataset:

In [4]:
investment_bank_employee_survey_topics

Unnamed: 0,topicID,topicName,topicDescription,topicKeywords,topicExamples,topicPrevalence
0,2,Professional Development and Skills,"Covers the development of analytical, financial modeling, and technical skills through on-the-job learning and formal training programs.","[analytical skills, financial modeling, technical training, professional growth, learning curve, skill development]","[1, 2, 5, 10, 20, 25, 45, 64, 72, 88]",0.18
1,4,Deal Experience and Client Exposure,"Focuses on working with high-profile clients, exposure to complex transactions, and the satisfaction derived from contributing to significant corporate events.","[client interactions, deal closings, complex transactions, corporate strategy, executive teams, transformative events]","[4, 11, 13, 15, 19, 33, 40, 52, 68, 81]",0.17
2,3,Management and Organizational Issues,"Addresses concerns about management transparency, career progression clarity, feedback mechanisms, and organizational hierarchy challenges.","[management transparency, career progression, feedback, organizational hierarchy, promotion criteria, leadership communication]","[2, 6, 12, 16, 23, 35, 56, 78, 85, 94]",0.16
3,1,Work-Life Balance and Work Hours,"Discusses the challenging work hours, late nights, weekend work expectations, and the ongoing struggle to maintain work-life balance in investment banking.","[work-life balance, late nights, weekend work, demanding hours, personal relationships, work intensity]","[1, 3, 8, 14, 47, 49, 57, 66, 89, 93]",0.15
4,6,Team Dynamics and Collaboration,"Explores varying team dynamics, collaboration quality, mentorship experiences, and knowledge sharing across different departments.","[team collaboration, mentorship, knowledge sharing, team dynamics, cross-functional work, deal teams]","[3, 6, 10, 27, 34, 51, 65, 71, 91, 98]",0.14
5,5,Compensation and Recognition,"Discusses compensation structures, bonus transparency, performance rewards, and inconsistent recognition systems across teams.","[compensation, bonus transparency, performance rewards, recognition systems, market standards, reward structure]","[4, 8, 17, 24, 32, 55, 80, 82, 87]",0.1
6,7,Work Environment and Culture,"Examines the competitive atmosphere, cultural emphasis on precision and perfection, and the pressure-filled work environment.","[competitive atmosphere, workplace culture, pressure, precision, professional standards, work environment]","[10, 22, 26, 38, 58, 63, 73, 86, 90, 95]",0.1


### Assigning topics to documents

Once a topic DataFrame is available, we can assign topics to documents using the `assign_topics()` function. This function takes a DataFrame of topics and a DataFrame of documents, and returns a DataFrame of documents with assigned topics.   Where possible, primary, secondary and tertiary topics are assigned to each document based on the topics discovered in the previous step.

This allows for human editing of the topics prior to assignment, if needed. The topics are assigned based on the relevance of the topics to the content of each document.

This process runs by sending the documents in consecutive batches to the LLM using a tool to fetch each batch of documents.  The LLM assigns each document in the batch to the the most relevant topics.

In [6]:
from docagent import topic_assignment as ta

# Assign topics to the synthetic documents
synthetic_docs_with_topics = await ta.assign_topics(
    corpus_name = "consulting_engagement_survey",
    corpus = investment_bank_employee_survey,
    topics = investment_bank_employee_survey_topics,
)

Commencing assignment of 100 documents to 7 topics. This will be done in chunks of 20 documents at a time...
Assigning topics for docs 1 to 20...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 21 to 40...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 41 to 60...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 61 to 80...
Document assignment 

We can view the assigned topics for the documents in the synthetic dataset:

In [7]:
synthetic_docs_with_topics.head()

Unnamed: 0,docID,docContent,primaryTopicID,primaryTopicName,secondaryTopicID,secondaryTopicName,tertiaryTopicID,tertiaryTopicName
0,1,"The fast-paced environment here challenges me daily. I've grown significantly in my analytical skills. However, the work-life balance remains a struggle, with late nights becoming routine rather than exception.",1,Work-Life Balance and Work Hours,2,Professional Development and Skills,0,No topic assigned
1,2,Working on complex transactions has sharpened my financial modeling abilities. The learning curve is steep but rewarding. Senior management could improve transparency about promotion criteria and career progression.,2,Professional Development and Skills,3,Management and Organizational Issues,4,Deal Experience and Client Exposure
2,3,The collaborative deal teams foster knowledge sharing across experience levels. I appreciate the exposure to various industries through diverse client engagements. The hours are demanding but expected in this field.,6,Team Dynamics and Collaboration,4,Deal Experience and Client Exposure,1,Work-Life Balance and Work Hours
3,4,"Client interactions provide valuable insights into corporate strategy implementation. The compensation structure adequately rewards performance, though bonus transparency could improve. Professional development opportunities are abundant if proactively sought.",4,Deal Experience and Client Exposure,5,Compensation and Recognition,2,Professional Development and Skills
4,5,"The technical training provided excellent foundation for financial analysis, though on-the-job learning proves most valuable. The pressure during active deals is intense but develops resilience and attention to detail.",2,Professional Development and Skills,7,Work Environment and Culture,4,Deal Experience and Client Exposure


### Running the agent to fully automate the workflow

The agent is designed to automate all steps of the process: it takes a corpus of documents, discovers topics, and assigns these topics to the documents.

First let's create a new synthetic dataset of documents to test the agent:


In [8]:
# Generate synthetic documents for a specific industry and context
hospital_patient_survey = sn.generate_synthetic_docs(
    industry="hospital emergency care",
    context="comments provided by patients on their experience in the emergency department",
    num_docs=200,
    min_doc_length=50,
    max_doc_length=200,
)

hospital_patient_survey.head()


    Generating 200 synthetic documents for the hospital emergency care industry with the requested context using model claude-3-7-sonnet-latest. 
    This usually takes about one second per document....
              
Generated 200 synthetic documents successfully in 215 seconds.


Unnamed: 0,docID,content
0,1,"I waited for almost 3 hours before anyone saw me. My pain was getting worse and I felt ignored. When the doctor finally came, they seemed rushed and didn't listen to my concerns. The nurses were nice though, especially when they brought me water and checked on me occasionally."
1,2,"The emergency department was very clean and organized. I appreciate how the staff kept me informed about what tests they were doing and why. Even though it was busy, the nurses checked on me regularly. The doctor explained my condition clearly and answered all my questions. Overall positive experience despite the circumstances."
2,3,"Terrible experience at the emergency room. Too crowded, had to wait over 4 hours with a high fever. Staff seemed overwhelmed and irritable. The doctor barely spent 5 minutes with me and seemed distracted. No clear explanation about my diagnosis or treatment plan. Would avoid if possible in the future."
3,4,"The emergency staff was professional and efficient. I was scared about my chest pain, but the triage nurse immediately took me back and started tests. The doctor was thorough and kind, explaining everything in terms I could understand. Even though it was a scary situation, they made me feel safe and cared for."
4,5,Nurses were fantastic but the wait time was ridiculous. Sat in the waiting room for 5 hours before being seen for what turned out to be a serious infection. The facilities were outdated and uncomfortable. Better communication about wait times would have helped manage expectations.


In [10]:
from docagent import topic_agent as ta

# Perform full topic analysis on the hospital patient survey documents, based on 6-8 topics
discovered_topics, assigned_topics = await ta.full_topic_analysis(
    corpus = hospital_patient_survey,
    corpus_name = "hospital_patient_survey",
    min_topics = 6,
    max_topics = 8,
)

Starting topic discovery from 200 documents using model claude-opus-4-20250514...
Topic discovery complete. Discovered 8 topics in 32 seconds.
Commencing assignment of 200 documents to 8 topics. This will be done in chunks of 20 documents at a time...
Assigning topics for docs 1 to 20...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 21 to 40...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 41 to 60...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained suc

The agent produces two DataFrames: the first contains the topics discovered in the documents, and the second contains the documents with assigned topics.

In [11]:
discovered_topics

Unnamed: 0,topicID,topicName,topicDescription,topicKeywords,topicExamples,topicPrevalence
0,1,Wait Times and Communication,"Experiences centered around long wait times in the emergency department and the quality of communication about delays, updates, and expected wait periods.","[wait time, hours, updates, communication, delays, waiting room, transparency]","[1, 3, 5, 12, 17, 24, 30, 51, 83, 122, 141, 152, 172]",0.27
1,2,Staff Compassion and Patient-Centered Care,"Positive experiences highlighting compassionate, respectful treatment by emergency staff who took time to listen, explain procedures clearly, and involve patients in care decisions.","[compassion, respect, empathy, patient-centered, dignity, listened, explained, thorough]","[2, 4, 6, 8, 11, 16, 18, 21, 23, 36, 49, 58, 109, 131, 198]",0.25
2,4,System Failures and Understaffing,"Experiences highlighting dangerous understaffing levels, overcrowding, patients in hallways, and systemic issues that compromise patient safety and dignity.","[understaffed, overcrowded, hallways, system failure, dangerous, overwhelmed, patient safety]","[7, 10, 28, 32, 52, 60, 65, 77, 80, 100, 120, 140, 160, 180, 200]",0.23
3,3,Pediatric Emergency Care,"Experiences specifically related to children's emergency care, including specialized approaches to communication, pain management, and creating child-friendly environments.","[pediatric, child, children, age-appropriate, distraction techniques, parents, asthma, allergic reaction]","[14, 31, 50, 73, 90, 114, 134, 154, 176, 196]",0.15
4,5,Facility Cleanliness and Environment,"Concerns about unsanitary conditions, uncomfortable physical environments, lack of privacy, and how the physical space impacts patient comfort and healing.","[filthy, unsanitary, blood stains, trash, cold, privacy, curtains, environment]","[20, 44, 56, 79, 84, 104, 124, 144, 156, 164, 184]",0.12
5,6,Medical History and Chronic Conditions,"Experiences where emergency staff failed to consider patients' existing medical conditions, chronic illnesses, or medical history when making treatment decisions.","[chronic condition, medical history, underlying condition, medications, dismissed, fragmented care]","[51, 71, 91, 111, 117, 137, 157, 177, 197]",0.1
6,8,Discharge Planning and Follow-up,"Issues related to inadequate discharge instructions, poor follow-up planning, and confusion about next steps after leaving the emergency department.","[discharge, follow-up, instructions, confusion, next steps, medication, planning]","[22, 46, 57, 89, 92, 112, 132, 152, 172, 192]",0.1
7,7,Cultural Sensitivity and Accessibility,"Experiences highlighting cultural competence, language services, religious accommodations, and accessibility for patients with disabilities or special needs.","[cultural sensitivity, language, interpreter, religious, disability, accessibility, inclusive]","[27, 38, 62, 69, 82, 102, 122, 142, 162, 182]",0.08


In [12]:
assigned_topics.head()

Unnamed: 0,docID,docContent,primaryTopicID,primaryTopicName,secondaryTopicID,secondaryTopicName,tertiaryTopicID,tertiaryTopicName
0,1,"I waited for almost 3 hours before anyone saw me. My pain was getting worse and I felt ignored. When the doctor finally came, they seemed rushed and didn't listen to my concerns. The nurses were nice though, especially when they brought me water and checked on me occasionally.",1,Wait Times and Communication,2,Staff Compassion and Patient-Centered Care,0,No topic assigned
1,2,"The emergency department was very clean and organized. I appreciate how the staff kept me informed about what tests they were doing and why. Even though it was busy, the nurses checked on me regularly. The doctor explained my condition clearly and answered all my questions. Overall positive experience despite the circumstances.",2,Staff Compassion and Patient-Centered Care,1,Wait Times and Communication,5,Facility Cleanliness and Environment
2,3,"Terrible experience at the emergency room. Too crowded, had to wait over 4 hours with a high fever. Staff seemed overwhelmed and irritable. The doctor barely spent 5 minutes with me and seemed distracted. No clear explanation about my diagnosis or treatment plan. Would avoid if possible in the future.",1,Wait Times and Communication,4,System Failures and Understaffing,8,Discharge Planning and Follow-up
3,4,"The emergency staff was professional and efficient. I was scared about my chest pain, but the triage nurse immediately took me back and started tests. The doctor was thorough and kind, explaining everything in terms I could understand. Even though it was a scary situation, they made me feel safe and cared for.",2,Staff Compassion and Patient-Centered Care,0,No topic assigned,0,No topic assigned
4,5,Nurses were fantastic but the wait time was ridiculous. Sat in the waiting room for 5 hours before being seen for what turned out to be a serious infection. The facilities were outdated and uncomfortable. Better communication about wait times would have helped manage expectations.,1,Wait Times and Communication,5,Facility Cleanliness and Environment,2,Staff Compassion and Patient-Centered Care
