## Automated AI agent for topic discovery and topic assignment for a set of documents

The workflow is designed to automate the process of discovering topics within a set of documents and assigning these topics to the documents. The workflow uses LLMs to analyze the content of the documents, extract relevant topics, and assign up to three topics to each document.

### Generating a synthetic dataset for testing

A function `generate_synthetic_docs()` is available to generate a synthetic dataset of documents. This dataset can be used for testing the topic discovery and assignment workflow.

In [1]:
from docagent import synthetic as sn
import pandas as pd

pd.set_option("display.max_colwidth", 800)

# Generate synthetic documents for a specific industry and context
investment_bank_employee_survey = sn.generate_synthetic_docs(
    industry="investment banking",
    context="comments provided by employees on their work experience",
    num_docs=100,
)


    Generating 100 synthetic documents for the investment banking industry with the requested context using model claude-3-7-sonnet-latest. 
    This usually takes about one second per document....
              
Generated 100 synthetic documents successfully in 68 seconds.


Viewing a few of the documents in the synthetic dataset:

In [2]:
investment_bank_employee_survey.head()

Unnamed: 0,docID,content
0,1,"The work environment is fast-paced but rewarding. Long hours during deal closings are challenging, but the team support makes it manageable. Compensation is competitive, though work-life balance could be improved."
1,2,"Challenging work with exposure to complex financial transactions. The learning curve is steep, but mentorship is available. Wish there were more formal training programs for new analysts."
2,3,"High pressure environment with significant demands on time. The experience gained is invaluable for career growth, but burnout is a real concern. More flexibility would be appreciated."
3,4,"Excellent exposure to senior bankers and clients. The hierarchical structure can be frustrating at times, but provides clear career progression paths. Bonus structure could be more transparent."
4,5,"The analytical skills developed here are unmatched. Working on high-profile deals is exciting, though the constant urgency can be draining. More recognition for junior staff would improve morale."


### Running topic discovery

Topic discovery can be run using the `discover_topics()` function. This function takes a list of documents and returns a list of topics discovered in the documents, along with various other descriptive data about the topics.

In [3]:
from docagent import topic_discovery as td

# Discover between 6 and 10 topics in the synthetic documents
investment_bank_employee_survey_topics = await td.discover_topics(
    corpus = investment_bank_employee_survey,
    min_topics = 6,
    max_topics = 10,
)

Starting topic discovery from 100 documents using model claude-opus-4-20250514...
Topic discovery complete. Discovered 8 topics in 28 seconds.


We can view the details of the topics discovered in the synthetic dataset:

In [4]:
investment_bank_employee_survey_topics

Unnamed: 0,topicID,topicName,topicDescription,topicKeywords,topicExamples,topicPrevalence
0,1,Work-Life Balance Challenges,"Concerns about long hours, unpredictable schedules, and the difficulty of maintaining personal relationships and commitments outside of work. The 'always-on' culture and expectation of 24/7 availability is a recurring theme.","[work-life balance, long hours, burnout, personal time, 24/7 availability, unpredictable schedule, lifestyle sacrifices]","[1, 3, 7, 10, 15, 21, 23, 27, 33, 40, 48, 52, 60, 73, 85]",0.95
1,2,Career Development and Learning,"Positive aspects of professional growth including exposure to complex deals, development of analytical and technical skills, and valuable experience for future career opportunities.","[career growth, learning opportunities, deal experience, skill development, professional development, exit opportunities, resume credentials]","[2, 5, 9, 11, 17, 20, 22, 31, 34, 42, 54, 62, 78, 89, 96]",0.9
2,4,Workplace Culture and Environment,"Observations about the hierarchical structure, competitive atmosphere, and cultural emphasis on face time over efficiency. Includes concerns about collaboration versus competition.","[culture, hierarchical structure, competitive atmosphere, face time, teamwork, collaboration, sink-or-swim]","[4, 6, 11, 13, 17, 24, 28, 32, 38, 41, 47, 61, 69, 77, 87]",0.85
3,8,Client and Deal Exposure,"Positive experiences working on high-profile transactions with sophisticated clients, though junior staff often have limited direct client interaction.","[client interaction, deal exposure, high-profile transactions, senior executives, complex deals, financial transactions]","[4, 8, 14, 19, 49, 57, 65, 75, 81, 93]",0.8
4,3,Compensation and Benefits,"Discussion of competitive compensation packages that reflect the demanding nature of the work, though often at the cost of personal time and well-being.","[compensation, competitive pay, bonus structure, salary, financial rewards, compensation package]","[1, 4, 8, 12, 23, 35, 43, 50, 58, 66, 74, 82, 90, 98]",0.75
5,6,Mental Health and Well-being,"Concerns about stress, anxiety, burnout, and the need for better mental health support and wellness initiatives in the workplace.","[mental health, stress, anxiety, burnout, wellness, well-being, recovery time]","[9, 10, 30, 35, 48, 56, 67, 71, 83, 91]",0.7
6,7,Process and Efficiency Improvements,"Suggestions for better planning, resource allocation, and process improvements to reduce unnecessary fire drills and improve workflow efficiency.","[efficiency, process improvement, resource planning, better planning, workflow, staffing levels, fire drills]","[15, 20, 25, 30, 36, 44, 59, 70, 79, 84]",0.65
7,5,Training and Mentorship,Mixed experiences with formal training programs and mentorship opportunities. Many describe a 'sink-or-swim' approach that can be stressful for new analysts.,"[training, mentorship, onboarding, learning curve, formal training, development programs, feedback]","[2, 13, 17, 26, 39, 54, 72, 76, 88, 95]",0.6


### Assigning topics to documents

Once a topic DataFrame is available, we can assign topics to documents using the `assign_topics()` function. This function takes a DataFrame of topics and a DataFrame of documents, and returns a DataFrame of documents with assigned topics.   Where possible, primary, secondary and tertiary topics are assigned to each document based on the topics discovered in the previous step.

This allows for human editing of the topics prior to assignment, if needed. The topics are assigned based on the relevance of the topics to the content of each document.

This process runs by sending the documents in consecutive batches to the LLM using a tool to fetch each batch of documents.  The LLM assigns each document in the batch to the the most relevant topics.

In [5]:
from docagent import topic_assignment as ta

# Assign topics to the synthetic documents
synthetic_docs_with_topics = await ta.assign_topics(
    corpus_name = "consulting_engagement_survey",
    corpus = investment_bank_employee_survey,
    topics = investment_bank_employee_survey_topics,
)

Commencing assignment of 100 documents to 8 topics. This will be done in chunks of 20 documents at a time...
Assigning topics for docs 1 to 20...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 21 to 40...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 41 to 60...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 61 to 80...
Document assignment 

We can view the assigned topics for the documents in the synthetic dataset:

In [6]:
synthetic_docs_with_topics.head()

Unnamed: 0,docID,docContent,primaryTopicID,primaryTopicName,secondaryTopicID,secondaryTopicName,tertiaryTopicID,tertiaryTopicName
0,1,"The work environment is fast-paced but rewarding. Long hours during deal closings are challenging, but the team support makes it manageable. Compensation is competitive, though work-life balance could be improved.",1,Work-Life Balance Challenges,3,Compensation and Benefits,4,Workplace Culture and Environment
1,2,"Challenging work with exposure to complex financial transactions. The learning curve is steep, but mentorship is available. Wish there were more formal training programs for new analysts.",2,Career Development and Learning,5,Training and Mentorship,8,Client and Deal Exposure
2,3,"High pressure environment with significant demands on time. The experience gained is invaluable for career growth, but burnout is a real concern. More flexibility would be appreciated.",1,Work-Life Balance Challenges,2,Career Development and Learning,6,Mental Health and Well-being
3,4,"Excellent exposure to senior bankers and clients. The hierarchical structure can be frustrating at times, but provides clear career progression paths. Bonus structure could be more transparent.",8,Client and Deal Exposure,4,Workplace Culture and Environment,3,Compensation and Benefits
4,5,"The analytical skills developed here are unmatched. Working on high-profile deals is exciting, though the constant urgency can be draining. More recognition for junior staff would improve morale.",2,Career Development and Learning,8,Client and Deal Exposure,4,Workplace Culture and Environment


### Running the agent to fully automate the workflow

The agent is designed to automate all steps of the process: it takes a corpus of documents, discovers topics, and assigns these topics to the documents.

First let's create a new synthetic dataset of documents to test the agent:


In [7]:
# Generate synthetic documents for a specific industry and context
hospital_patient_survey = sn.generate_synthetic_docs(
    industry="hospital emergency care",
    context="comments provided by patients on their experience in the emergency department",
    num_docs=200,
    min_doc_length=50,
    max_doc_length=200,
)

hospital_patient_survey.head()


    Generating 200 synthetic documents for the hospital emergency care industry with the requested context using model claude-3-7-sonnet-latest. 
    This usually takes about one second per document....
              
Generated 200 synthetic documents successfully in 155 seconds.


Unnamed: 0,docID,content
0,1,"The wait time in the emergency department was incredibly long. I sat there for over 3 hours before being seen by a doctor. The nurses seemed overwhelmed and understaffed. However, once I was finally treated, the care was thorough and professional. The doctor explained everything clearly and addressed my concerns."
1,2,"I was impressed by how quickly I was seen in the emergency room. The triage nurse was efficient and compassionate. The doctor who treated me was knowledgeable and took time to explain my condition. The facility was clean, though the waiting area was quite crowded. Overall, a positive experience during a stressful situation."
2,3,"The emergency staff seemed disorganized. I had to repeat my symptoms to multiple people, and there seemed to be poor communication between the nurses and doctors. The waiting room was uncomfortable with hard chairs and bright lights. It took forever to get my test results back. Not a good experience when you're already feeling terrible."
3,4,"I appreciated the kindness shown by the emergency department staff during my visit. The nurse who took my vitals was particularly gentle and reassuring. While the wait was longer than expected, I understand they were prioritizing more urgent cases. The doctor was thorough and didn't rush through my examination."
4,5,"The emergency room was freezing cold! I sat shivering for hours waiting to be seen. When I finally got treatment, the doctor seemed rushed and barely made eye contact. The nurse was nice though and brought me an extra blanket. The discharge instructions were confusing and I left not fully understanding my care plan."


In [9]:
from docagent import topic_agent as ta

# Perform full topic analysis on the hospital patient survey documents, based on 6-8 topics
discovered_topics, assigned_topics = await ta.full_topic_analysis(
    corpus = hospital_patient_survey,
    corpus_name = "hospital_patient_survey",
    min_topics = 6,
    max_topics = 8,
)

Starting topic discovery from 200 documents using model claude-opus-4-20250514...
Topic discovery complete. Discovered 8 topics in 39 seconds.
Commencing assignment of 200 documents to 8 topics. This will be done in chunks of 20 documents at a time...
Assigning topics for docs 1 to 20...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 21 to 40...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 41 to 60...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained suc

The agent produces two DataFrames: the first contains the topics discovered in the documents, and the second contains the documents with assigned topics.

In [10]:
discovered_topics

Unnamed: 0,topicID,topicName,topicDescription,topicKeywords,topicExamples,topicPrevalence
0,7,Efficient Triage and Process,"Positive experiences highlighting efficient triage systems, quick assessment, reasonable wait times, and well-organized processes. Patients appreciate being seen promptly and kept informed throughout their visit.","[efficient, triage, quickly assessed, prompt, reasonable time, within an hour, organized, informed, quick, efficient process]","[2, 6, 9, 12, 14, 18, 20, 22, 24, 26, 28, 30, 31, 34, 36, 38, 40, 42, 44, 46, 48, 50, 54, 56, 58, 62, 64, 66, 70, 72, 74, 78, 80, 82, 86, 88, 90, 94, 96, 98, 102, 104, 106, 110, 112, 114, 118, 120, 122, 126, 128, 130, 134, 136, 138, 142, 144, 146, 150, 152, 154, 158, 160, 162, 166, 168, 170, 174, 176, 178, 182, 184, 186, 190, 192, 194, 198, 200]",0.39
1,1,Long Wait Times,"Experiences focused on excessive waiting periods in the emergency department, often lasting 3-7 hours or more. Patients express frustration with the lengthy delays before receiving medical attention.","[wait time, hours, waiting, long wait, waited, 3 hours, 5 hours, 6 hours, 7 hours, excessive]","[1, 7, 13, 17, 27, 33, 37, 45, 53, 61, 67, 77, 85, 93, 101, 117, 125, 133, 141, 149, 157, 165, 173, 181, 189, 197]",0.32
2,2,Positive Staff Experience,"Patients reporting positive interactions with emergency department staff, highlighting professionalism, compassion, and effective communication. Staff are described as attentive, empathetic, and thorough.","[professional, compassionate, attentive, excellent care, empathetic, dignity, respect, thorough, listened, reassuring]","[2, 4, 6, 9, 12, 14, 16, 20, 22, 26, 28, 30, 36, 38, 42, 46, 50, 54, 58, 66, 70, 74, 78, 82, 86, 90, 94, 98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138, 142, 146, 150, 154, 158, 162, 166, 170, 174, 178, 182, 186, 190, 194, 198]",0.265
3,6,Rushed Doctor Interactions,"Experiences where patients felt their interaction with the doctor was brief, rushed, or impersonal. Doctors are described as distracted, spending minimal time with patients, and not fully addressing concerns.","[rushed, brief, minimal time, distracted, impersonal, five minutes, didn't address concerns, rushed interaction, spent minimal time, barely made eye contact]","[3, 5, 8, 10, 15, 19, 21, 25, 32, 39, 43, 47, 52, 55, 60, 63, 68, 71, 76, 79, 84, 87, 92, 95, 100, 103, 108, 111, 116, 119, 124, 127, 132, 135, 140, 143, 148, 151, 156, 159, 164, 167, 172, 175, 180, 183, 188, 191, 196, 199]",0.25
4,4,Overcrowding and Understaffing,"Experiences highlighting severe overcrowding in emergency departments with patients waiting in hallways, minimal privacy, and staff appearing overwhelmed. These reviews often mention understaffing as a core issue.","[overcrowded, understaffed, hallways, overwhelmed, stressed, privacy, lined up, stretched thin, busy, chaotic]","[7, 13, 17, 23, 27, 29, 33, 35, 41, 49, 57, 65, 69, 73, 81, 89, 97, 105, 109, 113, 121, 129, 137, 145, 153, 161, 169, 177, 185, 193]",0.15
5,8,Staff Under Pressure,"Observations about staff working under difficult circumstances, doing their best despite being overwhelmed, stressed, or exhausted. Patients acknowledge the challenging conditions while expressing disappointment with the overall experience.","[doing their best, stressed, overwhelmed, exhausted, difficult circumstances, challenges, trying their best, understaffed, busy night, stretched thin]","[11, 13, 17, 23, 29, 35, 41, 49, 57, 65, 69, 73, 81, 89, 97, 105, 109, 113, 121, 129, 137, 145, 153, 161, 169, 177, 185, 193]",0.14
6,3,Facility and Environment Issues,"Concerns about the physical environment of the emergency department, including uncomfortable seating, poor lighting, noise levels, temperature issues, and outdated facilities. Patients describe the waiting areas as crowded and uncomfortable.","[uncomfortable, seating, crowded, noisy, television, cold, freezing, outdated, facility, uncomfortable chairs, bright lights, blaring]","[3, 5, 15, 19, 32, 39, 47, 51, 55, 63, 71, 79, 87, 95, 103, 111, 119, 127, 135, 143, 151, 159, 167, 175, 183, 191, 199]",0.135
7,5,Poor Communication and Organization,"Patient experiences describing disorganized processes, poor communication between staff, having to repeat information multiple times, and unclear discharge instructions. Patients feel frustrated by the lack of coordination.","[disorganized, repeat information, poor communication, confused, discharge instructions, unclear, multiple times, coordination, chaotic, confusing]","[3, 8, 21, 32, 43, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196]",0.12


In [11]:
assigned_topics.head()

Unnamed: 0,docID,docContent,primaryTopicID,primaryTopicName,secondaryTopicID,secondaryTopicName,tertiaryTopicID,tertiaryTopicName
0,1,"The wait time in the emergency department was incredibly long. I sat there for over 3 hours before being seen by a doctor. The nurses seemed overwhelmed and understaffed. However, once I was finally treated, the care was thorough and professional. The doctor explained everything clearly and addressed my concerns.",1,Long Wait Times,4,Overcrowding and Understaffing,2,Positive Staff Experience
1,2,"I was impressed by how quickly I was seen in the emergency room. The triage nurse was efficient and compassionate. The doctor who treated me was knowledgeable and took time to explain my condition. The facility was clean, though the waiting area was quite crowded. Overall, a positive experience during a stressful situation.",7,Efficient Triage and Process,2,Positive Staff Experience,0,No topic assigned
2,3,"The emergency staff seemed disorganized. I had to repeat my symptoms to multiple people, and there seemed to be poor communication between the nurses and doctors. The waiting room was uncomfortable with hard chairs and bright lights. It took forever to get my test results back. Not a good experience when you're already feeling terrible.",5,Poor Communication and Organization,3,Facility and Environment Issues,0,No topic assigned
3,4,"I appreciated the kindness shown by the emergency department staff during my visit. The nurse who took my vitals was particularly gentle and reassuring. While the wait was longer than expected, I understand they were prioritizing more urgent cases. The doctor was thorough and didn't rush through my examination.",2,Positive Staff Experience,0,No topic assigned,0,No topic assigned
4,5,"The emergency room was freezing cold! I sat shivering for hours waiting to be seen. When I finally got treatment, the doctor seemed rushed and barely made eye contact. The nurse was nice though and brought me an extra blanket. The discharge instructions were confusing and I left not fully understanding my care plan.",3,Facility and Environment Issues,6,Rushed Doctor Interactions,5,Poor Communication and Organization
