# Expand Data Sources & Grow Knowledge
This notebook demonstrates how you can easily expand to include new sources in a knowledge graph backed by a graph database.  we show how you can add structured data from a Human Resource Information System (HRIS) and connect it to people extracted from the previous unstructured resume docs from module 2.

This data source makes some different assumptions around people being able to contribute to the same things with their accomplishments since the data is internal we can see collaboration on projects.  These sorts of data model updates can cause the need to refactor in RDBMS systems but with a graph database it poses no issues.

![](img/data-model-updates-with-tables.png)
![](img/data-model-updates-with-graph.png)

In [1]:
import getpass
import os
from dotenv import load_dotenv

#get env setup
load_dotenv('nb.env', override=True)

if not os.environ.get('NEO4J_URI'):
    os.environ['NEO4J_URI'] = getpass.getpass('NEO4J_URI:\n')
if not os.environ.get('NEO4J_USERNAME'):
    os.environ['NEO4J_USERNAME'] = getpass.getpass('NEO4J_USERNAME:\n')
if not os.environ.get('NEO4J_PASSWORD'):
    os.environ['NEO4J_PASSWORD'] = getpass.getpass('NEO4J_PASSWORD:\n')

NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

## Graph Construction

In [2]:
import pandas as pd

# read structured hris data
df = pd.read_csv('hris-tables/project-assignments.csv')
# Some minor formatting
df['in_progress'] = df['end_date'].isna()
df['end_date'] = df['end_date'].fillna("present")
df['duration'] = df['start_date'] + " - " + df['end_date']
df['year'] = df['end_date'].apply(lambda x: x[:4] if x.lower() != "present" else None)
df.head()

Unnamed: 0,project_name,role_title,allocation_percentage,start_date,end_date,project_domain,project_type,accomplishment_type,person_id,in_progress,duration,year
0,Supply Chain Optimization Engine,AI Consultant,30.0,2018-03-01,2018-11-30,AI,PRODUCT,BUILT,xRPBlhk9,False,2018-03-01 - 2018-11-30,2018.0
1,Automated Content Moderation,Senior ML Engineer,60.0,2020-02-01,2021-08-31,AI,PRODUCT,BUILT,xRPBlhk9,False,2020-02-01 - 2021-08-31,2021.0
2,Edge Computing AI Platform,Principal Architect,80.0,2022-01-01,2023-12-31,AI,INFRASTRUCTURE,BUILT,xRPBlhk9,False,2022-01-01 - 2023-12-31,2023.0
3,Quantum-Classical Hybrid Research,Research Lead,40.0,2024-01-01,present,AI,RESEARCH,LED,xRPBlhk9,True,2024-01-01 - present,
4,Employee Self-Service Portal,Full-Stack Developer,90.0,2020-06-01,2021-12-31,WEB,PRODUCT,BUILT,kkkMTAId,False,2020-06-01 - 2021-12-31,2021.0


In [3]:
from neo4j import GraphDatabase

# load into People nodes in Neo4j

#instantiate driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

#test neo4j connection
driver.execute_query("MATCH(n) RETURN count(n)")

EagerResult(records=[<Record count(n)=327>], summary=<neo4j._work.summary.ResultSummary object at 0x10bec4910>, keys=['count(n)'])

In [4]:
from neo4j import RoutingControl


#load structured data
def chunks(xs, n=10):
    n = max(1, n)
    return [xs[i:i + n] for i in range(0, len(xs), n)]

for chunk in chunks(df.to_dict(orient='records')):
    records = driver.execute_query(
        """
        UNWIND $records AS rec

        //match people
        MATCH(person:Person {id:rec.person_id})

        //merge accomplishments
        MERGE(thing:Thing {name:rec.project_name})
        MERGE(person)-[r:$(rec.accomplishment_type)]->(thing)
        SET r.year = rec.year,
            r.role  = rec.role_title,
            r.duration = rec.duration,
            thing.in_progress = rec.in_progress,
            thing.internal_project=true

        //merge domain and work type
        MERGE(Domain:Domain {name:rec.project_domain})
        MERGE(thing)-[:IN]->(Domain)
        MERGE(WorkType:WorkType {name:rec.project_type})
        MERGE(thing)-[:OF]->(WorkType)

        RETURN count(rec) AS records_upserted
        """,
        #database_=DATABASE,
        routing_=RoutingControl.WRITE,
        result_transformer_= lambda r: r.data(),
        records = chunk
    )
    print(records)

[{'records_upserted': 10}]
[{'records_upserted': 10}]
[{'records_upserted': 10}]
[{'records_upserted': 10}]
[{'records_upserted': 10}]


## Agents and Retrieval

In [5]:
# same tools from notebook 2

def find_similar_people(person_id: str):
    """
    This function will return potential similar people to the provided person based on common skill and types and domains of accomplishments.  You can use this as a starting point to find similarities scores. But should use follow up tools and queries to collect more info.
    :param person_id: the id of the person to search for similarities for
    :return: a list of person ids for similar candidates order by score which is the count of common skill and types and domains of accomplishments.
    """
    res = driver.execute_query(
        '''
        MATCH p=(p1:Person {id:$personId})--()
                 ((:!Person)--() ){0,3}
                 (p2:Person)
        RETURN count(*) AS score, p2.id AS person_id
        ORDER BY score DESC LIMIT $limit
         ''',
        personId=person_id,
        limit=5, #just hard code for now
        result_transformer_= lambda r: r.data())

    return res

def find_similarities_between_people(person1_id: str, person2_id: str):
    """
    This function will return potential similarities between people in the form of skill and accomplishment paths.  You can use this as a starting point to find similarities and query the graph further using the various name fields and person ids.
    :param person1_id: the id of the first person to compare
    :param person2_id: the id of the second person to compare
    :return: a list of paths between the two people, each path is a compact ascii string representation.  It should reflect the patterns in the graph schema
    """
    res = driver.execute_query(
        '''
        MATCH p=(p1:Person {id:$person1_id})--()
                 ((:!Person)--() ){0,3}
                 (p2:Person{id:$person2_id})
        WITH p, nodes(p) as path_nodes, relationships(p) as path_rels, p1, p2
        RETURN
          "(" + labels(path_nodes[0])[0] + " {name: \\"" + path_nodes[0].name + "\\" id: \\"" + path_nodes[0].id + "\\"})" +
          reduce(chain = "", i IN range(0, size(path_rels)-1) |
            chain +
            "-[" + type(path_rels[i]) + "]-" +
            "(" + labels(path_nodes[i+1])[0] + " {name: \\"" + path_nodes[i+1].name +
            CASE WHEN "Person" IN labels(path_nodes[i+1])
                 THEN "\\" id: \\"" + path_nodes[i+1].id +"\\""
                 ELSE "\\"" END + "})"
          ) as paths ORDER BY p1.id, p2.id
         ''',
        person1_id=person1_id,
        person2_id=person2_id,
        result_transformer_= lambda r: r.values())

    return res

def get_person_resume(person_id: str):
    """
    Gets the full resume of a person
    :param person_id: the id of the person
    :return: resume text and person name
    """
    res = driver.execute_query(
        '''
        MATCH (n:Person {id: $personId})
        RETURN n.text as resume, n.name AS name
         ''',
        personId=person_id,
        result_transformer_= lambda r: r.data())

    return res

def get_person_name(person_id: str):
    """
    Gets a person name given their id
    :param person_id: the unique id of the person
    :return: person name
    """
    res = driver.execute_query(
        '''
        MATCH (n:Person {id: $personId})
        RETURN n.name
         ''',
        personId=person_id,
        result_transformer_= lambda r: r.values())

    return res

def get_person_ids_from_name(person_name: str):
    """
    Gets all the unique people ids who have the provided name
    :param person_name: the name to look up person ids with
    :return: person ids that can be used for other tools and queries.  Note that names aren't guaranteed to be unique so you may get more than one person.
    """
    res = driver.execute_query(
        '''
        MATCH (n:Person {name: $personName})
        RETURN n.id
         ''',
        personName=person_name,
        result_transformer_= lambda r: r.values())

    return res

### Tool - Finding Collaborators
![](img/collaborators.png)

In [6]:
from person import Domain
from typing import List

# Additional tool for finding collaborators
def find_collaborators_in_domain(domains: List[str]):
    f"""
    Finds projects within a given set of domains where multiple people collaborated
    along with project details and individual contributions.

    It returns detailed information about each project, including the project name,
    domain, work type, and the list of collaborators (their names, IDs, and their contributions).

    Parameters:
        domains (List[Domain]): A list of domains (from the `Domain` enum) to filter
                                projects where collaborations occurred. Valid values are: {[i.value for i in Domain]}

    Returns:
        List[dict]: A list of dictionaries containing project and collaboration details.
        Each dictionary has the following structure:
        - `projectName` (str): Name of the project.
        - `domain` (str): The domain of the project (e.g., "AI").
        - `workType` (str): The type of work associated with the project (e.g., "PRODUCT").
        - `people` (List[dict]): A list of collaborators with:
            - `name` (str): Name of the person.
            - `id` (str): Unique ID of the person.
            - `contributions` (List[dict]): Contribution details, including:
                - `contribution` (str): The type of contribution (e.g., "BUILT IT").
                - `roleDetails` (str): Role of the person on the project (e.g., "AI Consultant").
                - `duration` (str): Time period of the contribution (e.g., "2022-01-01 - 2023-12-31").

    Example output:
        [
            {{
                "projectName": "Supply Chain Optimization Engine",
                "domain": "AI",
                "workType": "PRODUCT",
                "people": [
                    {{
                        "name": "John Doe",
                        "id": "xRPBlhk9",
                        "contributions": [
                            {{
                                "contribution": "BUILT IT",
                                "roleDetails": "AI Consultant",
                                "duration": "2018-03-01 - 2018-11-30"
                            }}
                        ]
                    }},
                    {{
                        "name": "Jane Smith",
                        "id": "kkkMTAId",
                        "contributions": [
                            {{
                                "contribution": "BUILT IT",
                                "roleDetails": "Full-Stack Developer",
                                "duration": "2020-06-01 - 2021-12-31"
                            }}
                        ]
                    }}
                ]
            }}
        ]

    """
    res = driver.execute_query(
        '''
        // graph pattern: get collaborates in set of domains
        MATCH (p:Person)-[r]->(t:Thing)<-[]-(:Person)
        MATCH (t)-[:IN]->(d)
        MATCH (t)-[:OF]->(w)
        WHERE d.name IN $domains

        //format for response
        WITH t, d, w, p,
        collect({contribution: type(r) + " IT", roleDetails: r.role, duration:r.duration}) AS contributions
        WITH t.name AS projectName, t.internal_project AS is_internal, d.name AS domain, w.name AS workType, collect({name: p.name, id: p.id, contributions: contributions}) AS people

        //return
        RETURN projectName, domain, workType, people
        ''',
        domains=domains,
        result_transformer_= lambda r: r.data())

    return res

find_collaborators_in_domain([Domain.AI])

[{'projectName': 'Supply Chain Optimization Engine',
  'domain': 'AI',
  'workType': 'PRODUCT',
  'people': [{'id': 'MhzMrjwz',
    'name': 'Robert Johnson',
    'contributions': [{'duration': '2018-03-01 - 2018-11-30',
      'roleDetails': 'Security Consultant',
      'contribution': 'BUILT IT'},
     {'duration': '2018-03-01 - 2018-11-30',
      'roleDetails': 'Security Consultant',
      'contribution': 'BUILT IT'}]},
   {'id': 'JSneKsS4',
    'name': 'Jennifer Park',
    'contributions': [{'duration': '2018-03-01 - 2018-11-30',
      'roleDetails': 'Data Engineering Lead',
      'contribution': 'LED IT'},
     {'duration': '2018-03-01 - 2018-11-30',
      'roleDetails': 'Data Engineering Lead',
      'contribution': 'LED IT'}]},
   {'id': 'xRPBlhk9',
    'name': 'Sarah Chen',
    'contributions': [{'duration': '2018-03-01 - 2018-11-30',
      'roleDetails': 'AI Consultant',
      'contribution': 'BUILT IT'},
     {'duration': '2018-03-01 - 2018-11-30',
      'roleDetails': 'AI Cons

In [7]:
from AgentRunner import AgentRunner
# build adk agent with neo4j mcp
from person import Domain, WorkType, SkillName
from google.adk.models.lite_llm import LiteLlm
from google.adk.agents import Agent
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset, StdioServerParameters

talent_agent = Agent(
    name="talent_agent",
    # model="gemini-2.0-flash-exp",
    model=LiteLlm(model="openai/gpt-4.1"),
    # model=LiteLlm(model="anthropic/claude-sonnet-4-20250514"),
    description="""
    Knowledge assistant for skills analysis, search, and team formation
    """,
    instruction=f"""
    You are a human resources assistant who helps with skills analysis, talent search, and team formation at Cyberdyne Systems.

    Your tools retrieve data from internal knowledge on Cyberdyne System employees based on their resume and profiles.

    Try to prioritize expert tools (those other than `read_neo4j_cypher`) as appropriate since they have expert approved logic for access data. Though you may need to directly access data afterwards to pull more details.


    When you need more flexible logic for aggregations, follow-up or anything else, you can access the knowledge (in a graph database) directly. ALWAYS get the schema first with `get_schema` and keep it in memory. Only use node labels, relationship types, and property names, and patterns in that schema to generate valid Cypher queries using the `read_neo4j_cypher` tool with proper parameter syntax ($parameter). If you get errors or empty results check the schema and try again at least up to 3 times.

    For domain knowledge, use these standard values:
    - Domains: {[i.value for i in Domain]}
    - Work Types: {[i.value for i in WorkType]}
    - Skills: {[i.value for i in SkillName]}

    Also never return embedding properties in Cypher queries. This will result in delays and errors.

    When responding to the user:
    - if your response includes people, include there names and IDs. Never just there Ids.
    - You must explain your retrieval logic and where the data came from. You must say exactly how relevance, similarity, etc. was inferred during search

    Use information from previous queries when possible instead of asking the user again.
    """,
    tools=[find_similar_people,
           find_similarities_between_people,
           get_person_name,
           get_person_resume,
           get_person_ids_from_name,
           find_collaborators_in_domain,
           MCPToolset(
        connection_params=StdioServerParameters(
            command='uvx',
            args=[
                "mcp-neo4j-cypher",
            ],
            env={ k: os.environ[k] for k in ["NEO4J_URI","NEO4J_USERNAME","NEO4J_PASSWORD"] }
        ),
        tool_filter=['get_neo4j_schema','read_neo4j_cypher']
    )]
)

talent_agent_runner = AgentRunner(app_name='talent_agent', user_id='Mr. Ed', agent=talent_agent)
await talent_agent_runner.start_session()

Session started successfully with ID: 1f14633c-432d-457c-a1fa-4495891790c3


True

In [8]:
res = await talent_agent_runner.run("Which individuals have collaborated with each other to deliver the most AI Things?")

None id='call_xmp0Y9NgmrlfUEzp62vwgErw' args={'domains': ['AI']} name='find_collaborators_in_domain' None
None None will_continue=None scheduling=None id='call_xmp0Y9NgmrlfUEzp62vwgErw' name='find_collaborators_in_domain' response={'result': [{'projectName': 'Supply Chain Optimization Engine', 'domain': 'AI', 'workType': 'PRODUCT', 'people': [{'id': 'MhzMrjwz', 'name': 'Robert Johnson', 'contributions': [{'duration': '2018-03-01 - 2018-11-30', 'roleDetails': 'Security Consultant', 'contribution': 'BUILT IT'}, {'duration': '2018-03-01 - 2018-11-30', 'roleDetails': 'Security Consultant', 'contribution': 'BUILT IT'}]}, {'id': 'JSneKsS4', 'name': 'Jennifer Park', 'contributions': [{'duration': '2018-03-01 - 2018-11-30', 'roleDetails': 'Data Engineering Lead', 'contribution': 'LED IT'}, {'duration': '2018-03-01 - 2018-11-30', 'roleDetails': 'Data Engineering Lead', 'contribution': 'LED IT'}]}, {'id': 'xRPBlhk9', 'name': 'Sarah Chen', 'contributions': [{'duration': '2018-03-01 - 2018-11-30',

In [9]:
from IPython.core.display import Markdown

display(Markdown(res))

To determine which individuals have collaborated most often to deliver AI-related projects ("AI Things"), I investigated collaborations within the "AI" domain using expert-driven internal logic that identifies project team compositions.

Here’s what the data reveals:

Most Frequent Collaborators on AI Projects:
- Sarah Chen (ID: xRPBlhk9) and Dr. Amanda Foster (ID: UhZn6uYW) have collaborated together on two major AI projects:
   1. Edge Computing AI Platform (Sarah as Principal Architect, Dr. Foster as Research Advisor, 2022–2023)
   2. Quantum-Classical Hybrid Research (Sarah as Research Lead, Dr. Foster as Research Collaborator, 2024–present)

Other Notable Collaborations:
- The Supply Chain Optimization Engine was delivered by Robert Johnson (ID: MhzMrjwz), Jennifer Park (ID: JSneKsS4), and Sarah Chen (ID: xRPBlhk9) working as a team.
- The AI Ethics Governance Framework involved Dr. Amanda Foster (ID: UhZn6uYW) and Lisa Wang (ID: ZIMWCRHs) working together.

Summary:
Sarah Chen and Dr. Amanda Foster are the top pair, having repeatedly teamed up for prominent AI projects across several years.

This conclusion was drawn by extracting project contributor data tied to the AI domain, tallying repeated collaborations, and prioritizing pairs with the most frequent shared project delivery records. The information comes directly from Cyberdyne System’s internal project and contributor relationship data.

In [10]:
await talent_agent_runner.end_session()

Session 1f14633c-432d-457c-a1fa-4495891790c3 ended successfully


True