# LangGraph and Knowledge Graphs Tutorial

## Learning Objectives üéØ

By the end of this tutorial, you will understand:

1. **Knowledge Graphs**: What they are and how they represent domain knowledge
2. **LangGraph**: How to build AI workflows with state management
3. **Biomedical Applications**: Real-world uses of AI + knowledge graphs
4. **Practical Implementation**: How to build your own AI agents

## Prerequisites üìö

- Basic Python programming
- Understanding of databases (helpful but not required)
- Interest in AI and biomedical applications

---

## Part 1: What are Knowledge Graphs? üï∏Ô∏è

### The Problem with Traditional Data Storage

Imagine you're studying biology and want to answer: **"What genes are related to diabetes?"**

With traditional databases (tables), you might have:
- `genes` table
- `diseases` table  
- `gene_disease_associations` table

But what about complex questions like: **"What pathway connects Gene X to Drug Y through proteins and diseases?"**

This requires joining multiple tables and becomes very complex!

### Knowledge Graphs: A Better Way

Knowledge graphs store information as **nodes** (entities) and **relationships** (edges):

```
Gene_Alpha --[ENCODES]--> Protein_Beta --[ASSOCIATED_WITH]--> Diabetes
                                           ^
                                           |
                                    [TREATS]
                                           |
                                      Drug_Gamma
```

This naturally represents how biological entities relate to each other!

## Part 2: Our Biomedical Knowledge Graph üß¨

### Graph Schema

Our knowledge graph contains:

**Nodes (Entities):**
- üß¨ **Gene**: Genetic sequences (e.g., GENE_ALPHA, BRCA1)
- üß™ **Protein**: Proteins encoded by genes (e.g., PROT_BETA, insulin)
- üè• **Disease**: Medical conditions (e.g., diabetes, cancer)
- üíä **Drug**: Medications and treatments (e.g., aspirin, AlphaCure)

**Relationships (Edges):**
- Gene `--[ENCODES]-->` Protein
- Gene `--[LINKED_TO]-->` Disease
- Protein `--[ASSOCIATED_WITH]-->` Disease  
- Drug `--[TREATS]-->` Disease
- Drug `--[TARGETS]-->` Protein

### Why This Matters

This structure mirrors how biologists think about molecular relationships!

In [None]:
# Let's connect to our knowledge graph and explore it!
import sys
sys.path.append('..')

import os
from dotenv import load_dotenv
from agent.graph_interface import GraphInterface

# Load environment variables
load_dotenv()

# Connect to the graph database
uri = os.getenv("NEO4J_URI", "bolt://localhost:7687")
user = os.getenv("NEO4J_USER", "neo4j")
password = os.getenv("NEO4J_PASSWORD")

if not password:
    print("‚ö†Ô∏è Please set NEO4J_PASSWORD in your .env file")
else:
    graph_db = GraphInterface(uri, user, password)
    print("‚úÖ Connected to knowledge graph!")

In [None]:
# Explore our graph schema
schema = graph_db.get_schema_info()

print("üèóÔ∏è Knowledge Graph Schema:")
print("=" * 40)
print(f"Node Types: {schema['node_labels']}")
print(f"Relationship Types: {schema['relationship_types']}")
print("\nüìä Node Properties:")
for node_type, properties in schema['node_properties'].items():
    print(f"  {node_type}: {properties}")

In [None]:
# Let's see some actual data!
# Get a few examples of each node type

print("üß¨ Sample Genes:")
genes = graph_db.execute_query("MATCH (g:Gene) RETURN g.gene_name, g.function LIMIT 3")
for gene in genes:
    print(f"  ‚Ä¢ {gene['g.gene_name']}: {gene['g.function']}")

print("\nüß™ Sample Proteins:")
proteins = graph_db.execute_query("MATCH (p:Protein) RETURN p.protein_name, p.molecular_weight LIMIT 3")
for protein in proteins:
    print(f"  ‚Ä¢ {protein['p.protein_name']}: {protein['p.molecular_weight']} kDa")

print("\nüè• Sample Diseases:")
diseases = graph_db.execute_query("MATCH (d:Disease) RETURN d.disease_name, d.category LIMIT 3")
for disease in diseases:
    print(f"  ‚Ä¢ {disease['d.disease_name']}: {disease['d.category']}")

print("\nüíä Sample Drugs:")
drugs = graph_db.execute_query("MATCH (dr:Drug) RETURN dr.drug_name, dr.type LIMIT 3")
for drug in drugs:
    print(f"  ‚Ä¢ {drug['dr.drug_name']}: {drug['dr.type']}")

## Part 3: Graph Queries with Cypher üîç

Neo4j uses **Cypher** as its query language. Think of it like SQL, but for graphs!

### Basic Cypher Patterns

1. **MATCH**: Find patterns in the graph
2. **WHERE**: Filter results
3. **RETURN**: What to give back

### Example Queries

In [None]:
# Simple query: Find all genes
query1 = "MATCH (g:Gene) RETURN g.gene_name LIMIT 5"
result1 = graph_db.execute_query(query1)

print("üîç Simple Query: All genes")
print(f"Query: {query1}")
print("Results:")
for row in result1:
    print(f"  ‚Ä¢ {row['g.gene_name']}")

In [None]:
# Relationship query: Find what proteins are encoded by genes
query2 = """
MATCH (g:Gene)-[:ENCODES]->(p:Protein) 
RETURN g.gene_name, p.protein_name 
LIMIT 5
"""

result2 = graph_db.execute_query(query2)

print("üîó Relationship Query: Gene encodes Protein")
print(f"Query: {query2.strip()}")
print("Results:")
for row in result2:
    print(f"  ‚Ä¢ Gene {row['g.gene_name']} encodes Protein {row['p.protein_name']}")

In [None]:
# Complex query: Find complete pathway from gene to treatment
query3 = """
MATCH (g:Gene)-[:ENCODES]->(p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)<-[:TREATS]-(dr:Drug)
RETURN g.gene_name, p.protein_name, d.disease_name, dr.drug_name
LIMIT 3
"""

result3 = graph_db.execute_query(query3)

print("üõ§Ô∏è Complex Query: Complete pathway Gene ‚Üí Protein ‚Üí Disease ‚Üê Drug")
print(f"Query: {query3.strip()}")
print("Results:")
for row in result3:
    print(f"  ‚Ä¢ {row['g.gene_name']} ‚Üí {row['p.protein_name']} ‚Üí {row['d.disease_name']} ‚Üê {row['dr.drug_name']}")

### üéØ Exercise 1: Write Your Own Query

Try writing a query to find drugs that treat diabetes!

**Hint**: Use the pattern `(dr:Drug)-[:TREATS]->(d:Disease)` and filter where disease name contains "diabetes"

In [None]:
# Your turn! Write a query to find drugs that treat diabetes
your_query = """
# Write your query here!
# Hint: MATCH (dr:Drug)-[:TREATS]->(d:Disease)
#       WHERE toLower(d.disease_name) CONTAINS 'diabetes'
#       RETURN dr.drug_name, d.disease_name
"""

# Uncomment and run when ready:
# result = graph_db.execute_query(your_query)
# for row in result:
#     print(f"Drug {row['dr.drug_name']} treats {row['d.disease_name']}")

## Part 4: What is LangGraph? üåä

### The Challenge: Complex AI Workflows

Imagine you want to build an AI that can:
1. Understand a natural language question
2. Extract important entities
3. Generate a database query
4. Execute the query
5. Format the results

Each step depends on the previous ones, and you need to manage **state** (information) flowing between steps.

### LangGraph: AI Workflow Engine

LangGraph helps you build **multi-step AI workflows** with:
- **Nodes**: Individual processing steps
- **Edges**: How steps connect
- **State**: Information that flows between steps

### Visual Representation

```
Question ‚Üí [Classify] ‚Üí [Extract] ‚Üí [Generate] ‚Üí [Execute] ‚Üí [Format] ‚Üí Answer
             ‚Üì            ‚Üì           ‚Üì           ‚Üì           ‚Üì
           State      State       State       State       State
```

## Part 5: Building Your First LangGraph Agent ü§ñ

Let's use our simplified educational agent to understand how LangGraph works!

In [None]:
# Import our educational agent
from agent.educational_agent import EducationalAgent, demonstrate_workflow_steps

# First, let's see what steps our workflow has
demonstrate_workflow_steps()

In [None]:
# Create our educational agent
anthropic_key = os.getenv("ANTHROPIC_API_KEY")

if not anthropic_key:
    print("‚ö†Ô∏è Please set ANTHROPIC_API_KEY in your .env file")
else:
    # Initialize the educational agent
    edu_agent = EducationalAgent(graph_db, anthropic_key)
    print("‚úÖ Educational agent ready!")

### Understanding the Workflow State

Before we run the agent, let's understand what information flows through our workflow:

```python
class LearningState(TypedDict):
    user_question: str              # Original question from student
    question_type: Optional[str]     # What kind of biomedical question?
    entities: Optional[List[str]]    # Important terms we found
    cypher_query: Optional[str]      # The database query we generated
    results: Optional[List[Dict]]    # What we found in the database
    final_answer: Optional[str]      # Human-readable response
    error: Optional[str]             # If something went wrong
```

Each step reads this state, does its work, and updates the state for the next step!

In [None]:
# Let's ask our agent a question and see the complete workflow!
question = "What genes are associated with diabetes?"

print("üéì Running Educational LangGraph Agent")
print("=" * 50)

result = edu_agent.answer_question(question)

print("\nüìã Complete Workflow Results:")
print("=" * 50)
print(f"‚ùì Original Question: {question}")
print(f"üè∑Ô∏è Question Type: {result['question_type']}")
print(f"üß¨ Entities Found: {result['entities']}")
print(f"üîß Generated Query: {result['cypher_query']}")
print(f"üìä Results Count: {result['results_count']}")
print(f"‚úÖ Final Answer: {result['answer']}")

if result['error']:
    print(f"‚ùå Error: {result['error']}")

### üéØ Exercise 2: Try Different Questions

Try asking different types of biomedical questions to see how the agent handles them:

In [None]:
# Try different questions!
questions_to_try = [
    "What drugs treat hypertension?",
    "What protein does GENE_ALPHA encode?",
    "What diseases is PROT_BETA associated with?",
    "What are the targets of AlphaCure?"
]

# Pick one and try it:
test_question = questions_to_try[0]  # Change the index to try different questions

print(f"Testing: {test_question}")
print("-" * 50)

result = edu_agent.answer_question(test_question)
print(f"Answer: {result['answer']}")
print(f"Query used: {result['cypher_query']}")

## Part 6: Understanding How LangGraph Works üîß

Let's dive deeper into the code to understand how our workflow is built:

In [None]:
# Let's examine the workflow creation code
import inspect

# Look at how the workflow is created
print("üèóÔ∏è How the LangGraph Workflow is Built:")
print("=" * 50)

workflow_code = inspect.getsource(edu_agent._create_learning_workflow)
print(workflow_code)

### Key LangGraph Concepts:

1. **StateGraph**: The main workflow container
2. **add_node()**: Adds processing steps
3. **add_edge()**: Connects steps in sequence
4. **set_entry_point()**: Where to start
5. **compile()**: Makes the workflow executable

### State Management

Each node function:
- Receives the current state
- Does its processing
- Returns updated state
- LangGraph automatically passes state to the next node

In [None]:
# Let's look at one of the workflow steps in detail
print("üîç Example: The Entity Extraction Step")
print("=" * 50)

extract_code = inspect.getsource(edu_agent.extract_entities)
print(extract_code)

## Part 7: Comparing Approaches üìä

Let's compare our educational agent with the simple template-based agent to understand the trade-offs:

In [None]:
# Import the simple agent for comparison
from agent.simple_agent import SimpleAgent

simple_agent = SimpleAgent(graph_db)

print("‚ö° Speed Comparison: Simple vs LangGraph Agent")
print("=" * 60)

import time

# Test the same question with both approaches
test_question = "diabetes"

# Simple agent (template-based)
start = time.time()
simple_result = simple_agent.get_genes_for_disease(test_question)
simple_time = time.time() - start

print(f"Simple Agent:")
print(f"  Time: {simple_time:.2f} seconds")
print(f"  Results: {len(simple_result)} genes found")
print(f"  Approach: Pre-written query template")

# LangGraph agent (AI-powered)
start = time.time()
ai_result = edu_agent.answer_question(f"What genes are associated with {test_question}?")
ai_time = time.time() - start

print(f"\nLangGraph Agent:")
print(f"  Time: {ai_time:.2f} seconds")
print(f"  Results: {ai_result['results_count']} results found")
print(f"  Approach: AI-generated query")

print(f"\nüìà Speed Difference: {ai_time/simple_time:.1f}x slower (but more flexible!)")

## Part 8: When to Use Each Approach? ü§î

### Simple Template Agent üöÄ
**Best for:**
- Fast, predictable queries
- Limited set of question types
- Production systems needing reliability
- When you know exactly what queries you need

### LangGraph AI Agent üß†
**Best for:**
- Flexible natural language input
- Exploratory research questions
- Complex multi-step reasoning
- When users ask questions in many different ways

### The Trade-offs
- **Speed vs Flexibility**: Templates are faster, AI is more flexible
- **Cost vs Capability**: Templates are free, AI uses API calls
- **Predictability vs Adaptability**: Templates always work the same, AI can handle new patterns

## Part 9: Hands-on Exercises üèãÔ∏è‚Äç‚ôÄÔ∏è

Now it's your turn to experiment!

### üéØ Exercise 3: Modify the Workflow

Try adding a new step to the workflow. For example, add a validation step that checks if the generated query is safe to run.

**Challenge**: Add a node that validates the Cypher query before execution.

In [None]:
# Your challenge: Create a modified workflow with validation
# Hint: You can use graph_db.validate_query() to check if a query is valid

from langgraph.graph import StateGraph, END
from agent.educational_agent import LearningState

class ImprovedAgent:
    def __init__(self, graph_interface, anthropic_api_key):
        # Your code here!
        pass
    
    def validate_query(self, state: LearningState) -> LearningState:
        """Add your validation logic here!"""
        # Hint: Check if state['cypher_query'] is valid
        # If not valid, set state['error'] = "Invalid query"
        pass

# Try implementing your improved agent!

### üéØ Exercise 4: Create Custom Queries

Write Cypher queries for these biomedical questions:

In [None]:
# Exercise 4: Write custom queries
exercises = {
    "a": "Find all proteins that are associated with neurological diseases",
    "b": "Find drugs that target proteins with high molecular weight (>50 kDa)", 
    "c": "Find the most common disease categories in our database",
    "d": "Find complete pathways: Gene ‚Üí Protein ‚Üí Disease, where the gene is on chromosome 1"
}

print("‚úèÔ∏è Query Writing Exercises:")
for key, exercise in exercises.items():
    print(f"{key}) {exercise}")

# Try writing queries for each exercise:

# Exercise A:
query_a = """
# Your query here!
"""

# Exercise B:
query_b = """
# Your query here!
"""

# Uncomment to test your queries:
# result_a = graph_db.execute_query(query_a)
# print(f"Exercise A results: {len(result_a)} found")

## Part 10: Real-World Applications üåç

### Where are Knowledge Graphs + AI Used?

1. **Drug Discovery** üíä
   - Find new drug targets
   - Predict drug side effects
   - Repurpose existing drugs

2. **Personalized Medicine** üß¨
   - Match patients to treatments based on genetics
   - Predict disease risk
   - Optimize treatment plans

3. **Research Acceleration** üî¨
   - Literature mining and synthesis
   - Hypothesis generation
   - Cross-domain connections

4. **Clinical Decision Support** üè•
   - Diagnostic assistance
   - Treatment recommendations
   - Drug interaction checking

### Industry Examples
- **Google**: Knowledge Graph for search
- **Amazon**: Product recommendations
- **Facebook**: Social graph analysis
- **Pharmaceutical companies**: Drug discovery pipelines

## Part 11: Next Steps and Advanced Topics üöÄ

### Immediate Next Steps
1. **Experiment** with the Streamlit web interface
2. **Try** different question types and see how the agent handles them
3. **Modify** the agent code to add new features
4. **Write** your own Cypher queries for complex biomedical questions

### Advanced Topics to Explore
1. **Graph Algorithms**: PageRank, community detection, shortest paths
2. **Advanced LangGraph**: Conditional edges, parallel processing, human-in-the-loop
3. **Graph Neural Networks**: AI models that work directly on graph structure
4. **Real-time Updates**: Streaming data into knowledge graphs
5. **Large-scale Graphs**: Handling millions/billions of nodes

### Learning Resources
- **Neo4j Documentation**: https://neo4j.com/docs/
- **LangGraph Documentation**: https://langchain-ai.github.io/langgraph/
- **Graph Theory Courses**: edX, Coursera, Khan Academy
- **Biomedical Databases**: PubMed, UniProt, STRING

## Summary and Reflection üéØ

### What You've Learned
‚úÖ **Knowledge Graphs**: How to represent complex domain relationships as nodes and edges

‚úÖ **Cypher Queries**: How to extract information from graph databases

‚úÖ **LangGraph Workflows**: How to build multi-step AI agents with state management

‚úÖ **Biomedical Applications**: Real-world uses of AI + knowledge graphs in life sciences

‚úÖ **Practical Implementation**: Hands-on experience building and modifying AI agents

### Key Insights
1. **Graphs are powerful** for representing relationships in complex domains
2. **LangGraph enables** sophisticated AI workflows with proper state management
3. **Different approaches** (templates vs AI) have different trade-offs
4. **Biomedical data** is naturally graph-structured and benefits from graph-based approaches

### Your Next Challenge
Pick a domain you're interested in (sports, movies, finance, etc.) and design a knowledge graph structure for it. What nodes and relationships would you include? How would you query it?

---

## üéâ Congratulations!

You've completed the LangGraph and Knowledge Graphs tutorial. You now have the foundation to build your own AI-powered graph applications!

Keep experimenting, keep learning, and remember: the best way to understand these concepts is to build something with them! üöÄ