[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/guerinjeanmarc/Neo4j-GraphRAG-Python-Workshop/blob/main/workshop-notebooks/01_quickstart_text2cypher.ipynb)

# 🚀 Quickstart: Knowledge Graph from PDFs + Text2Cypher GraphRAG

**What you'll learn:**
- Build a knowledge graph from pharmaceutical PDFs using Neo4j GraphRAG Python package
- Query your graph using Text2Cypher retriever
- Run complete GraphRAG pipelines

**Prerequisites:**
- A Neo4j database instance (see setup options below)
- OpenAI API key (provided by instructor )
- Basic Python knowledge

---

## 🔧 Neo4j Setup Options

Choose **ONE** of the following options:

### Option 1: Neo4j Aura Free (Recommended for this workshop)
- Free cloud database, no installation needed
- [Create account](https://console-preview.neo4j.io/)
- Save your credentials from the download

### Option 2: Neo4j Sandbox
- Temporary database (3-10 days), pre-configured
- [Launch sandbox](https://sandbox.neo4j.com/)
- Select "Blank Sandbox"

### Option 3: Neo4j Desktop
- Local installation, full control
- [Download](https://neo4j.com/download/)
- Create a new project and database

**💡 Tip:** Copy your `NEO4J_URI`, `NEO4J_USERNAME`, and `NEO4J_PASSWORD` - you'll need them below!


## 📦 Step 1: Install Dependencies

This will take ~2 minutes. We're installing:
- `neo4j-graphrag` - Official Neo4j GraphRAG package
- `google-generativeai` - Gemini API client
- Additional utilities for PDF processing


In [None]:
%%capture
%pip install neo4j-graphrag python-dotenv langchain-text-splitters pypdf langchain-google-genai "neo4j-graphrag[openai]"


## 🔐 Step 2: Configure Credentials

**Two options for entering credentials:**

### Option A: Using .env file (local or Colab with Drive)
Upload a `.env` file with your credentials, or create one.

### Option B: Direct input (Recommended for Colab)
Replace the placeholder values below with your actual credentials.


In [None]:
import os
from dotenv import load_dotenv

# Try loading from .env file
load_dotenv()

# Configure your credentials here (Option B - recommended for Colab)
# Replace these with your actual values!
NEO4J_URI = os.getenv('NEO4J_URI', '<your-neo4j-uri>')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME', 'neo4j')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD', '<your-password>')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE', 'neo4j')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', '<your-openai-api-key>')


# Set environment variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

# Verify we have credentials
if 'your-' in NEO4J_URI or 'your-' in OPENAI_API_KEY:
    print("⚠️  WARNING: Please update the credentials above with your actual values!")
else:
    print("✓ Credentials configured")


## ✅ Step 3: Test Neo4j Connection


In [None]:
from neo4j import GraphDatabase

# Connect to Neo4j
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

# Verify connectivity
try:
    driver.verify_connectivity()
    print(f"✓ Connected to Neo4j at {NEO4J_URI}")
    print(f"  Database: {NEO4J_DATABASE}")
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("\nTroubleshooting:")
    print("1. Check your NEO4J_URI format (should include neo4j+s:// or bolt://)")
    print("2. Verify username and password")
    print("3. Make sure your database is running")
    raise


## 📄 Step 4: Download Sample PDFs

We'll use pharmaceutical pipeline reports. In Colab, we need to download them first.


In [None]:
import os
import urllib.request

# Create data directory
os.makedirs('workshop-data', exist_ok=True)

# GitHub raw URL base (update this with your GitHub username/repo)
GITHUB_BASE = "https://raw.githubusercontent.com/guerinjeanmarc/Neo4j-GraphRAG-Python-Workshop/main/workshop-data/"

# PDF files to download
pdf_files = [
    "AbbVie Long-Term Guidance and Pipeline Update.pdf",
    "BMY-2024-Q1-Results-Investor-Presentation-with-Appendix.pdf",
    "JNJ-Pipeline-2Q2024.pdf",
    "ph-rd-pipeline-2025-07-24-update-20250725.pdf"
]

# Download files
print("Downloading PDFs...")
for pdf_file in pdf_files:
    local_path = f"workshop-data/{pdf_file}"
    if not os.path.exists(local_path):
        try:
            url = GITHUB_BASE + pdf_file.replace(" ", "%20")
            urllib.request.urlretrieve(url, local_path)
            print(f"  ✓ {pdf_file}")
        except Exception as e:
            print(f"  ⚠️  Could not download {pdf_file}: {e}")
            print(f"     You can manually upload this file to the workshop-data/ folder")
    else:
        print(f"  ✓ {pdf_file} (already exists)")

print("\n✓ PDF setup complete")


---

# 🏗️ Part 1: Building the Knowledge Graph

We'll use the `SimpleKGPipeline` to automatically extract entities and relationships from PDFs.


## 🎯 Define Schema

Schema helps guide the LLM to extract relevant entities and relationships.

**Pharmaceutical Knowledge Graph Schema:**
- **Entities:** Molecule, Company, Target, Disease
- **Relationships:** TREATS, TARGETS, ASSOCIATED, IN_PIPELINE


In [None]:
node_labels = ["Molecule", "Company", "Target", "Disease"]

rel_types = ["TREATS", "TARGETS", "ASSOCIATED", "IN_PIPELINE"]

patterns = [
    ("Molecule", "TREATS", "Disease"),
    ("Company", "IN_PIPELINE", "Molecule"),
    ("Molecule", "TARGETS", "Target"),
    ("Disease", "ASSOCIATED", "Target"),
]
        
print("✓ Schema defined:")
print(f"  Entities: {', '.join(node_labels)}")
print(f"  Relationships: {', '.join(rel_types)}")


## 💬 Create Custom Extraction Prompt

This prompt guides the LLM on how to extract information from pharmaceutical documents.


In [None]:
prompt_template = '''
You are a medical researcher tasked with extracting information from pharmaceutical documents
and structuring it in a property graph to inform drug development and research Q&A.

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. The relationship direction goes from the start node to the end node.

Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}

Guidelines:
- Use only the information from the Input text. Do not add any additional information.
- If the input text is empty, return empty JSON.
- Create as many nodes and relationships as needed to offer rich pharmaceutical context.
- Entity types should be fairly general (Molecule, Disease, Company, Target).
- Focus on: drug names, indications, targets, clinical phases, companies.

Use only the following nodes and relationships:
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Respect the source and target node types for relationships and the relationship direction.

Do not return any additional information other than the JSON.

Examples:
{examples}

Input text:

{text}
'''

print("✓ Extraction prompt created")


## 🤖 Initialize LLM 

Using gpt-4o-mini for entity extraction


In [None]:
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

ex_llm=OpenAILLM(
    model_name="gpt-4o-mini",
    model_params={
        "response_format": {"type": "json_object"}, # use json_object formatting for best results
        "temperature": 0 # turning temperature down for more deterministic results
    }
)

#create text embedder
embedder = OpenAIEmbeddings(model="text-embedding-3-small")



## 🔧 Configure SimpleKGPipeline

This pipeline handles:
1. PDF text extraction
2. Text chunking
3. LLM-based entity/relationship extraction
4. Neo4j storage
5. Vector embedding creation


In [None]:
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

# Create the KG builder pipeline
kg_builder = SimpleKGPipeline(
    llm=ex_llm,
    driver=driver,
    text_splitter=FixedSizeSplitter(
        chunk_size=500,      # Characters per chunk
        chunk_overlap=100    # Overlap between chunks
    ),
    embedder=embedder,
    entities=node_labels,
    relations=rel_types,
    potential_schema=patterns,
    prompt_template=prompt_template,
    from_pdf=True,           # Enable PDF processing
    neo4j_database=NEO4J_DATABASE,
)

print("✓ SimpleKGPipeline configured")
print("  PDF processing: enabled")


## 🚀 Run Pipeline on PDFs

**Expected runtime: ~5 minutes for 1 file**

The pipeline will process 4 pharmaceutical PDFs and build a knowledge graph.


In [None]:
import asyncio
from datetime import datetime

# List of PDF files to process
pdf_paths = [
    '/content/workshop-data/AbbVie Long-Term Guidance and Pipeline Update.pdf',
    # '/content/workshop-data/BMY-2024-Q1-Results-Investor-Presentation-with-Appendix.pdf',
    # '/content/workshop-data/JNJ-Pipeline-2Q2024.pdf',
    # '/content/workshop-data/ph-rd-pipeline-2025-07-24-update-20250725.pdf'
]

# Process each PDF
print(f"🏗️  Building Knowledge Graph from {len(pdf_paths)} PDFs...")
print(f"   Started at: {datetime.now().strftime('%H:%M:%S')}\n")

results = []
for i, pdf_path in enumerate(pdf_paths, 1):
    print(f"[{i}/{len(pdf_paths)}] Processing: {pdf_path.split('/')[-1]}")
    try:
        result = await kg_builder.run_async(file_path=pdf_path)
        results.append(result)
        
        # Extract stats from result
        if result.result and 'resolver' in result.result:
            stats = result.result['resolver']
            nodes_resolved = stats.get('number_of_nodes_to_resolve', 0)
            nodes_created = stats.get('number_of_created_nodes', 0)
            print(f"    ✓ Resolved: {nodes_resolved} nodes, Created: {nodes_created} new nodes")
        else:
            print(f"    ✓ Completed")
    except Exception as e:
        print(f"    ❌ Error: {e}")
        
print(f"\n✅ Knowledge Graph construction complete!")
print(f"   Finished at: {datetime.now().strftime('%H:%M:%S')}")


## 📊 Check Graph Statistics

Let's see what we built!


In [None]:
def run_query(query):
    """Helper function to run Cypher queries"""
    with driver.session(database=NEO4J_DATABASE) as session:
        result = session.run(query)
        return result.data()

# Count nodes by type
print("📊 Graph Statistics:\n")
print("Nodes by type:")
for label in node_labels:
    count = run_query(f"MATCH (n:{label}) RETURN count(n) as count")[0]['count']
    print(f"  {label}: {count}")

# Count relationships by type
print("\nRelationships by type:")
for rel_type in rel_types:
    count = run_query(f"MATCH ()-[r:{rel_type}]->() RETURN count(r) as count")[0]['count']
    print(f"  {rel_type}: {count}")

# Count chunks
chunk_count = run_query("MATCH (c:Chunk) RETURN count(c) as count")[0]['count']
print(f"\nText chunks: {chunk_count}")

print("\n✓ Graph ready for querying!")


---

# 🔍 Part 2: Querying with Text2Cypher

Now that we have a knowledge graph, let's query it using natural language!

The `Text2CypherRetriever` converts questions into Cypher queries automatically.


## 🗺️ Define Neo4j Schema for Text2Cypher

The LLM needs to understand the graph structure to generate correct Cypher queries.


In [None]:
neo4j_schema = """
Node properties:
- Molecule {{name: STRING}}
- Company {{name: STRING}}
- Target {{name: STRING}}
- Disease {{name: STRING}}
- Document {{path: STRING}}
- Chunk {{text: STRING, page_number: INTEGER, index: INTEGER}}

Relationships:
- (:Molecule)-[:TREATS]->(:Disease)
- (:Molecule)-[:TARGETS]->(:Target)
- (:Disease)-[:ASSOCIATED]->(:Target)
- (:Company)-[:IN_PIPELINE]->(:Molecule)
- (:Document)-[:HAS_CHUNK]->(:Chunk)
- (:Chunk)-[:NEXT_CHUNK]->(:Chunk)
- (:Molecule)-[:MENTIONED_IN]->(:Chunk)
- (:Disease)-[:MENTIONED_IN]->(:Chunk)
- (:Company)-[:MENTIONED_IN]->(:Chunk)
- (:Target)-[:MENTIONED_IN]->(:Chunk)
"""

print("✓ Schema defined for Text2Cypher")


## 💡 Provide Example Queries

Examples help the LLM generate better Cypher queries.


In [None]:
examples = [
    "USER INPUT: 'How many molecules are in the database?' CYPHER: MATCH (m:Molecule) RETURN count(m) as molecule_count",
    "USER INPUT: 'Which diseases are being targeted?' CYPHER: MATCH (m:Molecule)-[:TREATS]->(d:Disease) RETURN DISTINCT d.name as disease ORDER BY disease",
    "USER INPUT: 'List molecules and their target diseases' CYPHER: MATCH (m:Molecule)-[:TREATS]->(d:Disease) RETURN m.name as molecule, d.name as disease LIMIT 20",
    "USER INPUT: 'Which companies have the most molecules in their pipeline?' CYPHER: MATCH (c:Company)-[:IN_PIPELINE]->(m:Molecule) RETURN c.name as company, count(m) as molecule_count ORDER BY molecule_count DESC LIMIT 10",
    "USER INPUT: 'What targets are associated with cancer?' CYPHER: MATCH (d:Disease)-[:ASSOCIATED]->(t:Target) WHERE toLower(d.name) CONTAINS 'cancer' RETURN DISTINCT t.name as target",
]

print(f"✓ {len(examples)} example queries provided")


## 🔧 Initialize Text2CypherRetriever


In [None]:
llm=OpenAILLM(
    model_name="gpt-4o-mini",
    model_params={
        "temperature": 0 # turning temperature down for more deterministic results
    }
)

In [None]:
from neo4j_graphrag.retrievers import Text2CypherRetriever

# Create Text2Cypher retriever
text2cypher_retriever = Text2CypherRetriever(
    driver=driver,
    llm=llm,
    neo4j_schema=neo4j_schema,
    # examples = examples
)

print("✓ Text2CypherRetriever initialized")
print("  Ready to convert natural language to Cypher!")


## 🎯 Try Some Queries!

Let's ask questions about our pharmaceutical knowledge graph.


In [None]:
from neo4j_graphrag.generation import GraphRAG  # Correct import!

rag = GraphRAG(retriever=text2cypher_retriever, llm=llm)

def query_graph(question):
    """Helper function to query the graph and display results"""
    print(f"Question: {question}")
    print("="*60)

    response = rag.search(query_text=question, return_context=True)
    print(f"  {response.retriever_result.metadata.get('cypher', 'N/A')}")
    print(f"\nAnswer: {response.answer}")

In [None]:
query_graph("How many molecules do I have in the database?")

In [None]:
query_graph("What diseases are being targeted the most?")

In [None]:
query_graph("Which companies have the most molecules in their pipeline?")


In [None]:
query_graph("Show me molecules that target cancer-related diseases")


In [None]:
query_graph("What are the main therapeutic areas being targeted across all companies?")


In [None]:
query_graph("Which company has the most diverse pipeline in terms of disease areas?")

---

# 🎓 What You've Learned

Congratulations! You've just:

✅ Built a knowledge graph from pharmaceutical PDFs  
✅ Used LLMs for automatic entity extraction  
✅ Queried graphs using Text2Cypher (natural language → Cypher)  
✅ Built a complete GraphRAG pipeline  

## 🚨 Limitations of This Approach

This simple approach works, but has limitations:

1. **PDF Text Extraction** - We only extracted plain text, missing:
   - Tables and structured data
   - Page layout and visual structure
   - Logos and branding (company identification)

2. **One shot RAG** - The LLM has to generate the perfect cypher query:
   - Cannot test and retry
   - Difficult to answer very complex question
   - Non deterministc nature of the LLMs

**💡 Next:** We'll explore custom extractors that handle these challenges!

---

# 🎯 Try It Yourself!

**Experiment with:**
- Different questions
- Custom schemas (modify `node_labels`, `rel_types`, `patterns`)
- Different prompt templates
- Your own PDFs

**Questions to explore:**
- What molecules target the same disease?
- Which targets are most popular?
- What's the relationship between companies and diseases?


---

# 🧹 Cleanup (Optional)

If you want to clear your database and start fresh:


In [None]:
# Uncomment to delete all data
# WARNING: This will delete everything in your database!

# with driver.session(database=NEO4J_DATABASE) as session:
#     session.run("MATCH (n) DETACH DELETE n")
#     print("✓ Database cleared")

# Close driver connection
driver.close()
print("✓ Connection closed")


---

# 📚 Resources

- [Neo4j GraphRAG Python Documentation](https://neo4j.com/docs/neo4j-graphrag-python)
- [Neo4j Cypher Manual](https://neo4j.com/docs/cypher-manual/current/)
- [Neo4j GraphAcademy](https://graphacademy.neo4j.com/) - Free courses

**Next Notebook:** Custom Extractors for Complex PDFs →
