# Query London Transport Network with Natural Language

Ask questions about the London Transport Network in plain English and get answers through automatically generated Cypher queries.

## What This Notebook Does

1. **Connects to Neo4j** - Uses the same London Transport graph from the ETL notebook
2. **Retrieves Graph Schema** - Automatically gets the structure of nodes and relationships
3. **Converts Questions to Cypher** - Uses an LLM to generate valid Cypher queries
4. **Executes Queries** - Runs the generated Cypher and displays results
5. **Provides Examples** - Includes sample questions you can try

## Example Questions

- "How many stations are in zone 1?"
- "Which stations does the Bakerloo line connect?"
- "What tube lines go through Baker Street?"
- "Which stations have the most connections?"
- "Find a path between King's Cross and Victoria"

---

## Prerequisites

Before running this notebook:

1. **London Transport data loaded** - Run `load_london_transport.ipynb` first
2. **Python libraries installed** on cluster:
   - `langchain`
   - `langchain-neo4j`
   - `langchain-openai`
   - `neo4j`
3. **Neo4j connection configured** - Same as ETL notebook
4. **Databricks cluster** with access to Foundation Models

---

## Section 1: Configuration

Configure Neo4j connection parameters using widgets.

In [None]:
# Install required Python packages
# Using latest versions as of November 2025

%pip install --quiet \
    langchain==1.0.8 \
    langchain-neo4j==0.6.0 \
    langchain-openai==1.0.3 \
    neo4j==6.0.3

# Restart Python to ensure packages are properly loaded
dbutils.library.restartPython()

print("✓ Dependencies installed successfully")
print("\nInstalled packages:")
print("  - langchain 1.0.8")
print("  - langchain-neo4j 0.6.0")
print("  - langchain-openai 1.0.3")
print("  - neo4j 6.0.3")

In [None]:
# Remove existing widgets
dbutils.widgets.removeAll()

# Neo4j connection widgets
dbutils.widgets.text("neo4j_url", "bolt://localhost:7687", "Neo4j URL")
dbutils.widgets.text("neo4j_username", "neo4j", "Neo4j Username")
dbutils.widgets.text("neo4j_database", "neo4j", "Neo4j Database")

# Databricks Foundation Model endpoint
dbutils.widgets.text("databricks_endpoint", "<REPLACE_WITH_SERVING_ENDPOINT>", "Databricks Endpoint")
dbutils.widgets.text("model_name", "databricks-claude-sonnet-4-5", "Model Name")

print("✓ Widgets created successfully")
print("\nConfigure the widgets above, then run the next cell.")

In [None]:
# Get configuration from widgets
NEO4J_URL = dbutils.widgets.get("neo4j_url")
NEO4J_USER = dbutils.widgets.get("neo4j_username")
NEO4J_DB = dbutils.widgets.get("neo4j_database")
NEO4J_PASS = dbutils.secrets.get(scope="neo4j", key="password")

DATABRICKS_ENDPOINT = dbutils.widgets.get("databricks_endpoint")
MODEL_NAME = dbutils.widgets.get("model_name")
DATABRICKS_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

print("Configuration loaded:")
print(f"✓ Neo4j URL: {NEO4J_URL}")
print(f"✓ Neo4j User: {NEO4J_USER}")
print(f"✓ Neo4j Database: {NEO4J_DB}")
print(f"✓ Model: {MODEL_NAME}")
print(f"✓ Endpoint: {DATABRICKS_ENDPOINT}")

---

## Section 2: Connect to Neo4j and Validate Data

Verify the London Transport graph is loaded and accessible.

---

In [None]:
from langchain_neo4j import Neo4jGraph

# Connect to Neo4j
graph = Neo4jGraph(
    url=NEO4J_URL,
    username=NEO4J_USER,
    password=NEO4J_PASS,
    database=NEO4J_DB
)

print("✓ Connected to Neo4j")

In [None]:
# Validate that London Transport data exists
validation_query = """
MATCH (s:Station)
WITH count(s) as station_count
MATCH ()-[r]->()
RETURN station_count, count(r) as relationship_count
"""

result = graph.query(validation_query)

if result and result[0]['station_count'] > 0:
    print("✓ London Transport data found:")
    print(f"  - Stations: {result[0]['station_count']}")
    print(f"  - Connections: {result[0]['relationship_count']}")
else:
    print("✗ No data found. Please run load_london_transport.ipynb first.")

---

## Section 3: Retrieve and Display Graph Schema

Get the structure of the graph to inform Cypher generation.

---

In [None]:
# Get the graph schema
schema = graph.schema

print("="*80)
print("LONDON TRANSPORT GRAPH SCHEMA")
print("="*80)
print(schema)
print("="*80)

---

## Section 4: Configure LLM for Cypher Generation

Set up the language model with a specialized prompt for generating Cypher queries.

---

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_neo4j import GraphCypherQAChain

# Initialize LLM for Cypher generation with temperature 0.0 for consistency
cypher_llm = ChatOpenAI(
    api_key=DATABRICKS_TOKEN,
    base_url=DATABRICKS_ENDPOINT,
    model=MODEL_NAME,
    temperature=0.0
)

print("✓ LLM configured for Cypher generation")

In [None]:
# Create Cypher generation prompt template
cypher_template = """Task: Generate Cypher statement to query the London Transport Network graph database.

Instructions:
- Use only the provided relationship types and properties in the schema
- Do not use any other relationship types or properties that are not provided
- Use `WHERE toLower(node.name) CONTAINS toLower('name')` for case-insensitive name matching
- Relationships are bidirectional - you can traverse them in either direction
- For counting patterns, use modern COUNT{{}} subquery syntax
- Each tube line has its own relationship type (e.g., :BAKERLOO, :CENTRAL, :CIRCLE)
- Station properties include: station_id, name, zone, latitude, longitude, postcode
- To find busy stations, count connections: `count{{(s)-[]-()}}`
- To find paths, use shortest path: `shortestPath((from)-[*]-(to))`

Schema:
{schema}

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that ask anything other than generating a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}
"""

cypher_prompt = PromptTemplate(
    input_variables=["schema", "question"],
    template=cypher_template
)

print("✓ Cypher prompt template created")

In [None]:
# Create the Cypher QA chain
cypher_chain = GraphCypherQAChain.from_llm(
    graph=graph,
    llm=cypher_llm,
    cypher_llm=cypher_llm,
    cypher_prompt=cypher_prompt,
    allow_dangerous_requests=True,
    return_direct=True,
    verbose=True
)

print("✓ Cypher QA chain created")
print("\n⚠️  allow_dangerous_requests=True: The LLM is trusted to generate safe Cypher queries.")

---

## Section 5: Query Interface

Ask questions in natural language and get answers.

---

In [None]:
def ask_question(question: str):
    """
    Ask a question about the London Transport Network.
    
    The LLM will generate a Cypher query, execute it, and return the results.
    Generated Cypher will be displayed due to verbose=True.
    """
    print("="*80)
    print(f"QUESTION: {question}")
    print("="*80)
    
    result = cypher_chain.invoke({"query": question})
    
    print("\n" + "="*80)
    print("RESULT:")
    print("="*80)
    
    return result

### Try Your Own Question

Modify the question below and run the cell:

In [None]:
# Modify this question and run the cell
question = "How many stations are in zone 1?"

result = ask_question(question)
print(result)

---

## Example Questions

Run the cells below to try different types of questions.

---

### Basic Counting Questions

In [None]:
result = ask_question("How many stations are there in total?")
print(result)

In [None]:
result = ask_question("How many stations are in zone 2?")
print(result)

In [None]:
result = ask_question("Count the stations in each zone")
print(result)

### Station Information Questions

In [None]:
result = ask_question("Show me all stations in zone 1")
print(result)

In [None]:
result = ask_question("What zone is King's Cross St. Pancras in?")
print(result)

In [None]:
result = ask_question("Show me 10 stations with their zones and postcodes")
print(result)

### Tube Line Questions

In [None]:
result = ask_question("Which stations does the Bakerloo line connect?")
print(result)

In [None]:
result = ask_question("What tube lines go through Baker Street?")
print(result)

In [None]:
result = ask_question("Show all Central line connections")
print(result)

### Connection and Traffic Questions

In [None]:
result = ask_question("Which stations have the most connections?")
print(result)

In [None]:
result = ask_question("Show me the top 10 busiest interchange stations")
print(result)

In [None]:
result = ask_question("Which stations have fewer than 4 connections?")
print(result)

In [None]:
result = ask_question("How many connections does Oxford Circus have?")
print(result)

### London Travel and Navigation Questions

In [None]:
result = ask_question("Find a path between King's Cross St. Pancras and Victoria")
print(result)

In [None]:
result = ask_question("What's a route from Paddington to Liverpool Street?")
print(result)

In [None]:
result = ask_question("Show me stations I should avoid during rush hour based on connection counts")
print(result)

In [None]:
result = ask_question("Which quieter stations could I use as alternatives to busy interchanges?")
print(result)

---

## How It Works

This notebook uses a text-to-Cypher pipeline:

1. **Natural Language Question** - You ask a question in plain English
2. **Schema Context** - The graph schema is provided to the LLM
3. **Cypher Generation** - The LLM generates a Cypher query (displayed in output)
4. **Query Execution** - The Cypher is executed against Neo4j
5. **Results Display** - Results are returned directly

### Why Temperature 0.0?

The Cypher generation LLM uses `temperature=0.0` for deterministic, consistent query generation. This ensures the same question produces the same Cypher query every time.

### Modern Cypher Syntax

The prompt instructs the LLM to use modern Neo4j 5.x syntax:
- `COUNT{}` subqueries instead of OPTIONAL MATCH
- Case-insensitive matching with `toLower()`
- Efficient pattern matching

---

## Limitations

This is a basic text-to-Cypher agent. It has limitations:

**What it does well:**
- Simple counting queries ("How many...")
- Station lookups ("Which stations...")
- Relationship queries ("What lines connect...")
- Basic path finding ("Find a route...")

**What it cannot do:**
- Complex multi-hop reasoning
- Optimal journey planning with transfers
- Real-time timetable information
- Updates or modifications to data (read-only)
- Ambiguous questions without clear intent
- Questions about data not in the graph

**Best practices:**
- Be specific in your questions
- Use station names as they appear in the data
- Review the generated Cypher to understand results
- For complex questions, break them into simpler parts

---

## Troubleshooting

### No Results Returned
- Check that station names are spelled correctly
- Review the generated Cypher (shown in output)
- Try a simpler version of your question

### Invalid Cypher Generated
- The question may be too ambiguous
- Try rephrasing with more specific details
- Check that you're asking about data that exists in the graph

### Connection Errors
- Verify Neo4j is running and accessible
- Check that London Transport data is loaded
- Confirm connection parameters in widgets

### LLM Errors
- Verify Databricks endpoint URL is correct
- Check that the model name is available
- Ensure cluster has access to Foundation Models

---