# Project 3, Part 3, Create a graph database in Neo4j for the BART system

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Students in the group:
* Aris Chalini
* Jack Galvin
* Matt Lauritzen

Year: 2022

Semester: Spring

Section: 09


# Included Modules and Packages

Code cell containing your includes for modules and packages

Some starter code is provided

You may change the starter code as needed

You may add as much code and/or as many code cells as you need

In [1]:
import neo4j

import csv

import math
import numpy as np
import pandas as pd

import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

Some starter code is provided

You may change the starter code as needed

You may add as much code and/or as many code cells as you need

In [2]:
driver = neo4j.GraphDatabase.driver(uri="neo4j://neo4j:7687", auth=("neo4j","w205"))

In [3]:
session = driver.session(database="neo4j")

In [4]:
def my_neo4j_wipe_out_database():
    "wipe out database by deleting all nodes and relationships"
    
    query = "match (node)-[relationship]->() delete node, relationship"
    session.run(query)
    
    query = "match (node) delete node"
    session.run(query)

In [5]:
def my_neo4j_run_query_pandas(query, **kwargs):
    "run a query and return the results in a pandas dataframe"
    
    result = session.run(query, **kwargs)
    
    df = pd.DataFrame([r.values() for r in result], columns=result.keys())
    
    return df

In [6]:
def my_neo4j_number_nodes_relationships():
    "print the number of nodes and relationships"
   
    
    query = """
        match (n) 
        return n.name as node_name, labels(n) as labels
        order by n.name
    """
    
    df = my_neo4j_run_query_pandas(query)
    
    number_nodes = df.shape[0]
    
    
    query = """
        match (n1)-[r]->(n2) 
        return n1.name as node_name_1, labels(n1) as node_1_labels, 
            type(r) as relationship_type, n2.name as node_name_2, labels(n2) as node_2_labels
        order by node_name_1, node_name_2
    """
    
    df = my_neo4j_run_query_pandas(query)
    
    number_relationships = df.shape[0]
    
    print("-------------------------")
    print("  Nodes:", number_nodes)
    print("  Relationships:", number_relationships)
    print("-------------------------")


In [7]:
def my_neo4j_create_node(station_name):
    "create a node with label Station"
    
    query = """
    
    CREATE (:Station {name: $station_name})
    
    """
    
    session.run(query, station_name=station_name)
    

In [8]:
def my_neo4j_create_relationship_one_way(from_station, to_station, weight):
    "create a relationship one way between two stations with a weight"
    
    query = """
    
    MATCH (from:Station), 
          (to:Station)
    WHERE from.name = $from_station and to.name = $to_station
    CREATE (from)-[:LINK {weight: $weight}]->(to)
    
    """
    
    session.run(query, from_station=from_station, to_station=to_station, weight=weight)
    

In [9]:
def my_neo4j_create_relationship_two_way(from_station, to_station, weight):
    "create relationships two way between two stations with a weight"
    
    query = """
    
    MATCH (from:Station), 
          (to:Station)
    WHERE from.name = $from_station and to.name = $to_station
    CREATE (from)-[:LINK {weight: $weight}]->(to),
           (to)-[:LINK {weight: $weight}]->(from)
    
    """
    
    session.run(query, from_station=from_station, to_station=to_station, weight=weight)
    

In [10]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [11]:
cursor = connection.cursor()

# Introduction 

We will now go step by step through the process of creating a graph database in Neo4j for the BART sytem. 

We will use some of the queries in 3.2 to pull the needed data in the right format to create nodes and relationships in our Neo4j graph database

We will use the functions created above to create the nodes and relationships:
* my_neo4j_create_node() - creates a node with label Station
* my_neo4j_create_relationship_one_way() - creates a relationship one way between two stations with a weight
* my_neo4j_create_relationship_two_way() - create relationships two way between two stations with a weight

The way we create it might seem a bit strange at first. However, we want to be able to use the canned Neo4j Graph Data Science algorithms, and this design allows us to do so.

# 3.3.1 Wipe out the Neo4j database

Call the function my_neo4j_wipe_out_database() to wipe out the Neo4j database

In [12]:
my_neo4j_wipe_out_database()

# 3.3.2 Verify the number of nodes and relationships

Call the function my_neo4j_number_nodes_relationships() to verify the number of nodes and relationships is 0 for both

In [13]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 0
  Relationships: 0
-------------------------


# 3.3.3 Query the list of stations and create the departure and arrival nodes in the graph

Use the query from 3.2.1 "Query the list of stations"

For each station X, create two nodes:
* depart X
* arrive X

Use the function my_neo4j_create_node() defined above

For example, West Oakland:
* my_neo4j_create_node('depart West Oakland')
* my_neo4j_create_node('arrive West Oakland')



## Since this is the first one, a solution code cell is provided for you to execute and then pattern the rest after



In [14]:
connection.rollback()

query = """

select station
from stations
order by station

"""

cursor.execute(query)

connection.rollback()

rows = cursor.fetchall()

for row in rows:
    
    station = row[0]
    
    my_neo4j_create_node('depart ' + station)
    my_neo4j_create_node('arrive ' + station)
    

# 3.3.4 Verify the number of nodes and relationships

Call the function my_neo4j_number_nodes_relationships() to verify the number of nodes is 100 and the number of relationships is 0

In [15]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 100
  Relationships: 0
-------------------------


# 3.3.5 Query the list of stations and the lines they serve, create line nodes, and create relationships between the line nodes and the departure and arrival nodes with weight 0

Use the query from 3.2.3 "Query the list of stations and the lines they serve"

For each station X and each line Y that the station serves:
* Create a line node
* Create a relationship from the departure node to the line node with weight 0
* Create a relationship from the line node to the arrival node with weight 0

Use the function my_neo4j_create_relationship_one_way() defined above to create the relationships

For example, West Oakland should create the following line nodes:
* my_neo4j_create_node('blue West Oakland')
* my_neo4j_create_node('green West Oakland')
* my_neo4j_create_node('red West Oakland')
* my_neo4j_create_node('yellow West Oakland')

And the following relationships between line nodes and departure and arrival nodes:
* my_neo4j_create_relationship_one_way('depart West Oakland','blue West Oakland',0)
* my_neo4j_create_relationship_one_way('blue West Oakland','arrive West Oakland',0)
* my_neo4j_create_relationship_one_way('depart West Oakland','green West Oakland',0)
* my_neo4j_create_relationship_one_way('green West Oakland','arrive West Oakland',0)
* my_neo4j_create_relationship_one_way('depart West Oakland','red West Oakland',0)
* my_neo4j_create_relationship_one_way('red West Oakland','arrive West Oakland',0)
* my_neo4j_create_relationship_one_way('depart West Oakland','yellow West Oakland',0)
* my_neo4j_create_relationship_one_way('yellow West Oakland','arrive West Oakland',0)

In [16]:
connection.rollback()

query = """

select station, line
from lines
order by 1, 2

"""

cursor.execute(query)

connection.rollback()

rows = cursor.fetchall()

for row in rows:
    
    station = row[0]
    line = row[1]
    
    my_neo4j_create_node(line + " " + station)
    my_neo4j_create_relationship_one_way('depart ' + station, line + " " + station, 0)
    my_neo4j_create_relationship_one_way( line + " " + station, 'arrive ' + station, 0)
    

# 3.3.6 Verify the number of nodes and relationships

Call the function my_neo4j_number_nodes_relationships() to verify the number of nodes is 214 and the number of relationships is 228

In [17]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 214
  Relationships: 228
-------------------------


# 3.3.7 Query the list of all possible line transfers and the transfer times, create a relationship for each transfer with the transfer time as the weight

Use the query from 3.2.5 "Query the list of all possible line transfers and the transfer times"

For each station X, from line Y, to line Z, create a relationship from Y's line node to Z's line node with the weight set to the transfer time

For example, West Oakland should create the following relationships between line nodes for transfers:

* my_neo4j_create_relationship_one_way('blue West Oakland','green West Oakland',283)
* my_neo4j_create_relationship_one_way('blue West Oakland','red West Oakland',283)
* my_neo4j_create_relationship_one_way('blue West Oakland','yellow West Oakland',283)
* my_neo4j_create_relationship_one_way('green West Oakland','blue West Oakland',283)
* my_neo4j_create_relationship_one_way('green West Oakland','red West Oakland',283)
* my_neo4j_create_relationship_one_way('green West Oakland','yellow West Oakland',283)
* my_neo4j_create_relationship_one_way('red West Oakland','blue West Oakland',283)
* my_neo4j_create_relationship_one_way('red West Oakland','green West Oakland',283)
* my_neo4j_create_relationship_one_way('red West Oakland','yellow West Oakland',283)
* my_neo4j_create_relationship_one_way('yellow West Oakland','blue West Oakland',283)
* my_neo4j_create_relationship_one_way('yellow West Oakland','green West Oakland',283)
* my_neo4j_create_relationship_one_way('yellow West Oakland','red West Oakland',283)


In [18]:
connection.rollback()

query = """

select a.station, a.line as from_line, b.line as to_line, s.transfer_time
from lines as a
join lines as b
    on a.station = b.station
join stations as s
    on a.station = s.station
where a.line != b.line
order by 1, 2, 3

"""

cursor.execute(query)

connection.rollback()

rows = cursor.fetchall()

for row in rows:
    
    station = row[0]
    from_line = row[1]
    to_line = row[2]
    transfer_time = int(row[3])
    
    my_neo4j_create_relationship_one_way(from_line + " " + station, to_line + " " + station, transfer_time)


# 3.3.8 Verify the number of nodes and relationships

Call the function my_neo4j_number_nodes_relationships() to verify the number of nodes is 214 and the number of relationships is 436

In [19]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 214
  Relationships: 436
-------------------------


# 3.3.9 Query the list of all segments between each station and its adjoining stations, create a relationship for each segment both ways

Use the query from 3.2.7 "Query the list of all segments between each station and its adjoining stations"

For each segment from station X to station Y on line Z, create two relationships:
* From X's line node to Y's line node with travel time
* From Y's line node to X's line node with travel time

Use the function my_neo4j_create_relationship_two_way() defined above which will create both relationships 

For example, West Oakland should create the following relationships between line nodes:

* my_neo4j_create_relationship_two_way('blue Lake Merritt','blue West Oakland',360)
* my_neo4j_create_relationship_two_way('blue West Oakland','blue Embarcadero',420)
* my_neo4j_create_relationship_two_way('green Lake Merritt','green West Oakland',360)
* my_neo4j_create_relationship_two_way('green West Oakland','green Embarcadero',420)
* my_neo4j_create_relationship_two_way('red 12th Street','red West Oakland',300)
* my_neo4j_create_relationship_two_way('red West Oakland','red Embarcadero',420)
* my_neo4j_create_relationship_two_way('yellow 12th Street','yellow West Oakland',300)
* my_neo4j_create_relationship_two_way('yellow West Oakland','yellow Embarcadero',420)


In [20]:
connection.rollback()

query = """

select a.line, 
    a.station as "from station", 
    b.station as "to station", 
    t.travel_time as "travel time in seconds"
from lines as a
join lines as b
    on a.line = b.line 
        and a.sequence = b.sequence - 1
join travel_times as t
    on (a.station = t.station_1 and b.station = station_2)
        or (a.station = t.station_2 and b.station = station_1)
order by 1, 2, 3

"""

cursor.execute(query)

connection.rollback()

rows = cursor.fetchall()

for row in rows:
    
    line = row[0]
    from_station = row[1]
    to_station = row[2]
    travel_time = int(row[3])

    my_neo4j_create_relationship_two_way(line + " " + from_station, line + " " + to_station, travel_time)
    

# 3.3.10 Verify the number of nodes and relationships

Call the function my_neo4j_number_nodes_relationships() to verify the number of nodes is 214 and the number of relationships is 652

In [21]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 214
  Relationships: 652
-------------------------
