## Importing BigQuery data into Neo4j

Importing from a relational database can be straightforward; table names become node labels, foreign keys become relationships, and many-to-many relationships become a relationship with many properties.

However, it is important to define the problem you are trying to solve to optimise the data model.  

In Neo4j, nouns that describe _things_ are node labels, the verbs become relationship types.  For example, an order may contain many products.

You can find modeling tips and tricks in the [Modeling Fundamentals course on GraphAcademy](https://graphacademy.neo4j.com/courses/importing-fundamentals/).


### Prerequisites

To import from BigQuery, you will need to set up authentication with Google Cloud with appropriate billing and permissions.  This notebook assumes that this has been set up, and the default application is defined for the machine.

Prerequisites:

* `brew install gcloud-cli`
* `gcloud auth application-default login`

### Dependencies

You will need to install the following libraries:


* `neo4j`
* `google-cloud-bigquery`
* `pandas`

You will also need `ipykernel` installed to use this jupyter notebook.



### Connect to Neo4j

To connect to Neo4j, use the `GraphDatase.driver()` method using the database URI, and authenticate with your username and password.

In [9]:
from neo4j import GraphDatabase

uri = 'neo4j://localhost:7687'
user = 'neo4j'
password = 'neoneoneo'

driver = GraphDatabase.driver(
    uri,
    auth=(user, password)
)

# Verify the connection details are correct.  If not, an error will be thrown.
driver.verify_connectivity()

### Run a Cypher statement

Use the `driver.execute_query()` method, which expects a positional parameter for the Cypher statement.  Any named parameters not suffixed with an underscore are treated as parameters and can be accessed in the query using a `$`.


In [10]:
records, summary, keys = driver.execute_query(
    "MATCH (n) RETURN count(*) AS count, $foo AS parameter",
    foo="bar"
)

for record in records:
    for key in keys:
        print(f"{key}: {record[key]}")

### Transform results into a Pandas dataframe

The `result_transformer_` parameter can be used to specify a function that modifies the output.  The `Result.to_df` method turns the into a pandas DataFrame.



In [11]:
from neo4j import Result

driver.execute_query("""
UNWIND range(1, 10) as id
RETURN id, id * 2 as double, randomUuid() AS uuid, rand() AS random
""", result_transformer_=Result.to_df)

Unnamed: 0,id,double,uuid,random
0,1,2,0b169914-a361-44a6-baf6-f07c6fab4d94,0.037557
1,2,4,518163b5-d5bc-4027-9921-d043e59a72fc,0.425075
2,3,6,afcbec10-4f8e-4b8f-9e0b-e3273dfc227f,0.177764
3,4,8,124913ea-f1ad-4f60-ac0f-2a3e7f88efdf,0.874264
4,5,10,99236844-d48e-4b37-878b-1e4f86c7a092,0.148921
5,6,12,53cc3ae6-dfe9-4beb-bf7b-a0cda4948606,0.127653
6,7,14,d990d9d1-49ba-4660-8ee6-23458e7ed5f6,0.541341
7,8,16,d7fe3d41-b2a9-4a5e-9aea-af254dea4388,0.847347
8,9,18,8630ed85-da41-4129-8f74-367790372ad8,0.42115
9,10,20,ef81f791-b24e-4a5f-aa9a-dc9fd00563fe,0.901266


## Connect to BigQuery

This example assumes that default application credentials have been set for the machine.  For example:

```
gcloud auth application-default login
```

Use the query method to execute an SQL statement.  This query can return normalised, or denormalised data.


In [12]:
from google.cloud import bigquery

bq_client = bigquery.Client(project="neo4j-graphacademy")

In [13]:
# execute an SQL query
res = bq_client.query("SELECT * FROM `bigquery-public-data.new_york_citibike.citibike_trips` WHERE tripduration > 100 LIMIT 1000")
rows = [ dict(row) for row in res ]

In [14]:
# convert the data into a pandas DataFrame.
import pandas as pd

df = pd.DataFrame(rows)

df.head()

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender,customer_plan
0,59253,2016-03-09 13:27:57,2016-03-10 05:55:31,469,Broadway & W 53 St,40.763441,-73.982681,3240,NYCBS Depot BAL - DYR,0.0,0.0,20112,Customer,,unknown,
1,447,2016-03-11 18:43:51,2016-03-11 18:51:18,442,W 27 St & 7 Ave,40.746647,-73.993915,3240,NYCBS Depot BAL - DYR,0.0,0.0,14707,Subscriber,1970.0,male,
2,220,2016-03-17 22:50:31,2016-03-17 22:54:11,311,Norfolk St & Broome St,40.717227,-73.988021,3019,NYCBS Depot - DEL,40.716633,-73.981933,23565,Subscriber,1977.0,male,
3,213,2015-10-27 23:04:37,2015-10-27 23:08:10,311,Norfolk St & Broome St,40.717227,-73.988021,3019,NYCBS Depot - DEL,40.716633,-73.981933,15248,Subscriber,1969.0,male,
4,179,2015-11-10 10:14:52,2015-11-10 10:17:51,311,Norfolk St & Broome St,40.717227,-73.988021,3019,NYCBS Depot - DEL,40.716633,-73.981933,15479,Subscriber,1978.0,male,


## Import the list of dicts into Neo4j

Each row contains the following nodes: 

* Station (id, name, latitude, longitude)
* Bike (id)
* Ride (starttime, stoptime, duration)

And the following relationships:

* (:Ride)-[:STARTS_AT]->(:Station)
* (:Ride)-[:ENDS_AT]->(:Station)
* (:Ride)-[:USES]->(:Bike)

Each of these should be imported into the database sequentially, first by creating the Nodes, then creating the relationships.  Each node will have a unique identifier, so it's a good idea to first create a unique constraint to avoid duplicate nodes.

### Create Unique Constraints

In [15]:
with driver.session() as session:
    session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (s:Station) REQUIRE s.id IS UNIQUE")
    session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (b:Bike) REQUIRE b.id IS UNIQUE")
    session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (r:Ride) REQUIRE r.id IS UNIQUE")


## Import the data

When you import data into Neo4j, it is sensible to turn the result set into smaller batches, this avoids OOM errors.









In [16]:
batch_size = 100

def batch(cypher, iterable, batch_size=100):
    """
    Util function to incrementally import batches of data into Neo4j.
    This avoids out of memory errors due to larger sizes exceding the Java heap size.
    """
    while len(iterable) > 0:
        current_batch = iterable[:batch_size]
        driver.execute_query(cypher, rows=current_batch)
        iterable = iterable[batch_size:]


### Importing nodes

Start by importing nodes.


In [17]:
# Import the stations
stations = [
    dict(id=id, name=name, latitude=lat, longitude=lon)
    for id, name, lat, lon in {
        (row[k+'_station_id'], row[k+'_station_name'], row[k+'_station_latitude'], row[k+'_station_longitude'])
        for row in rows for k in ('start', 'end')
    }
]

batch(r"""
UNWIND $rows as row
MERGE (s:Station {id: row.id})
ON CREATE SET
    s.name = row.name,
    s.location = point({latitude: row.latitude, longitude: row.longitude})
""", stations, batch_size=100)


# Import the bikes
bikes = [
    dict(id=id)
    for id in set(row['bikeid'] for row in rows)
]

batch(r"""
UNWIND $rows as row
MERGE (b:Bike {id: row.id})
""", bikes, batch_size=100)

### Importing relationships

Then use the unique identifier columns to find the nodes, and create the relationships.


In [18]:
# (:Ride)-[:USES]->(:Bike)
# (:Ride)-[:STARTS_AT]->(:Station)
# (:Ride)-[:ENDS_AT]->(:Station)
batch(r"""
UNWIND $rows as row
MERGE (r:Ride {id: row['bikeid'] + toString(row['starttime']) + toString(row['stoptime'])})
SET r.tripduration = row.tripduration,
    r.starttime = localdatetime(row.starttime),
    r.stoptime = localdatetime(row.stoptime),
    r.usertype = row.usertype,
    r.birth_year = row.birth_year,
    r.gender = row.gender,
    r.customer_plan = row.customer_plan

MERGE (s:Station {id: row.start_station_id})
MERGE (e:Station {id: row.end_station_id})
MERGE (b:Bike {id: row.bikeid})
MERGE (r)-[:STARTS_AT]->(s)
MERGE (r)-[:ENDS_AT]->(e)
MERGE (r)-[:USES]->(b)
""", rows, batch_size=100)

## Querying the data

In [22]:
driver.execute_query("""
MATCH (s:Station)
RETURN s.name AS name, s.location AS location LIMIT 10
""", result_transformer_=Result.to_df)

Unnamed: 0,name,location
0,W 25 St & 6 Ave,"(-73.99144871, 40.74395411)"
1,E 71 St & 2 Ave,"(-73.95910263061523, 40.76817546742245)"
2,Broadway & W 36 St,"(-73.98765428, 40.75097711)"
3,W 27 St & 10 Ave,"(-74.00218427181244, 40.75018156325683)"
4,E 48 St & 5 Ave,"(-73.97805914282799, 40.75724567911726)"
5,Allen St & Hester St,"(-73.99190759, 40.71605866)"
6,Broadway & W 49 St,"(-73.98442659, 40.76064679)"
7,W 43 St & 10 Ave,"(-73.99461843, 40.76009437)"
8,5 Ave & E 78 St,"(-73.96427392959595, 40.77632142182271)"
9,E 55 St & 2 Ave,"(-73.96603308, 40.75797322)"


In [None]:
# Most used bikes
driver.execute_query(r"""
MATCH (b:Bike)<-[:USES]-(r:Ride)
RETURN b.id AS id,
    count(*) AS rides,
    sum(r.tripduration) AS duration,
    round(1.0 * sum(r.tripduration) / count(*), 2) AS average_duration
ORDER BY duration DESC LIMIT 10
""", result_transformer_=Result.to_df)

Unnamed: 0,id,rides,duration,average_duration
0,19257,1,280154,280154.0
1,22196,1,63455,63455.0
2,20112,1,59253,59253.0
3,16905,3,40493,13497.67
4,22658,1,38158,38158.0
5,18060,1,31870,31870.0
6,20615,1,19613,19613.0
7,19884,1,14955,14955.0
8,23768,1,8193,8193.0
9,20693,3,7415,2471.67


## Visualising the data

The data can be quickly visualised within a notebook using [yFiles Jupyter notebook plugin](https://github.com/yWorks/yfiles-jupyter-graphs-for-neo4j).

`pip install yfiles_jupyter_graphs_for_neo4j`



In [21]:
from yfiles_jupyter_graphs_for_neo4j import Neo4jGraphWidget

g = Neo4jGraphWidget(driver)

g.show_cypher("MATCH (s)-[r]->(t) RETURN s,r,t LIMIT 20")


GraphWidget(layout=Layout(height='700px', width='100%'))

Available `layout` parameter options for the `show_cypher` method: 

* circular
* hierarchic
* organic
* interactive_organic
* orthogonal
* radial
* tree
* map
* orthogonal_edge_router
* organic_edge_route