# BigQuery + Spark + Neo4j AGA


## Setup

We need to do a little setup before we can run this notebook.
In order to allow the spark workers connect to our session we need to create a new `NAT` network router that routes the workers traffice to the internet.

```shell
# 1) Cloud Router
gcloud compute routers create nat-router --network=YOUR_VPC --region=REGION

# 2) (Optional) reserve a static egress IP to allowlist at the third-party
gcloud compute addresses create spark-egress --region=REGION

# 3) Cloud NAT config (use static IP if you reserved one)
gcloud compute routers nats create spark-nat \
  --router=nat-router --router-region=REGION \
  --nat-all-subnet-ip-ranges \
  --auto-allocate-nat-external-ips
```

In [None]:
%pip install graphdatascience==1.16

## Create a Spark session

In [None]:
from google.cloud.dataproc_spark_connect import DataprocSparkSession
from google.cloud.dataproc_v1 import Session

session = Session()
session.environment_config.execution_config.subnetwork_uri = "projects/team-graph-analytics/regions/europe-west2/subnetworks/default"
spark = DataprocSparkSession.builder.dataprocSessionConfig(session).getOrCreate()
spark.addArtifacts("graphdatascience==1.16", pypi=True)

## Load data

Connect to the Big Query Dataset and make it accessible to PySpark

In [None]:
# Load data from BigQuery
trips_table = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data.new_york.citibike_trips') \
  .load()
trips_table.createOrReplaceTempView('trips')

## Creating a session



In [None]:
from graphdatascience.session import AuraAPICredentials, GdsSessions, CloudLocation, SessionMemory
from datetime import timedelta

# you can also use AuraAPICredentials.from_env() to load credentials from environment variables
api_credentials = AuraAPICredentials(
    client_id="",
    client_secret="",
    # If your account is a member of several project, you must also specify the project ID to use
    project_id="",
)

sessions = GdsSessions(api_credentials=api_credentials)

# Create a GDS session!
gds = sessions.get_or_create(
    session_name="trips",
    memory=SessionMemory.m_16GB,
    ttl=timedelta(minutes=30),
    cloud_location=CloudLocation("gcp", "europe-west1"),
)

## Graph projections

In [None]:
arrow_client = gds._query_runner._query_runner._gds_arrow_client
arrow_client.create_graph_from_triplets("trips", "neo4j")


In [None]:
import pyarrow
def upload_batch(iterator):
  for batch in iterator:
    arrow_client.upload_triplets("trips", [batch])
    yield pyarrow.RecordBatch.from_pydict({})

In [None]:
# Total number of sales broken down by product in descending order
spark.sql("""
  SELECT start_station_id AS sourceNode, end_station_id AS targetNode FROM trips LIMIT 10000000
""").mapInArrow(upload_batch, "").show()

In [None]:
arrow_client.triplet_load_done("trips")

## Running an algorithm

In [None]:
from graphdatascience import Graph
G = gds.graph.get("trips")
gds.degree.stream(G)

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [shut down the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Stop the Spark session and release all resources
sessions.delete(session_name="trips")
spark.stop()