# Graph Data Science Demo

Now that we've created our *huge* (1.7B relationships! 244M nodes!) graph projections, let's do some data science.

The point of this demo is to show that enterprise graph data science is simple, fast, and easy using GDS. We're going to take our citation network dataset and build up a quick recommendations workflow by (1) paring it down to the relevant data, (2) calculating a graph embedding to encode all the relevant topological data for each node in our graph, and then (3) building up a nearest neighbors graph - based on those embeddings - so we can find out which papers are similar based on the structure of the graph.

In the real world, you might use that similarity graph as an alternative to traditional collaborative filtering methods. It's more scalable and flexible, and can look beyond one hop relationships. For this demo, we'll build up our graph and then take a peak at the results in bloom.

### Set up & Initialization

In [1]:
%%capture
pip install graphdatascience==1.1.0rc1 ipywidgets jupyter

In [2]:
# Client import
from graphdatascience import GraphDataScience

# Replace with the actual URI, username and password
CONNECTION_URI = "neo4j+s://demo2.graphconnect.app:7687"
USERNAME = "neo4j"
with open('pass.txt', mode='r') as f:
    PASSWORD = f.readline().strip()

# Client instantiation
gds = GraphDataScience(
    CONNECTION_URI,
    auth=(USERNAME, PASSWORD)
)

### Bind the graph projection to a graph object 
The GDS Python Client works with graph objects in Python. If we were constructing the graph from a neo4j database (or a pandas dataframe), that would automatically return a graph object. Since we're using the graph that we just created with custom Arrow import code, we need to assign it to a graph object using `get`

In [3]:
G=gds.graph.get("gcdemo")

### Filter the demo graph down to a smaller graph 

Since this is a **10 minute, live demo** we want to keep it interesting - not have everyone sitting around watching code run. We're going to start with a workflow to cut down our full graph into something more manageable. 

We'll start by adding degree centrality to our graph, so we know how many relationships each node has. We can then use that to filter out nodes that are too densely connected, as well as orphan nodes.

In [4]:
result = gds.degree.mutate(
    G, 
    concurrency=224,
    mutateProperty="degree"
)

result

nodePropertiesWritten                                             244160499
centralityDistribution    {'p99': 78.00048065185547, 'min': 0.0, 'max': ...
mutateMillis                                                              0
postProcessingMillis                                                   1377
preProcessingMillis                                                       0
computeMillis                                                             1
configuration             {'jobId': 'a898239d-cfe9-45c9-8461-f2421b850c9...
Name: 0, dtype: object

Now, we can use a subgraph projection to create a new analysis graph, that only contains the `Paper` nodes with `CITES` relationships, f and removes dense nodes (degree > 25) and orphans (degree = 0) who won't be informative for our analysis. We're also going to drop unlabeled papers, because they're missing information.

In [5]:
analysis_graph, res = gds.beta.graph.project.subgraph(
  'analysis_graph',
  G,
  'n:Paper AND (n.degree < 26.0 AND n.degree > 0.0) AND (n.flag >= 0) AND (n.flag >= 0)',
  'r:CITES',
  concurrency=224
)

res

fromGraphName                                                    gcdemo
nodeFilter            n:Paper AND (n.degree < 26.0 AND n.degree > 0....
relationshipFilter                                              r:CITES
graphName                                                analysis_graph
nodeCount                                                        655087
relationshipCount                                                439868
projectMillis                                                     12608
Name: 0, dtype: object

### Recommendation Recipe

The point of this demo is that it's easy to do graph data science, at scale, and solve real tasks. What we're going to do here is build a simple recommendation engine, based on the topology of our citation network. The first step is to calculate a graph embedding - that will encode all the complex information about our graph into a bunch of (informative) numbers. Then we'll build a nearest neighbors graph using KNN. The nearest neighbors graph connects papers that are similar to eachother, based on the structure of the citation network. KNN is an *approximate* method, so it scales very well. 

In the real world, you would use your new `SIMILAR` relationships to power recommendations - when someone read a paper, you could recommend a similar paper, based on your calculations. In this demo, we'll export our data and take a look at the results in Bloom.

In [6]:
res=gds.fastRP.mutate(
    analysis_graph,
    embeddingDimension=25,
    concurrency=224,
    mutateProperty="fastRP_embedding"
)

res

nodePropertiesWritten                                               655087
mutateMillis                                                             0
nodeCount                                                           655087
preProcessingMillis                                                      0
computeMillis                                                          307
configuration            {'nodeSelfInfluence': 0, 'relationshipWeightPr...
Name: 0, dtype: object

In [7]:
res=gds.knn.mutate(
    analysis_graph,
    topK=3,
    similarityCutoff=0.6,
    sampleRate=0.25,
    randomJoins=5,
    deltaThreshold=0.01,
    initialSampler="randomWalk",
    concurrency=224,
    nodeProperties="fastRP_embedding",
    mutateProperty="score",
    mutateRelationshipType='SIMILAR'
)

res

Knn:   0%|          | 0/100 [00:00<?, ?%/s]

ranIterations                                                            20
nodePairsConsidered                                                84256942
didConverge                                                            True
preProcessingMillis                                                       0
computeMillis                                                         14205
mutateMillis                                                            637
postProcessingMillis                                                     -1
nodesCompared                                                        655087
relationshipsWritten                                                 436276
similarityDistribution    {'p1': 0.6199989318847656, 'max': 1.0000038146...
configuration             {'topK': 3, 'maxIterations': 100, 'randomJoins...
Name: 0, dtype: object

That's **84 million** comparisons!

We can now persist it as a new Neo4j database using `gds.graph.export`.

In [8]:
from time import time

res=gds.graph.export(
    analysis_graph,
    dbName=f"demo.Database.{int(time())}",
    writeConcurrency=224
)

res

dbName                       demo.database.1654534431
graphName                              analysis_graph
nodeCount                                      655087
relationshipCount                              876144
relationshipTypeCount                               2
nodePropertyCount                             2620348
relationshipPropertyCount                      436276
writeMillis                                      4164
Name: 0, dtype: object