# Neo4j Graph Data Science Starter Kit
This notebook acts as a simple starter kit for using the Neo4j GDS library from Python.
It contains code fragments to do the following:
1. Set up a connection to Neo4j and read/write data.
3. Create graph projections to run your algorithm on.
4. Run algorithms and stream back results to Neo4j.
5. Use the algorithm results as new features for your predictive models.

This example uses the Game of Thrones dataset as present in the Neo4j graph data science sandbox. You can get your own for free here:
https://sandbox.neo4j.com/login?usecase=graph-data-science

## 1. Setting up the Neo4j Driver
Enter your own Neo4j credentials here:

In [1]:
url = "bolt://34.201.68.240:33513"
user = "neo4j"
password = "chicken-nuggets" 

In [2]:
from neo4j import GraphDatabase
driver = GraphDatabase.driver(url, auth=(user, password))
neo4j = driver.session()

### Example - Reading Neo4j results using the driver

In [3]:
import pandas as pd
result = neo4j.run('MATCH (n:Person) RETURN n.name AS name, n.age as age LIMIT 10')
df = pd.DataFrame(result.data())
print(df)

                    name   age
0    Gunthor son of Gurn   NaN
1  High Septon (fat_one)   NaN
2        Jaime Lannister  39.0
3         Gregor Clegane  35.0
4            Andros Brax   NaN
5           Roose Bolton  45.0
6         Wylis Manderly  53.0
7          Medger Cerwyn   NaN
8       Harrion Karstark   NaN
9         Halys Hornwood   NaN


## 2. Creating a graph projection
As an example, we want to analyze which people are most influential using the PageRank algorithm.

First, create a graph projection `interactions` that contains only the pattern we are interested in: `(:Person)-[:INTERACTS]->(:Person)`. 

Then, go through the following steps:
- Check if we have enough memory to generate it.
- Check if the graph projection already exists, if so, delete it.
- Create the graph projection.


### Estimating the required size of the projection

In [4]:
# Run the Cypher query
result = neo4j.run("""
CALL gds.graph.create.cypher.estimate(
    'MATCH (p) WHERE p:Person RETURN id(p) as id',
    'MATCH (p)-[:INTERACTS]->(p2:Person) RETURN id(p) AS source, id(p2) AS target')
""")

# Print the results
row = result.single()
print("Estimates for creating this graph projection:")
print(row['requiredMemory'], " memory")
print(row['nodeCount']," nodes")
print(row['relationshipCount']," rels")

Estimates for creating this graph projection:
333 KiB  memory
2642  nodes
16747  rels


### Clear existing in-memory graph (if it exists)


In [5]:
# This query drops the projected graph if it already exists, else it returns 'None'.
result = neo4j.run("""
CALL gds.graph.exists($name) YIELD exists
WHERE exists
CALL gds.graph.drop($name) YIELD graphName
RETURN graphName + " was dropped." as message
""", name = 'interactions')

# Print the results
print(result.single())

<Record message='interactions was dropped.'>


### Creating the new graph projection

In [6]:
# Create a weighted Cypher projection graph of (Person)-[:INTERACTS]->(:Person)
result = neo4j.run("""
CALL gds.graph.create.cypher(
    'interactions',
    'MATCH (p) WHERE p:Person RETURN id(p) as id',
    'MATCH (p)-[i:INTERACTS]->(p2:Person) RETURN id(p) AS source, i.weight as weight, id(p2) AS target')
""")

# Print the results
row = result.single()
print(row['nodeCount'], " nodes projected.")
print(row['relationshipCount'], " rels projected.")
print(row['createMillis']," ms to create the projection.")

2166  nodes projected.
3907  rels projected.
12  ms to create the projection.


## 3. Running graph algorithms
Now that we have our graph project, we're ready to run the algorithm!

As always, best practice is to first check if we have enough memory for running the algorithm.

In [7]:
result = neo4j.run("""
CALL gds.pageRank.stream.estimate('interactions',  { relationshipWeightProperty: 'weight' })
""")

print(result.single()['requiredMemory'], ' memory required to run the algorithm.')

84 KiB  memory required to run the algorithm.


### Run the algorithm (stream mode)
First, use 'stream' mode to inspect the results:

In [8]:
result = neo4j.run("""
CALL gds.pageRank.stream('interactions', { relationshipWeightProperty: 'weight'}) 
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as character, score 
ORDER BY score DESC
""")

df = pd.DataFrame(result.data())
print(df)

              character      score
0      Tyrion Lannister  16.006423
1       Tywin Lannister   8.975134
2                 Varys   8.487427
3     Stannis Baratheon   8.152841
4         Theon Greyjoy   5.456874
...                 ...        ...
2161     Malaquo Maegyr   0.150000
2162              Morgo   0.150000
2163      Old Bill Bone   0.150000
2164               Scar   0.150000
2165      Shrouded Lord   0.150000

[2166 rows x 2 columns]


### Run the algorithm (write mode)
Then, use 'write' mode to write the results back to the Neo4j database.

In [9]:
import pprint 

result = neo4j.run("""
CALL gds.pageRank.write('interactions', { writeProperty: 'pageRank', relationshipWeightProperty: 'weight' })
""")

pprint.pprint(result.data())

[{'computeMillis': 48,
  'configuration': {'cacheWeights': False,
                    'concurrency': 4,
                    'dampingFactor': 0.85,
                    'maxIterations': 20,
                    'nodeLabels': ['*'],
                    'relationshipTypes': ['*'],
                    'relationshipWeightProperty': 'weight',
                    'sourceNodes': [],
                    'sudo': False,
                    'tolerance': 1e-07,
                    'writeConcurrency': 4,
                    'writeProperty': 'pageRank'},
  'createMillis': 0,
  'didConverge': True,
  'nodePropertiesWritten': 2166,
  'ranIterations': 15,
  'writeMillis': 3}]


## 4. Using graph features in ML
Now that we have the PageRank of each of the people in the Game of Thrones dataset, lets use it as a predictor variable.

We're going to predict whether a character will die based on `age`, `gender`, `is_knight`, `house` and `pageRank` of the corresponding node:

In [10]:
# Consider only the characters that have a defined age.
result = neo4j.run(
"""
MATCH (p:Person)
WHERE p.age >= 0
MATCH (p)-[:BELONGS_TO]->(h:House) 
RETURN p.name as name, p.age as age, p.gender as gender,(p:Knight) as is_knight,
       collect(h.name)[0] as house, p.pageRank as pagerank, (p:Dead) as is_dead 
ORDER BY pagerank DESC
""")

df = pd.DataFrame(result.data())
print(df)

                 name  age  gender  is_knight       house   pagerank  is_dead
0    Tyrion Lannister   32    male      False   Lannister  16.006423    False
1     Tywin Lannister   58    male      False   Lannister   8.975134     True
2       Theon Greyjoy   27    male      False     Greyjoy   5.456874    False
3         Sansa Stark   19  female      False       Stark   4.504770    False
4         Walder Frey   97    male      False        Frey   4.160120    False
..                ...  ...     ...        ...         ...        ...      ...
407  Walder Goodbrook   15    male      False   Goodbrook   0.150000    False
408           Nettles  100  female      False      Blacks   0.150000     True
409  Humfrey Wagstaff   75  female       True    Wagstaff   0.150000    False
410  Lucos Chyttering   22    male      False  Chyttering   0.150000    False
411      Criston Cole   48    male       True        Cole   0.150000     True

[412 rows x 7 columns]


###  Process the data into a format the model expects

In [11]:
# Factorize non-numeric columns
from sklearn.preprocessing import MinMaxScaler

genders = pd.DataFrame(pd.factorize(df['gender'])[0])
knight = pd.DataFrame(pd.factorize(df['is_knight'])[0])
houses = pd.DataFrame(pd.factorize(df['house'])[0])
dead  = pd.DataFrame(pd.factorize(df['is_dead'])[0])

# Convert to normalized data.
data = pd.concat([houses, genders, df['age'], df['is_knight'], df['pagerank'], dead], axis=1) 
data.columns = ['house', 'gender', 'age', 'is_knight', 'pagerank', 'is_dead']
scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(data) 
data.loc[:,:] = scaled_values

print(data)

        house  gender   age  is_knight  pagerank  is_dead
0    0.000000     0.0  0.32        0.0  1.000000      0.0
1    0.000000     0.0  0.58        0.0  0.556565      1.0
2    0.015385     0.0  0.27        0.0  0.334683      0.0
3    0.030769     1.0  0.19        0.0  0.274638      0.0
4    0.046154     0.0  0.97        0.0  0.252902      0.0
..        ...     ...   ...        ...       ...      ...
407  0.938462     0.0  0.15        0.0  0.000000      0.0
408  0.953846     1.0  1.00        0.0  0.000000      1.0
409  0.969231     1.0  0.75        1.0  0.000000      0.0
410  0.984615     0.0  0.22        0.0  0.000000      0.0
411  1.000000     0.0  0.48        1.0  0.000000      1.0

[412 rows x 6 columns]


### Training the model
Lets see how much we can boost the model accuracy by using `pagerank` as a feature.

We train two models and compare their AUC scores:
1. A RandomForestClassifier that uses `[gender, house, age, is_knight]` to predict `is_dead`.
2. A RandomForestClassifier that uses `[gender, house, age, is_knight, pagerank]` to predict `is_dead`.

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score

X_old = data[['gender', 'house', 'age', 'is_knight',]]
X_new = data[['gender', 'house', 'age', 'is_knight', 'pagerank']]
y = data['is_dead']

clf = RandomForestClassifier(random_state=0, class_weight="balanced", n_estimators=10)
cv = StratifiedShuffleSplit(random_state=0)

print("auc score without pagerank: ", cross_val_score(clf, X_old, y, cv=cv, scoring='roc_auc').mean())
print("auc score with pagerank: ", cross_val_score(clf, X_new, y, cv=cv, scoring='roc_auc').mean())

auc score without pagerank:  0.7049278846153847
auc score with pagerank:  0.7664663461538461


We're seeing a slight improvement when adding our new feature - however, we're going to need a lot more data to get a good model. 

Keep in mind this dataset is tiny: it's naturally very subceptible to randomness in the classifier and choice of train/test split. 

## What's next?
to learn more about the different execution modes of algorithms:
https://neo4j.com/docs/graph-data-science/current/common-usage/running-algos/

To speed up your process, consider looking into native projections:
https://neo4j.com/docs/graph-data-science/current/management-ops/native-projection/.

Read the docs on other algorithms, tips for modeling your data, and algo configurations:
https://neo4j.com/docs/graph-data-science/current/introduction/
