# K-Nearest Neighbor (KNN) Similarity
This notebook demonstrates running FastRP Embeddings and KNN across the H&M dataset of customer purchases.  These commands could be wrapped in a service or batch job to run & update on a recurring basis as the dataset is updated over time.

## Setup

In [14]:
from graphdatascience import GraphDataScience
from dotenv import load_dotenv
import os
import pandas as pd

In [15]:
pd.set_option('display.max_rows', 12)
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.width', 0)

In [16]:
load_dotenv('.env', override=True)

# Use Neo4j URI and credentials according to our setup
gds = GraphDataScience(
    os.getenv('NEO4J_URI'),
    auth=(os.getenv('NEO4J_USERNAME'),
          os.getenv('NEO4J_PASSWORD')),
    aura_ds=eval(os.getenv('AURA_DS').title()))

# Necessary if you enabled Arrow on the db - this is true for AuraDS
gds.set_database("neo4j")

In [17]:
def clear_all_graphs():
    g_names = gds.graph.list().graphName.tolist()
    for g_name in g_names:
        g = gds.graph.get(g_name)
        g.drop()

## Clear Past Analysis

In [18]:
clear_all_graphs()

In [19]:
gds.run_cypher('MATCH(:Article)-[r:CUSTOMERS_ALSO_PURCHASED]->() DELETE r')

## Apply GDS FastRP Node Embeddings and K-Nearest Neighbor (KNN) Similarity


First, apply a graph projection to structure the portion of the graph we need in an optimized in-memory format for graph ML.

In [20]:
%%time
# graph projection
g, _ = gds.graph.project('proj',['Customer', 'Article', 'Product'],
                         {'PURCHASED':{'orientation':'UNDIRECTED'}, 'VARIANT_OF':{'orientation':'UNDIRECTED'}},
                         readConcurrency=20)

CPU times: user 7.58 ms, sys: 5.03 ms, total: 12.6 ms
Wall time: 248 ms


Next, we will generate node embeddings for similarity calculation.  In this case, we will use FastRP (Fast Random Projection) which is a fast, scalable, and robust embedding algorithm. FastRP calculates embeddings using probabilistic sampling and linear algebra.

In [21]:
%%time
# embeddings (writing back Article embeddings in case we want to introspect later)
gds.fastRP.mutate(g, mutateProperty='embedding', embeddingDimension=256, randomSeed=7474, concurrency=20)
gds.graph.writeNodeProperties(g, ['embedding'], ['Article'])

CPU times: user 11 ms, sys: 4.95 ms, total: 16 ms
Wall time: 559 ms


writeMillis                  105
graphName                   proj
nodeProperties       [embedding]
propertiesWritten          21596
Name: 0, dtype: object

This is what the node embeddings look like:

In [22]:
gds.run_cypher('MATCH(n:Article) RETURN n.articleId, n.embedding LIMIT 3')

Unnamed: 0,n.articleId,n.embedding
0,108775015,"[0.05234747380018234, 0.13151785731315613, -0.05253056809306145, -0.041636157780885696, 0.11699235439300537, -0.10946889221668243, -0.07382769137620926, -0.020323529839515686, 0.013979077339172363, 0.1326017528772354, -0.06144546717405319, 0.16738006472587585, -0.09897221624851227, 0.0878610908985138, -0.020181160420179367, 0.03383782505989075, 0.03157186508178711, 0.06264729052782059, -0.061635278165340424, -0.04274582862854004, -0.05333463102579117, 0.008594872429966927, -0.006356884725391..."
1,108775044,"[0.055843550711870193, 0.06728993356227875, -0.032038092613220215, -0.06382659077644348, 0.15777777135372162, -0.004036028869450092, 0.1188691258430481, -0.0810934454202652, 0.07403451949357986, 0.02506580390036106, -0.06376611441373825, 0.09702759981155396, -0.14350834488868713, 0.10597515106201172, -0.08877047896385193, -0.017673328518867493, -0.06682626903057098, 0.10495038330554962, 0.023164458572864532, -0.07564680278301239, 0.05063466727733612, 0.14934001863002777, 0.003125160932540893..."
2,110065001,"[-0.029060423374176025, 0.05402792617678642, -0.015207819640636444, 0.060010239481925964, -0.09379316866397858, -0.05362923443317413, -0.056747082620859146, 0.03353670984506607, 0.0766131803393364, -0.06844352185726166, -0.06377026438713074, 0.0030390238389372826, 0.007028941065073013, -0.16843706369400024, -0.02670716866850853, 0.12507864832878113, -0.11475594341754913, -0.007393300533294678, 0.06577442586421967, -0.03280405327677727, 0.009825348854064941, 0.017182160168886185, 0.0515425056..."


Finally, we can do our similarity inference with K-Nearest Neighbor (KNN) and write back to the graph.
We will use a slightly low cutoff of 0.75 similarity score to extend the result size for exploration.  We can provide a higher cutoff at query time if needed.

In [23]:
%%time
# KNN
gds.knn.write(g, nodeProperties=['embedding'], nodeLabels=['Article'],
                  writeRelationshipType='CUSTOMERS_ALSO_PURCHASED', writeProperty='score', 
                  sampleRate=1.0,maxIterations=1000, similarityCutoff=0.75, concurrency=20);

Knn:   0%|          | 0/100 [00:00<?, ?%/s]

CPU times: user 178 ms, sys: 23.9 ms, total: 202 ms
Wall time: 3.4 s


## Clean Up

In [11]:
g.drop()

graphName                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               proj
database                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   