# K-Nearest Neighbor (KNN) Similarity
This notebook demonstrates running FastRP Embeddings and KNN across the H&M dataset of customer purchases.  Thes commands could be wrapped in a service or batch job to run & update on a recurring basis as the dataset is updated over time.

## Setup

In [1]:
from graphdatascience import GraphDataScience
from dotenv import load_dotenv
import os
import pandas as pd

In [2]:
load_dotenv('.env', override=True)

# Use Neo4j URI and credentials according to our setup
gds = GraphDataScience(
    os.getenv('NEO4J_URI'),
    auth=(os.getenv('NEO4J_USERNAME'),
          os.getenv('NEO4J_PASSWORD')),
    aura_ds=eval(os.getenv('AURA_DS').title()))

# Necessary if you enabled Arrow on the db - this is true for AuraDS
gds.set_database("neo4j")

In [3]:
def clear_all_graphs():
    g_names = gds.graph.list().graphName.tolist()
    for g_name in g_names:
        g = gds.graph.get(g_name)
        g.drop()

## Clear Past Analysis

In [4]:
clear_all_graphs()

In [5]:
gds.run_cypher('MATCH(:Article)-[r:CUSTOMERS_ALSO_PURCHASED]->() DELETE r')

## Apply GDS FastRP Node Embeddings and K-Nearest Neighbor (KNN) Similarity


First apply a graph projection to structure the portion of the graph we need in an optimized in-memory format for graph ML.

In [None]:
%%time
# graph projection
g, _ = gds.graph.project('proj',['Customer', 'Article', 'Product'],
                         {'PURCHASED':{'orientation':'UNDIRECTED'}, 'VARIANT_OF':{'orientation':'UNDIRECTED'}},
                         readConcurrency=20)

Next, we will generate node embeddings for similarity calculation.  In this case we will use FastRP (Fast Random Projection) which is a fast, scalable, and robust embedding algorithm. FastRP calculates embeddings using probabilistic sampling and linear algebra.

In [None]:
# embeddings (writing back Article embeddings in case we want to introspect later)
gds.fastRP.mutate(g, mutateProperty='embedding', embeddingDimension=256, randomSeed=7474, concurrency=20)
gds.graph.writeNodeProperties(g, ['embedding'], ['Article'])

This is what the node embeddings look like:

In [8]:
gds.run_cypher('MATCH(n:Article) RETURN n.articleId, n.embedding LIMIT 3')

Unnamed: 0,n.articleId,n.embedding
0,108775015,"[0.05234747380018234, 0.13151785731315613, -0...."
1,108775044,"[0.055843550711870193, 0.06728993356227875, -0..."
2,110065001,"[-0.029060423374176025, 0.05402792617678642, -..."


Finally, we can do our Similarity inference with KNN and write back to the graph.
We will use a slightly low cutoff of 0.75 similarity score to extend the result size for exploration.  We can provide a higher cutoff at query time if needed.

In [9]:
%%time
# graph projection
g, _ = gds.graph.project('proj',['Customer', 'Article', 'Product'],
                         {'PURCHASED':{'orientation':'UNDIRECTED'}, 'VARIANT_OF':{'orientation':'UNDIRECTED'}},
                         readConcurrency=20)

# embeddings (writing back Article embeddings in case we want to introspect later)
gds.fastRP.mutate(g, mutateProperty='embedding', embeddingDimension=256, randomSeed=7474, concurrency=20)
gds.graph.writeNodeProperties(g, ['embedding'], ['Article'])

# KNN
gds.knn.write(g, nodeProperties=['embedding'], nodeLabels=['Article'],
                  writeRelationshipType='CUSTOMERS_ALSO_PURCHASED', writeProperty='score', 
                  sampleRate=1.0,maxIterations=1000, similarityCutoff=0.75, concurrency=20);

Knn:   0%|          | 0/100 [00:00<?, ?%/s]

CPU times: user 108 ms, sys: 15.2 ms, total: 123 ms
Wall time: 4.6 s


## Clean Up

In [7]:
g.drop()

graphName                                                             proj
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                            35654
relationshipCount                                                   125412
configuration            {'relationshipProjection': {'PURCHASED': {'agg...
density                                                           0.000099
creationTime                           2023-10-26T17:36:31.990377882+00:00
modificationTime                       2023-10-26T17:36:33.358528963+00:00
schema                   {'graphProperties': {}, 'nodes': {'Customer': ...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'Customer': ...
Name: 0, dtype: object