# K-Nearest Neighbor (KNN) Similarity
This notebook demonstrates running FastRP Embeddings and KNN across the H&M dataset of customer purchases.  Thes commands could be wrapped in a service or batch job to run & update on a recurring basis as the dataset is updated over time. 

In [30]:
import pandas as pd
import configparser

### Neo4j Settings
The `neo4j.ini` file is an ini configuration for Neo4j properties so this notebook can connect to your Neo4j instance and load data. The ini file should be formatted as follows:

```
[NEO4J]
PASSWORD=<password>
USERNAME=<username, is 'neo4j' by default>
HOST=<host uri>
```

In [31]:
config = configparser.RawConfigParser()
config.read('neo4j.ini')
HOST = config['NEO4J']['HOST']
USERNAME = config['NEO4J']['USERNAME']
PASSWORD = config['NEO4J']['PASSWORD']

### Connect to Graph Data Science (GDS)

In [32]:
from graphdatascience import GraphDataScience

# Use Neo4j URI and credentials according to your setup
gds = GraphDataScience(HOST, auth=(USERNAME, PASSWORD), aura_ds=True)

In [33]:
def clear_all_graphs():
    g_names = gds.graph.list().graphName.tolist()
    for g_name in g_names:
        g = gds.graph.get(g_name)
        gds.graph.drop(g)

## Clear Past Analysis

In [34]:
clear_all_graphs()

In [35]:
gds.run_cypher('MATCH(:Article)-[r:CUSTOMERS_ALSO_PURCHASED]->() DELETE r')

In [36]:
gds.run_cypher('MATCH(n:EstCustomer) REMOVE n:EstCustomer')
gds.run_cypher('MATCH(n:EstArticle) REMOVE n:EstArticle')
gds.run_cypher('MATCH(n:EstProduct) REMOVE n:EstProduct')

## Label Entities in Main Component
Above 99% of the data is in the one largest graph component. We will label the nodes in this largest, or "main", component so they are easier to select for downstream analytics.  The minority of customers and articles that lie outside of the main component will essentially be cold starters.  While out of scope for this demo, we can provide recommendations to that minority differently, i.e. by providing overall most popular articles to new customers and using different content based recommendations for new articles. 

In [37]:
g, _ = gds.graph.project('proj',['Customer', 'Article', 'Product'],
                         {'PURCHASED':{'orientation':'UNDIRECTED'}, 'IS_PRODUCT':{'orientation':'UNDIRECTED'}}, 
                         readConcurrency=20)

In [38]:
gds.wcc.write(g, writeProperty='component').componentDistribution

{'p99': 2,
 'min': 1,
 'max': 1514151,
 'mean': 150.86009696250125,
 'p90': 1,
 'p50': 1,
 'p999': 3,
 'p95': 1,
 'p75': 1}

In [39]:
largest_component_id = gds.run_cypher('''
    MATCH(n) 
    RETURN n.component AS largestComponentId, count(n) AS numberOfEntities ORDER BY numberOfEntities DESC LIMIT 1
''').largestComponentId[0]

In [40]:
res = gds.run_cypher('''
    MATCH(n:Customer) WHERE n.component = $largestComponentId
    SET n:EstCustomer
    RETURN count(n)
''', params = {'largestComponentId':int(largest_component_id)})
print('-- Labeled Customers in Main Component ---')
print(res)
      
res = gds.run_cypher('''
    MATCH(n:Article) WHERE n.component = $largestComponentId
    SET n:EstArticle
    RETURN count(n)
''', params = {'largestComponentId':int(largest_component_id)})
print('-- Labeled Articles in Main Component ---')
print(res)
      
res = gds.run_cypher('''
    MATCH(n:Product) WHERE n.component = $largestComponentId
    SET n:EstProduct
    RETURN count(n)
''', params = {'largestComponentId':int(largest_component_id)})
print('-- Labeled Products in Main Component ---')
print(res)    

-- Labeled Customers in Main Component ---
   count(n)
0   1362264
-- Labeled Articles in Main Component ---
   count(n)
0    105071
-- Labeled Products in Main Component ---
   count(n)
0     46816


In [41]:
g.drop()

## Apply GDS FastRP Node Embeddings and K-Nearest Neighbor (KNN) Similarity
We will use a slightly lower cutoff of 0.75 to extend the result size for exploratoration in the demo.  We can provide a higher cutoff at query time if needed.

In [42]:
%%time
# graph projection
g, _ = gds.graph.project('proj',['EstCustomer', 'EstArticle', 'EstProduct'],
                         {'PURCHASED':{'orientation':'UNDIRECTED'}, 'IS_PRODUCT':{'orientation':'UNDIRECTED'}}, 
                         readConcurrency=20)

# embeddings (writing back Article embeddings in case we want to introspect later)
gds.fastRP.mutate(g, mutateProperty='embedding', embeddingDimension=256, randomSeed=7474, concurrency=20)
gds.graph.writeNodeProperties(g, ['embedding'], ['EstArticle'])

# KNN
gds.knn.write(g, nodeProperties=['embedding'], nodeLabels=['EstArticle'],
                  writeRelationshipType='CUSTOMERS_ALSO_PURCHASED', writeProperty='score', 
                  sampleRate=1.0,maxIterations=1000, similarityCutoff=0.75, concurrency=20);

CPU times: user 25.6 ms, sys: 982 µs, total: 26.6 ms
Wall time: 1min 46s


ranIterations                                                           116
didConverge                                                            True
nodePairsConsidered                                               220618923
preProcessingMillis                                                       0
computeMillis                                                         65887
writeMillis                                                            8160
postProcessingMillis                                                     -1
nodesCompared                                                        105071
relationshipsWritten                                                 427230
similarityDistribution    {'p1': 0.750579833984375, 'max': 1.00000381469...
configuration             {'topK': 10, 'maxIterations': 1000, 'writeConc...
Name: 0, dtype: object

In [29]:
g.drop()