# K-Nearest Neighbor (KNN) Similarity Evaluation
This notebook demonstrates evaluating FastRP Embeddings and KNN for the H&M dataset of customer purchases.  Thes commands could be wrapped in a service or batch job to run & update on a recurring basis and test for changes in performance as the dataset is updated over time. 

In [37]:
import pandas as pd
import configparser
from datetime import datetime, timedelta

In [39]:
def clear_all_graphs():
    g_names = gds.graph.list().graphName.tolist()
    for g_name in g_names:
        g = gds.graph.get(g_name)
        gds.graph.drop(g)

### Neo4j Settings
The `neo4j.ini` file is an ini configuration for Neo4j properties so this notebook can connect to your Neo4j instance and load data. The ini file should be formatted as follows:

```
[NEO4J]
PASSWORD=<password>
USERNAME=<username, is 'neo4j' by default>
HOST=<host uri>
```

In [38]:
config = configparser.RawConfigParser()
config.read('neo4j.ini')
HOST = config['NEO4J']['HOST']
USERNAME = config['NEO4J']['USERNAME']
PASSWORD = config['NEO4J']['PASSWORD']

### Connect to Graph Data Science (GDS)

In [41]:
from graphdatascience import GraphDataScience

# Use Neo4j URI and credentials according to your setup
gds = GraphDataScience(HOST, auth=(USERNAME, PASSWORD), aura_ds=True)

## Merge Historic Relationships for Evaluation
We will add mirror historic purchase relationships to split the dataset. We will reserve the last week of purchases for evaluation and the rest for calculating KNN similarity.

In [42]:
gds.run_cypher('CREATE INDEX purchase_date IF NOT EXISTS FOR ()-[r:PURCHASED]-() ON (r.transactionDate)')

In [43]:
max_purchase_date = gds.run_cypher('MATCH(:Customer)-[r:PURCHASED]->() RETURN max(r.transactionDate) AS maxDate')['maxDate'][0]
cutoff_date = datetime(year=max_purchase_date.year, month=max_purchase_date.month, day=max_purchase_date.day) - timedelta(days=7)

In [291]:
gds.run_cypher('''
    MATCH(c:Customer)-[r:PURCHASED]->(a) 
    WHERE r.transactionDate >= date($cutoffDate)
    WITH c, a, r
    CALL {
    WITH c, a, r
        MERGE(c)-[h:RECENTLY_PURCHASED {transactionDate:r.transactionDate, transactionId:r.transactionId}]->(a) 
    }  IN TRANSACTIONS OF 100000 ROWS
    RETURN count(*)
''', params={'cutoffDate':str(cutoff_date)[:10]})

Unnamed: 0,count(*)
0,266364


In [292]:
%%time
gds.run_cypher('''
    MATCH(c:Customer)-[r:PURCHASED]->(a) 
    WHERE r.transactionDate < date($cutoffDate)
    WITH c, a, r
    CALL {
        WITH c, a, r
        MERGE(c)-[h:HISTORICALLY_PURCHASED {transactionDate:r.transactionDate, transactionId:r.transactionId}]->(a)
    }  IN TRANSACTIONS OF 100000 ROWS
    RETURN count(*)
''', params={'cutoffDate':str(cutoff_date)[:10]})

CPU times: user 47.6 ms, sys: 19.4 ms, total: 67 ms
Wall time: 19min 15s


Unnamed: 0,count(*)
0,31521960


## Label Entities in Main Component
Above 99% of the data is in the one largest graph component. We will label the nodes in this largest, or "main", component so they are easier to select for downstream analytics.  The minority of customers and articles that lie outside of the main component will essentially be cold starters.  While out of scope for this demo, we can provide recommendations to that minority differently, i.e. by providing overall most popular articles to new customers and using different content based recommendations for new articles. 

In [44]:
g, _ = gds.graph.project('proj',['Customer', 'Article', 'Product'],{
    'HISTORICALLY_PURCHASED':{'orientation':'UNDIRECTED'},
    'IS_PRODUCT':{'orientation':'UNDIRECTED'},
})

In [45]:
gds.wcc.write(g, writeProperty='histComponent')

writeMillis                                                            354
nodePropertiesWritten                                              1524746
componentCount                                                       16572
componentDistribution    {'p99': 3, 'min': 1, 'max': 1507255, 'mean': 9...
postProcessingMillis                                                    27
preProcessingMillis                                                      0
computeMillis                                                          155
configuration            {'writeConcurrency': 4, 'seedProperty': None, ...
Name: 0, dtype: object

In [46]:
g.drop()

In [47]:
gds.run_cypher('''
    MATCH(n) 
    WITH n.histComponent AS maxComponent, count(n) AS cnt ORDER BY cnt DESC LIMIT 1
    MATCH(n:Customer) WHERE n.histComponent = maxComponent
    SET n:HistEstCustomer
    RETURN count(n)
''')

Unnamed: 0,count(n)
0,1356117


In [48]:
gds.run_cypher('''
    MATCH(n) 
    WITH n.histComponent AS maxComponent, count(n) AS cnt ORDER BY cnt DESC LIMIT 1
    MATCH(n:Article) WHERE n.histComponent = maxComponent
    SET n:HistEstArticle
    RETURN count(n)
''')

Unnamed: 0,count(n)
0,104632


In [49]:
gds.run_cypher('''
    MATCH(n) 
    WITH n.histComponent AS maxComponent, count(n) AS cnt ORDER BY cnt DESC LIMIT 1
    MATCH(n:Product) WHERE n.histComponent = maxComponent
    SET n:HistEstProduct
    RETURN count(n)
''')

Unnamed: 0,count(n)
0,46500


## Apply GDS FastRP Node Embeddings and K-Nearest Neighbor (KNN) Similarity

In [52]:
%%time
# graph projection
g, _ = gds.graph.project('proj',['HistEstCustomer', 'HistEstArticle', 'HistEstProduct'],{
    'HISTORICALLY_PURCHASED':{'orientation':'UNDIRECTED'},
    'IS_PRODUCT':{'orientation':'UNDIRECTED'}}, readConcurrency=20)

# embeddings (writing back Article embeddings in case we want to introspect later)
gds.fastRP.mutate(g, mutateProperty='embedding', embeddingDimension=256, randomSeed=7474, concurrency=20)

# KNN
gds.graph.writeNodeProperties(g, ['embedding'], ['HistEstArticle'])
knn_stats = gds.knn.write(g, nodeProperties=['embedding'], nodeLabels=['HistEstArticle'],
                  writeRelationshipType='HIST_CUSTOMERS_ALSO_PURCHASED', writeProperty='score', similarityCutoff=0.82,
                  sampleRate=1.0,maxIterations=1000, concurrency=20);

CPU times: user 49.2 ms, sys: 5.34 ms, total: 54.6 ms
Wall time: 1min 42s


In [53]:
g.drop()

## Evaluate KNN Performance
Select ground truth purchases and predictions (Recommendations) to calculate Mean Average Precision (MAP).  THis will provide a rough idea for how this method may perform in real life. 

In [54]:
%%time
pred_df = gds.run_cypher('''
    MATCH(c:HistEstCustomer)-[:RECENTLY_PURCHASED]->()
    WITH DISTINCT c
    MATCH(c)-[r:HISTORICALLY_PURCHASED]->(a0) WHERE r.transactionDate > date($cutOffDate)
    WITH c, a0, r
    MATCH(a0)-[s:HIST_CUSTOMERS_ALSO_PURCHASED]->(a)
    RETURN c.customerId AS customerId, a.articleId AS articleId, sum(s.score) AS aggScore, max(r.transactionDate)
    ORDER BY customerId, aggScore DESC
''', params = {'cutOffDate':str(cutoff_date - timedelta(days=42))[:10]})

CPU times: user 5.37 s, sys: 23.3 ms, total: 5.39 s
Wall time: 28.6 s


In [56]:
%%time
obs_df = gds.run_cypher('''
    MATCH(c:HistEstCustomer)-[r:RECENTLY_PURCHASED]->(a)
    WITH c.customerId AS customerId, a.articleId AS articleId
    RETURN customerId, articleId
''')

CPU times: user 7.75 s, sys: 85.2 ms, total: 7.84 s
Wall time: 40.5 s


In [57]:
obs_eval_df = obs_df.groupby('customerId').agg({'articleId': lambda x: x.tolist()}).reset_index().rename(columns={'articleId':'observedPurchases'})

In [58]:
pred_eval_df = pred_df.groupby('customerId').agg({'articleId': lambda x: x.tolist()}).reset_index().rename(columns={'articleId':'predictedPurchases'})

In [59]:
eval_df = obs_eval_df.merge(pred_eval_df, on='customerId', how='inner')

In [60]:
from collections import Counter

def average_precision(true_list,predicted_list,at_k):
    if not isinstance(predicted_list, list):
        return 0.0
    true_dict = dict(Counter(true_list))
    true_set = set(true_dict.keys())
    length_pred = len(predicted_list)
    p=0
    K = min(at_k,length_pred)
    for k in range(1,(K+1)):
        v = predicted_list[k-1]
        if v in true_list:
            p += true_dict[v] *len(true_set.intersection(predicted_list[:k]))/k
    return p/min(len(true_set),at_k)

In [61]:
eval_df['averagePrecisions'] = eval_df.apply(lambda row: average_precision(row.observedPurchases, row.predictedPurchases, 12), axis=1)

In [62]:
print('Mean Average Precision of KNN on Last Week of Purchases: {:.2f} %'.format(100*eval_df['averagePrecisions'].sum()/eval_df.shape[0]))

Mean Average Precision of KNN on Last Week of Purchases: 0.67 %


Could be better, but certainly not too bad for a first round without tuning!

In [63]:
#eval_df[eval_df['averagePrecisions'] > 0.0]