# Data Science with Neo4j Using Yelp Data

### Module 2: Segmentation and Community Detection (Work in Progress)

Goal: Find communities based on categories of restaurants users review      

Algorithm: Label Propagation

High Level Approach:
      
- Subset data to only include restaurants in Toronto.  Consequently, this means we are only considering users who reviewed restaurants in Toronto.
- Population:  1251 users, 1707 businesses, 81 categories
- Relationships: user REVIEWED business, business IN_CATEGORY, user REVIEWED_CATEGORY
- Create weights between users to form tighter communities (jaccard index between users based on # restaurant categories)
- Draw parallels with traditional data science clustering approaches (eg: k-means, kNN)

In [17]:
from neo4j.v1 import GraphDatabase, basic_auth
import pandas as pd
import matplotlib.pyplot as plt
import os

In [18]:
uri  = os.getenv('NEO4J_URI',  'bolt://localhost:7687')
user = os.getenv('NEO4J_USER', 'neo4j')
pwd  = os.getenv('NEO4J_PWD',  'neo4j')

driver = GraphDatabase.driver(uri, auth=basic_auth(user, pwd))

#### Part 1: Compute Jaccard index between user pairs

In [109]:
%%time

# first, create relationship between users and categories.
query = """
        MATCH (u:User)-[:REVIEWED]->(b:Business)-[:IN_CATEGORY]->(c:Category)
        WITH u, c, COUNT(DISTINCT b) as num_business
        CREATE (u)-[:REVIEWED_CATEGORY {num_reviewed: num_business}]->(c)        
        """     

with driver.session() as session:
    result = session.run(query)

CPU times: user 2 ms, sys: 2.35 ms, total: 4.35 ms
Wall time: 1.07 s


In [19]:
%%time

# count number of REVIEWED_CATEGORY relationships
query = """
        MATCH ()-[r:REVIEWED_CATEGORY]->()
        RETURN COUNT(r)
        """     

with driver.session() as session:
    result = session.run(query)
    
result_df = [dict(record) for record in result]
display(result_df)

[{'COUNT(r)': 69186}]

CPU times: user 6.09 ms, sys: 2.15 ms, total: 8.24 ms
Wall time: 38.6 ms


In [161]:
%%time

# compute intersection, union and jaccard index and store the index in the graph
query = """
        MATCH (u1:User)-[:REVIEWED_CATEGORY]->(c:Category)<-[:REVIEWED_CATEGORY]-(u2:User)
        WHERE id(u1) < id(u2) 
        WITH  u1, u2, COUNT(DISTINCT c) as intersection_count, 
                SIZE((u1)-[:REVIEWED_CATEGORY]->()) as cat1,
                SIZE((u2)-[:REVIEWED_CATEGORY]->()) as cat2
        WITH  u1, u2, (intersection_count * 1.0) / (cat1 + cat2 - intersection_count) as jaccard_index
        CREATE (u1)-[:SIMILAR_TO {similarity: jaccard_index}]->(u2)
        """     

with driver.session() as session:
    result = session.run(query)

CPU times: user 7.26 ms, sys: 5.7 ms, total: 13 ms
Wall time: 2min 22s


In [158]:
jaccard_df = pd.DataFrame([dict(record) for record in result])
    
display(jaccard_df.head())
display(jaccard_df.shape)

Unnamed: 0,cat1,cat2,intersection_count,jaccard_index,u1.id,u2.id
0,38,41,20,0.338983,q6AMn2HPGYVsD31NB1K9xg,Tc3GAQdAfOW542ROdyCZPg
1,43,38,14,0.208955,o5hk57cqhWnV1sULPvq1jw,Nr2uHirba5WNcG0vOXoVDA
2,41,46,20,0.298507,rKEJfzCIV0AXKo0kdzPBgQ,0BaJ43WuBnP-G6fmstEmNQ
3,51,52,23,0.2875,FSzxEJHeDjEB6Lsotqc1Qg,8-GTQbes8cfy5QRpjVy7bg
4,48,33,22,0.372881,XuCbLgo9j1q5dDh9251vkg,bIXj8nZWd9f3vEHzPUJ4lg


(781875, 6)

In [21]:
%%time
# delete
query = """
        MATCH ()-[r:SIMILAR_TO]->()
        RETURN COUNT(r)       
        """     

with driver.session() as session:
    result = session.run(query)

CPU times: user 1.69 ms, sys: 2.08 ms, total: 3.76 ms
Wall time: 35.5 ms


In [22]:
similarto_df = pd.DataFrame([dict(record) for record in result])
    
display(similarto_df.head())

Unnamed: 0,COUNT(r)
0,781875


#### Part 2: Label Propagation

In [194]:
%%time

# call label prop using jaccard index as weight
query = """
        CALL algo.labelPropagation('User', 'SIMILAR_TO','OUTGOING',
            {iterations:2, partitionProperty:'cluster', weightProperty:'similarity', write: true})
        YIELD nodes, iterations, didConverge, loadMillis, computeMillis, writeMillis, write, partitionProperty;
        """

with driver.session() as session:
    result = session.run(query)
    for row in result:
        print(row)

<Record nodes=1251 iterations=2 didConverge=False loadMillis=390 computeMillis=86 writeMillis=2 write=True partitionProperty='cluster'>
CPU times: user 2.24 ms, sys: 2.09 ms, total: 4.33 ms
Wall time: 565 ms


In [23]:
%%time

query = """
        MATCH (u:User)
        RETURN distinct(u.cluster) 
        """     

with driver.session() as session:
    result = session.run(query)

for row in result:
    print(row)

<Record (u.cluster)=1249>
<Record (u.cluster)=1234>
<Record (u.cluster)=1250>
<Record (u.cluster)=1011>
<Record (u.cluster)=1242>
<Record (u.cluster)=1245>
<Record (u.cluster)=1247>
CPU times: user 2.16 ms, sys: 2.4 ms, total: 4.55 ms
Wall time: 22.9 ms
