# Vivino users
> ZHAW CAS Machine Intelligence - Big Data Module - Sansar Choinyambuu, Gustavo Martinez

In this notebook a graph analysis of vivino users is done with help of Apache Spark's GraphFrames library.

https://www.vivino.com/ 
Vivino is an online wine marketplace powered by a community of over 40 million users. The users can rate and write reviews for wine and follow each other.

## Data scraping
The data was obtained from vivino.com using self-written scrapper available at:
https://github.com/sansar-choinyambuu/vivino-users/blob/main/scrape_top_ranked.py

Top 1000 users from following 10 countries were scraped along with their followers and followings, for this analysis.
For some users, vivino doesn't provide details unless API user is authenticated. Therefore not exactly 10k users were crawled. In the end it ended up being 4'949 users.
["fr", "it", "es", "us", "ch", "de", "ru", "gb", "au", "ca"]

***

Vivino exposes API's to get information on users and the followership:
- HTTP POST https://www.vivino.com/users/x/country_rankings - top ranked users for a country
  {
      "page": 1,
      "country_code": "ca"
  }
- HTTP GET http://app.vivino.com/api/users/mikhail-mikhail20 - user information
- HTTP GET http://app.vivino.com/api/users/mikhail-mikhail20/followers?start_from=0&limit=10 - followers of user
- HTTP GET http://app.vivino.com/api/users/mikhail-mikhail20/followers?start_from=0&limit=10 - following of user

## Read and prepare data

In [0]:
import pandas as pd
# data is available at https://github.com/sansar-choinyambuu/vivino-users
users_df = pd.read_pickle("/dbfs/FileStore/shared_uploads/choinsa1@students.zhaw.ch/vivino_top_ranked.pkl")
users_df["country"] = users_df["address"].map(lambda a: a["country"])
users_df["avatar"] = users_df["image"].map(lambda i: i["location"])
users_df["ratings"] = users_df["statistics"].map(lambda s: s["ratings_count"])
users_df["reviews"] = users_df["statistics"].map(lambda s: s["reviews_count"])
users_df["stories"] = users_df["statistics"].map(lambda s: s["activity_stories_count"])
users_df.head(3)

Unnamed: 0,address,alias,background_image,bio,followers,following,id,image,is_featured,seo_name,statistics,visibility,website,country,avatar,ratings,reviews,stories
0,"{'title': None, 'name': None, 'street': None, ...",Josean M,{'location': '//images.vivino.com/users/backgr...,,[],[],30610918,{'location': '//images.vivino.com/avatars/defa...,False,josean.m1,"{'followers_count': 0, 'followings_count': 0, ...",all,,es,//images.vivino.com/avatars/default_user.png,353,30,0
1,"{'title': None, 'name': None, 'street': None, ...",Jenna Eddie,{'location': '//images.vivino.com/users/backgr...,WSET Level 2,"[27729961, 23030936, 39078028, 26626339, 43496...","[19618445, 4866859, 23030936, 14061140, 506762...",9938486,{'location': '//images.vivino.com/avatars/m49f...,False,jenna.ed,"{'followers_count': 37, 'followings_count': 21...",all,,gb,//images.vivino.com/avatars/m49fMjIDT06e2C8bxL...,1287,240,0
2,"{'title': None, 'name': None, 'street': None, ...",Beth VonVino,{'location': '//images.vivino.com/users/backgr...,A Texan in Hessen,"[7288914, 3015082, 3506287, 5082774, 1556134, ...","[3015082, 1556134, 2500201, 575520, 3837455, 7...",7032149,{'location': '//images.vivino.com/avatars/0046...,False,bebe.v,"{'followers_count': 393, 'followings_count': 1...",all,,de,//images.vivino.com/avatars/0046q1hyae05926065...,352,286,0


In [0]:
users = users_df[["id", "seo_name", "alias", "country", "bio", "avatar", "ratings", "reviews", "stories"]]
followers = users_df[["id", "followers"]].explode("followers").rename(columns={"followers": "src", "id":"dst"})
following = users_df[["id", "following"]].explode("following").rename(columns={"id": "src", "following":"dst"})
followership = followers.append(following, ignore_index=True)
followership = followership.drop_duplicates()

# filter the followerships to include only users where we have the id's in users dataframe
users_ids = users["id"].to_numpy()
followership_filtered = followership[followership["src"].isin(users_ids) & followership["dst"].isin(users_ids)]

In [0]:
print(f"There are {len(users)} and {len(followership_filtered)} followership connections")

## Create graph

In [0]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = spark._jvm.org.apache.log4j
logging.getLogger("py4j").setLevel(logging.ERROR)

In [0]:
from graphframes import *

vertices = sqlContext.createDataFrame(users)
edges = sqlContext.createDataFrame(followership_filtered)

g = GraphFrame(vertices, edges)

In [0]:
print(f"Graph has {g.vertices.count()} vertices and {g.edges.count()} edges")

## Analyze graph

### Mutual followers

In [0]:
# How many vivino top ranked users follow each other mutually
mutually_follow = g.find("(a)-[e1]->(b); (b)-[e2]->(a)").dropDuplicates()
print(f"Between {g.vertices.count()} top ranked vivino users there are {mutually_follow.count()} mutual following relationships")

### Degrees

In [0]:
from pyspark.sql import functions as F

# In degree statistics
g.inDegrees.agg(F.min(g.inDegrees.inDegree),
                F.max(g.inDegrees.inDegree),
                F.avg(g.inDegrees.inDegree),
                F.expr('percentile(inDegree, array(0.25))')[0].alias('%25'),
                F.expr('percentile(inDegree, array(0.50))')[0].alias('%50'),
                F.expr('percentile(inDegree, array(0.75))')[0].alias('%75'),
                F.expr('percentile(inDegree, array(0.90))')[0].alias('%90')).show()

# Out degree statistics
g.outDegrees.agg(F.min(g.outDegrees.outDegree),
                 F.max(g.outDegrees.outDegree),
                 F.avg(g.outDegrees.outDegree),
                 F.expr('percentile(outDegree, array(0.25))')[0].alias('%25'),
                 F.expr('percentile(outDegree, array(0.50))')[0].alias('%50'),
                 F.expr('percentile(outDegree, array(0.75))')[0].alias('%75'),
                 F.expr('percentile(outDegree, array(0.90))')[0].alias('%90')).show()

### Users by countries

In [0]:
g.vertices.groupBy("country").count().show()

### Connected components

In [0]:
# Connected components
sc.setCheckpointDir("/FileStore/shared_uploads/choinsa1@students.zhaw.ch/project/checkpoints")
connected = g.connectedComponents()

In [0]:
connected.select("id", "component").groupBy("component").count().orderBy(F.desc("count")).show(5)

*Among the top ranked users there is one connected component with 4265 users out of total 4949 users*

In a connected component of the graph, any two users have path connecting them

### Strongly connected components

In [0]:
strongly_connected = g.stronglyConnectedComponents(maxIter=10)

In [0]:
strongly_connected.select("id", "component").groupBy("component").count().orderBy(F.desc("count")).show(5)

*Among the top ranked users there is one strongly connected component with 3637 users out of total 4949 users*

In a strongly connected component of the graph, there is a path in each direction between each pair of users.

### Community detection

## Visualizations

In [0]:
import networkx as nx
from networkx.drawing.nx_agraph import graphviz_layout
import matplotlib.pyplot as plt

def PlotGraph(edge_list):
    Gplot=nx.Graph()
    for row in edge_list.select('src','dst').take(100):
        Gplot.add_edge(row['src'],row['dst'])

    plt.subplot(121)
    pos = nx.kamada_kawai_layout(Gplot,scale=2)
    nx.draw(Gplot, with_labels=True, pos = pos)
    
PlotGraph(g.edges)