# GraphFrames Basics

Examples of how to use GraphFrames for basic queries, motif finding, and general graph algorithms.

## Basic setup

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

In order to be able to use this API, we need to first import the python modules. This assumes the GraphFrames Spark package has been already configured correctly to run with this notebook.

In [None]:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import *
from graphframes import *

spark.sparkContext.setCheckpointDir("/tmp")

## Preparing the data

Graph consist of set of vertices set of edges. These has to be provided as data frames.

In [None]:
n_schema = StructType([StructField("id", LongType()), StructField("sex", StringType())])
n_nodes = [[0, "m"], [1, "f"], [2, "m"], [3, "m"], [4, "m"], [5, "f"], [6, "m"], [7, "f"], [8, "m"], [9, "f"]]

e_schema = StructType([StructField("src", LongType()), StructField("dst", LongType())])
n_edges = [[1,2], [2,3], [3,1], [3,4], [4,2], [2,4], [9, 1], [8, 1], [7, 1], [6, 5], [9, 0], [0,1]]

nodes_df = spark.createDataFrame(n_nodes, schema=n_schema)
print("nodes:")
nodes_df.show()

edges_df = spark.createDataFrame(n_edges, schema=e_schema)
print("edges:")
edges_df.show()

## Constructing the graph

Now, we can create the graph frame from the nodes and edges data frames. It's quite handy to cache the result to avoid the unwanted computations.

In [None]:
g1 = GraphFrame(nodes_df, edges_df).cache()

## Creating other graph

If we don't care about the schema, there is also easier way to create the simple graph frame:

In [None]:
# Vertex DataFrame
v = sqlContext.createDataFrame([
  (1, "Alice", 62),
  (2, "Bob", 12),
  (3, "Charlie", 55),
  (4, "David", 29),
  (5, "Esther", 32),
  (6, "Fanny", 14),
  (7, "Gabby", 60),
  (0, "Henry", 51)
], ["id", "name", "age"])

# Edge DataFrame
e = sqlContext.createDataFrame([
  (1, 2, "friend"),
  (2, 3, "follow"),
  (3, 2, "follow"),
  (6, 3, "follow"),
  (5, 6, "follow"),
  (0, 2, "follow"),
  (0, 6, "follow"),
  (5, 4, "friend"),
  (4, 1, "friend"),
  (2, 0, "friend"),
  (1, 5, "friend")
], ["src", "dst", "relationship"])

# Create a GraphFrame
g2 = GraphFrame(v, e).cache()

We have two different graphs but we can merge them together, providing the ids match. First we need to merge the nodes:

In [None]:
print("g1 nodes:")
g1.vertices.show()

print("g2 nodes:")
g2.vertices.show()

print("merged nodes:")
merged_nodes = g1.vertices.join(g2.vertices, 'id', 'fullouter')
merged_nodes.show()

Do the similar for edges. Here we need to join on both 'src' and 'dst'.

In [None]:
print("g1 edges:")
g1.edges.show()

print("g2 edges:")
g2.edges.show()

print("merged edges:")
merged_edges_raw = g1.edges.join(g2.edges, ['src', 'dst'], 'fullouter')
merged_edges_raw.show()

Replace `null` values with word 'other'.

In [None]:
merged_edges = merged_edges_raw.na.fill('other')
merged_edges.show()

In [None]:
g = GraphFrame(merged_nodes, merged_edges)
g.cache()

## Simple algorithms

GraphFrames provides the same suite of standard graph algorithms as GraphX, plus some new ones. See the [API g.find("(a)-[]->(b)") docs](https://graphframes.github.io/api/python/index.html) for more details.

### Vertex degrees

In [None]:
vertexInDegrees = g.inDegrees
vertexInDegrees.show()

vertexOutDegrees = g.outDegrees
vertexOutDegrees.show()

# node with the highest out degree
foo = vertexInDegrees.join(g.vertices, 'id') \
                     .orderBy("inDegree", ascending=False) \
                     .head()
print("highest in degree:" + str(foo))

# node with the highest in degree
bar = vertexOutDegrees.join(g.vertices, 'id') \
                      .orderBy("outDegree", ascending=False) \
                      .head()
print("highest out degree:" + str(bar))

### Motif queries
It's possible to use ASCII-like queries to find patterns in the graph structure, the general form looks like:
```
g.find("(a)-[e]->(b)") 
 .filter(...)
 .groupBy(...)
 .
```

Find all people that follow someone, but are not followed back.

In [None]:
motifs = g.find("(a)-[e]->(b); !(b)-[]->(a)") \
          .filter("e.relationship = 'follow'")
motifs.show()

Find all people older than 40 that follow at least two people of age under 15.

In [None]:
candidates = g.find("(a)-[]->(b); (a)-[]->(c)") \
              .filter("b != c") \
              .filter("a.age > 40") \
              .filter("b.age < 15") \
              .filter("c.age < 15")
candidates.show()

### Label Propagation
Within complex networks, real networks tend to have community structure. [Label propagation](https://en.wikipedia.org/wiki/Label_Propagation_Algorithm) is an algorithm for finding communities.

In [None]:
labels = g.labelPropagation(maxIter=1)
labels.show()

### Triangles
Strong communities exhibit large number of triangles in graph. This assigns each node the number of triangles it forms.

In [None]:
triangles = g.triangleCount()
triangles.show()

### Others

In [None]:
# bfs - finds shortes paths from nodes satisfying the first conditions to nodes satisfying the second condition
paths = g.bfs("age > 40 and sex = 'm'", "age < 40 and sex = 'f'", maxPathLength=3)
paths.show()

# connected components
components = g.connectedComponents()
components.show()

# page rank
# ranks = g.pageRank(resetProbability=0.15, tol=0.01)
# ranks.vertices.orderBy('pagerank', ascending=False).show()

# foo = g.shortestPaths(landmarks=[1, 2])
# foo.show()

## Visualization of a graph

Unfortunately, there is no builtin method in the Jupyter notebook, but we can use the external JS library for this purpose. To do that, we need to massage the data to the format that is suitable for the library.

In [None]:
import random

def node_to_dict(r):
    return {
        'id': r[0],
        'label': r[2],
        'x': random.uniform(0,1),
        'y': random.uniform(0,1),
        'size': 0.5,
        'color': '#f00' if r[1] == 'f' else '#00f'
        
    }

def edge_to_dict(i, r):
    return {
        'id': i,
        'source': r[0],
        'target': r[1],
        'label' : r[2],
        'type': 'arrow',
        'color': '#999'
    }

nodes_dict = map(node_to_dict, g.vertices.collect())
edges_dict = [edge_to_dict(i, r) for i, r in enumerate(g.edges.collect())]

Now we are ready to show the data using the [sigmajs](sigmajs.org) library.

In [None]:
%%javascript
require.config({
    paths: {
        sigmajs: 'https://cdnjs.cloudflare.com/ajax/libs/sigma.js/1.2.0/sigma.min'
    }
});

require(['sigmajs']);

In [None]:
from IPython.core.display import display, HTML
from string import Template
import json

js_text_template = Template(open('js/sigma-graph.js','r').read())

graph_data = { 'nodes': nodes_dict, 'edges': edges_dict }

js_text = js_text_template.substitute({'graph_data': json.dumps(graph_data),
                                       'container': 'graph-div'})

html_template = Template('''
<div id="graph-div" style="height:400px"></div>
<script> $js_text </script>
''')

HTML(html_template.substitute({'js_text': js_text}))