## Import useful Python packages

In [0]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

## Add `GraphFrames` to `SparkContext`

[`GraphFrames`](https://graphframes.github.io/graphframes/docs/_site/index.html) is a Python wrapper for [Spark's GraphX API](https://spark.apache.org/graphx/), which unfortunately does not natively support Python at the moment. In order to make `GraphFrames` work, we need to add it to our runtime environment (i.e., cluster), so as to make it available to our current `SparkContext`.<br>
**NOTE:** To correctly set up your cluster, please follow the steps below:
-  Select `7.6 (includes Apache Spark 3.0.1, Scala 2.12)` as your runtime environment
-  On the Cluster menu, go to the `Libraries` tab and install the following two dependencies (once the cluster is up and running), clicking on the `Install New` button:
    - `graphframes` via `PyPI`
    - `graphframes:graphframes:0.8.1-spark3.0-s_2.12` via `Maven`

###Import all packages from `graphframes`

In [0]:
from graphframes import *

In [0]:
spark

In [0]:
sc._conf.getAll()

# **Google Web Graph**

In this notebook, we will be using a dataset from the [SNAP Group](https://snap.stanford.edu/data/web-Google.html)*, which was released in 2002 by Google as a part of [Google Programming Contest](http://www.google.com/programming-contest/). 

More specifically, this dataset contains a snapshot of the **Google Web Graph** consisting of **875,713** nodes (i.e., web pages) connected by **5,105,039** edges (i.e., hyperlinks).

The original dataset comes as a single file; in order to facilitate the usage of `GraphFrames` API, it has been split into **2 sources**: 
-  `web-pages.csv.bz2` containing only the nodes of the graph;
-  `web-links.csv.bz2` containing the links between nodes.

The reason for that is because `GraphFrames` can create a graph from **2 PySpark DataFrame objects**, representing the set of vertices and edges, respectively. There is a naming convention which those two dataframes must be compliant with: the former should contain at least a column named `id` (indicating the node identifiers), whilst the latter should contain at least two columns named `src` and `dst` to denote the identifiers of the *source* and *destination* nodes connected by a (directed) edge, respectively.

For a deeper understanding of the `GraphFrames` package, please refer to the [online user guide](https://graphframes.github.io/graphframes/docs/_site/user-guide.html).

[**SNAP is the acronym for Stanford Network Analysis Project led by Prof. Jure Leskovec*]

## **1. Data Acquisition**

This is the first step we need to accomplish before going any further. The dataset will be downloaded and loaded to DBFS, as usual.

### Download the dataset to the local driver node's ```/tmp``` folder using ```wget```

###**Retrieve the dataset containing nodes (i.e., web pages)**

In [0]:
%sh wget -P /tmp https://github.com/gtolomei/big-data-computing/raw/master/datasets/web-pages.csv.bz2

###**Retrieve the dataset containing edges (i.e., hyperlinks)**

In [0]:
%sh wget -P /tmp https://github.com/gtolomei/big-data-computing/raw/master/datasets/web-links.csv.bz2

In [0]:
%fs ls file:/tmp/

### Move both files from local driver node's file system to DBFS

In [0]:
dbutils.fs.mv("file:/tmp/web-pages.csv.bz2", "dbfs:/bdc-2020-21/datasets/web-pages.csv.bz2")

In [0]:
dbutils.fs.mv("file:/tmp/web-links.csv.bz2", "dbfs:/bdc-2020-21/datasets/web-links.csv.bz2")

In [0]:
%fs ls /bdc-2020-21/datasets/

### **Read both dataset files (nodes and links) into two Spark Dataframes**

In [0]:
nodes_df = spark.read.load("dbfs:/bdc-2020-21/datasets/web-pages.csv.bz2", 
                         format="csv", 
                         sep=",", 
                         inferSchema="true", 
                         header="true"
                         )

In [0]:
links_df = spark.read.load("dbfs:/bdc-2020-21/datasets/web-links.csv.bz2", 
                         format="csv", 
                         sep=",", 
                         inferSchema="true", 
                         header="true"
                         )

## **2. Construct the `GraphFrames` graph object**

In [0]:
web_graph = GraphFrame(nodes_df, links_df) # Create the GraphFrame object from the 2 DataFrames

In [0]:
## Take a look at the DataFrames
web_graph.vertices.show(5, truncate=False)
web_graph.edges.show(5, truncate=False)

In [0]:
## Check the number of edges of each vertex
web_graph.degrees.show(10)

## **3. Execute PageRank on this graph**

The `GraphFrames` package has a nice and easy-to-use API for executing PageRank over a graph. There is a method called `pageRank` whose parameters are:
-  `resetProbability` which is the complement of the damping factor _d_ (i.e., _1-d_);
-  `tol` is the tolerance threshold used to determine the convergence of the PageRank vector;
-  `maxIter` specifies the total number of maximum iterations to be run (alternative to `tol`).

In [0]:
pr = web_graph.pageRank(resetProbability=0.15, tol=0.01)

In [0]:
# Look at the pagerank score for every vertex
pr.vertices.show(10)

In [0]:
# Sorting nodes by their value of PageRank (from the highest to the lowest)
pr.vertices.sort(['pagerank'], ascending=[0]).show(10)

In [0]:
# Scaling PageRank values so that PageRank vector entries sum up to 1
pr_sum = pr.vertices.groupBy().sum().collect()[0][0]

pr_norm = pr.vertices.withColumn("pagerank_norm", pr.vertices.pagerank/pr_sum)

In [0]:
pr_norm.sort(['pagerank'], ascending=[0]).show(10, truncate=False)

#**`GraphFrames` supports many other graph-based algorithms**

In addition to PageRank, `GraphFrames` provides support for the following graph-based algorithms:
-  Breadth-first search (BFS)
-  Connected components
-  Strongly connected components
-  Label Propagation Algorithm (LPA)
-  Shortest paths
-  Triangle count

For any further information, please refer to the [online user guide](https://graphframes.github.io/graphframes/docs/_site/user-guide.html).