# Graph Mining Common Crawl URL Data with PySpark 
This notebook derives heavily from this excellent [Medium blog post](https://towardsdatascience.com/large-scale-graph-mining-with-spark-part-2-2c3d9ed15bb5) by Win Suen on the same topic. Our goal is to extend on the upstream and downstream parts of the actual graph analysis from Win's [original notebook](https://github.com/wsuen/pygotham2018_graphmining/blob/master/notebooks/Graphframes_demo.ipynb), and to document any interesting observations that we make. All our ETL and visualization code is published in this repository.

In addition to performing analysis on graphframe objects in PySpark, we aim to visualize the communities detected using tools like D3. 

## Import the necessary modules

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 pyspark-shell'
import pyspark
from pyspark.sql import * 
from pyspark.sql.functions import udf, col, desc
import hashlib

In [2]:
sc = pyspark.SparkContext("local[*]")
spark = SparkSession.builder.appName('notebook').getOrCreate()

In [3]:
from graphframes import *
import hashlib

## Read in data
Our PySpark ETL code extracts relevant URLs from the CommonCrawl WARC files and outputs them parquet files. We now read them for community clustering analysis.

In [4]:
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet("./bootstrap/spark-warehouse/sep_2017")
df = df.distinct().limit(10000)

### Visualize a subset of the data
We visualize just a subset of the data with common children as below. 

In [5]:
focus = 'twitter'
df = df.where(df.child.contains(focus))
df.show(10)

+--------------------+--------------------+-----------+--------------------+
|              parent|           parentTLD|   childTLD|               child|
+--------------------+--------------------+-----------+--------------------+
|http://aspireholi...|aspireholidays.co.uk|twitter.com|http://twitter.co...|
|http://aftertheco...|   afterthecorps.com|twitter.com|http://twitter.co...|
|http://archive.bo...|  archive.boston.com|twitter.com|https://twitter.c...|
|http://allabouthe...|   allaboutherbs.org|twitter.com|https://twitter.c...|
|http://archive.fo...| archive.fortune.com|twitter.com|https://twitter.c...|
|http://animecons.com|       animecons.com|twitter.com|https://twitter.c...|
|http://asrar7days...|      asrar7days.com|twitter.com|http://twitter.co...|
|http://articles.c...|articles.chicagot...|twitter.com|http://twitter.co...|
| http://bitex.com.vn|        bitex.com.vn|twitter.com|  http://twitter.com|
|http://archive.co...|archive.constantc...|twitter.com|http://twitter.co...|

In [6]:
# Select set of parents and children TLDs (nodes) to assign id for each node.
assignID = df.select("parentTLD","childTLD").rdd.flatMap(lambda x: x).distinct()

### Assign unique hashkeys to each item in the nodelist

In [7]:
def hashnode(x):
    # Assign unique 8-digit hex hashkey to each item
    return hashlib.sha1(x.encode("UTF-8")).hexdigest()[:8]

hashnode_udf = udf(hashnode)

### Define graph vertices

In [8]:
vertices = assignID.map(lambda x: (hashnode(x), x)).toDF(["id","name"])
vertices.show(5)

+--------+--------------------+
|      id|                name|
+--------+--------------------+
|1ca72648|aspireholidays.co.uk|
|465806fb|         twitter.com|
|9858ddd3|   afterthecorps.com|
|3a2c956e|  archive.boston.com|
|e39256d7|   allaboutherbs.org|
+--------+--------------------+
only showing top 5 rows



### Define graph edges 

In [9]:
edges = df.select("parentTLD","childTLD")\
    .withColumn("src", hashnode_udf("parentTLD"))\
    .withColumn("dst", hashnode_udf("childTLD"))\
    .select("src","dst")
edges.show(5)

+--------+--------+
|     src|     dst|
+--------+--------+
|1ca72648|465806fb|
|9858ddd3|465806fb|
|3a2c956e|465806fb|
|e39256d7|465806fb|
|f9691dc2|465806fb|
+--------+--------+
only showing top 5 rows



### Create GraphFrame in PySpark

In [10]:
graph = GraphFrame(vertices, edges)

## Run Label-Propagation Analysis (LPA)
LPA is a community detection algorithm for large-scale graph networks [Original paper](https://arxiv.org/pdf/0709.2938.pdf). The main benefit of using LPA is that it does not require any prior labeling of the dataset prior to community detection - the algorithm iteratively connects groups of nodes based on a consensus of unique labels using the intrinsic information in the graph itself! 

LPA is implemented in PySpark graphframes and is run as shown below. 

In [11]:
communities = graph.labelPropagation(maxIter=5)
communities.persist().show(10)

+--------+--------------------+------------+
|      id|                name|       label|
+--------+--------------------+------------+
|22f34613| biznes-bulgaria.com|429496729600|
|9dd923ac|          7dniv.info|429496729600|
|daaaa8f6|archive.constantc...|429496729600|
|804ff334|         acidcow.com|429496729600|
|6c4680a3|    antigo.chuza.org|429496729600|
|f8a6bd9e|  anunciamano.com.ar|429496729600|
|d2c250b7|      asrar7days.com|429496729600|
|27f1947d|     autos.mlive.com|429496729600|
|9858ddd3|   afterthecorps.com|429496729600|
|82c418b5|   balidiscovery.com|429496729600|
+--------+--------------------+------------+
only showing top 10 rows



## Run PageRank

In [12]:
results = graph.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank")\
    .join(vertices, on="id").orderBy("pagerank", ascending=False)\
    .show(5)

+--------+------------------+-------------------+
|      id|          pagerank|               name|
+--------+------------------+-------------------+
|465806fb|26.200405836312694|        twitter.com|
|2d723324|0.6811286114600442|           nola.com|
|407eca8a|0.5121267755338679|      animecons.com|
|9858ddd3|0.5121267755338679|  afterthecorps.com|
|f9691dc2|0.5121267755338679|archive.fortune.com|
+--------+------------------+-------------------+
only showing top 5 rows



## Run TriangleCount

In [13]:
trg_count = graph.triangleCount().show(5)

+-----+--------+--------------------+
|count|      id|                name|
+-----+--------+--------------------+
|    0|9dd923ac|          7dniv.info|
|    0|3eaf8394|      autos.nola.com|
|    0|407eca8a|       animecons.com|
|    0|f9691dc2| archive.fortune.com|
|    0|546cb1d0|autoteile-online.biz|
+-----+--------+--------------------+
only showing top 5 rows

