## MSBX 5420 Assignment 4
This assignment includes two parts: (1) Graph analysis with Spark GraphFrames (Task 1 and 2); (2) Load data from MySQL database and do a simple analysis (Task 3). Two datasets are used in the assignment - facebook social networks (Task 1) and reddit community links (Task 2). For task 3, we will continue our class exercise with `employees` database.

### Task 1 - Graph Analysis on Facebook Networks

The data is from Facebook circles. For social networks, the data sometimes looks simple but boring - to protect privacy, only (recoded) user id is available and each row in the data is the connection or friendship from one user to another. 

Let's first load graphframes package and build the graph.

In [None]:
#in case you need to download the graphframes package
#!wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.jar
#if you encounter issues on MyBinder, you may reinstall pyspark package
#!pip install pyspark

In [None]:
#for cluster, switch kernel to PySpark, and use yarn or spark://spark-master:7077 in master()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[4]') \
                    .config("spark.executor.memory", "1g") \
                    .config("spark.driver.memory", "2g") \
                    .config("spark.jars", "mysql-connector-java-8.0.23.jar, graphframes-0.8.1-spark3.0-s_2.12.jar") \
                    .config("spark.packages", "graphframes:graphframes:0.8.1-spark3.0-s_2.12") \
                    .appName('spark_graph').getOrCreate()

In [None]:
#make sure graphframes-0.8.1-spark3.0-s_2.12.jar is under same directory
sc = spark.sparkContext
sc.addPyFile('graphframes-0.8.1-spark3.0-s_2.12.jar')
#if on the cluster
#sc = spark.sparkContext
#sc.addPyFile('s3://msbx5420-spr21/zhiyiwang/graphframes-0.8.1-spark3.0-s_2.12.jar')

In [None]:
#first read the dataset
import pyspark.sql.functions as fn

#this is a txt file without header so after reading data we use .toDF() to add column names
#for cluster, read files under s3://msbx5420-spr21/zhiyiwang/
fb_connection = spark.read.csv('./facebook_combined.txt.gz', sep=' ').toDF('from', 'to')
fb_connection.show()

Create vertices and edges dataframes, with `id` for vertices, and `src` / `dst` for edges.

In [None]:
#create vertices dataframe
fb_vertices = fb_connection.select(fn.col('from').alias('id')).union(fb_connection.select(fn.col('to').alias('id'))).distinct()
fb_vertices.count()

Because Graphframes by default uses multi-directed graph and there is no "undirected" definition, we need to "duplicate" the edges to have two edges between two nodes to capture their friend relationship on Facebook.

In [None]:
#create edges dataframe
fb_edges = fb_connection.union(fb_connection.select(fn.col('to').alias('from'),fn.col('from').alias('to'))) \
                        .withColumnRenamed('from', 'src').withColumnRenamed('to', 'dst').distinct()
fb_edges.show()
fb_edges.count()

In total the data contains 4,039 users and 176,468 edges (bi-directional friendship), consistent with the data description. Then we can build the graph with the two dataframes.

In [None]:
from graphframes import *
#build graph
fb_graph = GraphFrame(fb_vertices, fb_edges)
print(fb_graph)

Let's first get degree centrality. Because friendship tie in Facebook is essentially undirected (bi-directional in our data setup), inDegree and outDegree are actually same here.

In [None]:
#because this is an undirected graph (Facebook only has friendship, not following / followed), inDegree and outDegree are same here
fb_graph.inDegrees.sort(fn.desc("inDegree")).show()
fb_graph.outDegrees.sort(fn.desc("outDegree")).show()

Now let's calculate pagerank to see who are the important ones in the network.

In [None]:
#[Your Code] to calculate pagerank on the graph and display nodes with top pageranks



Shortest path is useful in many cases. Note that the `shortestPaths()` function in Grapgframes will actually calculate shortest distances (number of edges) from each node in the graph to all the nodes specified in `landmarks`. Here we want to calculate all the shortest paths from all users to two sample users with `id` of `0` and `25`, and then see the distribution of shortest distances from all users to them. So we first need to calculate shortest paths on the graph and extract the distance information.

In [None]:
#[Your Code] to calculate shortest paths from all nodes to node id 0 and 25


Then we check the distribution of distances from all nodes to node 0 and 25.

In [None]:
#check the distribution of distances to node 0 and 25
shortest_path.select(fn.map_values('distances')[1].alias('distance')).groupBy('distance').count().orderBy('distance').show()
shortest_path.select(fn.map_values('distances')[0].alias('distance')).groupBy('distance').count().orderBy('distance').show()

Next we want to know the structure of this network, so we can get the clusters. We use label propagation to identify clusters, and show the number of clusters as well as size of clusters in the end.

In [None]:
#[Your Code] to use label propagation to identify clusters in the network; then show the total number of clusters you get and the size of each cluster.



### Task 2 - Graph Analysis on Reddit Communities
We will work on a different graph dataset from Reddit in Task 2. Reddit is a large community for discussing different topics. In reddit, there are subreddits for specific topics. In particular, one community (subreddit) links to another community (subreddit) when a post refers to another post in another community. Therefore, the data here contains the posts from 2014 to 2017 that contain hyperlinks of another different subreddit. The data contains two parts, one is the hyperlinks in the body of reddit posts, the other is the hyperlinks in the title of reddit posts.

In [None]:
import pyspark.sql.functions as fn

#read data, two data files in total; for cluster read files under s3://msbx5420-spr21/zhiyiwang/
reddit_link = spark.read.csv('./reddit_hyperlinks.csv.gz', header=True, inferSchema=True, sep=',')
reddit_link.show()
reddit_link_title = spark.read.csv('./reddit_hyperlinks_title.csv.gz', header=True, inferSchema=True, sep=',')
reddit_link_title.show()

Here we union the two dataframes first and then create the vertices/edges dataframes.

In [None]:
reddit_link_all = reddit_link.union(reddit_link_title)

In [None]:
#create vertices dataframe
reddit_vertices = reddit_link_all.select(fn.col('SOURCE_SUBREDDIT').alias('id')) \
                                 .union(reddit_link_all.select(fn.col('TARGET_SUBREDDIT').alias('id'))).distinct()
reddit_vertices.count()

In [None]:
#create edges dataframe
reddit_edges = reddit_link_all.withColumnRenamed('SOURCE_SUBREDDIT', 'src').withColumnRenamed('TARGET_SUBREDDIT', 'dst')
reddit_edges.show()
reddit_edges.count()

Now build the graph with the two dataframes.

In [None]:
from graphframes import *
#build graph
reddit_graph = GraphFrame(reddit_vertices, reddit_edges)
print(reddit_graph)

Let's start with degree centrality again. Here the importance of a community is better approximated by the links *to* the community (the posts in the community were referred in other communities), so we use inDegree centrality.

In [None]:
reddit_graph.inDegrees.sort(fn.desc("inDegree")).show()

Now let's use pagerank to determine the importance of community and show the top ones.

In [None]:
#[Your Code] to use pagerank to identify the most important communities based on the hyperlinks and display the top ones



In the data, one column is the sentiment of the post with hyperlinks from one subreddit to another. So we can learn whether or not this is a positive post referring another subreddit. In other words, some posts might be negative when referring to the posts in other subreddits, implying that some communities may have conflicts. Can you identify which pairs of communities are more likely to have conflicts?

To do this, we can perform a query on the edges in the graph. Basically, we can obtain the average sentiment (`LINK_SENTIMENT` column) from one subreddit to another. To make sure this is not random, we should ONLY consider those pairs of communities with *at least 10 hyperlinks from one to another*.

In [None]:
#[Your Code] to identify the communities with significant conflicts)

Next let's perform some searches on the graph. Assume you are a random walker in reddit communities - you just randomly browse posts without targeting any particular communities. Now assume you start your browsing trip in the `leagueoflegends` commuity (League of Legends is a Multiplayer Online Battle Arena (MOBA) e-sports video game). Now we are wondering whether (and in what way) you have a chance to reach other communities through the hyperlinks between communities. To do this, we can use breath-first search or motif finding.

Note that this is not likely to be a real action in practice and it is also not the actual role of those hyperlinks. We just use it as a simulated case of graph search. Now let's first see if you can reach `politics` community from `leagueoflegends` community directly.

In [None]:
paths = reddit_graph.bfs(fromExpr = "id = 'leagueoflegends'", toExpr = "id = 'politics'", maxPathLength = 1)
paths.show(truncate=False)

It seems no direct hyperlinks from `leagueoflegends` subreddit to `politics` subreddit. Therefore, we should check if there are shortest paths with length of 2 so that we may still reach `politics` community through another community. Can you identify those paths through `both` breath-first search and motif finding?

In [None]:
#[Your Code] to use breath-first search to find possible shortest paths from leagueoflegends to politics



In [None]:
#[Your Code] to use motif finding to find possible shortest paths from leagueoflegends to politics



### Task 3 - Read Data from MySQL
As the last task in all assignments, you will see no existing code and you will do a simple task on your own. 

The task is to read data from MySQL database. You need to follow the steps and commands in class exercise to have a MySQL database available in docker and import the `employees` datababase. Then read data from the `dept_emp` table (employee-department table).

After reading the data into a spark dataframe, we are wondering which employees have been at more than one department. As we mentioned in the class, those employees that connect multiple groups may be structural holes to have comparative advantages. Even though the social network is not available for employees in this database, we may approximate this concept by identifying those employees who have worked in more than one department. This is just a rough calculation.

In the table, there are columns `from_date` and `to_date`. Fot `to_date`, if it is `9999-01-01`, the employee is still at the company by the time of data collection (current employee). Therefore, we want to filter those employees with `to_date` as `9999-01-01` and with more than one records in the `dept_emp` table. That's what you will obtain eventually - you can use either dataframe operations or sql, and use `.show()` to display the results you obtain.

In [None]:
#[Your Code] to read data from dept_emp table and filter current employees who have worked at more than one department
#for cluster, MySQL is at ip-172-16-0-110:3306 rather than localhost:3306; the username and password are same with the class exercise


