#Colab 5: Learn to use Spark's GraphFrames

## Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!pip install graphframes

!curl -L -o "/usr/local/lib/python3.7/dist-packages/pyspark/jars/graphframes-0.8.1-spark3.0-s_2.12.jar" http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.jar

openjdk-8-jdk-headless is already the newest version (8u282-b08-0ubuntu1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  242k  100  242k    0     0   596k      0 --:--:-- --:--:-- --:--:--  596k


Now we authenticate a Google Drive client to download files. Please follow the instruction to enter the authoriztion code.


In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Now download the file we will be processing in our Spark job.

In [None]:
id='1MnfrYQJtV1p0Iv2xl5qV8QoJY7y_Ozj5'
downloaded = drive.CreateFile({'id': id}) 
downloaded.GetContentFile('metro.csv') 

id='19UgCueFvH4agly8ks0TPzIxIhOWDmh4m'
downloaded = drive.CreateFile({'id': id}) 
downloaded.GetContentFile('country.csv') 

id='1pBOY2eVrFFI2FXQY4ctajYVr1ubfFWoZ'
downloaded = drive.CreateFile({'id': id}) 
downloaded.GetContentFile('continent.csv') 

id='1OkJa_O3G6KgcMq8uCuy9hvoTidjkMCLx'
downloaded = drive.CreateFile({'id': id}) 
downloaded.GetContentFile('metro_country.csv') 

id='1MCrV5XNjlr4X9TJXZV_6RkqGZ9yFZY7w'
downloaded = drive.CreateFile({'id': id}) 
downloaded.GetContentFile('country_continent.csv') 

Import libraries.

In [None]:
# import pandas as pd
# import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

from graphframes import *


Initialize the Spark context.


In [None]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc.stop()
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()


##Load the data## 
The contents in these five input files are pretty straightforward. `metro.csv` has all the cities intormation. `country.csv` has all the countries informaiton. `continent.csv` has all the continents information. `metro_country.csv` builds connection between cities and countier. `country_continent.csv` builds connection between countries and continents.

In [None]:
metro = spark.read.csv("metro.csv", header='true').withColumnRenamed("name","metro_name")
country = spark.read.csv("country.csv", header='true').withColumnRenamed("name","country_name")
continent = spark.read.csv("continent.csv", header='true').withColumnRenamed("name","continent_name")
metro_country = spark.read.csv("metro_country.csv", header='true')
country_continent = spark.read.csv("country_continent.csv", header='true')




In [None]:
# look at the heads

[print(row) for row in metro.head(3)]
[print(row) for row in country.head(3)]
[print(row) for row in continent.head(3)]
[print(row) for row in metro_country.head(3)]
[print(row) for row in country_continent.head(3)]


Row(metro_id='1', metro_name='Tokyo', population='36923000')
Row(metro_id='2', metro_name='Seoul', population='25620000')
Row(metro_id='3', metro_name='Shanghai', population='24750000')
Row(country_id='1', country_name='Japan')
Row(country_id='2', country_name='South Korea')
Row(country_id='3', country_name='China')
Row(continent_id='1', continent_name='Asia')
Row(continent_id='2', continent_name='Africa')
Row(continent_id='3', continent_name='North America')
Row(metro_id='1', country_id='1')
Row(metro_id='2', country_id='2')
Row(metro_id='3', country_id='3')
Row(country_id='1', continent_id='1')
Row(country_id='2', continent_id='1')
Row(country_id='3', continent_id='1')


[None, None, None]

##Task 1##

The goal of this assignment is to learn Spark's graphframes to build graphs and do some simple queires to the graphs. Here are some references you may use:

*   https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html
*   https://towardsdatascience.com/graphframes-in-jupyter-a-practical-guide-9b3b346cebc5
*   https://www.baeldung.com/spark-graph-graphframes

The first task of this assignment to build a graph to show the relationship between countries and metro cities. You should use one graph to show all the countries and all the metro cities in each country.

## 

In [None]:
import numpy as np


mc_verts= 



# mc_edges = metro_country.join(country_continent, on='country_id') # these are the edges
# mc_vertices = 


In [None]:




mc = GraphFrame(mc_vertices, mc_edges)

Let's verify the above grpah by checking the outdegrees of vertices. I defined my graph as a directed graph, i.e., a country node has outlinkes to cities. Since cities do not have any outgoinng links, when we check the outdegree, cities are not listed.

In [None]:
out_degrees=mc.outDegrees
out_degrees.show(100)

Once you have succesfully created a graphframe, you can use `networkx` to display the graph. In the code below, I assumed `mc` is the graphframe you have created. In this graph, the vertices includes all the coutnries and all the metro cities, and there's an edge between a city and a country if the city is in that country. Overall the graph is a little messy, because there are too many vertices and it is hard to tell which country includes which cities.

In [None]:
import networkx as nx

mc_gp = nx.from_pandas_edgelist(mc.edges.toPandas(),'src','dst')
nx.draw(mc_gp, with_labels = True, node_size = 10, font_size = 10)

To be more focus, let's retrive a subgraph from the above graph. In this subgraph we only want to find out the cities in the USA. This time the graph is easier to read because we don't have many vertices.

In [None]:
# Your code goes here (5 points)


##Task 2##
This task is similar to Task 1. The difference is we want to find the relatinship beteen countries and continents. Overall, you should use one graph to include all the continents and the countries should be connected to the continent they belongs to.

In [None]:
# Your code goes here (5 points)
# cc_vertices = 
# cc_edges = 

In [None]:
cc = GraphFrame(cc_vertices, cc_edges)

Let's display the graph.*italicized text*

In [None]:
cc_gp = nx.from_pandas_edgelist(cc.edges.toPandas(),'src','dst')
nx.draw(cc_gp, with_labels = True, node_size = 20, font_size = 10)

Again, let's only focus on Continent "North America" only to make the graph easy to read.

In [None]:
# Your codes goes here (5 points)



##Task 3##

Put it all together. Now you should build a grpah which shows all the continents, all the countries and all the metro cities. Link them accordingly based on their geographical locations. This graph is even messier because it has too many vertices.

In [None]:
# Your code goes here (5 points)
# mcc_vertices = 
# mcc_edges = 
# mcc = 
# display the graph

The graph you built above has several connected components (i.e., continents). Now let's use graphframe's `conectedComponents()` to find each individual component. 

In [None]:
sc.setCheckpointDir("/tmp/graphframes-example-connected-components")
components = mcc.connectedComponents()

In [None]:
components.show(100)

Once we know such informaiton, we only want to display one conneted component, which is Continent "North America", i.e., display all the contries in North America, and all the metro cities in these countries.

In [None]:
# Your code goes here (5 points)
# northamerica_vertices = 
# northamerica_edges = 

In [None]:
northamerica = GraphFrame(northamerica_vertices, northamerica_edges)

In [None]:
northamerica_gp = nx.from_pandas_edgelist(northamerica.edges.toPandas(),'src','dst')
nx.draw(northamerica_gp, with_labels = True, node_size = 40, font_size = 10, edge_color = "red")