# MIDS - w261 Machine Learning At Scale
__Course Lead:__ Dr James G. Shanahan (__email__ Jimi via  James.Shanahan _AT_ gmail.com)

## Final Exam Exercise

---
__Name:__   Megan Jasek  <br \> 
__Class:__ MIDS w261 (Summer 2016, Section ?) <br \> 
__Email:__ meganjasek@ischool.berkeley.edu <br \> 
__Week:__  14 <br \>

In [None]:
# purpose of cell: download and view test data set from DropBox for use in subsequent cells

#!wget https://www.dropbox.com/sh/2c0k5adwz36lkcw/AADxzBgNxNF5Q6-eanjnK64qa/PageRank-test.txt
#!cat PageRank-test.txt

# *Spark* implementation of basic PageRank

-----

- Per Jimi, if we translate the MapReduce concepts to Spark, that will start our juices and we should be set up nicely for the Final Exam.
- We can run locally since the Final Exam is expected to be similar in format to the Midterm Exam (e.g., not require AWS or SoftLayer).
- He encouraged us to feel free to share notebook(s) on Google Groups since that might help each other.

-----

**The remaining text below is verbatim from HW9.1, except for the first sentence which replaces 'MRJob' with 'Spark'.**

As we had written for HW9.1 (basic MRJob implementation), now write a basic Spark implementation of the iterative PageRank algorithm that takes sparse adjacency lists as input.

Make sure that your implementation utilizes teleportation (1-damping/the number of nodes in the network), and further, distributes the mass of dangling nodes with each iteration so that the output of each iteration is correctly normalized (sums to 1).

[NOTE: The PageRank algorithm assumes that a random surfer (walker), starting from a random web page, chooses the next page to which it will move by clicking at random, with probability d, one of the hyperlinks in the current page. This probability is represented by a so-called ‘damping factor’ d, where d ∈ (0, 1). Otherwise, with probability (1 − d), the surfer jumps to any web page in the network. If a page is a dangling end, meaning it has no outgoing hyperlinks, the random surfer selects an arbitrary web page from a uniform distribution and “teleports” to that page.

As you build your code, use the test data
s3://ucb-mids-mls-networks/PageRank-test.txt

Or under the Data Subfolder for HW7 on Dropbox with the same file name. 
(On Dropbox https://www.dropbox.com/sh/2c0k5adwz36lkcw/AAAAKsjQfF9uHfv-X9mCqr9wa?dl=0)

with teleportation parameter set to 0.15 (1-d, where d, the damping factor is set to 0.85), and crosscheck
your work with the true result, displayed in the first image in the Wikipedia article: https://en.wikipedia.org/wiki/PageRank

and here for reference are the corresponding PageRank probabilities:

A,0.033 <br />
B,0.384 <br />
C,0.343 <br />
D,0.039 <br />
E,0.081 <br />
F,0.039 <br />
G,0.016 <br />
H,0.016 <br />
I,0.016 <br />
J,0.016 <br />
K,0.016 <br />

-----

### Create a Spark Context to use throughout this homework

In [1]:
import pyspark
from pyspark.sql import SQLContext

# We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# In this case, it is local multicore execution with "local[*]"
app_name = "HW11"
master = "local[*]"
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc.stop()
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)

print sc
print sqlContext

<pyspark.context.SparkContext object at 0x7f0dc55bc8d0>
<pyspark.sql.context.SQLContext object at 0x7f0dc55bc910>


### Read the data into an RDD and cache it

In [4]:
!cat PageRank-test.txt

B	{'C': 1}
C	{'B': 1}
D	{'A': 1, 'B': 1}
E	{'D': 1, 'B': 1, 'F': 1}
F	{'B': 1, 'E': 1}
G	{'B': 1, 'E': 1}
H	{'B': 1, 'E': 1}
I	{'B': 1, 'E': 1}
J	{'E': 1}
K	{'E': 1}

In [4]:
import ast
import json
from collections import namedtuple

Node = namedtuple('Node', 'name adj_list page_rank')

def parsePoint(line):
    all_nodes = []
    node_num, adj_dict = line.strip().split('\t')
    adj_dict = ast.literal_eval(adj_dict)
    all_nodes.append((str(node_num), adj_dict))
    for node in adj_dict:
        all_nodes.append((node, {}))
    return all_nodes

def aggNodes(x, y):
    if x != {}:
        return x
    elif y != {}:
        return y
    else:
        return {}
    
def initNode(node, n):
    return Node(node[0], node[1], 1.0/n)

fileName = 'PageRank-test.txt'

rawNodes = sc.textFile(fileName).flatMap(parsePoint).reduceByKey(aggNodes)
num_nodes = rawNodes.count()
print n
print rawNodes.take(n)

# Initialize Page Ranks
nodes = rawNodes.map(lambda node: initNode(node, num_nodes))

11
[('A', {}), ('C', {'B': 1}), ('E', {'B': 1, 'D': 1, 'F': 1}), ('G', {'B': 1, 'E': 1}), ('I', {'B': 1, 'E': 1}), ('K', {'E': 1}), ('H', {'B': 1, 'E': 1}), ('J', {'E': 1}), ('B', {'C': 1}), ('D', {'A': 1, 'B': 1}), ('F', {'B': 1, 'E': 1})]
