#DATASCI W261, Machine Learning at Scale
--------
####Assignement:  week \#7
####[Lei Yang](mailto:leiyang@berkeley.edu) | [Michael Kennedy](mailto:mkennedy@ischool.berkeley.edu) | [Natarajan Krishnaswami](mailto:natarajan@krishnaswami.org)
####Due: 2016-03-10, 8AM PST

###General Description
In this assignment you will explore networks and develop MRJob code for 
finding shortest path graph distances. To build up to large data 
you will develop your code on some very simple, toy networks.
After this you will take your developed code forward and modify it and 
apply it to two larger datasets (performing EDA along the way).


####Undirected toy network dataset

In an undirected network all links are symmetric, 
i.e., for a pair of nodes 'A' and 'B,' both of the links:

A -> B and B -> A

will exist. 

The toy data are available in a sparse (stripes) representation:

(node) \t (dictionary of links)

on AWS/Dropbox via the url:

s3://ucb-mids-mls-networks/undirected_toy.txt

On under the Data Subfolder for HW7 on Dropbox with the same file name

In the dictionary, target nodes are keys, link weights are values 
(here, all weights are 1, i.e., the network is unweighted).


####Directed toy network dataset

In a directed network all links are not necessarily symmetric, 
i.e., for a pair of nodes 'A' and 'B,' it is possible for only one of:

A -> B or B -> A

to exist. 

These toy data are available in a sparse (stripes) representation:

(node) \t (dictionary of links)

on AWS/Dropbox via the url:

s3://ucb-mids-mls-networks/directed_toy.txt

On under the Data Subfolder for HW7 on Dropbox with the same file name

In the dictionary, target nodes are keys, link weights are values 
(here, all weights are 1, i.e., the network is unweighted).


###HW 7.0: Shortest path graph distances (toy networks)

In this part of your assignment you will develop the base of your code for the week.

Write MRJob classes to find shortest path graph distances, 
as described in the lectures. In addition to finding the distances, 
your code should also output a distance-minimizing path between the source and target.
Work locally for this part of the assignment, and use 
both of the undirected and directed toy networks.

To proof you code's function, run the following jobs

- shortest path in the undirected network from node 1 to node 4

Solution: 1,5,4 

- shortest path in the directed network from node 1 to node 5

Solution: 1,2,4,5

and report your output---make sure it is correct!

###One iteration Mrjob
<img src="ShortestPathMapReduce.png" alt="Drawing" style="width: 550px;"/>

In [96]:
%%writefile ShortestPathIter.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from json import dumps


class ShortestPathIter(MRJob):
    DEFAULT_PROTOCOL = 'json'
    
    def __init__(self, *args, **kwargs):
        super(ShortestPathIter, self).__init__(*args, **kwargs)
        
                                                 
    def configure_options(self):
        super(ShortestPathIter, self).configure_options()
        self.add_passthrough_option(
            '--source', dest='source', default='1', type='string',
            help='source: source node (default 1)')
        self.add_passthrough_option(
            '--destination', dest='destination', default='1', type='string',
            help='destination: destination node (default 1)')
        

    def mapper(self, _, line):
        nid, dic = line.strip().split('\t', 1)
        cmd = 'node = %s' %dic
        exec cmd
        #nid = nid.strip('"')
        # if the node structure is incomplete (first pass), add them                
        if 'dist' not in node:            
            node = {'adj':node, 'path':[]}            
            node['dist'] = 0 if self.options.source==nid else -1
            
        
        # emit node
        yield nid, (node, None)
        
        # emit distances to reachable nodes
        if node['dist'] >= 0:
            for m in node['adj']:                
                yield m, (node['adj'][m] + node['dist'], node['path']+[nid])
                
    def reducer(self, nid, value):
        dmin = float('inf')
        path = node = None
        # loop through all arrivals
        for d, pre in value:            
            if not pre:
                node = d
            elif d < dmin:
                dmin = d
                path = pre
        # handle dangling node
        if not node:
            node = {'adj':{}, 'dist':dmin, 'path':path}
        # update distance and path
        if (node['dist'] == -1 and path) or dmin < node['dist']:
            node['dist'] = dmin
            node['path'] = path
        # emit for next iteration
        yield nid, node
        
    def steps(self):
        return [MRStep(mapper=self.mapper
                       #, reducer_init=self.reducer_init
                       , reducer=self.reducer
                       #, reducer_final=self.reducer_final
                       #, jobconf = jc
                      )
               ]

if __name__ == '__main__':
    ShortestPathIter.run()

Overwriting ShortestPathIter.py


In [94]:
#!python ShortestPathIter.py directed_toy.txt --source 1

###Driver

In [103]:
from ShortestPathIter1 import ShortestPathIter

# some facilitor variables
graph = 'directed_toy.txt'
source, destination = '1', '5'

# creat BFS job
init_job = ShortestPathIter(args=[graph, '--source', source, '--destination', destination]) #, '-r', 'hadoop'])
iter_job = ShortestPathIter(args=['graph', '--source', source, '--destination', destination]) #, '-r', 'hadoop'])

# run initialization job
with init_job.make_runner() as runner:
    runner.run()
    dist_old = {}
    # save our graph file for iteration
    with open('graph', 'w') as f:
        for line in runner.stream_output():
            # value is nid and node object
            nid, node = init_job.parse_output_line(line)
            # record distance for each node
            dist_old[nid] = node['dist']
            # write graph file 
            f.write('%s\t%s\n' %(nid, node))

# run BFS iteratively
i = 1
while(1):    
    print 'iteration %s' %i
    dist = {}
    with iter_job.make_runner() as runner: 
        runner.run()
        # stream_output: get access of the output    
        with open('graph', 'w') as f:
            for line in runner.stream_output():
                # value is nid and node object
                nid, node = iter_job.parse_output_line(line)
                dist[nid] = node['dist']
                f.write('%s\t%s\n' %(nid, str(node)))
            
    # check if distance for each node changes
    toBreak = True
    for n in dist:
        if dist_old[n] != dist[n]:
            toBreak = False
            break  
    
    if toBreak:
        break
    
    # save dist for next iteration comparison
    dist_old = dist
    i += 1
        
print "\nTraining completes!\n"

# show path between source and destination
with open('graph', 'r') as f:
    line = f.readline()
    while (line):
        nid, node = line.split('\t')
        if nid == destination:
            cmd = 'node = %s' %node
            exec cmd
            if node['path']:
                print 'shortest distance between %s and %s: %s' %(source, destination, node['dist'])
                print 'path: %s' %' -> '.join(node['path']+[destination])
            else:
                print '%s is a dangling node, cannot traverse!' %source
            break
        line = f.readline()



iteration 1
iteration 2
iteration 3





Training completes!

shortest distance between 1 and 5: 3
path: 1 -> 2 -> 4 -> 5


###Main dataset 1: NLTK synonyms

In the next part of this assignment you will explore a network derived from
the NLTK synonym database used for evaluation in HW 5. At a high level, this
network is undirected, defined so that there exists link between two nodes/words 
if the pair or words are a synonym. These data may be found at the location:

- s3://ucb-mids-mls-networks/synNet/synNet.txt
- s3://ucb-mids-mls-networks/synNet/indices.txt

On under the Data Subfolder for HW7 on Dropbox with the same file names

where synNet.txt contains a sparse representation of the network:

(index) \t (dictionary of links)

in indexed form, and indices.txt contains a lookup list

(word) \t (index)

of indices and words. This network is small enough for you to explore and run
scripts locally, but will also be good for a systems test (for later) on AWS.

In the dictionary, target nodes are keys, link weights are values 
(here, all weights are 1, i.e., the network is unweighted).

###HW 7.1: Exploratory data analysis (NLTK synonyms)

Using MRJob, explore the synonyms network data.
Consider plotting the degree distribution (does it follow a power law?),
and determine some of the key features, like:

- number of nodes, 
- number links,
- or the average degree (i.e., the average number of links per node),
- etc...

As you develop your code, please be sure to run it locally first (though on the whole dataset). 
Once you have gotten you code to run locally, deploy it on AWS as a systems test
in preparation for our next dataset (which will require AWS).

###HW 7.2: Shortest path graph distances (NLTK synonyms)

Write (reuse your code from 7.0) an MRJob class to find shortest path graph distances, 
and apply it to the NLTK synonyms network dataset. 

Proof your code's function by running the job:

- shortest path starting at "walk" (index=7827) and ending at "make" (index=536),

and showing you code's output. Once again, your output should include the path and the distance.

As you develop your code, please be sure to run it locally first (though on the whole dataset). 
Once you have gotten you code to run locally, deploy it on AWS as a systems test
in preparation for our next dataset (which will require AWS).

###Main dataset 2: English Wikipedia

For the remainder of this assignment you will explore the English Wikipedia hyperlink network.
The dataset is built from the Sept. 2015 XML snapshot of English Wikipedia.
For this directed network, a link between articles: 

A -> B

is defined by the existence of a hyperlink in A pointing to B.
This network also exists in the indexed format:

Data: s3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-out.txt
Data: s3://ucb-mids-mls-networks/wikipedia/all-pages-indexed-in.txt
Data: s3://ucb-mids-mls-networks/wikipedia/indices.txt
On under the Data Subfolder for HW7 on Dropbox with the same file names

but has an index with more detailed data:

(article name) \t (index) \t (in degree) \t (out degree)

In the dictionary, target nodes are keys, link weights are values .
Here, a weight indicates the number of time a page links to another.
However, for the sake of this assignment, treat this an unweighted network,
and set all weights to 1 upon data input.

###HW 7.3: Exploratory data analysis (Wikipedia)

Using MRJob, explore the Wikipedia network data on the AWS cloud. Reuse your code from HW 7.1---does is scale well? 
Be cautioned that Wikipedia is a directed network, where links are not symmetric. 
So, even though a node may be linked to, it will not appear as a primary record itself if it has no out-links. 
This means that you may have to ADJUST your code (depending on its design). 
To be sure of your code's functionality in this context, run a systems test on the directed_toy.txt network.


###HW 7.4: Shortest path graph distances (Wikipedia)

Using MRJob, find shortest path graph distances in the Wikipedia network on the AWS cloud.
Reuse your code from 7.2, but once again be warned of Wikipedia being a directed network.
To be sure of your code's functionality in this context, run a systems test on the directed_toy.txt network.

When running your code on the Wikipedia network, proof its function by running the job:

- shortest path from "Ireland" (index=6176135) to "University of California, Berkeley" (index=13466359),

and show your code's output.

Once your code is running, find some other shortest paths and report your results.

###HW 7.5: Conceptual exercise: Largest single-source network distances

Suppose you wanted to find the largest network distance from a single source,
i.e., a node that is the furthest (but still reachable) from a single source.

How would you implement this task? 
How is this different from finding the shortest path graph distances?

Is this task more difficult to implement than the shortest path distance?

As you respond, please comment on program structure, runtimes, iterations, general system requirements, etc...

###HW 7.6 (optional): Computational exercise: Largest single-source network distances 

Using MRJob, write a code to find the largest graph distance and distance-maximizing nodes from a single-source.
Test your code first on the toy networks and synonyms network to proof its function.

###stop yarn, hdfs, and job history

In [245]:
!/usr/local/Cellar/hadoop/2*/sbin/stop-yarn.sh
!/usr/local/Cellar/hadoop/2*/sbin/stop-dfs.sh
!/usr/local/Cellar/hadoop/2*/sbin/mr-jobhistory-daemon.sh --config /usr/local/Cellar/hadoop/2*/libexec/etc/hadoop/ stop historyserver 

stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping historyserver


###start yarn, hdfs, and job history

In [10]:
!/usr/local/Cellar/hadoop/2*/sbin/start-yarn.sh
!/usr/local/Cellar/hadoop/2*/sbin/start-dfs.sh
!/usr/local/Cellar/hadoop/2*/sbin/mr-jobhistory-daemon.sh --config /usr/local/Cellar/hadoop/2*/libexec/etc/hadoop/ start historyserver 

starting yarn daemons
starting resourcemanager, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/yarn-leiyang-resourcemanager-Leis-MacBook-Pro.local.out
localhost: starting nodemanager, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/yarn-leiyang-nodemanager-Leis-MacBook-Pro.local.out
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-leiyang-namenode-Leis-MacBook-Pro.local.out
localhost: starting datanode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-leiyang-datanode-Leis-MacBook-Pro.local.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-leiyang-secondarynamenode-Leis-MacBook-Pro.local.out
starting historyserver, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/mapred-leiyang-historyserver-Leis-MacBook-Pro.local.out
16/02/26 17:44:11 INFO hs.JobHistoryServer: STARTUP_MSG: 
/**********