#DATASCI W261: Machine Learning at Scale

**Nick Hamlin** (nickhamlin@gmail.com)  
**Tigi Thomas** (tgthomas@berkeley.edu)  
**Rock Baek** (rockb1017@gmail.com)  
**Hussein Danish** (husseindanish@gmail.com)  
  
Time of Submission: 10:30 AM EST, Saturday, Feb 27, 2016  
W261-3, Spring 2016  
Week 7 Homework

###Submission Notes:
- For each problem, we've included a summary of the question as posed in the instructions.  In many cases, we have not included the full text to keep the final submission as uncluttered as possible.  For reference, we've included a link to the original instructions in the "Useful Reference" below.
- Some aspects of this notebook don't always render nicely into PDF form.  In these situations, please reference [the complete rendered notebook on Github](https://github.com/nickhamlin/mids_261_homework/blob/master/HW7/MIDS-W261-2015-HWK-Week07-Hamlin-Thomas-Baek-Danish.ipynb)


###Useful References:
- **[Original Assignment Instructions](https://www.dropbox.com/s/26ejqhkzqdidzwj/HW7-Questions.txt?dl=0)**


In [6]:
#Use this to make sure we reload the MrJob code when we make changes
%load_ext autoreload
%autoreload 2
#Render matplotlib charts in notebook
%matplotlib inline

#Import some modules we know we'll use frequently
import numpy as np
import pylab as plt

## HW 7.0

### HW 7.0 - Problem Statement

In this part of your assignment you will develop the base of your code for the week.

Write MRJob classes to find shortest path graph distances, 
as described in the lectures. In addition to finding the distances, 
your code should also output a distance-minimizing path between the source and target.
Work locally for this part of the assignment, and use 
both of the undirected and directed toy networks.

To proof you code's function, run the following jobs

- shortest path in the undirected network from node 1 to node 4
Solution: 1,5,4 

- shortest path in the directed network from node 1 to node 5
Solution: 1,2,4,5

and report your output---make sure it is correct!


In [7]:
!cat undirected_toy.txt

1	{'2': 1,'5': 1}
2	{'1': 1,'3': 1,'4': 1,'5': 1}
3	{'2': 1, '4': 1}
4	{'2': 1,'3': 1,'5': 1}
5	{'1': 1, '2': 1, '4': 1}


###HW 7.0 - Initialization

In [190]:
%%writefile ssp_init.py
import sys

from mrjob.job import MRJob
from mrjob.step import MRStep

class SSP_Init(MRJob):        
                                                 
    def configure_options(self):
        """
        Set up infrastructure to enable us to pass important parameters into the job
        when we call it
        """
        super(SSP_Init, self).configure_options()
        
        #Integer ID of our starting node
        self.add_passthrough_option(
            '--source', dest='source', default=1, type='int',
            help='source: identifier of source node')

    
    def mapper(self, _, node):
        """
        Broadcast source node and attach statuses to each node
        """
        node_id,links=node.split('\t')
        links=eval(links)
        #for unweighted, undirected graph
        list_of_links=links.keys() #Probably going to need to modify this to pass weights through as well
        if node_id==str(self.options.source):
            yield node_id,(list_of_links,0,'Q',[node_id])
        else:
            yield node_id,(list_of_links,sys.maxint,'U',[]) 
       
    def steps(self):
        return [MRStep(
                    mapper=self.mapper
                      )]

if __name__ == '__main__':
    SSP_Init.run()

Overwriting ssp_init.py


In [224]:
from ssp_init import SSP_Init

def init_graph(source,data):
    mr_job = SSP_Init(args=[data,'--no-strict-protocols','--source',str(source)])
    with open('graph.txt','w+') as f:
        with mr_job.make_runner() as runner: 
            runner.run()
            for line in runner.stream_output():
                f.write(line)
                
init_graph(1,'directed_toy.txt')

In [225]:
!cat graph.txt

"1"	[["2", "6"], 0, "Q", ["1"]]
"2"	[["1", "3", "4"], 9223372036854775807, "U", []]
"3"	[["2", "4"], 9223372036854775807, "U", []]
"4"	[["2", "5"], 9223372036854775807, "U", []]
"5"	[["1", "2", "4"], 9223372036854775807, "U", []]


###HW 7.0 - Run iterative jobs

In [228]:
%%writefile ssp.py
import sys

from mrjob.job import MRJob
from mrjob.step import MRStep

class SSP(MRJob):
    
    def mapper(self, _, node):
        """
        Expand frontier node
        """
        node_id,data=node.split('\t')
        node_id=eval(node_id)
        data=eval(data)
        list_of_links,dist,status,path=data[:]
        if status=='V':
            yield node_id,(list_of_links,dist,status,path)
            #print node_id,(list_of_links,dist,status)
            
        if status=='Q':#check if node is in frontier:
            #emit the node we visited
            yield node_id,(list_of_links,dist,'V',path)
            #print node_id,(list_of_links,dist,'V')
            
            #Emit adjacent nodes
            for link in list_of_links:
                new_path=path[:]
                new_path.append(link)
                yield link,(None,dist+1,'Q',new_path) 

        else:
            yield node_id,(list_of_links,sys.maxint,'U',path)
            #print node_id,(list_of_links,sys.maxint,'U') 
        
    def reducer(self, node_id, data):  
        """
        Aggregate data for each node and create a consolidated version to pass to the next iteration
        """
        min_dist=sys.maxint
        adjacent_nodes=[]
        count=0
        for i in data:
            list_of_links,distance,status,path=i[:]
            if list_of_links:
                adjacent_nodes=list_of_links
            if status in ['Q','V']:
                if distance<min_dist:
                    min_dist=distance
                    if node_id not in path:
                        path.append(node_id)
            if status=='V':
                break
            count+=1
            #print list_of_links, distance,status
            #print i
        if count>1:
            status='Q'
        yield node_id, (adjacent_nodes,min_dist,status,path)
        
       
    def steps(self):
        return [MRStep(
                    mapper=self.mapper
                    ,reducer=self.reducer
                      )]

if __name__ == '__main__':
    SSP.run()

Overwriting ssp.py


In [229]:
#SINGLE ITERATION
init_graph(1,'directed_toy.txt')
from ssp import SSP
mr_job = SSP(args=['graph.txt','--no-strict-protocols'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)

('1', [['2', '6'], 0, 'V', ['1']])
('2', [['1', '3', '4'], 1, 'Q', ['1', '2']])
('3', [['2', '4'], 9223372036854775807, 'U', []])
('4', [['2', '5'], 9223372036854775807, 'U', []])
('5', [['1', '2', '4'], 9223372036854775807, 'U', []])
('6', [[], 1, 'Q', ['1', '6']])


In [231]:
#MULTIPLE ITERATIONS
import os
from ssp import SSP
mr_job = SSP(args=['graph.txt','--no-strict-protocols'])

init_graph(1,'undirected_toy.txt')

persist=True
count=0
while persist:
    print "Now on iteration: "+str(count)
    persist=False
    with open('graph_new.txt','w+') as f:
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for line in runner.stream_output():
                output=mr_job.parse_output_line(line)
                print output
                node,data=mr_job.parse_output_line(line)[:]
                f.write(line)
                #If we encounter any unvisited nodes, continue iterating
                if data[2]!='V':
                    persist=True
                
    count+=1
    os.rename('graph_new.txt','graph.txt')
    print ""
print "Graph has been fully traversed after {0} iterations!".format(str(count))

Now on iteration: 0
('1', [['2', '5'], 0, 'V', ['1']])
('2', [['1', '3', '5', '4'], 1, 'Q', ['1', '2']])
('3', [['2', '4'], 9223372036854775807, 'U', []])
('4', [['3', '2', '5'], 9223372036854775807, 'U', []])
('5', [['1', '2', '4'], 1, 'Q', ['1', '5']])

Now on iteration: 1
('1', [['2', '5'], 0, 'V', ['1']])
('2', [['1', '3', '5', '4'], 1, 'V', ['1', '2']])
('3', [['2', '4'], 2, 'Q', ['1', '2', '3']])
('4', [['3', '2', '5'], 2, 'Q', ['1', '5', '4']])
('5', [['1', '2', '4'], 1, 'V', ['1', '5']])

Now on iteration: 2
('1', [['2', '5'], 0, 'V', ['1']])
('2', [['1', '3', '5', '4'], 1, 'V', ['1', '2']])
('3', [['2', '4'], 2, 'V', ['1', '2', '3']])
('4', [['3', '2', '5'], 2, 'V', ['1', '5', '4']])
('5', [['1', '2', '4'], 1, 'V', ['1', '5']])

Graph has been fully traversed after 3 iterations!


This matches what we'd expect to see based on the slides from class.  We also confirm that the shortest path from node 1 to 4 is [1,5,4]. Now, let's see what we get using the directed graph.

In [219]:
!cat directed_toy.txt

1	{'2': 1, '6': 1}
2	{'1': 1, '3': 1, '4': 1}
3	{'2': 1, '4': 1}
4	{'2': 1, '5': 1}
5	{'1': 1, '2': 1, '4': 1}


In [220]:
!cat undirected_toy.txt

1	{'2': 1,'5': 1}
2	{'1': 1,'3': 1,'4': 1,'5': 1}
3	{'2': 1, '4': 1}
4	{'2': 1,'3': 1,'5': 1}
5	{'1': 1, '2': 1, '4': 1}


In [232]:
#MULTIPLE ITERATIONS
import os
from ssp import SSP
mr_job = SSP(args=['graph.txt','--no-strict-protocols'])

init_graph(1,'directed_toy.txt')

persist=True
count=0
while persist:
    print "Now on iteration: "+str(count)
    persist=False
    with open('graph_new.txt','w+') as f:
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for line in runner.stream_output():
                output=mr_job.parse_output_line(line)
                print output
                node,data=mr_job.parse_output_line(line)[:]
                f.write(line)
                #If we encounter any unvisited nodes, continue iterating
                if data[2]!='V':
                    persist=True
                
    count+=1
    os.rename('graph_new.txt','graph.txt')
    print ""
print "Graph has been fully traversed after {0} iterations!".format(str(count))

Now on iteration: 0
('1', [['2', '6'], 0, 'V', ['1']])
('2', [['1', '3', '4'], 1, 'Q', ['1', '2']])
('3', [['2', '4'], 9223372036854775807, 'U', []])
('4', [['2', '5'], 9223372036854775807, 'U', []])
('5', [['1', '2', '4'], 9223372036854775807, 'U', []])
('6', [[], 1, 'Q', ['1', '6']])

Now on iteration: 1
('1', [['2', '6'], 0, 'V', ['1']])
('2', [['1', '3', '4'], 1, 'V', ['1', '2']])
('3', [['2', '4'], 2, 'Q', ['1', '2', '3']])
('4', [['2', '5'], 2, 'Q', ['1', '2', '4']])
('5', [['1', '2', '4'], 9223372036854775807, 'U', []])
('6', [[], 1, 'V', ['1', '6']])

Now on iteration: 2
('1', [['2', '6'], 0, 'V', ['1']])
('2', [['1', '3', '4'], 1, 'V', ['1', '2']])
('3', [['2', '4'], 2, 'V', ['1', '2', '3']])
('4', [['2', '5'], 2, 'V', ['1', '2', '4']])
('5', [['1', '2', '4'], 3, 'Q', ['1', '2', '4', '5']])
('6', [[], 1, 'V', ['1', '6']])

Now on iteration: 3
('1', [['2', '6'], 0, 'V', ['1']])
('2', [['1', '3', '4'], 1, 'V', ['1', '2']])
('3', [['2', '4'], 2, 'V', ['1', '2', '3']])
('4', [['2'

Sure enough, we confirm that the shortest path from node 1 to node 5 using the directed graph is [1,2,4,5]

##HW7.1 

### HW 7.1 Problem Statement

Using MRJob, explore the synonyms network data.
Consider plotting the degree distribution (does it follow a power law?),
and determine some of the key features, like:

number of nodes, 
number links,
or the average degree (i.e., the average number of links per node),
etc...

As you develop your code, please be sure to run it locally first (though on the whole dataset). 
Once you have gotten you code to run locally, deploy it on AWS as a systems test
in preparation for our next dataset (which will require AWS).

##HW7.2

### HW 7.2 Problem Statement

Write (reuse your code from 7.0) an MRJob class to find shortest path graph distances, 
and apply it to the NLTK synonyms network dataset. 

Proof your code's function by running the job:

- shortest path starting at "walk" (index=7827) and ending at "make" (index=536),

and showing you code's output. Once again, your output should include the path and the distance.

As you develop your code, please be sure to run it locally first (though on the whole dataset). Once you have gotten you code to run locally, deploy it on AWS as a systems test
in preparation for our next dataset (which will require AWS).

##HW7.3 

### HW 7.3 Problem Statement
Using MRJob, explore the Wikipedia network data on the AWS cloud. Reuse your code from HW 7.1---does is scale well? 
Be cautioned that Wikipedia is a directed network, where links are not symmetric. 
So, even though a node may be linked to, it will not appear as a primary record itself if it has no out-links. 
This means that you may have to ADJUST your code (depending on its design). 
To be sure of your code's functionality in this context, run a systems test on the directed_toy.txt network.


##HW 7.4

### HW 7.4 - Problem Statement

Using MRJob, find shortest path graph distances in the Wikipedia network on the AWS cloud.
Reuse your code from 7.2, but once again be warned of Wikipedia being a directed network.
To be sure of your code's functionality in this context, run a systems test on the directed_toy.txt network.

When running your code on the Wikipedia network, proof its function by running the job:

- shortest path from "Ireland" (index=6176135) to "University of California, Berkeley" (index=13466359),

and show your code's output.

Once your code is running, find some other shortest paths and report your results.

##HW 7.5

### HW 7.5 Problem Statement
Suppose you wanted to find the largest network distance from a single source,
i.e., a node that is the furthest (but still reachable) from a single source.

How would you implement this task? 
How is this different from finding the shortest path graph distances?

Is this task more difficult to implement than the shortest path distance?

As you respond, please comment on program structure, runtimes, iterations, general system requirements, etc...

##End of Submission