Name: Patrick Ng  
Class: W261-2  
Date: Mar 10, 2016  
HW07

In [81]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## HW 7.0: Shortest path graph distances (toy networks)

```
In this part of your assignment you will develop the base of your code for the week.

Write MRJob classes to find shortest path graph distances, 
as described in the lectures. In addition to finding the distances, 
your code should also output a distance-minimizing path between the source and target.
Work locally for this part of the assignment, and use 
both of the undirected and directed toy networks.

To proof you code's function, run the following jobs

- shortest path in the undirected network from node 1 to node 4
Solution: 1,5,4 

- shortest path in the directed network from node 1 to node 5
Solution: 1,2,4,5

and report your output---make sure it is correct!
```

#### Init function: convert an input file into an SSSP file
SSSP File format:  
```
JSON representaton of:
node id \t [ {adj-list dict}, [path from src], cost from src, status ]

e.g.
3       [{"2": 1, "4": 1}, [], 9223372036854775807, "U"]
```

In [126]:
%%writefile InitSssp.py
import sys
import json

def initSomeValues(isSourceNode, src):
    if isSourceNode:
        # This is the source node
        status = 'Q'
        sps = [int(src)] # shortest path from source
        costFromSrc = 0
    else:
        # Other nodes
        status = 'U'
        sps = [] # Unknown
        costFromSrc = sys.maxint # Unknown
        
    return (status, sps, costFromSrc)

class InitSssp:
    @staticmethod
    def run(fn, src):
        seenNodes = set()
        emittedNodes = set()
                
        # Input format:
        # 4       {'2': 1, '5': 1}
        with open('sssp.txt', 'w') as output:
            with open(fn, "r") as f:
                for line in f:
                    fields = line.strip().split('\t')

                    node = fields[0]

                    # JSON uses double quote for string
                    adjList = json.loads(fields[1].replace("'", '"'))

                    status, sps, costFromSrc = initSomeValues(src == fields[0], src)

                    values = [adjList, sps, costFromSrc, status]
                    output.write("%s\t%s\n" % (node, json.dumps(values)))

                    # Remember all the nodes we've seen in the adjacency list,
                    # and all the nodes we've emitted
                    seenNodes.update(adjList.keys())
                    emittedNodes.update(node)

            # We may have items which are seen in the adj list, but they don't have an entry in the
            # input record to point back to anyone.  This could happen in input for directed graph.
            # In this case we still need to export a record for those remaining nodes.
            for node in seenNodes - emittedNodes:
                status, sps, costFromSrc = initSomeValues(src == node, src)
                adjList = {}
                values = [adjList, sps, costFromSrc, status]
                output.write("%s\t%s\n" % (node, json.dumps(values)))

if __name__ == '__main__':
    fn = sys.argv[1]
    src = sys.argv[2]

    InitSssp.run(fn, src)

Overwriting InitSssp.py


### MRJob for find shortest path

In [130]:
%%writefile MrSssp_hw70.py
from numpy import argmin, array, random
import re
import numpy as np
from mrjob.job import MRJob, MRStep
from itertools import chain
import mrjob
import sys

class MrSssp_hw70(MRJob):
    def configure_options(self):
        super(MrSssp_hw70, self).configure_options()
        self.add_passthrough_option(
            '-k', type='int', default=4, help='The number of centroids.')
        
    def steps(self):
        return [
            MRStep(
                   mapper=self.mapper,
                   #combiner = self.combiner,
                   reducer=self.reducer
            )
               ]

    INPUT_PROTOCOL = mrjob.protocol.JSONProtocol
    
    def mapper(self, nodeId, data):
        # data format:
        # Json object:
        # [ {adj-list dict}, [path from src], cost from src, status ]
        
        #print >> sys.stderr, "input data:", data
        adjList, pathFromSrc, costFromSrc, status = data
        
        if status == "Q":
            # The node is in Queued mode.  It's a frontier node.
            # Need to process its neighbours
            for neighbor, weight in adjList.items():
                neighbor = int(neighbor)
                neighborPathFromSrc = pathFromSrc + [neighbor]
                neighborCostFromSrc = costFromSrc + weight
                yield neighbor, (None, neighborPathFromSrc, neighborCostFromSrc, "Q")
        
            # Lastly, change its own status to visited
            status = "V"
            
        yield nodeId, (adjList, pathFromSrc, costFromSrc, status)
        
    def reducer(self, key, values):
        newStatus = None
        minCostFromSrc = sys.maxint
        
        for data in values:
            adjList, pathFromSrc, costFromSrc, status = data
            
            if adjList is None:
                # It is a record emitted from a neighbor
                if costFromSrc < minCostFromSrc:
                    minCostFromSrc = costFromSrc
                    minPathFromSrc = pathFromSrc                    
                    assert status == 'Q', "status must be Q for record emitted by neighbor"
                    newStatus = status
            else:
                realAdjList = adjList
                latestCostFromSrc = costFromSrc
                latestPathFromSrc = pathFromSrc
                latestStatus = status
        
        if minCostFromSrc < latestCostFromSrc:
            # Its 'cost from Src' becomes smaller.  Put it to the frontier.
            latestStatus = newStatus
            assert latestStatus == 'Q'
            latestCostFromSrc = minCostFromSrc
            latestPathFromSrc = minPathFromSrc            
            
        yield key, (realAdjList, latestPathFromSrc, latestCostFromSrc, latestStatus)
            
              
if __name__ == '__main__':
    MrSssp_hw70.run()

Overwriting MrSssp_hw70.py


### Driver for finding shortest path

In [134]:
%%writefile Driver_Hw70.py
from numpy import random
from MrSssp_hw70 import MrSssp_hw70
from InitSssp import InitSssp
import json
import sys

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--inputFile", type=str)
parser.add_argument("--srcNode", type=str)
args = parser.parse_args()

mr_job = MrSssp_hw70(args=['sssp.txt',
                        '--no-strict-protocols',
                        '-r', 'inline'])

# Initialized sssp.txt
InitSssp.run(args.inputFile, args.srcNode)

i = 1
hasQueuedNode = True

# Loop as long as we still have at least one queued node
while(hasQueuedNode):
    print "iteration"+str(i)+":"
    
    hasQueuedNode = False # Init it back to False
    with mr_job.make_runner() as runner: 
        runner.run()

        # Generate the new sssp.txt based on job's output
        with open('sssp.txt', 'w') as f:
            for line in runner.stream_output():
                key, value =  mr_job.parse_output_line(line)
                f.write("%s\t%s\n" % (key,json.dumps(value)))
                print "Result: %s\t%s" % (key,value)
                
                if not hasQueuedNode:
                    # Check if we've seen at least one Queued node
                    hasQueuedNode = value[-1] == 'Q'
                        
    print
    print "------------------------"
    print
    i += 1
        

Overwriting Driver_Hw70.py


### Run it for undirected_toy.txt

In [135]:
!python Driver_Hw70.py --inputFile undirected_toy.txt --srcNode 1

iteration1:
Result: 1	[{'2': 1, '5': 1}, [1], 0, 'V']
Result: 2	[{'1': 1, '3': 1, '5': 1, '4': 1}, [1, 2], 1, 'Q']
Result: 3	[{'2': 1, '4': 1}, [], 9223372036854775807, 'U']
Result: 4	[{'3': 1, '2': 1, '5': 1}, [], 9223372036854775807, 'U']
Result: 5	[{'1': 1, '2': 1, '4': 1}, [1, 5], 1, 'Q']

------------------------

iteration2:
Result: 1	[{'2': 1, '5': 1}, [1], 0, 'V']
Result: 2	[{'1': 1, '3': 1, '5': 1, '4': 1}, [1, 2], 1, 'V']
Result: 3	[{'2': 1, '4': 1}, [1, 2, 3], 2, 'Q']
Result: 4	[{'3': 1, '2': 1, '5': 1}, [1, 2, 4], 2, 'Q']
Result: 5	[{'1': 1, '2': 1, '4': 1}, [1, 5], 1, 'V']

------------------------

iteration3:
Result: 1	[{'2': 1, '5': 1}, [1], 0, 'V']
Result: 2	[{'1': 1, '3': 1, '5': 1, '4': 1}, [1, 2], 1, 'V']
Result: 3	[{'2': 1, '4': 1}, [1, 2, 3], 2, 'V']
Result: 4	[{'3': 1, '2': 1, '5': 1}, [1, 2, 4], 2, 'V']
Result: 5	[{'1': 1, '2': 1, '4': 1}, [1, 5], 1, 'V']

------------------------



**Answer:**  
From the output, you can see that the paths from 1 to 4 is: 1, 2, 4  
  
(It is different from the standard solution, but it is okay because there is more than one path from 1 to 4.)

### Run it for directed_toy.txt

In [133]:
!python Driver_Hw70.py --inputFile directed_toy.txt --srcNode 1

iteration1:
Result: 1	[{'2': 1, '6': 1}, [1], 0, 'V']
Result: 2	[{'1': 1, '3': 1, '4': 1}, [1, 2], 1, 'Q']
Result: 3	[{'2': 1, '4': 1}, [], 9223372036854775807, 'U']
Result: 4	[{'2': 1, '5': 1}, [], 9223372036854775807, 'U']
Result: 5	[{'1': 1, '2': 1, '4': 1}, [], 9223372036854775807, 'U']
Result: 6	[{}, [1, 6], 1, 'Q']

------------------------

iteration2:
Result: 1	[{'2': 1, '6': 1}, [1], 0, 'V']
Result: 2	[{'1': 1, '3': 1, '4': 1}, [1, 2], 1, 'V']
Result: 3	[{'2': 1, '4': 1}, [1, 2, 3], 2, 'Q']
Result: 4	[{'2': 1, '5': 1}, [1, 2, 4], 2, 'Q']
Result: 5	[{'1': 1, '2': 1, '4': 1}, [], 9223372036854775807, 'U']
Result: 6	[{}, [1, 6], 1, 'V']

------------------------

iteration3:
Result: 1	[{'2': 1, '6': 1}, [1], 0, 'V']
Result: 2	[{'1': 1, '3': 1, '4': 1}, [1, 2], 1, 'V']
Result: 3	[{'2': 1, '4': 1}, [1, 2, 3], 2, 'V']
Result: 4	[{'2': 1, '5': 1}, [1, 2, 4], 2, 'V']
Result: 5	[{'1': 1, '2': 1, '4': 1}, [1, 2, 4, 5], 3, 'Q']
Result: 6	[{}, [1, 6], 1, 'V']

------------------------

it

**Answer:**  
From the output, you can see that the paths from 1 to 5 is: 1, 2, 4, 5 