# Connected Components

The purpose of this assignment is to familiarize yourself with the handling of graph data structures. You will implement depth-first search for identifying the connected components of an undirected graph, implementing procedure Search as a subroutine along the way.

You will use the [NetworkX](https://networkx.github.io/) Python package to represent and manipulate graphs. You should first familiarize yourself with its functionality by going through the brief [tutorial](http://networkx.github.io/documentation/networkx-1.9.1/tutorial/index.html). For this homework, you may only use the basic undirected graph methods listed [here](http://networkx.github.io/documentation/networkx-1.9.1/reference/classes.graph.html).

As a use case, we will work with a dataset recording the interactions between characters in Homer's *Iliad*.

In [1]:
import networkx
import urllib2
homer = urllib2.urlopen('http://people.sc.fsu.edu/~jburkardt/datasets/sgb/homer.dat')

The format of the data is straightforward. After some comment lines (beginning with \*), the file lists a codename for each character (i.e., node of the graph), followed by a description. The file then lists the groups of characters that interact in each chapter, from which you will form the edges. For instance, the first line has the form:

```1:CH,AG,ME,GS;AP,CH;HE,AC;AC,AG,CA;HE,AT;AT,AC;AT,OG;NE,AG,AC;CS,OD```

This means that CH,AG,ME,GS interacted, so there are edges for all pairs of these nodes. Groups of characters that interacted are separated by semicolons. The lines start with chapter information of the form `1:` or `&:`, which can be ignored for this problem.

First implement a function to read in the nodes from the input file. You may implement any auxiliary functions as needed, and are encouraged to use small functions with specific purposes to keep your code readable. Any function you implement should be clearly commented.

Next implement a function to read in the edges from the input file.

In [2]:
def is_comment(line):
    """Returns: True if a line is considered a comment, i.e. starts with an asterisk"""
    return line.startswith('*')

def is_blank(line):
    """Returns: True if a line is blank, i.e. does not contain any characters except white space"""
    return len(line.strip()) == 0

def character_id(line):
    """Returns: character ID, which is the first word in the line of character description"""
    chunks = line.split(' ') # split on white space
    assert len(chunks) > 0, "Invalid character description"
    return chunks[0].strip()

def read_nodes(gfile):
    """Reads in the nodes of the graph from the input file.
    
    Args:
        gfile: A handle for the file containing the graph data, starting at the top.
        
    Returns:
        A generator of the nodes in the graph, yielding a list of the form:
            ['CH', 'AG, 'ME', ...]
    """
    # NOTE: read_nodes MUST be called first after urllib2.open
    # 'for line in gfile' is an idiomatic lazy iterator over lines in a file 
    # (content of the http response in this case). The function terminates when a blank line is encountered,
    # and does not read all lines in the file.
    for line in gfile:
        # Skip over comment lines
        if is_comment(line): continue
        # Stop iteration once a blank like encountered
        if is_blank(line): break
        # Otherwise yield a character id
        yield character_id(line)

In [3]:
# This is a reimplementation of itertools.combinations(list, 2), only included here bacause of the 
# requirement not to import additional modules.
def combinations(items):
    """
    Args: items - a list of items (must be unique items)
    Returns: generator of tuples (X, Y) where each pair of X, Y is an n choose 2 combination from the items list.
    """
    for first_item_idx in range(len(items)):
        for second_item_idx in range(first_item_idx+1, len(items)):
            yield (items[first_item_idx], items[second_item_idx])
    
def is_chapter_interactions(line):
    """Check if the line conforms to the character interaction line format '##: ...' or '&: ...'
    Returns: True or False
    """
    # The logic is to split by : and check that it results in only 2 chunks,
    # with the first chunk being & char or a digit
    chunks = line.split(':')
    return len(chunks) == 2 and (chunks[0] == '&' or chunks[0].isdigit())

def each_interaction_group(line):
    """Parses a chapter interaction line
    Returns: generator of character interaction groups (arrays).
    Example: For a line '&:TH,ZE,HE;ZE,HE;HE,HP;HP,OG;HP,ZE;OG' will yield the following sequence:
        [TH,ZE,HE], [ZE,HE], [HE,HP], [HP,OG], [HP,ZE], [OG]
    """
    groups = line.split(':')[1].strip()
    for group in groups.split(';'):
        yield group.split(',')

def read_edges(gfile):
    """Reads in the edges of the graph from the input file.
    
    Args:
        gfile: A handle for the file containing the graph data, starting at the top 
            of the edges section.
            
    Returns:
        A generator of the edges in the graph, yielding a list of pairs of the form:
            [('CH', 'AG'), ('AG', 'ME'), ...]
    """
    for line in gfile:
        # skip lines that are not chapter lines
        if not is_chapter_interactions(line): continue
        for group in each_interaction_group(line):
            # group here is an array of interacting characters
            # Loop through all combinations in the group, since an edge should be defined 
            # for each pair of interactions
            for interaction_pair in combinations(group):
                yield interaction_pair

The following code should now correctly create the graph.

In [4]:
import networkx as nx
G = nx.Graph()
G.add_nodes_from(read_nodes(homer))
G.add_edges_from(read_edges(homer))

Next implement procedure Search. The function takes in a graph and a root node, and returns a list of the nodes visited during the search. The nodes should appear in the order in which they were *first visited*. The neighbors of a node should be processed in *alphabetical order*, where numbers come before letters. This will ensure that the output of your function is uniquely defined, given any input node.

In [5]:
def unexplored_dict(graph):
    """
    Returns: dictionary with graph nodes as keys and False as values == all nodes are un-explored
    """
    return {node: False for node in graph.nodes}

explored = unexplored_dict(G)

def Search(graph, root):
    """Runs Search from vertex root in a graph. Neighboring nodes are processed in alphabetical order.
    
    Args:
        graph: the given graph, with nodes encoded as strings.
        root: the node from which to start the search.
        
    Returns:
        A list of nodes in the order in which they were first visited.
    """
    # not the best pactice to use globals, but given the fixed parameters and returns of Search function
    # this is the only way to share state between connected_components and multiple Search invocations.
    global explored 
    # previsit(root)
    explored[root] = True  
    visited_nodes = [root]
    for node in sorted(graph.adj[root]):
        if not explored[node]:
            visited_nodes = visited_nodes + Search(graph, node)
    # postvisit(root)
    return visited_nodes

We will check the correctness of your code by verifying that it correctly computes the DFS tree starting at Ulysses (node `OD`).

In [6]:
ulysses = Search(G, 'OD')

Next implement DFS to find the connected components of the character graph. When choosing roots for your components, always pick the *smallest unvisited node* according to alphabetical ordering. Combined with your Search routine, this will ensure that the output is again uniquely defined.

In [7]:
def connected_components(graph):
    """Computes the connected components of the given graph.
    
    Args: 
        graph: the given graph, with nodes encoded as strings.
        
    Returns:
        The connected components of the graph. Components are listed in
        alphabetical order of their root nodes.
    """
    global explored
    explored = unexplored_dict(G) # this is O(n)
    components = []
    # Sorting list of nodes to be deterministic in output
    for node in sorted(graph.nodes): # O(n log n)
        if not explored[node]:
            components.append(Search(G, node))
    return components

We will check correctness of your code by verifying that your output list is identical to our solution.

In [8]:
character_interactions = connected_components(G)

As a preliminary check, you should find that the following statements are all true.

In [9]:
component_sizes = [len(c) for c in character_interactions]
print "There are 12 connected components in the Iliad:", len(component_sizes) == 12
print "The giant component has size 542:", max(component_sizes) == 542
print "There are 5 isolated characters:", len([c for c in component_sizes if c == 1]) == 5

There are 12 connected components in the Iliad: True
The giant component has size 542: True
There are 5 isolated characters: True
