#Explanation - Learning


## Classifying Congress

During Obama's visit to MIT, you got a chance to impress him with your analytical thinking. Now, he has hired you to do some political modeling for him. He seems to surround himself with smart people that way.

He takes a moment out of his busy day to explain what you need to do. "I need a better way to tell which of my plans are going to be supported by Congress," he explains. "Do you think we can get a model of Democrats and Republicans in Congress, and which votes separate them the most?"

"Yes, we can!" You answer.

## The Data

You acquire the data on how everyone in the previous Senate and House of Representatives voted on every issue. (These data are available in machine-readable form via voteview.com. We've included it in the laboratory directory, in the files beginning with `H110` and `S110`.)

The Data Reader code contains functions for reading data in this format.

`read_congress_data("FILENAME.ord")` reads a specially-formatted file that gives information about each Congressperson and the votes they cast. It returns a list of dictionaries, one for each member of Congress, including the following items:
* 'name': The name of the Congressperson.
* 'state': The state they represent.
* 'party': The party that they were elected under.
* 'votes': The votes that they cast, as a list of numbers. 1 represents a "yea" vote, -1 represents "nay", and 0 represents either that they abstained, were absent, or were not a member of Congress at the time.

To make sense of the votes, you will also need information about what they were voting on. This is provided by `read_vote_data("FILENAME.csv")`, which returns a list of votes in the same order that they appear in the Congresspeople's entries. Each vote is represented a dictionary of information, which you can convert into a readable string by running `vote_info(vote)`.

The Data Reader code reads in the provided data, storing them in the variables `senate_people`, `senate_votes`, `house_people`, and `house_votes`.

## Nearest Neighbors

You decide to start by making a nearest-neighbors classifier that can tell Democrats apart from Republicans in the Senate.

We've provided a `nearest_neighbors` function that classifies data based on training data and a distance function. In particular, this is a third-order function:
* First, call `nearest_neighbors(distance, k)`, with distance being the distance function you wish to use and `k` being the number of neighbors to check. This returns a *classifier factory*.
* A classifier factory is a function that makes classifiers. You call it with some training data as an argument, and it returns a classifier.
* Finally, you call the classifier with a data point (here, a Congressperson) and it returns the classification as a string.

Much of this is handled by the `evaluate(factory, group1, group2)` function, which you can use to test the effectiveness of a classification strategy. You give it a classifier factory (as defined above) and two sets of data. It will train a classifier on one data set and test the results against the other, and then it will switch them and test again.

Given a list of data such as `senate_people`, you can divide it arbitrarily into two groups using the `crosscheck_groups(data)` function.

One way to measure the "distance" between Congresspeople is with the *Hamming distance*: the number of entries that differ. This function is provided as `hamming_distance`.

An example of putting this all together is provided in the laboratory code:

    senate_group1, senate_group2 = crosscheck_groups(senate_people)
    evaluate(nearest_neighbors(edit_distance, 1), senate_group1, senate_group2, verbose=1)
Examine the results of this evaluation. In addition to the problems caused by independents, it's classifying Senator Johnson from South Dakota as a Republican instead of a Democrat, mainly because he missed a lot of votes while he was being treated for cancer. This is a problem with the distance function -- when one Senator votes yes and another is absent, that is less of a "disagreement" than when one votes yes and the other votes no.

You should address this. Euclidean distance is a reasonable measure for the distance between lists of discrete numeric features, and is the alternative to Hamming distance that you decide to try. Recall that the formula for Euclidean distance is:

*[(x1 - y1)^2 + (x2 - y2)^2 + ... + (xn - yn)^2] ^ (1/2)*
* Make a distance function called euclidean_distance that treats the votes as high-dimensional vectors, and returns the Euclidean distance between them.

When you evaluate using `euclidean_distance`, you should get better results, except that some people are being classified as Independents. Given that there are only 2 Independents in the Senate, you want to avoid classifying someone as an Independent just because they vote similarly to one of them.
* Make a simple change to the parameters of `nearest_neighbors` that accomplishes this, and call the classifier factory it outputs `my_classifier`.

## ID Trees

So far you've classified Democrats and Republicans, but you haven't created a model of which votes distinguish them. You want to make a classifier that explains the distinctions it makes, so you decide to use an ID-tree classifier.

`idtree_maker(votes, disorder_metric)` is a third-order function similar to `nearest_neighbors`. You initialize it by giving it a list of vote information (such as `senate_votes` or `house_votes`) and a function for calculating the disorder of two classes. It returns a classifier factory that will produce instances of the `CongressIDTree` class, defined in the Classify code, to distinguish legislators based on their votes.

The possible decision boundaries used by `CongressIDTree` are, for each vote:
* Did this legislator vote YES on this vote, or not?
* Did this legislator vote NO on this vote, or not?

(These are different because it is possible for a legislator to abstain or be absent.)

You can also use `CongressIDTree` directly to make an ID tree over the entire data set.

If you `print` a `CongressIDTree`, then you get a text representation of the tree. Each level of the ID tree shows the minimum disorder it found, the criterion that gives this minimum disorder, and (marked with a +) the decision it makes for legislators who match the criterion, and (marked with a -) the decision for legislators who don't. The decisions are either a party name or another ID tree. An example is shown in the section below.

## An ID tree for the entire Senate

You start by making an ID tree for the entire Senate. This doesn't leave you anything to test it on, but it will show you the votes that distinguish Republicans from Democrats the most quickly overall. You run this (which you can uncomment in your lab file):

`print(CongressIDTree(senate_people, senate_votes, homogeneous_disorder))`

The ID tree you get here is:

    Disorder: -49
    Yes on S.Con.Res. 21: Kyl Amdt. No. 583; To reform the death tax by setting the
    exemption at 5 million dollars per estate, indexed for inflation, and the top death
    tax rate at no more than 35% beginning in 2010; to avoid subjecting an
    estimated 119,200 families, family businesses, and family farms to the death
    tax each and every year; to promote continued economic growth and job creation;
    and to make the enhanced teacher deduction permanent.:
    + Republican
    - Disorder: -44
      Yes on H.R. 1585: Feingold Amdt. No. 2924; To safely redeploy United States
      troops from Iraq.:
      + Democrat
      - Disorder: -3
        No on H.R. 1495: Coburn Amdt. No. 1089; To prioritize Federal spending to
        ensure the needs of Louisiana residents who lost their homes as a result of
        Hurricane Katrina and Rita are met before spending money to design or
        construct a nonessential visitors center.:
        + Democrat
        - Disorder: -2
          Yes on S.Res. 19: S. Res. 19; A resolution honoring President Gerald
          Rudolph Ford.:
          + Disorder: -4
            Yes on H.R. 6: Motion to Waive C.B.A. re: Inhofe Amdt. No. 1666; To
            ensure agricultural equity with respect to the renewable fuels standard.:
            + Democrat
            - Independent
          - Republican
Some things that you can observe from these results are:
* Senators like to write bills with very long-winded titles that make political points.
* The key issue that most clearly divided Democrats and Republicans was the issue that Democrats call the "estate tax" and Republicans call the "death tax", with 49 Republicans voting to reform it.
* The next key issue involved 44 Democrats voting to redeploy troops from Iraq.
* The issues below that serve only to peel off homogenous groups of 2 to 4 people.

## Implementing a better disorder metric

You should be able to reduce the depth and complexity of the tree, by changing the disorder metric from the one that looks for the largest homogeneous group to the information-theoretical metric.
* Write the `information_disorder(group1, group2)` function to replace `homogeneous_disorder`. This function takes in the lists of classifications that fall on each side of the decision boundary, and returns the information-theoretical disorder.

Example:

    information_disorder(["Democrat", "Democrat", "Democrat"], ["Republican", "Republican"])
      => 0.0
    information_disorder(["Democrat", "Republican"], ["Republican", "Democrat"])
      => 1.0
Once this is written, you can try making a new `CongressIDTree` with it. (if you're having trouble, keep in mind you should return a float or similar)

## Evaluating over the House of Representatives

Now, you decide to evaluate how well ID trees do in the wild, weird world of the House of Representatives.

You can try running an ID tree on the entire House and all of its votes. It's disappointing. The 110th House began with a vote on the rules of order, where everyone present voted along straight party lines.

It's not a very informative result to observe that Democrats think Democrats should make the rules and Republicans think Republicans should make the rules.

Anyway, since your task was to make a tool for classifying the newly-elected Congress, you'd like it to work after a relatively small number of votes. We've provided a function, `limited_house_classifier`, which evaluates an ID tree classifier that uses only the most recent N votes in the House of Representatives. You just need to find a good value of *N*.
* Using `limited_house_classifier`, find a good number *N_1* of votes to take into account, so that the resulting ID trees classify at least 430 Congresspeople correctly. How many training examples (previous votes) does it take to predict at least 90 senators correctly? What about 95? **To pass the tests**, you will need to find close to the minimum such values for *N_1*, *N_2*, and *N_3*. Keep guessing to find close to the minimum that will pass the offline tests. Do the values surprise you? Is the house more unpredictable than the senate, or is it just bigger?
* Which is better at predicting the senate, 200 training samples, or 2000? Why?

The total number of Congresspeople in the evaluation may change, as people who didn't vote in the last *N* votes (perhaps because they're not in office anymore) aren't included.


# Data Reader

In [None]:
"""
A set of utility functions for reading in the data format used by Keith T.
Poole on voteview.com.

You can download additional data that will work with these functions, for
any Congress going back to the 1st, on that site.
"""
import csv
from copy import deepcopy

def legislator_info(legislator):
    district = ''
    if legislator['district'] > 0: district = '-%s' % legislator['district']
    return "%s (%s%s)" % (legislator['name'], legislator['state'], district)

def vote_info(vote):
    if not vote['name']: return vote['number']
    return "%s: %s" % (vote['number'], vote['name'])

def is_interesting(vote):
    return (vote['name'] != '')

def title_case(str):
    chars = list(str)
    chars[0] = chars[0].upper()
    for i in range(1, len(chars)):
        if chars[i-1] not in ' -': chars[i] = chars[i].lower()
    return ''.join(chars)

state_codes = {}
f = open('states.dat')
for line in f:
    state_codes[int(line[0:2])] = title_case(line[6:].strip())
f.close()

party_codes = {}
f = open('party3.dat')
for line in f:
    party_codes[int(line[2:6])] = line[8:].strip()
f.close()

def vote_meaning(n):
    if n in [1, 2, 3]: return 1
    elif n in [4, 5, 6]: return -1
    else: return 0

def read_congress_data(filename):
    """
    Reads a database of Congressional information in the format that comes
    from Keith T. Poole's voteview.com.
    """
    f = open(filename)
    legislators = []
    for line in f:
        line = line.rstrip()
        person = {}
        person['state'] = state_codes[int(line[8:10])]
        person['district'] = int(line[10:12])
        person['party'] = party_codes[int(line[19:23])]
        name = line[25:36].strip()
        person['name'] = title_case(name.replace("  ", ", "))
        person['votes'] = [vote_meaning(int(x)) for x in line[36:]]
        legislators.append(person)
    f.close()
    return legislators

def read_vote_data(filename):
    """
    Reads a CSV file of data on the votes that were taken.
    """
    f = open(filename, encoding='utf-8')
    csv_reader = csv.reader(f)
    votes = []
    for row in csv_reader:
        vote = {}
        vote['date'] = row[0]
        vote['number'] = row[3]
        vote['motion'] = row[4]
        vote['name'] = row[6]
        vote['result'] = row[5]
        votes.append(vote)
    f.close()
    return votes

def limit_votes(legislators, votes, n):
    indices = [i for i in range(len(legislators[0]['votes'])-1, -1, -1) if
    is_interesting(votes[i])][:n]

    newleg = []
    for leg in legislators:
        leg = deepcopy(leg)
        leg['votes'] = [leg['votes'][i] for i in indices]
        found_any_votes = False
        for vote in leg['votes']:
            if vote != 0:
                found_any_votes = True
                break
        if found_any_votes: newleg.append(leg)
    newvotes = [votes[i] for i in indices]
    return newleg, newvotes

## Classify

In [None]:
try:
    set()
except:
    from sets import Set as set

import math
import random

INFINITY = 1e100
def crosscheck_groups(people):
    # Split up the data in an interesting way.
    group1 = people[0::4] + people[3::4]
    group2 = people[1::4] + people[2::4]
    return group1, group2

def random_split_groups(people):
    # Split up the data randomly, into two groups of ~equal size
    group1 = random.sample(people, len(people)/2)
    group2 = [ x for x in people if not x in group1 ]
    return (group1, group2)
    
def evaluate(factory, group1, group2, verbose=0):
    score = 0
    for (test, train) in ((group1, group2), (group2, group1)):
        classifier = factory(train)
        for legislator in test:
            gold_standard = legislator['party']
            predicted = classifier(legislator)
            if gold_standard == predicted:
                score += 1
                if verbose >= 2:
                    print("%s: %s (correct)" % (legislator_info(legislator),),predicted)
            else:
                if verbose >= 1:
                    print("* %s: got %s, actually %s" %\
                    (legislator_info(legislator), predicted, gold_standard))
    if verbose >= 1:
        print("Accuracy: %d/%d" % (score, len(group1) + len(group2)))
    return score

def hamming_distance(list1, list2):
    """ Calculate the Hamming distance between two lists """
    # Make sure we're working with lists
    # Sorry, no other iterables are permitted
    assert isinstance(list1, list)
    assert isinstance(list2, list)

    dist = 0

    # 'zip' is a Python builtin, documented at
    # <http://www.python.org/doc/lib/built-in-funcs.html>
    for item1, item2 in zip(list1, list2):
        if item1 != item2: dist += 1
    return dist

edit_distance = hamming_distance

def nearest_neighbors(distance, k=1):
    def nearest_neighbors_classifier(train):
        def classify_value(query):
            best_distance = INFINITY
            ordered = sorted(train, key=lambda x: distance(query['votes'],
            x['votes']))
            nearest = [x['party'] for x in ordered[:k]]
            best_class = None
            best_count = 0
            for party in nearest:
                count = nearest.count(party)
                if count > best_count:
                    best_count = count
                    best_class = party
            return best_class
        return classify_value
    return nearest_neighbors_classifier

def homogeneous_disorder(yes, no):
    result = 0
    if homogeneous_value(yes): result -= len(yes)
    if homogeneous_value(no): result -= len(no)
    return result

def partition(legislators, vote_index, vote_value):
    # Find the people who voted a particular way, and the people who didn't.
    # Yes, No, and Abstain/Absent count as three different options here.
    matched = []
    unmatched = []
    for leg in legislators:
        if leg['votes'][vote_index] == vote_value:
            matched.append(leg)
        else:
            unmatched.append(leg)
    return matched, unmatched

def homogeneous_value(lst):
    """If this list contains just a single value, return it."""
    assert isinstance(lst[0], str)
    for item in lst[1:]:
        if item != lst[0]: return None
    return lst[0]

class CongressIDTree(object):
    def __init__(self, legislators, vote_meanings, disorder_func=None):
        if disorder_func is None: disorder_func = homogeneous_disorder
        self.vote_meanings = vote_meanings
        homog_test = homogeneous_value([leg['party'] for leg in legislators])
        if homog_test:
            self.leaf_value = homog_test
        else:
            self.leaf_value = None
            best_disorder = INFINITY
            best_criterion = None
            for vote_index in range(len(legislators[0]['votes'])):
                for vote_value in [1, 0, -1]:
                    yes, no = partition(legislators, vote_index, vote_value)
                    if len(yes) == 0 or len(no) == 0: continue
                    disord = disorder_func([y['party'] for y in yes],
                                           [n['party'] for n in no])
                    if disord < best_disorder:
                        best_disorder = disord
                        best_criterion = (vote_index, vote_value)

            if best_criterion is None:
                # No reasonable criteria left, so give up
                self.leaf_value = 'Unknown'
                return
            vote_index, vote_value = best_criterion
            self.criterion = best_criterion
            self.disorder = best_disorder
            yes_class, no_class = partition(legislators, vote_index, vote_value)
            yes_values = [y['party'] for y in yes]
            no_values = [n['party'] for n in no]
            self.yes_branch = CongressIDTree(yes_class, vote_meanings,
            disorder_func)
            self.no_branch = CongressIDTree(no_class, vote_meanings,
            disorder_func)
    
    def classify(self, legislator):
        if self.leaf_value: return self.leaf_value
        vote_index, vote_value = self.criterion
        if legislator['votes'][vote_index] == vote_value:
            return self.yes_branch.classify(legislator)
        else:
            return self.no_branch.classify(legislator)

    def __str__(self):
        return '\n'+self._str(0)

    def _str(self, indent):
        if self.leaf_value:
            return str(self.leaf_value)
        
        vote_index, vote_value = self.criterion
        value_name = 'Abstain/Absent'
        if vote_value == -1: value_name = 'No'
        elif vote_value == 1: value_name = 'Yes'
        vote_name = vote_info(self.vote_meanings[vote_index])
        indentation = ' '*indent
        disord_string = 'Disorder: %s' % self.disorder
        yes_string = indentation+'+ '+self.yes_branch._str(indent+2)
        no_string = indentation+'- '+self.no_branch._str(indent+2)
        return ("%(disord_string)s\n%(indentation)s%(value_name)s on %(vote_name)s:"
                "\n%(yes_string)s\n%(no_string)s") % locals()


def idtree_maker(vote_meanings, disorder_func):
    def train_classifier(train):
        idtree = CongressIDTree(train, vote_meanings, disorder_func)
        def classify_value(query):
            return idtree.classify(query)
        return classify_value
    return train_classifier


# **To be implemented code**

In [None]:
import math

### Data sets for the lab
## You will be classifying data from these sets.
senate_people = read_congress_data('S110.ord')
senate_votes = read_vote_data('S110desc.csv')

house_people = read_congress_data('H110.ord')
house_votes = read_vote_data('H110desc.csv')

last_senate_people = read_congress_data('S109.ord')
last_senate_votes = read_vote_data('S109desc.csv')


### Part 1: Nearest Neighbors
## An example of evaluating a nearest-neighbors classifier.
senate_group1, senate_group2 = crosscheck_groups(senate_people)
#evaluate(nearest_neighbors(hamming_distance, 1), senate_group1, senate_group2, verbose=1)

## Write the euclidean_distance function.
## This function should take two lists of integers and
## find the Euclidean distance between them.
## See 'hamming_distance()' in classify.py for an example that
## computes Hamming distances.

def euclidean_distance(list1, list2):
    # this is not the right solution!
    # return hamming_distance(list1, list2)
       dist = 0

       for i1, i2 in zip(list1, list2):
          dist += (i1 - i2)**2

       return dist**0.5

#Once you have implemented euclidean_distance, you can check the results:
#evaluate(nearest_neighbors(euclidean_distance, 1), senate_group1, senate_group2)

## By changing the parameters you used, you can get a classifier factory that
## deals better with independents. Make a classifier that makes at most 3
## errors on the Senate.

my_classifier = nearest_neighbors(hamming_distance, 1)
#evaluate(my_classifier, senate_group1, senate_group2, verbose=1)

### Part 2: ID Trees
#print(CongressIDTree(senate_people, senate_votes, homogeneous_disorder))

## Now write an information_disorder function to replace homogeneous_disorder,
## which should lead to simpler trees.

    # For democrat, republican, independent

def information_disorder(yes, no):
   ## return homogeneous_disorder(yes, no)
    from math import log

    def disorder(branch):
        total = 0
        for unique in set(branch):
            nbc = float(branch.count(unique))
            nb = float(len(branch))
            ratio = nbc/nb
            total += -ratio * log(ratio, 2)
        return total

    floaty = float(len(yes))
    floatn = float(len(no))
    yn = floaty + floatn

    yes_branch_disorder = floaty/yn * disorder(yes)
    no_branch_disorder = floatn/yn * disorder(no)

    average_disorder = yes_branch_disorder + no_branch_disorder

    return average_disorder

#print(CongressIDTree(senate_people, senate_votes, information_disorder))
#evaluate(idtree_maker(senate_votes, homogeneous_disorder), senate_group1, senate_group2)

## Now try it on the House of Representatives. However, do it over a data set
## that only includes the most recent n votes, to show that it is possible to
## classify politicians without ludicrous amounts of information.

def limited_house_classifier(house_people, house_votes, n, verbose = False):
    house_limited, house_limited_votes = limit_votes(house_people,
    house_votes, n)
    house_limited_group1, house_limited_group2 = crosscheck_groups(house_limited)

    if verbose:
        print("ID tree for first group:")
        print(CongressIDTree(house_limited_group1, house_limited_votes,
                             information_disorder))
        print()
        print("ID tree for second group:")
        print(CongressIDTree(house_limited_group2, house_limited_votes,
                             information_disorder))
        print()
        
    return evaluate(idtree_maker(house_limited_votes, information_disorder),
                    house_limited_group1, house_limited_group2)

                                   
## Find a value of n that classifies at least 430 representatives correctly.
## Hint: It's not 10.
N_1 = 44
#31 for k_nearest and 44 for ID_tree
rep_classified = limited_house_classifier(house_people, house_votes, N_1)

## Find a value of n that classifies at least 90 senators correctly.
N_2 = 67
#71 /67 for IDtree and 11 for k_nearest!
senator_classified = limited_house_classifier(senate_people, senate_votes, N_2)

## Now, find a value of n that classifies at least 95 of last year's senators correctly.
N_3 = 23
old_senator_classified = limited_house_classifier(last_senate_people, last_senate_votes, N_3)
#22 for k_nearest and 23 for IDtree


## This function is used by the tester, please don't modify it!
def eval_test(eval_fn, group1, group2, verbose = 0):
    """ Find eval_fn in globals(), then execute evaluate() on it """
    # Only allow known-safe eval_fn's
    if eval_fn in [ 'my_classifier' ]:
        return evaluate(globals()[eval_fn], group1, group2, verbose)
    else:
        raise Exception("Error: Tester tried to use an invalid evaluation function: '%s'" % eval_fn)

# Tester

In [None]:
from xmlrpc import client
import traceback
import sys
import os
import tarfile

try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


# This is a skeleton for what the tester should do. Ideally, this module
# would be imported in the pset and run as its main function. 

# We need the following rpc functions. (They generally take username and
# password, but you could adjust this for whatever security system.)
#
# tester.submit_code(username, password, pset, studentcode)
#   'pset' is a string such as 'ps0'. studentcode is a string containing
#   the contents of the corresponding file, ps0.py. This stores the code on
#   the server so we can check it later for cheating, and is a prerequisite
#   to the tester returning a grade.
#
# tester.get_tests(pset)
#   returns a list of tuples of the form (INDEX, TYPE, NAME, ARGS):
#     INDEX is a unique integer that identifies the test.
#     TYPE should be one of either 'VALUE' or 'FUNCTION'.
#     If TYPE is 'VALUE', ARGS is ignored, and NAME is the name of a
#     variable to return for this test.  The variable must be an attribute
#     of the lab module.
#     If TYPE is 'FUNCTION', NAME is the name of a function in the lab module
#     whose return value should be the answer to this test, and ARGS is a
#     tuple containing arguments for the function.
#
# tester.send_answer(username, password, pset, index, answer)
#   Sends <answer> as the answer to test case <index> (0-numbered) in the pset
#   named <pset>. Returns whether the answer was correct, and an expected
#   value.
#
# tester.status(username, password, pset)
#   A string that includes the official score for this user on this pset.
#   If a part is missing (like the code), it should say so.

# Because I haven't written anything on the server side, test_online has never
# been tested.

def test_summary(dispindex, ntests, testname):
    return "Test %d/%d (%s)" % (dispindex, ntests, testname)
  
tests = []

def show_result(testsummary, testcode, correct, got, expected, verbosity):
    """ Pretty-print test results """
    if correct:
        if verbosity > 0:
            print("%s: Correct." % testsummary)
        if verbosity > 1:
            print('\t', testcode)
            print()
    else:
        print("%s: Incorrect." % testsummary)
        print('\t', testcode)
        print("Got:     ", got)
        print("Expected:", expected)

def show_exception(testsummary, testcode):
    """ Pretty-print exceptions (including tracebacks) """
    print("%s: Error." % testsummary)
    print("While running the following test case:")
    print('\t', testcode)
    print("Your code encountered the following error:")
    traceback.print_exc()
    print()


def get_lab_module():
    # Try the easy way first
    try:
        from tests import lab_number
    except ImportError:
        lab_number = None
        
    if lab_number != None:
        lab = __import__('lab%s' % lab_number)
        return lab
        
    lab = None

    for labnum in range(10):
        try:
            lab = __import__('lab%s' % labnum)
        except ImportError:
            pass

    if lab == None:
        raise ImportError("Cannot find your lab; or, error importing it.  Try loading it by running 'python labN.py' (for the appropriate value of 'N').")

    if not hasattr(lab, "LAB_NUMBER"):
        lab.LAB_NUMBER = labnum
    
    return lab

def type_decode(arg, lab):
    """
    XMLRPC can only pass a very limited collection of types.
    Frequently, we want to pass a subclass of 'list' in as a test argument.
    We do that by converting the sub-type into a regular list of the form:
    [ 'TYPE', (data) ] (ie., AND(['x','y','z']) becomes ['AND','x','y','z']).
    This function assumes that TYPE is a valid attr of 'lab' and that TYPE's
    constructor takes a list as an argument; it uses that to reconstruct the
    original data type.
    """
    if isinstance(arg, list) and len(arg) >= 1: # We'll leave tuples reserved for some other future magic
        try:
            mytype = arg[0]
            data = arg[1:]
            return getattr(lab, mytype)([ type_decode(x, lab) for x in data ])
        except AttributeError:
            return [ type_decode(x, lab) for x in arg ]
        except TypeError:
            return [ type_decode(x, lab) for x in arg ]
    else:
        return arg

    
def type_encode(arg):
    """
    Encode trees as lists in a way that can be decoded by 'type_decode'
    """
    if isinstance(arg, list) and not type(arg) in (list,tuple):
        return [ arg.__class__.__name__ ] + [ type_encode(x) for x in arg ]
    elif hasattr(arg, '__class__') and arg.__class__.__name__ == 'IF':
        return [ 'IF', type_encode(arg._conditional), type_encode(arg._action), type_encode(arg._delete_clause) ]
    else:
        return arg

    
def run_test(test, lab):
    """
    Takes a 'test' tuple as provided by the online tester
    (or generated by the offline tester) and executes that test,
    returning whatever output is expected (the variable that's being
    queried, the output of the function being called, etc)

    'lab' (the argument) is the module containing the lab code.
    
    'test' tuples are in the following format:
      'id': A unique integer identifying the test
      'type': One of 'VALUE', 'FUNCTION', 'MULTIFUNCTION', or 'FUNCTION_ENCODED_ARGS'
      'attr_name': The name of the attribute in the 'lab' module
      'args': a list of the arguments to be passed to the function; [] if no args.
      For 'MULTIFUNCTION's, a list of lists of arguments to be passed in
    """
    id, mytype, attr_name, args = test

    attr = getattr(lab, attr_name)

    if mytype == 'VALUE':
        return attr
    elif mytype == 'FUNCTION':
        return apply(attr, args)
    elif mytype == 'MULTIFUNCTION':
        return [ run_test( (id, 'FUNCTION', attr_name, FN), lab) for FN in args ]
    elif mytype == 'FUNCTION_ENCODED_ARGS':
        return run_test( (id, 'FUNCTION', attr_name, type_decode(args, lab)), lab )
    else:
        raise Exception("Test Error: Unknown TYPE '%s'.  Please make sure you have downloaded the latest version of the tester script.  If you continue to see this error, contact a TA.")


def test_offline(verbosity=1):
    """ Run the unit tests in 'tests.py' """
#    import tests as tests_module
    
#    tests = [ (x[:-8],
#               getattr(tests_module, x),
#               getattr(tests_module, "%s_testanswer" % x[:-8]),
#               getattr(tests_module, "%s_expected" % x[:-8]),
#               "_".join(x[:-8].split('_')[:-1]))
#              for x in tests_module.__dict__.keys() if x[-8:] == "_getargs" ]

#    tests = tests_module.get_tests()
    global tests

    ntests = len(tests)
    ncorrect = 0
    
    for index, (testname, getargs, testanswer, expected, fn_name, type) in enumerate(tests):
        dispindex = index+1
        summary = test_summary(dispindex, ntests, fn_name)
        
        try:
            if callable(getargs):
                getargs = getargs()
            
            if type == 'FUNCTION':
                answer = fn_name(*getargs)
            elif type == 'VALUE':
                answer = fn_name
            else:
                answer = [ FN(*getargs) for FN in getargs ]#run_test((index, type, fn_name, getargs), get_lab_module())
        except NotImplementedError:
            print("%d: (%s: Function not yet implemented, NotImplementedError raised)" % (index, testname))
            continue
        except Exception:
            show_exception(summary, testname)
            continue
        
        correct = testanswer(answer, original_val = getargs)
        show_result(summary, testname, correct, answer, expected, verbosity)
        if correct: ncorrect += 1
    
    print("Passed %d of %d tests." % (ncorrect, ntests))
    tests = []
    return ncorrect == ntests

def get_target_upload_filedir():
    """ Get, via user prompting, the directory containing the current lab """
    cwd = os.getcwd() # Get current directory.  Play nice with Unicode pathnames, just in case.
        
    print("Please specify the directory containing your lab.")
    print("Note that all files from this directory will be uploaded!")
    print("Labs should not contain large amounts of data; very-large")
    print("files will fail to upload.")
    print()
    print("The default path is '%s'" % cwd)
    target_dir = raw_input("[%s] >>> " % cwd)

    target_dir = target_dir.strip()
    if target_dir == '':
        target_dir = cwd

    print("Ok, using '%s'." % target_dir)

    return target_dir

def get_tarball_data(target_dir, filename):
    """ Return a binary String containing the binary data for a tarball of the specified directory """
    data = StringIO()
    file = tarfile.open(filename, "w|bz2", data)

    print("Preparing the lab directory for transmission...")
            
    file.add(target_dir+"/lab4.py")
    
    print("Done.")
    print()
    print("The following files have been added:")
    
    for f in file.getmembers():
        print(f.name)
            
    file.close()

    return data.getvalue()
    

def test_online(verbosity=1):
    """ Run online unit tests.  Run them against the 6.034 server via XMLRPC. """
    lab = get_lab_module()

    try:
        server = xmlrpclib.Server(server_url, allow_none=True)
        #print("Getting tests:", (username, password, lab.__name__))
        tests = server.get_tests(username, password, lab.__name__)
        #print("*** TESTS:")
        #print(tests)

    except NotImplementedError: # Solaris Athena doesn't seem to support HTTPS
        print("Your version of Python doesn't seem to support HTTPS, for")
        print("secure test submission.  Would you like to downgrade to HTTP?")
        print("(note that this could theoretically allow a hacker with access")
        print("to your local network to find your 6.034 password)")
        answer = raw_input("(Y/n) >>> ")
        if len(answer) == 0 or answer[0] in "Yy":
            server = xmlrpclib.Server(server_url.replace("https", "http"))
            tests = server.get_tests(username, password, lab.__name__)
        else:
            print("Ok, not running your tests.")
            print("Please try again on another computer.")
            print("Linux Athena computers are known to support HTTPS,")
            print("if you use the version of Python in the 'python' locker.")
            sys.exit(0)
            
    ntests = len(tests)
    ncorrect = 0

    lab = get_lab_module()
    
    target_dir = get_target_upload_filedir()

    tarball_data = get_tarball_data(target_dir, "lab%s.tar.bz2" % lab.LAB_NUMBER)
            
    print("Submitting to the 6.034 Webserver...")

    server.submit_code(username, password, lab.__name__, xmlrpclib.Binary(tarball_data))

    print("Done submitting code.")
    print("Running test cases...")
    
    for index, testcode in enumerate(tests):
        dispindex = index+1
        summary = test_summary(dispindex, ntests, testcode)

        try:
            answer = run_test(testcode, get_lab_module())
        except Exception:
            show_exception(summary, testcode)
            continue

        correct, expected = server.send_answer(username, password, lab.__name__, testcode[0], type_encode(answer))
        show_result(summary, testcode, correct, answer, expected, verbosity)
        if correct: ncorrect += 1
    
    response = server.status(username, password, lab.__name__)
    print(response)



#if __name__ == '__main__':
#    test_offline()
    
def make_test_counter_decorator():
    #tests = []
    def make_test(getargs, testanswer, expected_val, name = None, type = 'FUNCTION'):
        if name != None:
            getargs_name = name
        elif not callable(getargs):
            getargs_name = "_".join(getargs[:-8].split('_')[:-1])
            getargs = lambda: getargs
        else:
            getargs_name = "_".join(getargs.__name__[:-8].split('_')[:-1])
            
        tests.append( ( getargs_name,
                        getargs,
                        testanswer,
                        expected_val,
                        getargs_name,
                        type ) )

    def get_tests():
        return tests

    return make_test, get_tests


make_test, get_tests = make_test_counter_decorator()


# Tests

In [None]:
# This code implements some very rudimentary matrix-like and vector-like operations
# It is used by the tester.
# You are welcome to use these functions for the laboratory implementation as well,
# though this isn't expected to be necessary.

import random
import operator
import math

def test_code():
    try:
        all([])
    except NameError:
        # The all() function was introduced in Python 2.5.
        # Provide our own implementation if it's not available here.
        def all(iterable):
            for element in iterable:
                if not element:
                    return False
            return True


    def unit_vector(vec1, vec2):
        """ Return a unit vector pointing from vec1 to vec2 """
        diff_vector = map(operator.sub, vec2, vec1)
        
        scale_factor = math.sqrt( sum( map( lambda x: x**2, diff_vector ) ) )
        if scale_factor == 0:
            scale_factor = 1 # We don't have an actual vector, it has zero length
        return map(lambda x: x/scale_factor, diff_vector)

    def vector_compare(vec1, vec2, delta):
        """ Compare two vectors
        Confirm that no two corresponding fields differ by more than delta """
        return all( map(lambda x,y: (abs(x-y) < delta), vec1, vec2) )
        
    def validate_euclidean_distance(list1, list2, dist):
        """
        Confirm that the given distance is the Euclidean distance
        between list1 and list2 by establishing a unit vector between
        the two lists and seeing if vec * dist + list1 == list2
        """

        vec = unit_vector(list1, list2)
        target = map(lambda jmp, base: jmp * dist + base, vec, list1)
        return vector_compare(target, list2, 0.01)

    def random_list(length):
        return [ random.randint(1,100) for x in range(length) ]


    import operator
    import math

    # KNN tests

    def euclidean_distance_1_getargs():
        return [ [1,2,3], [4,5,6] ]

    def euclidean_distance_1_testanswer(val, original_val = None):
        return ( abs(val - math.sqrt(27)) < 0.00001 )

    make_test(type = 'FUNCTION',
            getargs = euclidean_distance_1_getargs,
            testanswer = euclidean_distance_1_testanswer,
            expected_val = math.sqrt(27),
            name = euclidean_distance
            )


    senate_people = read_congress_data('S110.ord')
    senate_votes = read_vote_data('S110desc.csv')

    house_people = read_congress_data('H110.ord')
    house_votes = read_vote_data('H110desc.csv')

    last_senate_people = read_congress_data('S109.ord')
    last_senate_votes = read_vote_data('S109desc.csv')


    def euclidean_distance_1_getargs():
        return [ [1,2,3], [4,5,6] ]

    def euclidean_distance_1_testanswer(val, original_val = None):
        return ( abs(val - math.sqrt(27)) < 0.00001 )

    make_test(type = 'FUNCTION',
            getargs = euclidean_distance_1_getargs,
            testanswer = euclidean_distance_1_testanswer,
            expected_val = math.sqrt(27),
            name = euclidean_distance
            )

    def euclidean_distance_2_getargs():
        return [ [0,0], [3,4] ]

    def euclidean_distance_2_testanswer(val, original_val = None):
        return ( abs(val - math.sqrt(25)) < 0.00001 )

    make_test(type = 'FUNCTION',
            getargs = euclidean_distance_2_getargs,
            testanswer = euclidean_distance_2_testanswer,
            expected_val = 5,
            name = euclidean_distance
            )

    def disorder_1_getargs():
        return [ ['Democrat','Democrat','Democrat'], ['Republican',"Republican"]]
    def disorder_1_testanswer(val, original_val=None):
        return (abs(val - 0) < 0.0001)
    make_test(type="FUNCTION",
            getargs = disorder_1_getargs,
            testanswer = disorder_1_testanswer,
            expected_val = 0,
            name=information_disorder)

    def disorder_2_getargs():
        return [ ['Democrat','Republican'], ['Democrat',"Republican"]]
    def disorder_2_testanswer(val, original_val=None):
        return (abs(val - 1) < 0.0001)
    make_test(type="FUNCTION",
            getargs = disorder_2_getargs,
            testanswer = disorder_2_testanswer,
            expected_val = 1,
            name=information_disorder)

    def disorder_3_getargs():
        return [ ['Democrat','Democrat','Democrat','Republican'],
                ['Democrat','Republican',"Republican"]]
    def disorder_3_testanswer(val, original_val=None):
        return (abs(val - 0.8571428) < 0.0001)
    make_test(type="FUNCTION",
            getargs = disorder_3_getargs,
            testanswer = disorder_3_testanswer,
            expected_val = 0.8571428,
            name=information_disorder)


    def eval_test_1_getargs():
        senate_group1, senate_group2 = crosscheck_groups(senate_people)
        return [ 'my_classifier', senate_group1, senate_group2 ]

    def eval_test_1_testanswer(val, original_val = None):
        return ( val >= 97 )

    make_test(type = 'FUNCTION',
            getargs = eval_test_1_getargs,
            testanswer = eval_test_1_testanswer,
            expected_val = "Less than or equal to 3 miscategorizations",
            name = eval_test
            )

    rep_classified_getargs = "rep_classified"

    def rep_classified_testanswer(val, original_val = None):
        return ( val >= 430 )

    make_test(type = 'VALUE',
            getargs = rep_classified_getargs,
            testanswer = rep_classified_testanswer,
            expected_val = "430 or larger",
            name = rep_classified
            )

    senator_classified_getargs = "senator_classified"

    def senator_classified_testanswer(val, original_val = None):
        return ( val >= 90 )

    make_test(type = 'VALUE',
            getargs = senator_classified_getargs,
            testanswer = senator_classified_testanswer,
            expected_val = "90 or larger",
            name = senator_classified
            )

    old_senator_classified_getargs = "old_senator_classified"

    def old_senator_classified_testanswer(val, original_val = None):
        return ( val >= 95 )

    make_test(type = 'VALUE',
            getargs = old_senator_classified_getargs,
            testanswer = old_senator_classified_testanswer,
            expected_val = "95 or larger",
            name = old_senator_classified
            )

    test_offline()

#**Test your code**

In [None]:
test_code()

Test 1/10 (<function euclidean_distance at 0x7ff5e473d200>): Correct.
Test 2/10 (<function euclidean_distance at 0x7ff5e473d200>): Correct.
Test 3/10 (<function euclidean_distance at 0x7ff5e473d200>): Correct.
Test 4/10 (<function information_disorder at 0x7ff5e473d4d0>): Correct.
Test 5/10 (<function information_disorder at 0x7ff5e473d4d0>): Correct.
Test 6/10 (<function information_disorder at 0x7ff5e473d4d0>): Correct.
Test 7/10 (<function eval_test at 0x7ff5e463ce60>): Correct.
Test 8/10 (431): Correct.
Test 9/10 (91): Correct.
Test 10/10 (96): Correct.
Passed 10 of 10 tests.
