## Before submitting
1. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

2. Make sure that no assertions fail or exceptions occur, otherwise points will be subtracted.

3. After you submit the notebook more tests will be run on your code. The fact that no assertions fail on your computer localy does not guarantee that completed the exercise correctly.

4. Please submit only the `*.ipynb` file.

5. Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Edit only between `YOUR CODE HERE` and `END YOUR CODE`.

6. Make sure to use Python 3, not Python 2.

Fill your group name and collaborators below:

In [1]:
GROUPNAME = ""
COLLABORATORS = ""

---

# Exercise Sheet 1: Python Basics

This first  exercise sheet tests the basic functionalities of the Python programming language in the context of a simple prediction task. We consider the problem of predicting health risk of subjects from personal data and habits. We first use for this task a decision tree

![](tree.png)

adapted from the webpage http://www.refactorthis.net/post/2013/04/10/Machine-Learning-tutorial-How-to-create-a-decision-tree-in-RapidMiner-using-the-Titanic-passenger-data-set.aspx. For this exercise sheet, you are required to use only pure Python, and to not import any module, including numpy. In exercise sheet 2, the nearest neighbor part of this exercise sheet will be revisited with numpy.

## Classifying a single instance (15 P)

* In this sheet we will represent patient info as a tuple.
* Create a function that takes as input a tuple containing values for attributes (smoker,age,diet), and computes the output of the decision tree. Should return `"less"` or `"more"`.
* Test your function on the tuple `('yes', 31, 'good')`.

In [2]:
def decision(x):
    '''
    This function implements the decision tree represented above. As input the function 
    receives a tuple with three values that represent some information about a patient.
    Args:
        x (tuple): Input tuple containing exactly three values
        val1 (string): If a patient is a smoker this value will be 'yes'. All other values 
            represent that the patient is not a smoker
        val2 (int): The age of a patient as an integer
        val3 (string): If a patient has a good diet this string will be 'good'. All other
            values represent that the patient has a poor diet.
    Returns:
        string: A string that has either the value 'more' or 'less'. No other return value is valid.
                        
    '''
    # >>>>> YOUR CODE HERE
    if x[0] == 'yes':
        if x[1] < 29.5:
            return 'less'
        else:
            return 'more'
    elif x[2] == 'good':
        return 'less'
    else:
        return 'more'
    # <<<<< END YOUR CODE

In [3]:
# Test
x = ('yes', 31, 'good')
assert decision(x) == 'more'


## Reading a dataset from a text file (10 P)

The file `health-test.txt` contains several fictious records of personal data and habits.

* Read the file automatically using the methods introduced during the lecture.
* Represent the dataset as a list of tuples. Make sure that the tuples have the same format as above, e.g. `('yes', 31, 'good')`.
* Make sure that you close the file after you have opened it and read its content.

**Notes**: 
* Values read from files are always strings.
* Each line contains a newline `\n` character at the end

In [4]:
def parse_line_test(line: str):
    '''
    Takes a line from the file and parses it into a a patient tuple
    
    Args:
        line (str): A line from the `health-test.txt` file
    Returns:
        tuple: A tuple representing a patient 
    '''
    # >>>>> YOUR CODE HERE
    tmp = line.rstrip().split(",")
    return (tmp[0], int(tmp[1]), tmp[2])
    # <<<<< END YOUR CODE

def gettest():
    '''
    Opens the `health-test.txt` file and parses it  into 
    a list of patient tuples. You can use the `parse_line_test` function
    but it is not necessary to do so.
    
    Returns:
        list: A list of patient tuples
    '''
    # >>>>> YOUR CODE HERE
    with open('health-test.txt','r') as f:
        result = list()
        for line in f:    
            tmp = parse_line_test(line)
            result.append(tmp)
        return result
    # <<<<< END YOUR CODE

In [5]:
parsed_line = parse_line_test('yes,23,good\n')
assert isinstance(parsed_line, tuple)
assert isinstance(parsed_line[1], int)


In [6]:
testset = gettest()
assert isinstance(testset, list)
assert isinstance(testset[0], tuple)

## Applying the decision tree to the dataset (15 P)

* Apply the decision tree to all points in the dataset, and return the ratio of them that are classified as "more".
* A ratio is a value in [0-1]. So if out of 50 data points 15 return `"more"` the value that should be returned is `0.3`

In [7]:
def evaluate_testset():
    '''
    Every time this function is called the dataset is loaded. 
    
    Returns:
        float: The percentage of data points for which your implemented decision tree returns `'more'`
    '''
    # >>>>> YOUR CODE HERE
    results = list()
    data = gettest()
    for x in data:
        str = decision(x)
        if str == 'more':
            results.append(str)
    # results.append(for x in data if decision(x) == srt 'more')
    return len(results)/len(data)
    # <<<<< END YOUR CODE

In [8]:
ratio = evaluate_testset()
assert isinstance(ratio, float)
assert 0.0 <= ratio <= 1.0


## Learning from examples (10 P)

Suppose that instead of relying on a fixed decision tree, we would like to use a data-driven approach where data points are classified based on a set of training observations manually labeled by experts. Such labeled dataset is available in the file `health-train.txt`. The first three columns have the same meaning than for `health-test.txt`, and the last column corresponds to the labels.

* Write a function that reads this file and converts it into a list of pairs. The first element of each pair is a triplet of attributes, and the second element is the label.

**Note**: A triplet is a tuple that contains exactly three values, a pair is a tuple that contains exactly two values

In [9]:
def parse_line_train(line: str):
    '''
    This function works similarly to the `parse_line_test` function.
    It parses a line of the `health-train.txt` file into a tuple that 
    contains a patient tuple and a label.
    
    Args:
        line (str): A line from the `health-train.txt`
    
    Returns: 
        tuple: A tuple that contains a patient tuple and a label as a string
    '''
    # >>>>> YOUR CODE HERE
    # split() method returns a list of strings after breaking the given string by the specified separator.
    # rstrip() method returns a copy of the string with trailing characters removed
    tmp = line.rstrip().split(",")
    return ((tmp[0], int(tmp[1]), tmp[2]), tmp[3])
    # <<<<< END YOUR CODE

def gettrain():
    # >>>>> YOUR CODE HERE
    with open('health-train.txt','r') as f:
        result = list()
        for line in f:    
            tmp = parse_line_train(line)
            result.append(tmp)
        return result
    # <<<<< END YOUR CODE


In [10]:
parsed_line = parse_line_train('yes,67,poor,more\n')
assert isinstance(parsed_line, tuple)
assert isinstance(parsed_line[0], tuple)
assert isinstance(parsed_line[1], str)
assert parsed_line[0][1] == 67


In [11]:
trainset = gettrain()
assert isinstance(trainset, list)
assert isinstance(trainset[0], tuple)
assert isinstance(trainset[0][0], tuple)
assert isinstance(trainset[0][1], str)

## Nearest neighbor classifier (25 P)

We consider the nearest neighbor algorithm that classifies test points following the label of the nearest neighbor in the training data. For this, we need to define a distance function between data points. We define it to be

`distance(a, b) = (a[0] != b[0]) + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] != b[2])`

where `a` and `b` are two tuples corrsponding to the attributes of two data points.

* Write a function that retrieves for a test point the nearest neighbor in the training set, and classifies the test point accordingly (i.e. returns the label of the nearest data point).
* Test your function on the tuple `('yes', 31, 'good')`

**Note**: You can use the special `infinity` floating point value with `float('inf')`

In [12]:
def distance(a: tuple, b: tuple):
    '''
    Calculates the distance between two data points (patient tuples)
    Args:
        a, b (tuple): Two patient tuples for which we want to calculate the distance
    Returns:
        float: The distance between a, b according to the above formula
    '''
    # >>>>> YOUR CODE HERE
    return (a[0] != b[0]) + ((float(a[1]) - float(b[1])) / 50.0) ** 2 + (a[2] != b[2])
    # <<<<< END YOUR CODE

def neighbor(x: tuple, trainset: list):
    '''
    Returns the label of the nearest data point in trainset to x.
    So if x is `('no', 30, 'good')` and the nearest data point is `('no', 31, 'good')` with 
    label `'less'` then `'less'` will be returned 
    
    Args: 
        x (tuple): The unknown data point for which we want to find the nearest neighbor
        trainset (list): A list of tuples with patient tuples and a label
        
    Returns: 
        str: The label of the nearest data point in the trainset. Can only be 'more' or 'less'
    '''
    # >>>>> YOUR CODE HERE
    result = list()
    for t in trainset:
        #print(x, t[0])
        dist = distance(x, t[0])
        result.append(dist)
    idx = result.index(min(result))
    #print(trainset[idx])
    return str(trainset[idx][1])
    # <<<<< END YOUR CODE

In [13]:
# Test distance
import math
assert math.isclose(distance(('yes', 34, 'poor'), ('yes', 51, 'good')), 1.1156)


In [14]:
# Test neighbor
x = ('yes', 31, 'good')
assert neighbor(x, gettrain()) == 'more'


* Apply both the decision tree and nearest neighbor classifiers on the test set, and return the list of data point(s) for which the two classifiers disagree, and with which probability it happens.
* A data point should look like above, e.g. `('yes', 31, 'good')`.

In [15]:
def compare_classifiers():
    '''
    This function compares the two classification methods by finding all the datapoints for which 
    the methods disagree
    
    This function doesn't receive any arguments so the test-dataset will be loaded from
    within the function
    
    Returns:
        list: A list containing all the data points which yield different results for the two
            classification methods.
        float: The percentage of data points for which the two methods disagree.
    
    '''
    # >>>>> YOUR CODE HERE
    dt = list()
    nn = list()
    Xdisagree = list()
    
    testest = gettest()
    trainset = gettrain()
    # Testing the decision tree classifier
    for x in testset:
        dt.append(decision(x))
    
    # Testing the nearest neighbor classifier
    for x in testset:
        nn.append(neighbor(x, trainset))
    
    # Xdisagree.append([i for i, j in zip(dt, nn) if i != j])
    Xdisagree = [i for i, j in zip(dt, nn) if i != j]
    probability = len(Xdisagree) / len(testset)
    # <<<<< END YOUR CODE
    return Xdisagree, probability

In [16]:
Xdisagree, probability = compare_classifiers()
assert isinstance(Xdisagree, list)
assert 0.0 <= probability <= 1.0

One problem of simple nearest neighbors is that one needs to compare the point to predict to all data points in the training set. This can be slow for datasets of thousands of points or more. Alternatively, some classifiers train a model first, and then use it to classify the data.

## Nearest mean classifier (25 P)

We consider one such trainable model, which operates in two steps:

(1) Compute the average point for each class, (2) classify new points to be of the class whose average point is nearest to the point to predict.

For this classifier, we convert the attributes smoker and diet to real values (for smoker: yes=1.0 and no=0.0, and for diet: good=0.0 and poor=1.0), and use the modified distance function:

`distance(a,b) = (a[0] - b[0]) ** 2 + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] - b[2]) ** 2`

Age will also from now on be represented as a `float`. The new data points will be referred to as numerical patient tuples. 

We adopt an object-oriented approach for building this classifier.

* Implement the methods `train` and `predict` of the class `NearestMeanClassifier`.

In [17]:
class NearestMeanClassifier:
    '''
    This class represents a NearestMeanClassifier.
    When an instance is trained a dataset is provided and the mean for each class is calculated.
    During prediction the class compares the datapoint to each class mean (not all datapoints) 
        and returns the label of the class mean to which the datapoint is closest to.
    Attributes:
        more (tuple): A tuple representing the average of every 'more' instance in the dataset
        less (tuple): A tuple representing the average of every 'less' instance in the dataset
    '''
    def train(self, dataset: list):
        '''
        Calculates the class means for a given dataset and stores them in class attributes. 
        Args:
            dataset (list): A list of tuples each of them containing a numerical patient tuple and its label
        Returns:
            None
        '''
        # >>>>> YOUR CODE HERE
        l0, l1, l2 = 0, 0, 0
        m0, m1, m2 = 0, 0, 0
        num_less = 0
        num_more = 0
        
        for x, l in dataset:
            if l == 'less':
                l0 += x[0]
                l1 += x[1]
                l2 += x[2]
                num_less += 1
                
            elif l == 'more':
                m0 += x[0]
                m1 += x[1]
                m2 += x[2]
                num_more += 1
        
        self.less = (l0/num_less, l1/num_less, l2/num_less)
        self.more = (m0/num_more, m1/num_more, m2/num_more)
        # <<<<< END YOUR CODE

    def predict(self, x: tuple):
        '''
        Returns a prediction/label for patient tuple x. 
        The classifier compares the given data point to the mean class tuples of each class
        and returns the label of the class to which x is the closest to (according to our distance function).
        
        Args: 
            x (tuple): A numerical patient tuple for which we want a prediction
        '''
        # >>>>> YOUR CODE HERE
        if distance(x, self.more) < distance(x, self.less):
            prediction = 'more'
        else:
            prediction = 'less'
        # <<<<< END YOUR CODE
        return prediction

* Implement a function that will load the training dataset from the `health-train.txt` file and parse each line to a numerical patient tuple. You can still follow the same structure that we used before (i.e. using a `parse_line` function), however, it is not required for this exercise. 
* Build an object of class `NearestMeanClassifier`, train it on the training data, and return the mean class numerical tuple for each class.

In [18]:
def gettrain_num():
    '''
    Parses the `health-train.txt` file into numerical patient tuples
    
    Returns: 
        list: A list of tuples containing numerical patient tuples and their labels
    '''
    # >>>>> YOUR CODE HERE
    r = list()
    
    for v, l in gettrain():        
        f1 = 1.0 if v[0] == 'yes' else 0.0
        f2 = float(v[1])
        f3 = 0.0 if v[2] == 'good' else 1.0
        r.append(((f1, f2, f3), l))
        
    return r   
    # <<<<< END YOUR CODE
    
def build_and_train():
    '''
    Instantiates the `NearestMeanClassifier`, trains on the training set and returns a dictionary 
    containing the numerical mean class tuple for each class in a dictionary.
    
    Returns:
        dict: A dict with two keys (less, more). For each key the mean class numerical tuple is returned
    '''
    # >>>>> YOUR CODE HERE
    cls = NearestMeanClassifier()
    cls.train(gettrain_num())
    return {"less" : cls.less, "more" : cls.more}
    # <<<<< END YOUR CODE


In [19]:
trainset_num = gettrain_num()
assert isinstance(trainset_num, list)
assert isinstance(trainset_num[0][0][0], float)


In [20]:
train_dict = build_and_train()
assert isinstance(train_dict['less'], tuple)
assert isinstance(train_dict['more'], tuple)

* Load the test dataset into memory as a list of numerical patient tuples
* Predict the test data using the nearest mean classifier and print all test examples for which all three classifiers (decision tree, nearest neighbor and nearest mean) agree.

**Note**: Be careful that the `NearestMeanClassifier` expects the dataset in a different form, compared to the other two methods.

In [21]:
def gettest_num():
    '''
    Parses the `health-test.txt` file into numerical patient tuples
    
    Returns: 
        list: A list containing numerical patient tuples, loaded from `health-test.txt`
    '''
    def cast(smoker, age, diet):
        return (float(smoker == 'yes'), float(age), float(diet == 'poor'))

    return [cast(*l[:-1].split(',')) for l in list(open('health-test.txt', 'r'))]

def predict_test():
    '''
    Classifies the test set using all the methods that were developed in this sheet.
    
    Returns:
        list: a list of tuples containing all the datapoints that were classfied the same by all methods, 
            as well as the predicted labels
            
    Example:
    >>> predict_test()
    [(('yes', 22, 'poor'), 'less'),
     (('yes', 21, 'poor'), 'less'),
     (('no', 31, 'good'), 'more')]
     
    This example only shows how the output should look like. The values in the tuples are completely made up
    '''
    # >>>>> YOUR CODE HERE
    data_test = gettest_num()
    bt = build_and_train()
    model = NearestMeanClassifier()
    model.less = bt['less']
    model.more = bt['more']
    
    nn = list()
    dt = list()
    agreed_samples = list()
    for x in data_test:
        nn.append((x, model.predict(x)))
        dt.append((x, decision(x)))

    agreed_samples = [i for i, j in zip(dt, nn) if i == j]
    print(agreed_samples)
    # <<<<< END YOUR CODE
    return agreed_samples


In [22]:
testset_num = gettest_num()
assert isinstance(testset_num, list)
assert isinstance(testset_num[0], tuple)

In [23]:
same_predictions = predict_test()
assert isinstance(same_predictions[0], tuple)
assert isinstance(same_predictions[0][0], tuple)
assert isinstance(same_predictions[0][1], str)

[((0.0, 50.0, 0.0), 'more'), ((1.0, 45.0, 1.0), 'more'), ((1.0, 51.0, 0.0), 'more'), ((0.0, 60.0, 0.0), 'more')]
