## Before submitting
1. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel\Restart) and then **run all cells** (in the menubar, select Cell\Run All).

2. Make sure that no assertions fail or exceptions occur, otherwise points will be subtracted.\n",

3. After you submit the notebook more tests will be run on your code. The fact that no assertions fail on your computer localy does not guarantee that completed the exercise correctly.

4. Please submit only the `*.ipynb` file.

5. Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\". Edit only between `YOUR CODE HERE` and `END YOUR CODE`.

6. Make sure to use Python 3.6 at least.

In [1]:
import sys

if sys.version_info < (3, 6):
    print("You are not using a modern enough version of Python. ")

In [2]:
# This cell is for grading. DO NOT remove it

# Use unittest asserts
import unittest

t = unittest.TestCase()
from pprint import pprint

# Helper assert function
def assert_percentage(val):
    t.assertGreaterEqual(val, 0.0, f"Percentage ({val}) cannot be < 0")
    t.assertLessEqual(val, 1.0, f"Percentage ({val}) cannot be > 1")



# Warm Ups

Before starting the homework sheet we recommend you finish these warm-up tasks. They won't bring any points but should help you to get familiar with Python code.

### Function and types (0 P)

Write a function using list comprehension that returns the types of list elements.

* The function should be called `types_of`
* The function expects a list as an input argument.
* The function should return a list with the types of the given list elements.
* Read the testing cell to understand how `types_of` is supposed to work.

In [3]:
# YOUR CODE HERE

def types_of(a:list):
    list1=[type(x) for x in a]
    return list1
# YOUR CODE HERE


In [4]:
# Test type_of_two function
types = types_of([7, 0.7, "hello", True, (2, "s")])

assert isinstance(types, list)
t.assertEqual(types[0], int)
t.assertEqual(types[1], float)
t.assertEqual(types[2], str)
t.assertEqual(types[3], bool)
t.assertEqual(types[-1], tuple)

### Concatenation and enumerate (0 P)


Concatenate the strings from the array 'animals' into one string.

* Use: `counting +=` and string formatting.
* Use `enumerate` to get the `i`th index.
* The result should look as follows: `'0: mouse | 1: rabbit | 2: cat | 3: dog | '`

In [5]:
animals = ["mouse", "rabbit", "cat", "dog"]

In [6]:
counting = "|"
for i, animal in enumerate(animals):
    # YOUR CODE HERE
    counting+=f'{i}: {animal}'
    counting+=" |"
    # YOUR CODE HERE
    

print(counting)

|0: mouse |1: rabbit |2: cat |3: dog |


In [7]:
# Test of the enumeration loop
t.assertEqual(counting, "|0: mouse |1: rabbit |2: cat |3: dog |")

### String formating (0 P)

What does the following string formating result in?
* Write the result of the string formating into the variables result1, result2, result3.
* Example: `string0 = "This is a {} string.".format("test")`
* Example solution: `result0 = "This is a test string"`

In [8]:
# first string
string1 = "The sky is {}. {} words in front of {} random words create {} random sentence.".format(
    "clear", "Random", "other", 1
)

# second string
a = "irony"
b = "anyone"
c = "room"

string2 = f"The {a} of the situation wasn't lost on {b} in the {c}."

# third string
string3 = f"{7*10} * {9/3} with three digits after the floating point looks like this: {70*3 :.3f}."

# fourth string
string4 = "   Hello World.   ".strip()

In [9]:
print(string4)

Hello World.


In [10]:
# YOUR CODE HERE
result1 = "The sky is clear. Random words in front of other random words create 1 random sentence."
result2 = "The irony of the situation wasn't lost on anyone in the room."
result3 = "70 * 3.0 with three digits after the floating point looks like this: 210.000."
result4 = "Hello World."
string5 = "hello"
result5 = "hello"
# YOUR CODE HERE


In [11]:
# Test the string results
t.assertEqual(string1, result1)
t.assertEqual(string2, result2)
t.assertEqual(string3, result3)
t.assertEqual(string4, result4)
t.assertEqual(string5, result5)

# Exercise Sheet 1: Python Basics

This first  exercise sheet tests the basic functionalities of the Python programming language in the context of a simple prediction task. We consider the problem of predicting health risk of subjects from personal data and habits. We first use for this task a decision tree.

![](tree.png)

Make sure that you have downloaded the `tree.png` file from ISIS. For this exercise sheet, you are required to use only pure Python, and to not import any module, including `Numpy`. Next week are going to implement the nearest neighbor part of this exercise sheet using `Numpy` 😉.

## Classifying a single instance (15 P)

* In this sheet we will represent patient info as a tuple.
* Implement the function `decision` that takes as input a tuple containing values for attributes (smoker,age,diet), and computes the output of the decision tree. Should return either `'less'` or `'more'`. No other outputs are valid.

In [12]:
def decision(x: tuple) -> str:
    """
    This function implements the decision tree represented in the above image. As input the function 
    receives a tuple with three values that represent some information about a patient.
    Args:
        x (tuple): Input tuple containing exactly three values. The first element represents 
        a patient is a smoker this value will be 'yes'. All other values represent that 
        the patient is not a smoker. The second element represents the age of a patient
        in years as an integer. The last element represents the diet of a patient.
        If a patient has a good diet this string will be 'good'. All other
            values represent that the patient has a poor diet.
    Returns:
        string: A string that has either the value 'more' or 'less'. 
        No other return value is valid.
                        
    """
    # YOUR CODE HERE
    if x[0] == "yes":
        if x[1] < 29.5:
            return "less"
        else:
            return "more"
    else:
        if x[2] == "good":
            return "less"
        else:
            return "more"
    # YOUR CODE HERE
    

In [13]:
# Test decision function

# Test expected 'more'
x = ("yes", 31, "good")
output = decision(x)
print(f"decision({x}) --> {output}")
t.assertIsInstance(output, str)
t.assertEqual(output, "more")

# Test expected 'less'
x = ("yes", 29, "poor")
output = decision(x)
print(f"decision({x}) --> {output}")
t.assertIsInstance(output, str)
t.assertEqual(output, "less")


decision(('yes', 31, 'good')) --> more
decision(('yes', 29, 'poor')) --> less


In [14]:
# This cell is for grading. DO NOT remove it

## Reading a dataset from a text file (10 P)

The file `health-test.txt` contains several fictious records of personal data and habits. We split this task into two parts. In the first part, we assume that we have read a line from the file and can now process it. In the second function we load the file and process each line.

* Read the file automatically using the methods introduced during the lecture.
* Represent the dataset as a list of tuples. Make sure that the tuples have the same format as in the previous task, e.g. `('yes', 31, 'good')`.
* Make sure that you close the file after you have opened it and read its content. If you use a `with` statement then you don't have to worry about closing the file.

**Notes**: 
* Values read from files are always strings.
* Each line contains a newline `\n` character at the end
* If you are using Windows as your operating system, refrain from opening any text files using Notepad. It will remove any linebreaks `\n`. You should inspect the files using the Jupyter text editor or any other modern text editor.

In [15]:
def parse_line_test(line: str) -> tuple:
    """
    Takes a line from the file, including a newline, and parses it into a patient tuple
    
    Args:
        line (str): A line from the `health-test.txt` file
    Returns:
        tuple: A tuple representing a patient 
    """
    # YOUR CODE HERE
    line = line.rstrip()
    patient = (line.split(","))
    patient = (patient[0],int(patient[1]),patient[2])
    return patient
    
    # YOUR CODE HERE
    

In [16]:
x = "yes,23,good\n"
parsed_line = parse_line_test(x)
print(parsed_line)
t.assertIsInstance(parsed_line, tuple)
t.assertEqual(len(parsed_line), 3)
t.assertIsInstance(parsed_line[1], int)
t.assertNotIn("\n", parsed_line[-1], "Are you handling line breaks correctly?")
t.assertEqual(parsed_line[-1], "good")


('yes', 23, 'good')


In [17]:
# This cell is for grading. DO NOT remove it

In [18]:
def gettest() -> list:
    """
    Opens the `health-test.txt` file and parses it 
    into a list of patient tuples. You are encouraged to use 
    the `parse_line_test` function but it is not necessary to do so.
    
    Returns:
        list: A list of patient tuples
    """
    # YOUR CODE HERE
    data = []
    with open('./health-test.txt', 'r') as f:
        for line in f:
            parsed_line = parse_line_test(line)

            data.append(tuple(parsed_line))
    return data
    # YOUR CODE HERE
    

In [19]:
testset = gettest()
pprint(testset)
t.assertIsInstance(testset, list)
t.assertEqual(len(testset), 8)
t.assertIsInstance(testset[0], tuple)


[('yes', 21, 'poor'),
 ('no', 50, 'good'),
 ('no', 23, 'good'),
 ('yes', 45, 'poor'),
 ('yes', 51, 'good'),
 ('no', 60, 'good'),
 ('no', 15, 'poor'),
 ('no', 18, 'good')]


In [20]:
# This cell is for grading. DO NOT remove it

## Applying the decision tree to the dataset (15 P)

* Apply the decision tree to all points in the dataset, and return the ratio of them that are classified as "more".
* A ratio is a value in [0-1]. So if out of 50 data points 15 return `"more"` the value that should be returned is `0.3`

In [21]:
def evaluate_testset(dataset: list) -> float:
    """
    Calculates the percentage of datapoints for which the
    decision function evaluates to `'more'` for a given dataset
    
    Args:
        dataset (list): A list of patient tuples
    
    Returns:
        float: The percentage of data points which are evaluated to `'more'`
    """
    # YOUR CODE HERE
    j=0
    for i in dataset:
        if decision(i) == "more":
            j+=1
        ratio = j/len(dataset)
    return ratio
    # YOUR CODE HERE
    

In [22]:
ratio = evaluate_testset(gettest())
print(f"ratio --> {ratio}")
t.assertIsInstance(ratio, float)
assert_percentage(ratio)


ratio --> 0.375


## Learning from examples (10 P)

Suppose that instead of relying on a fixed decision tree, we would like to use a data-driven approach where data points are classified based on a set of training observations manually labeled by experts. Such labeled dataset is available in the file `health-train.txt`. The first three columns have the same meaning than for `health-test.txt`, and the last column corresponds to the labels.

* Read the `health-train.txt` file and convert it into a list of pairs. The first element of each pair is a triplet of attributes, and the second element is the label.
* Similarlly to the previous exercise we split the task into two parts. The first involves processing each line individually. The second handles opening the file and processing all lines of the file

**Note**: A triplet is a tuple that contains exactly three values, a pair is a tuple that contains exactly two values

In [23]:
def parse_line_train(line: str) -> tuple:
    """
    This function works similarly to the `parse_line_test` function.
    It parses a line of the `health-train.txt` file into a tuple that 
    contains a patient tuple and a label.
    
    Args:
        line (str): A line from the `health-train.txt`
    
    Returns: 
        tuple: A tuple that contains a patient tuple and a label as a string
    """
    # YOUR CODE HERE
    line = line.rstrip()
    patient = (line.split(","))
    pair1 = (patient[0], int(patient[1]), patient[2])
    pair = (pair1,patient[3])
    return pair
    # YOUR CODE HERE
    

In [24]:
x = "yes,67,poor,more\n"
parsed_line = parse_line_train(x)
print(parsed_line)

t.assertIsInstance(parsed_line, tuple)
t.assertEqual(len(parsed_line), 2)

data, label = parsed_line

t.assertIsInstance(data, tuple)
t.assertEqual(len(data), 3)
t.assertEqual(data[1], 67)

t.assertIsInstance(label, str)
t.assertNotIn("\n", label, "Are you handling line breaks correctly?")
t.assertEqual(label, "more")


(('yes', 67, 'poor'), 'more')


In [25]:
# This cell is for grading. DO NOT remove it

In [26]:
def gettrain() -> list:
    """
    Opens the `health-train.txt` file and parses it into 
    a list of patient tuples accompanied by their respective label. 
    
    Returns:
        list: A list of tuples comprised of a patient tuple and a label
    """
    # YOUR CODE HERE
    data_train = []
    with open('./health-train.txt', 'r') as f:
        for line in f:
            parsed_line = parse_line_train(line)
            data_train.append(tuple(parsed_line))
    return data_train
    # YOUR CODE HERE
    

In [27]:
trainset = gettrain()
pprint(trainset)
t.assertIsInstance(trainset, list)
t.assertEqual(len(trainset), 16)
first_datapoint = trainset[0]
t.assertIsInstance(first_datapoint, tuple)
t.assertIsInstance(first_datapoint[0], tuple)
t.assertIsInstance(first_datapoint[1], str)

[(('yes', 54, 'good'), 'less'),
 (('no', 55, 'good'), 'less'),
 (('no', 26, 'good'), 'less'),
 (('yes', 40, 'good'), 'more'),
 (('yes', 25, 'poor'), 'less'),
 (('no', 13, 'poor'), 'more'),
 (('no', 15, 'good'), 'less'),
 (('no', 50, 'poor'), 'more'),
 (('yes', 33, 'good'), 'more'),
 (('no', 35, 'good'), 'less'),
 (('no', 41, 'good'), 'less'),
 (('yes', 30, 'poor'), 'more'),
 (('no', 39, 'poor'), 'more'),
 (('no', 20, 'good'), 'less'),
 (('yes', 18, 'poor'), 'less'),
 (('yes', 55, 'good'), 'more')]


In [28]:
# This cell is for grading. DO NOT remove it

## Nearest neighbor classifier (25 P)

We consider the nearest neighbor algorithm that classifies test points following the label of the nearest neighbor in the training data. You can read more about Nearest neighbor classifiers [here](http://www.robots.ox.ac.uk/~dclaus/digits/neighbour.htm). For this, we need to define a distance function between data points. We define it to be

`distance(a, b) = (a[0] != b[0]) + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] != b[2])`

^this is not valid python code

where `a` and `b` are two tuples corrsponding to the attributes of two data points.

* Implement the distance function.
* Implement the function that retrieves for a test point the nearest neighbor in the training set, and classifies the test point accordingly (i.e. returns the label of the nearest data point).

**Hint**: You can use the special `infinity` floating point value with `float('inf')`

In [29]:
def distance(a: tuple, b: tuple) -> float:
    """
    Calculates the distance between two data points (patient tuples)
    Args:
        a, b (tuple): Two patient tuples for which we want to calculate the distance
    Returns:
        float: The distance between a, b according to the above formula
    """
    # YOUR CODE HERE
    Distance = (a[0] != b[0]) + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] != b[2])
    return Distance
    # YOUR CODE HERE
    

In [30]:
# Test distance
x1 = ("yes", 34, "poor")
x2 = ("yes", 51, "good")
dist = distance(x1, x2)
print(f"distance({x1}, {x2}) --> {dist}")
expected_dist = 1.1156
t.assertAlmostEqual(dist, expected_dist)


distance(('yes', 34, 'poor'), ('yes', 51, 'good')) --> 1.1156


In [31]:
# This cell is for grading. DO NOT remove it

In [32]:
def neighbor(x: tuple, trainset: list) -> str:
    """
    Returns the label of the nearest data point in trainset to x.
    If x is `('no', 30, 'good')` and the nearest data point in trainset
    is `('no', 31, 'good')` with label `'less'` then `'less'` will be returned 
    
    Args: 
        x (tuple): The data point for which we want to find the nearest neighbor
        trainset (list): A list of tuples with patient tuples and a label
        
    Returns: 
        str: The label of the nearest data point in the trainset. Can only be 'more' or 'less'
    """
    # YOUR CODE HERE
    a = []
    for b in trainset:
        a.append(distance(x,b[0]))
        c = min(a)
    print(a)
    print(c)
    for idx, n in enumerate(trainset):
        if c == distance(x, n[0]):
            pre = n[1]
            print(n)
    return pre
    # YOUR CODE HERE
    

In [33]:
# Test neighbor
x = ("yes", 31, "good")
prediction = neighbor(x, gettrain())
print(f"prediction --> {prediction}")
expected = "more"
t.assertEqual(prediction, expected)


[0.2116, 1.2304, 1.01, 0.0324, 1.0144, 2.1296, 1.1024, 2.1444, 0.0016, 1.0064, 1.04, 1.0004, 2.0256, 1.0484, 1.0676, 0.2304]
0.0016
(('yes', 33, 'good'), 'more')
prediction --> more


In [34]:
# This cell is for grading. DO NOT remove it

* Apply both the decision tree and nearest neighbor classifiers on the test set, and return the list of data point(s) for which the two classifiers disagree, and with which probability it happens.

In [35]:
def compare_classifiers(trainset: list, testset: list) -> float:
    """
    This function compares the two classification methods by finding all the datapoints for which 
    the methods disagree.
    
    Args:
        trainset (list): The training set used in the nearest neighbour classfier.
        testset (list): Contains the elements which will be used to compare the 
            decision tree and nearest neighbor classification methods.
    
    Returns:
        list: A list containing all the data points which yield different results for the two
            classification methods.
        float: The percentage of data points for which the two methods disagree.
    
    """
    # YOUR CODE HERE
    decisionset = testset
    neighborset = trainset
    count = 0
    disagree = []
    for patient in decisionset:
        if decision(patient) != neighbor(patient,neighborset):
            #print(patient)
            disagree.append(patient)
            #print(disagree)
            count += 1
            #print(count)
    percentage = count / len(testset)
    # YOUR CODE HERE
    
    return disagree, percentage

In [36]:
# Test compare_classifiers
disagree, ratio = compare_classifiers(gettrain(), gettest())
t.assertIsInstance(disagree, list)
t.assertIsInstance(disagree[0], tuple)
assert_percentage(ratio)

[1.4356, 2.4624, 2.01, 1.1444, 0.0064, 1.0256, 2.0144, 1.3364, 1.0576, 2.0784000000000002, 2.16, 0.0324, 1.1296, 2.0004, 0.0036, 1.4624000000000001]
0.0036
(('yes', 18, 'poor'), 'less')
[1.0064, 0.010000000000000002, 0.2304, 1.04, 2.25, 1.5476, 0.48999999999999994, 1.0, 1.1156, 0.09, 0.0324, 2.16, 1.0484, 0.36, 2.4096, 1.01]
0.010000000000000002
(('no', 55, 'good'), 'less')
[1.3844, 0.4096, 0.0036, 1.1156, 2.0016, 1.04, 0.0256, 1.2916, 1.04, 0.0576, 0.1296, 2.0196, 1.1024, 0.0036, 2.01, 1.4096]
0.0036
(('no', 26, 'good'), 'less')
(('no', 20, 'good'), 'less')
[1.0324, 2.04, 2.1444, 1.01, 0.16000000000000003, 1.4096, 2.36, 1.01, 1.0576, 2.04, 2.0064, 0.09, 1.0144, 2.25, 0.2916, 1.04]
0.09
(('yes', 30, 'poor'), 'more')
[0.0036, 1.0064, 1.25, 0.0484, 1.2704, 2.5776, 1.5184, 2.0004, 0.1296, 1.1024, 1.04, 1.1764, 2.0576, 1.3844, 1.4356, 0.0064]
0.0036
(('yes', 54, 'good'), 'less')
[1.0144, 0.010000000000000002, 0.4624000000000001, 1.1600000000000001, 2.49, 1.8836, 0.81, 1.04, 1.2916, 0.25, 0

One problem of simple nearest neighbors is that one needs to compare the point to predict to all data points in the training set. This can be slow for datasets of thousands of points or more. Alternatively, some classifiers train a model first, and then use it to classify the data.

## Nearest mean classifier (25 P)

We consider one such trainable model, which operates in two steps:

1. Compute the average point for each class
2. Classify new points to be of the class whose average point is nearest to the point to predict.

For this classifier, we convert the attributes smoker and diet to real values (for smoker: yes=1.0 and no=0.0, and for diet: good=0.0 and poor=1.0), and use the modified distance function:

`distance(a,b) = (a[0] - b[0]) ** 2 + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] - b[2]) ** 2`

Age will also from now on be represented as a `float`. The new data points will be referred to as numerical patient tuples. 

We adopt an object-oriented approach for building this classifier.

* Implement the `gettrain_num` function that will load the training dataset from the `health-train.txt` file and parse each line to a numerical patient tuple with its label. You can still follow the same structure that we used before (i.e. using a `parse_line_...` function), however, it is not required for this exercise. Only the `gettrain_num` function will be tested.


* Implement the new distance function.


* Implement the methods `train` and `predict` of the class `NearestMeanClassifier`.

In [37]:
def parse_line_train_num(line: str) -> tuple:
    """
    Takes a line from the file `health-train.txt`, including a newline, 
    and parses it into a numerical patient tuple
    
    Args:
        line (str): A line from the `health-test.txt` file
    Returns:
        tuple: A numerical patient
    """
    # YOUR CODE HERE
    line = line.rstrip()
    patient = (line.split(","))
    #print(patient)
    if patient[0] == "yes":
        patient[0] = 1.0
    elif patient[0] == "no":
        patient[0] = 0.0
    if patient[2] == "good":
        patient[2] = 0.0
    elif patient[2] == "poor":
        patient[2] = 1.0
    pair1 = (patient[0], float(patient[1]), patient[2])
    pair = (pair1, patient[3])
    return pair
    # YOUR CODE HERE
    


def gettrain_num() -> list:
    """
    Parses the `health-train.txt` file into numerical patient tuples
    
    Returns: 
        list: A list of tuples containing numerical patient tuples and their labels
    """
    # YOUR CODE HERE
    data_train = []
    with open('./health-train.txt', 'r') as f:
        for line in f:
            parsed_line = parse_line_train_num(line)
            data_train.append(tuple(parsed_line))
    return data_train
    # YOUR CODE HERE
    

In [38]:
# Test gettrain_num
trainset_num = gettrain_num()
t.assertIsInstance(trainset_num, list)
first_datapoint = trainset_num[0]
print(f"first_datapoint --> {first_datapoint}")
t.assertIsInstance(first_datapoint[0], tuple)
t.assertIsInstance(first_datapoint[0][0], float)
t.assertIsInstance(first_datapoint[0][1], float)
t.assertIsInstance(first_datapoint[0][2], float)

first_datapoint --> ((1.0, 54.0, 0.0), 'less')


In [39]:
# This cell is for grading. DO NOT remove it

In [40]:
def distance_num(a: tuple, b: tuple) -> float:
    """
    Calculates the distance between two data points (numerical patient tuples)
    Args:
        a, b (tuple): Two numerical patient tuples for which 
            we want to calculate the distance
    Returns:
        float: The distance between a, b according to the above formula
    """
    # YOUR CODE HERE
    Distance = (a[0] != b[0]) + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] != b[2])
    return Distance
    # YOUR CODE HERE
    

In [41]:
x1 = (1.0, 23.0, 0.0)
x2 = (0.0, 41.0, 1.0)
dist = distance_num(x1, x2)
print(f"dist --> {dist}")
t.assertIsInstance(dist, float)
expected_dist = 2.1296
t.assertAlmostEqual(dist, expected_dist)

dist --> 2.1296


In [42]:
# This cell is for grading. DO NOT remove it

In [43]:
class NearestMeanClassifier:
    """
    Represents a NearestMeanClassifier.
    
    When an instance is trained a dataset is provided and the mean for each class is calculated.
    During prediction the instance compares the datapoint to each class mean (not all datapoints) 
    and returns the label of the class mean to which the datapoint is closest to.
    
    Instance Attributes:
        more (tuple): A tuple representing the mean of every 'more' data-point in the dataset
        less (tuple): A tuple representing the mean of every 'less' data-point in the dataset
    """

    def __init__(self):
        self.more = None
        self.less = None

    def train(self, dataset: list):
        """
        Calculates the class means for a given dataset and stores 
        them in instance attributes more, less. 
        Args:
            dataset (list): A list of tuples each of them containing a numerical patient tuple and its label
        Returns:
            self
        """
        # YOUR CODE HERE
        a = 0
        b = 0
        c = 0
        d = 0
        e = 0
        f = 0
        m = 0
        n = 0
        for i in dataset:
            if i[1] == "more":
                a += i[0][0]
                b += i[0][1]
                c += i[0][2]
                m +=1
            elif i[1] == "less":
                d += i[0][0]
                e += i[0][1]
                f += i[0][2]
                n += 1
        tupm = [a/m,b/m,c/m]
        tupn = [d/n,e/n,f/n]
        self.more = tuple(tupm)
        self.less = tuple(tupn)
        # YOUR CODE HERE
        
        return self

    def predict(self, x: tuple) -> str:
        """
        Returns a prediction/label for numeric patient tuple x. 
        The classifier compares the given data point to the mean 
        class tuples of each class and returns the label of the
        class to which x is the closest to (according to our 
        distance function).
        
        Args: 
            x (tuple): A numerical patient tuple for which we want a prediction
            
        Returns:
            str: The predicted label
        """
        # YOUR CODE HERE
        Distancem = distance_num(x,self.more)
        Distancen = distance_num(x,self.less)
        #Distancem = (x[0] != self.more[0]) + ((x[1] - self.more[1]) / 50.0) ** 2 + (x[2] != self.more[2])
        #Distancen = (x[0] != self.less[0]) + ((x[1] - self.less[1]) / 50.0) ** 2 + (x[2] != self.less[2])
        '''
        if Distancem <= Distancen:
            label = "more"
        else:
            label = "less"
        return label
        '''
        a = min(Distancem,Distancen)
        if a == Distancem:
            label = "more"
        else:
            label = "less"
        return label
        # YOUR CODE HERE
        

    def __str__(self):
        return repr(self)

    def __repr__(self):
        more = tuple(round(m, 3) for m in self.more) if self.more else self.more
        less = tuple(round(l, 3) for l in self.less) if self.less else self.less
        return f"NearestMeanClassfier(more: {more}, less: {less})"

* Instantiate the `NearestMeanClassifier`, train it on the training data, and return it

In [44]:
def build_and_train(trainset_num: list) -> NearestMeanClassifier:
    """
    Instantiates the `NearestMeanClassifier`, trains it on the
    `trainset_num` dataset and returns it.
    
    Args: 
        trainset_num (list): A list of numerical patient tuples with their respective labels
    
    Returns:
        NearestMeanClassifier: A NearestMeanClassifier trained on `trainset_num`
    """
    # YOUR CODE HERE
    c = NearestMeanClassifier()
    c.train(trainset_num)
    return c
    # YOUR CODE HERE
    

In [45]:
# Test build_and_train
classifier = build_and_train(gettrain_num())
print(classifier)
t.assertIsInstance(classifier, NearestMeanClassifier)

t.assertIsNotNone(
    classifier.more,
    "Did you train the classifier? \
Did you store the mean vector for the 'more' class?",
)
t.assertIsNotNone(
    classifier.less,
    "Did you train the classifier? \
Did you store the mean vector for the 'less' class?",
)

t.assertIsInstance(classifier.more, tuple)
t.assertIsInstance(classifier.less, tuple)

t.assertEqual(round(classifier.more[1]), 37)
t.assertEqual(round(classifier.less[1]), 32)


NearestMeanClassfier(more: (0.571, 37.143, 0.571), less: (0.333, 32.111, 0.222))


In [46]:
# This cell is for grading. Do NOT remove it

* Load the test dataset into memory as a list of numerical patient tuples
* Predict the test data using the nearest mean classifier and return all test examples for which all three classifiers (decision tree, nearest neighbor and nearest mean) agree.

**Note**: Be careful that the `NearestMeanClassifier` expects the dataset in a different form, compared to the other two methods.

In [47]:
def gettest_num() -> list:
    """
    Parses the `health-test.txt` file into numerical patient tuples
    
    Returns: 
        list: A list containing numerical patient tuples, loaded from `health-test.txt`
    """
    # YOUR CODE HERE
    def parse_line_test_num(line: str) -> tuple:
        line = line.rstrip()
        patient = (line.split(","))
        if patient[0] == "yes":
            patient[0] = 1.0
        elif patient[0] == "no":
            patient[0] = 0.0
        if patient[2] == "good":
            patient[2] = 0.0
        elif patient[2] == "poor":
            patient[2] = 1.0
        pair1 = (patient[0], float(patient[1]), patient[2])
        return pair1
    data_train = []
    with open('./health-test.txt', 'r') as f:
        for line in f:
            parsed_line = parse_line_test_num(line)
            data_train.append(tuple(parsed_line))
    return data_train
    # YOUR CODE HERE
    

In [48]:
testset_num = gettest_num()
pprint(testset_num)
t.assertIsInstance(testset_num, list)
t.assertEqual(len(testset_num), 8)
t.assertIsInstance(testset_num[0], tuple)
t.assertEqual(len(testset_num[0]), 3)

[(1.0, 21.0, 1.0),
 (0.0, 50.0, 0.0),
 (0.0, 23.0, 0.0),
 (1.0, 45.0, 1.0),
 (1.0, 51.0, 0.0),
 (0.0, 60.0, 0.0),
 (0.0, 15.0, 1.0),
 (0.0, 18.0, 0.0)]


In [49]:
def predict_test() -> list:
    """
    Classifies the test set using all the methods that were developed in this exercise sheet,
    namely `decision`, `neighbor` and `NearestMeanClassifier`
    
    Returns:
        list: a list of patient tuples containing all the datapoints that were classfied 
            the same by all methods, as well as the predicted labels
            
    Example:
    >>> predict_test()
    [(('yes', 22, 'poor'), 'less'),
     (('yes', 21, 'poor'), 'less'),
     (('no', 31, 'good'), 'more')]
     
    This example only shows how the output should look like. The values in the tuples 
    are completely made up
    """
    # YOUR CODE HERE
    decisionset = gettest()
    neighborset = gettrain()
    classifier = build_and_train(gettrain_num())
    classfy = gettest_num()
    agreed_samples = []
    for i in range(len(decisionset)):
        if decision(decisionset[i]) == neighbor(decisionset[i],neighborset): #"more","less"
            prea = classifier.predict(decisionset[i])
            if decision(decisionset[i]) == prea:
                
                 prediction = (decisionset[i],decision(decisionset[i]))
                 agreed_samples.append(prediction)
    # YOUR CODE HERE
    
    return agreed_samples

In [50]:
same_predictions = predict_test()
pprint(same_predictions)
t.assertIsInstance(same_predictions, list)
t.assertEqual(len(same_predictions), 6)
t.assertIsInstance(same_predictions[0], tuple)
t.assertIsInstance(same_predictions[0][0], tuple)
t.assertIsInstance(same_predictions[0][0][0], str)
t.assertIsInstance(same_predictions[0][1], str)

[1.4356, 2.4624, 2.01, 1.1444, 0.0064, 1.0256, 2.0144, 1.3364, 1.0576, 2.0784000000000002, 2.16, 0.0324, 1.1296, 2.0004, 0.0036, 1.4624000000000001]
0.0036
(('yes', 18, 'poor'), 'less')
[1.0064, 0.010000000000000002, 0.2304, 1.04, 2.25, 1.5476, 0.48999999999999994, 1.0, 1.1156, 0.09, 0.0324, 2.16, 1.0484, 0.36, 2.4096, 1.01]
0.010000000000000002
(('no', 55, 'good'), 'less')
[1.3844, 0.4096, 0.0036, 1.1156, 2.0016, 1.04, 0.0256, 1.2916, 1.04, 0.0576, 0.1296, 2.0196, 1.1024, 0.0036, 2.01, 1.4096]
0.0036
(('no', 26, 'good'), 'less')
(('no', 20, 'good'), 'less')
[1.0324, 2.04, 2.1444, 1.01, 0.16000000000000003, 1.4096, 2.36, 1.01, 1.0576, 2.04, 2.0064, 0.09, 1.0144, 2.25, 0.2916, 1.04]
0.09
(('yes', 30, 'poor'), 'more')
[0.0036, 1.0064, 1.25, 0.0484, 1.2704, 2.5776, 1.5184, 2.0004, 0.1296, 1.1024, 1.04, 1.1764, 2.0576, 1.3844, 1.4356, 0.0064]
0.0036
(('yes', 54, 'good'), 'less')
[1.0144, 0.010000000000000002, 0.4624000000000001, 1.1600000000000001, 2.49, 1.8836, 0.81, 1.04, 1.2916, 0.25, 0

AssertionError: 4 != 6