# Lab 2
### Machine Learning 1

This assignment is to be done with a partner. Please only submit ONE .ipynb (not .py) file per pair!

**Enter your names and student numbers here:**

- Niek Horstman (9593977)
- Tobias Middelhoff (9676945)
- Joël van de Water (9145532)

**Total points: 8**

In Lab 2 you will get to know an easy but powerful Machine Learning method: the k-Nearest Neighbours algorithm (kNN). We will use the method on a classification problem, a case of supervised learning.

## The data

We will look at data originally from an Australian bank. Based on some information about a customer, the bank has to decide whether this customer's application for a credit card will be approved or not.
(The original data file can be found here: https://archive.ics.uci.edu/ml/datasets/Statlog+(Australian+Credit+Approval), but for convenience we have already provided two .txt files with this assignment.)
The data of each customer is described by 14 variables. Four of these are categorical variables: they assign a label to each customer. These labels are represented by integers. (To protect the privacy of the customers, there is no information about the meaning of the variables and thus not about the meaning of the different labels.) Four other variables are binary (0 or 1). The other six variables are real numbers. Together they constitute the input: the vector *x*.
Each customer also has their output variable *y*, which is 0 or 1. This indicates whether the customer's application (according to human experts) should be approved (*y=1*) or not (*y=0*).

We want to write an algorithm for this task that learns from the expert decisions it has seen and is able to make its own decisions for new customers.

## The algorithm

The Nearest Neighbour algorithm is based on a simple idea: we are going to base the decision for a new customer (approve credit card application or not) on the decision for a customer in the training data (the historical data). We base it on the customer who is most similar to the new customer.
To use this in an algorithm, we first need a measure which can be used to determine the similarity between two customers. We will use the *Euclidean* distance for this purpose.

If we consider a customer to be a point in a 14-dimensional space (one dimension for each variable), we can calculate the Euclidean distance between two customers/points *A* and *B* with:

$$\sqrt{(A_1-B_1)^2+(A_2-B_2)^2+...+(A_{14}-B_{14})^2}$$

with $A_1$ the first input variable of $A$, $A_2$ the second, etc.

The algorithm has the following simple steps, with step 3 being an evaluation of how well the algorithm is doing:

1. Read the training data from the file;
2. For every customer $c_{test}$ in the test file: determine which customer $c_{train}$ in the training data has the smallest Euclidean distance to the test-customer $c_{test}$, and assign the decision corresponding to $c_{train}$ to $c_{test}$;
3. Count the number of test-customers for which the decision made by the algorithm is the same as the decision made by human experts;

Step 1 is already implemented below in the two classes **Data** and **Person**.
Person is a class containing all customer information: the input vector *x* and the output *y*. The inputs are divided in three different arrays: *xCategorical*, *xBinary* and *xReal*; the output is *y*.

In [None]:
class Person:
  def __init__(self, line_text):
    # integers
    self.xCategorical = [None]*4
    # integers
    self.xBinary = [None]*4
    # floats
    self.xReal = [None]*6

    # label (0 or 1)
    self.y = None
    # Initialize the features
    self.fromString(line_text)

  def fromString(self, line_text):
    self.xBinary[0] = int(line_text[0])
    self.xReal[0] = float(line_text[1])
    self.xReal[1] = float(line_text[2])
    self.xCategorical[0] = int(line_text[3])
    self.xCategorical[1] = int(line_text[4])
    self.xCategorical[2] = int(line_text[5])
    self.xReal[2] = float(line_text[6])
    self.xBinary[1] = int(line_text[7])
    self.xBinary[2] = int(line_text[8])
    self.xReal[3] = float(line_text[9])
    self.xBinary[3] = int(line_text[10])
    self.xCategorical[3] = int(line_text[11])
    self.xReal[4] = float(line_text[12])
    self.xReal[5] = float(line_text[13])

    self.y = int(line_text[14])

  def __str__(self):
    person_info = "x = (" + \
	    str(self.xBinary[0]) + "; " + \
	    str(self.xReal[0]) + "; " + \
	    str(self.xReal[1]) + "; " + \
	    str(self.xCategorical[0]) + "; " + \
	    str(self.xCategorical[1]) + "; " + \
	    str(self.xCategorical[2]) + "; " + \
	    str(self.xReal[2]) + "; " + \
	    str(self.xBinary[1]) + "; " + \
	    str(self.xBinary[2]) + "; " + \
	    str(self.xReal[3]) + "; " + \
	    str(self.xBinary[3]) + "; " + \
	    str(self.xCategorical[3]) + "; " + \
	    str(self.xReal[4]) + "; " + \
	    str(self.xReal[5]) + \
	    "); y = " + str(self.y)
    return person_info

The class Data contains a dataset in the format of *credit_train.txt* or *credit_test.txt* (training data and test data) and uses the Person class to represent the customers.  

In [None]:
class Data():
  def __init__(self, filename):
    self.data = []
    self.loadFile(filename)

  def loadFile(self,filename):
    with open(filename) as file:
      lines = file.readlines()
      for line in lines:
        person = Person(line.split())
        self.data.append(person)

  def printList(self):
    for person in self.data:
      print(person)

### A note on classes

You may have not had the concept of Python classes yet in Computational Linguistics, but Python classes are similar to C# classes. In this case, the classes are just reading in the data and ordering it by way of named variables tied to the class name.

Suppose you have an instance of the class `Person` called `john`. If you want to know the label for `john`, you call:

`john.y`

If you have loaded the training data as an instance of the class `Data` called `trainData`, you would access the data by calling:

`trainData.data`

The same goes for methods (such as the method `printList` in `Data`). The method `__str__` in `Person` is not meant to be called by the user using that name. Rather, it is the method that gets called automatically by Python when you run:

`print(john)`

(assuming `john` is an instance of `Person`)

If you run the code below (the first time you have to run the cells above as well), you can get an idea of what the data looks like.

In [None]:
trainData = Data("credit_train.txt")
trainData.printList()

Complete the implementation of the algorithm.

**1. Start with writing a function computeDistance(person1, person2)** (1 point)

In [None]:
# person1 and person2 are instances of Person, and the return value should be a float
def computeDistance(person1, person2):
    person1 = 
    person2
    for d in person1.y:
        afstand = d*
    pass # Remove this statement after having filled in the function

In [None]:
def testComputeDistance():
    person1 = Person("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0".split())
    person2 = Person("0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 0".split())
    distance = computeDistance(person1, person2)
    expected_distance = 1213.5008265757383
    assert abs(distance - expected_distance) / expected_distance < 1e-6

testComputeDistance()

**2. Load the test data as well** (0.5 points)

The function must return an instance of the class Data, containing the test data, which you read from a file.

In [None]:
def loadTestData() -> Data:
    # FILL THIS IN (see a couple of cells above)

testData = loadTestData()
testData.printList()

**3. Write the function giveCredit** (1 point)

Given a `Person`, giveCredit predicts a decision (to approve the credit card application or not), only based on the `Person` in the training data that is closest.

In [None]:
# Library we will use as part of the tests
import os

# This is the heart of the kNN algorithm. You pass in an instance of Person and an instance of Data
# The return value should be a bool (True or False)
def giveCredit(person, trainData):
    # FILL THIS IN

In [None]:
def testGiveCredit():
    # This is a simple test to check that the method above works well
    with open('test_give_credit.txt', 'w') as f:
        f.write("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    trainData = Data('test_give_credit.txt')
    testPerson = Person("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1".split())
    creditDecision = giveCredit(testPerson, trainData)
    expectedDecision = True
    assert creditDecision == expectedDecision
    # Remove the accessory file we created
    os.remove('test_give_credit.txt')

testGiveCredit()

**4. Write the function getErrorPercentage** (1 point)

getErrorPercentage determines the percentage of the test data that is wrongly classified. (Make sure that in all cases, you use the *y* from the test data only to check whether the classification of the algorithm was wrong or correct):

In [None]:
def getErrorPercentage(trainData: Data, testData: Data) -> float:
    # FILL THIS IN
    # Hint: you need to iterate over all test datapoints, and for each of them, call the giveCredit function

In [None]:
def testGetErrorPercentage():
    # A simple function to check that the function above works correctly
    with open('test_give_credit_train.txt', 'w') as f:
        f.write("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    trainData = Data('test_give_credit_train.txt')
    with open('test_give_credit_test.txt', 'w') as f:
        f.write("0 29.58 1.75 1 4 4 1.25 0 0 0 1 2 280 1 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    testData = Data('test_give_credit_test.txt')
    errorPercentage = getErrorPercentage(trainData, testData)
    assert errorPercentage == 50, errorPercentage
    # Remove the accessory files we created
    os.remove('test_give_credit_train.txt')
    os.remove('test_give_credit_test.txt')

testGetErrorPercentage()

What percentage is wrongly classified?

In [None]:
print("Wrongly classified:", getErrorPercentage(trainData, testData), "%")

## Normalising the data
In our training data, some of the variables have very large values (up to thousands), and others have very small ones
(like the binary variables). When you interpret the data as a set of points, this makes the training data very 'stretched' in some directions and compressed in other directions. This way, the variables with large values will dominate in the decision, whereas the smaller variables will hardly play a part. This is undesirable, because if a variable is represented in large numbers, it does not mean the variable is more important than other variables.

This is why we are going to standardize or normalise the data: we divide all *real* variables by a number, such that the difference between the minimum and the maximum is 1. This way the real variables have the same scale as the binary variables (we leave the categorical variables for now).
We can do this by calculating the minimum $min_i$ and maximum $max_i$ for each variable $i$, and in the formula for the Euclidean distance, for each real valued variable *i*, we replace

$$ ... + (A_i - B_i)^2 + ... $$

with

$$  ... +\left(\frac{A_i - B_i}{max_i - min_i}\right)^2 + ...$$


**5. Write a function that determines the minimum and maximum for each real valued variable** (1 point)

In [None]:
"""
There are many ways to do this, but the way I recommend is the following:
1) iterate over all the training data, and determine the maximum and minimum value for each real-valued variable
2) return those maximum and minimum values in a list, which can then be used by the functions below
"""
def getRealBounds(data: Data) -> list:
    # This takes a dataset as input (you will use trainData),
    # and outputs a list of bounds for the real-values variables
    # Output: [(min, max), (min, max), ..., (min, max)] -> one 2-tuple for each real-valued variable
    # FILL THIS IN

In [None]:
def testGetRealBounds():
    # Check if the method above works as expected
    # First create some mock data
    with open('mock_data.txt', 'w') as f:
        f.write("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0\n")
        f.write("0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 0\n")
        f.write("0 29.58 1.75 1 4 4 1.25 0 0 0 1 2 280 1 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    mockData = Data('mock_data.txt')
    # Note that the indices of the real-valued variables are 1, 2, 6, 9, 12, 13
    # We can read the minima and maxima off from the input text
    expected_bounds = [(15.83, 29.58), (0.585, 11.46), (0.165, 1.585), (0, 2), (100, 280), (1, 1213)]
    # Now use the method above to find the bounds
    real_bounds = getRealBounds(mockData)
    assert expected_bounds == real_bounds
    # Delete the accessory file we created
    os.remove('mock_data.txt')

testGetRealBounds()

**6. Write a new distance function _computeDistance2_** (0.5 points)

computeDistance2 scales the real valued variables before the distance is calculated.

In [None]:
# Hint: Start by copying your computeDistance function from above
# Rename the function as computeDistance2, then make the necessary changes
# Note that this function now NEEDS to know the real-variable bounds, so we need to add one more parameter
def computeDistance2(person1, person2, realBounds):
    # FILL THIS IN

In [None]:
def testComputeDistance2():
    # Check that the function above does as expected
    person1 = Person("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0".split())
    person2 = Person("0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 0".split())
    bounds = [(15.83, 29.58), (0.585, 11.46), (0.165, 1.585), (0, 2), (100, 280), (1, 1213)]
    expected_distance = 4.50345939998121
    computed_distance = computeDistance2(person1, person2, bounds)
    assert abs(expected_distance - computed_distance) / expected_distance < 1e-4, (expected_distance, computed_distance)

testComputeDistance2()

**7. What percentage of the test data is wrongly classified using this distance function?** (0.5 points)

In [None]:
# Now we need to copy the function getErrorPercentage that we wrote above, but we need to use the new distance
# This means we also need to copy and modify giveCredit
# If we leave the signature (parameters) of giveCredit as they are, then we will compute the bounds for every test
# datapoint. That is a waste of time. So we will pass the (known) bounds to giveCredit
def giveCredit2(person, trainData, realBounds):
    # FILL THIS IN
    # Hint: this function has to call computeDistance2

def getErrorPercentage2(trainData: Data, testData: Data) -> float:
    # FILL THIS IN
    # Hint: this function has to call getRealBounds and computeDistance2
    # If you do not call getRealBounds first, and instead normalize every distance on the fly, your code
    # will be too slow

In [None]:
def testGiveCredit2():
    # This is a simple test to check that the method above works well
    with open('test_give_credit.txt', 'w') as f:
        f.write("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    trainData = Data('test_give_credit.txt')
    testPerson = Person("0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 0".split())
    bounds = [(15.83, 29.58), (0.585, 11.46), (0.165, 1.585), (0, 2), (100, 280), (1, 1213)]
    creditDecision = giveCredit2(testPerson, trainData, bounds)
    expectedDecision = False
    assert creditDecision == expectedDecision
    # Remove the accessory file we created
    os.remove('test_give_credit.txt')

def testGetErrorPercentage2():
    # A simple function to check that the function above works correctly
    with open('test_give_credit_train.txt', 'w') as f:
        f.write("""1 39.58 13.915 2 9 4 8.625 1 1 6 1 2 70 1 1
1 23.17 0 2 13 4 0.085 1 0 0 0 2 0 1 1
1 63.33 0.54 2 8 4 0.585 1 1 3 1 2 180 1 0
1 23.75 0.415 1 8 4 0.04 0 1 2 0 2 128 7 0
""")
    trainData = Data('test_give_credit_train.txt')
    with open('test_give_credit_test.txt', 'w') as f:
        f.write("0 29.58 1.75 1 4 4 1.25 0 0 0 1 2 280 1 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    testData = Data('test_give_credit_test.txt')
    errorPercentage = getErrorPercentage2(trainData, testData)
    assert errorPercentage == 50, errorPercentage
    # Remove the accessory files we created
    os.remove('test_give_credit_train.txt')
    os.remove('test_give_credit_test.txt')

testGiveCredit2()
testGetErrorPercentage2()

In [None]:
print("Wrongly classified with real value normalization:", getErrorPercentage2(trainData, testData), "%")

For the categorical variables, the distance function now makes a different mistake. It assumes that points with labels 1 and 2 in a certain categorical variable are very similar but a label 14 in that variable is totally different. However, these numbers are only dummy values that represent certain labels. For two customers, we can only say whether they are in the same category or not. The data does not say anything about the grades of similarity between categories.

**8. Write a third distance function computeDistance3** (0.5 points)

computeDistance3 acts the same on binary and real values as _computeDistance2_, but simply uses 0 or 1 for categorical variables instead of $...+(A_i-B_i)^2+...$. That is, it should use $0$ if $A_i = B_i$, and $1$ otherwise.

In [None]:
# Start by copying your computeDistance2 function from above
# Rename the function as computeDistance3, then make the necessary changes
def computeDistance3(person1, person2, realBounds):
    # FILL THIS IN

In [None]:
def testComputeDistance3():
    # Check that the function above does as expected
    person1 = Person("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0".split())
    person2 = Person("0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 0".split())
    bounds = [(15.83, 29.58), (0.585, 11.46), (0.165, 1.585), (0, 2), (100, 280), (1, 1213)]
    expected_distance = 2.29807453475276
    computed_distance = computeDistance3(person1, person2, bounds)
    assert abs(expected_distance - computed_distance) / expected_distance < 1e-4

testComputeDistance3()

**9. Implement giveCredit3 and getErrorPercentage3** (0.5 points)

Calculate the percentage of the test data that is wrongly classified if the algorithm uses _computeDistance3_.

In [None]:
# Now we need to copy the function getErrorPercentage2 that we wrote above, but we need to use the new distance
# This means we also need to copy and modify giveCredit2
def giveCredit3(person, trainData, realBounds):
    # FILL THIS IN

def getErrorPercentage3(trainData: Data, testData: Data) -> float:
    # FILL THIS IN

In [None]:
def testGiveCredit3():
    # This is a simple test to check that the method above works well
    with open('test_give_credit.txt', 'w') as f:
        f.write("1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1213 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    trainData = Data('test_give_credit.txt')
    testPerson = Person("0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 0".split())
    bounds = [(15.83, 29.58), (0.585, 11.46), (0.165, 1.585), (0, 2), (100, 280), (1, 1213)]
    creditDecision = giveCredit3(testPerson, trainData, bounds)
    expectedDecision = False
    assert creditDecision == expectedDecision
    # Remove the accessory file we created
    os.remove('test_give_credit.txt')

def testGetErrorPercentage3():
    # A simple function to check that the function above works correctly
    with open('test_give_credit_train.txt', 'w') as f:
        f.write("""1 39.58 13.915 2 9 4 8.625 1 1 6 1 2 70 1 1
1 23.17 0 2 13 4 0.085 1 0 0 0 2 0 1 1
1 63.33 0.54 2 8 4 0.585 1 1 3 1 2 180 1 0
1 23.75 0.415 1 8 4 0.04 0 1 2 0 2 128 7 0
""")
    trainData = Data('test_give_credit_train.txt')
    with open('test_give_credit_test.txt', 'w') as f:
        f.write("0 29.58 1.75 1 4 4 1.25 0 0 0 1 2 280 1 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    testData = Data('test_give_credit_test.txt')
    errorPercentage = getErrorPercentage3(trainData, testData)
    assert errorPercentage == 0, errorPercentage
    # Remove the accessory files we created
    os.remove('test_give_credit_train.txt')
    os.remove('test_give_credit_test.txt')

testGiveCredit3()
testGetErrorPercentage3()

In [None]:
print("Wrongly classified with real normalization and categorical binarization:",
      getErrorPercentage3(trainData, testData), "%")

## From NN to kNN
The algorithm as described above is called the 1-Nearest Neighbour-algorithm. This is a special case of the *k*-Nearest Neighbours-algorithm. In the *kNN* algorithm,  we do not look at only 1 'neighbour', but the *k* points in the training data that are closest to the point we are classifying. Our new prediction is the decision that has a majority. Usually, odd numbers are used for *k*, so we do not have to worry about the case where there is an equal division.
At *k=1* the boundary between the two areas where the algorithm approves/rejects is very uneven. By making *k* bigger the shape of the boundary is less determined by single points in the training data. This makes the boundary more smooth, which often leads to better predictions.

**10. Adapt getCredit and getErrorPercentage to k-Nearest Neighbours** (1 point)

Modify the classification function in such a way that it uses the k-NN algorithm. Make *k* an argument of the function. You can assume that $k \in \{1,3,5\}$. Use the last distance function *computeDistance3* from above.

In [None]:
# The distance did not change, so we will just modify giveCredit3 and getErrorPercentage3
# We will rename them as giveCreditk and getErrorPercentagek, to reflect the choice of neighbours
# We need to take in the new parameter k
def giveCreditk(person, trainData, realBounds, k):
    # FILL THIS IN

def getErrorPercentagek(trainData: Data, testData: Data, k: int) -> float:
    # FILL THIS IN

In [None]:
def testGiveCreditk():
    # This is a simple test to check that the method above works well
    with open('test_give_credit.txt', 'w') as f:
        f.write("""1 23.08 2.5 2 8 4 1.085 1 1 11 1 2 60 2185 1
1 27 0.75 2 8 8 4.25 1 1 3 1 2 312 151 1
0 20.42 10.5 1 14 8 0 0 0 0 1 2 154 33 0
1 52.33 1.375 1 8 8 9.46 1 0 0 1 2 200 101 0
""")
    trainData = Data('test_give_credit.txt')
    testPerson = Person("0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 0".split())
    bounds = [(15.83, 29.58), (0.585, 11.46), (0.165, 1.585), (0, 2), (100, 280), (1, 1213)]
    creditDecision = giveCreditk(testPerson, trainData, bounds, 3)
    expectedDecision = True
    assert creditDecision == expectedDecision
    # Remove the accessory file we created
    os.remove('test_give_credit.txt')

def testGetErrorPercentagek():
    # A simple function to check that the function above works correctly
    with open('test_give_credit_train.txt', 'w') as f:
        f.write("""1 39.58 13.915 2 9 4 8.625 1 1 6 1 2 70 1 1
1 23.17 0 2 13 4 0.085 1 0 0 0 2 0 1 1
1 63.33 0.54 2 8 4 0.585 1 1 3 1 2 180 1 0
1 23.75 0.415 1 8 4 0.04 0 1 2 0 2 128 7 0
""")
    trainData = Data('test_give_credit_train.txt')
    with open('test_give_credit_test.txt', 'w') as f:
        f.write("0 29.58 1.75 1 4 4 1.25 0 0 0 1 2 280 1 0\n")
        f.write("0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 1\n")
    testData = Data('test_give_credit_test.txt')
    errorPercentage = getErrorPercentagek(trainData, testData, 3)
    assert errorPercentage == 50, errorPercentage
    # Remove the accessory files we created
    os.remove('test_give_credit_train.txt')
    os.remove('test_give_credit_test.txt')

testGiveCreditk()
testGetErrorPercentagek()

**11. What are the test results for the different values of *k*?** (0.5 points)

In [None]:
def printTestResults():
    # FILL THIS IN
    # Hint: you need to iterate over the possible values of k: 1, 3 and 5. For each k, call getErrorPercentagek

printTestResults()

**Note:** In a realistic situation we will not first try which value of *k* leads to the best results on the test data, but we will have to choose the value of *k* (or let the computer choose) **before** we see the test data. Usually we will do this on a separate dataset called the *development* dataset. This is an example of a general problem which we will encounter in all kinds of Machine Learning methods and on which we will pay a lot of attention in this course.

# Programming practices

## Copy-pasted code

In this lab we have followed a very bad programming practice: copying and pasting code. For example, we copied computeDistance to then write computeDistance2, and then we copied computeDistance2 to write computeDistance3. In general, best practice is to avoid redundant copying and pasting, but instead to do one of the following:

* if you want to replace the old functionality with the new one, simply edit the original function to have the new functionality

* if you want to keep the old functionality along with the new one, edit the original function to be able to run in both modes

When grading this lab, we will consider avoiding copy-pasting as a *bonus* 0.5 points (on the whole lab). The maximum grade will be capped at 8 points (you cannot score more than 8 points on the entire lab).

## Explicitly stating the parameter types and return types

Python allows you to explicitly state the parameter types and the return types, i.e., you can write method signatures as:

`def get_some_boolean(param1):`

but you can also write it as:

`def get_some_boolean(param1: float) -> bool:`

Both of these are correct, and we will not base our grading on this. Whatever you prefer is fine. We just want you to be aware of both possibilities, to avoid confusion on your end when reading our code.

## Unit testing

A good practice when writing code is to write unit tests. These allow you to test the functionality of your code in very controlled scenarios. You can use the unit tests provided above. If one of these fails, it probably means there is a bug in your code. However, we also note that you might have used slightly different function signatures, which might mean that our unit tests are not precisely tailored to your code.

## Congratulations! You have finished the required portion of the assignment.
### If you have finished and want to challenge yourself a little further, you can attempt the following 2 extra questions.

**12. Try what happens for higher values of $k$. Discuss your results in a short paragraph** (0 points)

In [None]:
# FILL THIS IN
# Hint: most of the work is done by just running getErrorPercentagek with other values for k

Can you think of alternate distance metrics? One possible metric is *Manhattan distance*, also called *city block* distance, and related to *lasso*, a powerful method for improving the reliability of linear models.

$dist_{MH} = \sum_{i=1}^{d} \lvert A_i - B_i \rvert$

https://en.wikipedia.org/wiki/Taxicab_geometry

**13. Implement the manhattan distance metric. Is this a good metric for this problem? Discuss your results**  (0 points)

In [None]:
# FILL THIS IN
"""
Hint: most of the work is done by changing the computeDistance3 function.
However, you will also need to change the giveCredit function and the getErrorPercentage function
"""