<a href="https://colab.research.google.com/github/hr-ge/Python-for-clinicians/blob/main/exam_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of this exam is to put together some of the core concepts we taught you along the course. In particular, you will work on a small data science project. Your task is to perform classification on the Iris dataset (https://archive.ics.uci.edu/dataset/53/iris) using the simplest machine learning algorithm: k-Nearest Neighbours (k-NN). You are expected to do a simple hypertuning optimization over the number of neighbours $k$ and implement more advanced training techniques such as k-Fold cross validation. You are NOT expected to write the code from scratch for the most part! Instead, you will have to fill in the blank spaces. Good luck!

Let's start with a basic task from week 1. Print out your first and last name along with your position. Put the relevant code in the code cell below:

Hristo Georgiev, PhD student


Let's now load a few familiar packages that we will need to use throughout the exam.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We will now fetch the Iris dataset. For your convinience, we do that entirely through ```Python```, meaning that you do not have to worry about downloading a ```CSV``` file, mounting your drive, etc.

In [None]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
from ucimlrepo import fetch_ucirepo

And now for the actual fetching:

In [None]:
iris = fetch_ucirepo(id=53)

X = iris.data.features
y = iris.data.targets

We now have a pair of variables called ```X``` and ```y``` which supposedly stores the features and labels of the Iris dataset. Note, however, that we know nothing about their type. Are they a ```Pandas``` dataframe? Are they a ```NumPy``` array? Check the type of ```X``` and ```y``` by passing them as an argument to the ```type()```. Remember that arguments to a function are passed within the function's paranthesis. In this particular case, we should make two separate calls to the ```type()``` function. Fill in the blank space after the comas in the code cell below:

In [None]:
print("The type of X is: ", type(X))
print("The type of y is: ", type(y))

The type of X is:  <class 'pandas.core.frame.DataFrame'>
The type of y is:  <class 'pandas.core.frame.DataFrame'>


If correct, your implementation should print out ```<class 'pandas.core.frame.DataFrame'>``` after the colon. We now know the type of ```X``` and ```y```. Unfortunately, this is not enough to understand the data. An experienced data scientist would always investigate further, as understanding the data is essential for solving the problem.

In [None]:
print(iris.variables)

           name     role         type demographic  \
0  sepal length  Feature   Continuous        None   
1   sepal width  Feature   Continuous        None   
2  petal length  Feature   Continuous        None   
3   petal width  Feature   Continuous        None   
4         class   Target  Categorical        None   

                                         description units missing_values  
0                                               None    cm             no  
1                                               None    cm             no  
2                                               None    cm             no  
3                                               None    cm             no  
4  class of iris plant: Iris Setosa, Iris Versico...  None             no  


Focus on the first table. What can you tell about the data? How many distinct features can you count (see the table)? How many distinct classes are there (hint: use ```y.nunique()```)?

Change # to the correct answer, for example, # -> 8:

**There are # unique features and # unique classes in the dataset.**

Now that we understand the data, it is time to do some basic preprocessing. Your task is to convert ```X``` and ```y``` from ```Pandas``` dataframes to ```NumPy``` arrays.

In [None]:
X = X.to_numpy()
y = y.to_numpy()

In [None]:
print("The type of X is: ", type(X))
print("The type of y is: ", type(y))

The type of X is:  <class 'numpy.ndarray'>
The type of y is:  <class 'numpy.ndarray'>


As an aspiring data scientist, you should always be curious about the size of the data you are given. This is important mostly for computational reasons. Certain models are sufficiently good for smaller datasets, while others, such as neural networls (https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) excell when given a large dataset.

In [None]:
print("The size of X is: ", len(X))
print("The size of y is: ", len(y))

The size of X is:  150
The size of y is:  150


Why did I print the size of both ```X``` and ```y```? Are they not expected to match? The answer is yes. We can safely assume that the size of ```X``` and ```y``` matches in a well-established dataset, such as the Iris one. However, we often work with data that has not been pre-processed and normilized. If a single ```X``` sample was unlabelled (the respecting ```y``` was missing), we would have had a violation of the assumption of supervised learning that the data is labelled. Therefore, we would have needed to drop the unlabelled sample.

We shall now look into the training setting. We will perform basic k-Fold Cross validation, and we will set $k$ to be $3$, meaning that at each step, we will train on $100$ samples and evaluate on $50$ samples. We will set the number of neighbours $n \in [3, 5, 7]$ (we use $n$ to distinguish from $k$). We will use accuracy as our performance measure. For each number of neighbours $n$, we will define a model and train it on $3$ distinct data splits. We will then calculate and store the average accuracy.

Let's now import some relevant machine learning stuff:

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier # Import the k-NN classifier.

Define the ```folds``` and ```neighbours``` variables accordingly.

In [None]:
folds      =
neighbours =

In [None]:
# Set up 3-fold cross-validation:
kf = KFold(n_splits=folds, shuffle=True, random_state=42)

# Store average accuracies:
average_accuracies = {}

# Evaluate each value of n_neighbors:
for n in neighbours:
    accuracies = []

    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Create and train k-NN model:
        model = KNeighborsClassifier(n_neighbors=n)
        model.fit(X_train, y_train)

        # Predict and evaluate:
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        accuracies.append(acc)

    # Store average accuracy for this n:
    average_accuracies[n] = np.mean(accuracies)

# Print results:
for n in neighbours:
    print(f"n_neighbors = {n}: Average Accuracy = {average_accuracies[n]:.4f}")

n_neighbors = 3: Average Accuracy = 0.9600
n_neighbors = 5: Average Accuracy = 0.9667
n_neighbors = 7: Average Accuracy = 0.9667


  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)


Briefly explain which of the three classifiers would you choose and why. Think about not only accuracy but computational effectiveness as well.