# Introduction to Machine Learning with Python
### eSTEAM Program
### Saturday, October 28, 2023

![sky](https://storage.googleapis.com/kaggle-datasets-images/29/33/default-backgrounds/dataset-cover.jpg)

### How to use Jupyter notebooks

Click on a cell using the mouse and hit the "play" or "Run" button above to execute the code in the cell. You can also hold down the "shift" key and then press "enter" to execute a cell.

To insert a cell, choose "Insert" in the menu then either "Insert Cell Above" or "Insert Cell Below". After inserting the cell, to type in the cell either click on it or hit the "enter/return" key.

After a variable is defined it can be referenced below the line where it is defined.

### Distance between points

First, we load some Python modules to make certain functionality available below:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

Below we define the coordinates of two points:

In [None]:
x1, y1 = (0, 0)
x2, y2 = (1, 1)

Let's plot the points:

In [None]:
plt.plot([x1, x2], [y1, y2], marker="o", markersize=10, ls="dashed")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

The distance between the two points can be calculated as follows:

In [None]:
distance = ((x1 - x2)**2 + (y1 - y2)**2)**0.5
print(distance)

Does the number above seem familiar to you? Can you use trigonometry (the Pythagorean theorem) to confirm the result above?

Here we use [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). There are many other definitions of distance such as [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) and [Haversine distance](https://en.wikipedia.org/wiki/Haversine_formula).

### Exercise 1

What is the distance between the points when x1=2, y1=7 and x2=6, y2=5?

In [None]:
# you write code here

Now that we have reminded ourselves about distance and how to calculate it using a computer, let's train a machine learning model to classify the cells of breast tissue as malignant (harmful) or benign (not harmful). To do this will we use a simple machine learning model called k-nearest neighbors.

# k-Nearest Neighbors for Classification

[This algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) works by making a prediction based on the class labels of the samples that are closest to it. Consider the figure below showing two classes (blue squares and red triangles). You want to classify the green circle as one of the two classes. kNN looks to see which classes are closest and makes a decision based on that.

![kNN](https://tigress-web.princeton.edu/~jdh4/KnnClassification.png)

Should the green circle be classified as a red triangle or a blue square? If we consider the 3 closest neighbors (solid circle) then it would be classified as a red triangle since there are 2 red triangles and only 1 blue square (majority wins). If we consider the 5 closest neighbors in making the decision (the dashed circle) then the green circle would be classified as a blue square since there are more neighbors of that class.

### Exercise 2

Can you think of a case where k-nearest neighbors should work well to distinguish two classes?

For instance, if you had the height and weight of everyone at an elementary school, would 
you be able to train a k-NN model to distingish teachers from students and then apply the model successfully at another elementary school? Would it work at a college?





# Medical Diagnosis: Wisconsin Breast Cancer Dataset

Our goal is to train a model to classify tissue samples as either malignant (positive) or benign (negative).

![positive_negative](https://tigress-web.princeton.edu/~jdh4/positive_negative.jpeg)

In the next line we read in the data and store it in df:

In [None]:
df = pd.read_csv("https://tigress-web.princeton.edu/~jdh4/wdbc.csv")

The [original data set](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) includes these measurements and more for each sample:

- ID number
- Diagnosis (M = malignant, B = benign).
- Radius (the mean of distances from the centre to points on the perimeter).
- Texture (the standard deviation of gray-scale values).
- Perimeter
- Area
- Smoothness (the local variation in radius lengths).
- Compactness (the perimeter^2 / area - 1.0).
- Concavity (the severity of concave portions of the contour).
- Concave points (the number of concave portions of the contour).
- Symmetry
- Fractal dimension ("coastline approximation" - 1).

In [None]:
df.tail(3).T

For simplicity we will use only the first two features, namely, radius and texture:

In [None]:
X_train = df[["radius", "texture"]].values
y_train = df["target"].values

The function below is used for making plots (you do not need to understand how it works).

In [None]:
def make_plot(radius=None, texture=None, zoom=False):
    plt.figure(figsize=(9,6))
    if zoom: plt.scatter([radius], [texture], color="w", edgecolor="k", marker="o", s=50000)
    plt.scatter(df[df.target == "Malignant"]["radius"].values, df[df.target == "Malignant"]["texture"].values, marker="s", label="Malignant")
    plt.scatter(df[df.target == "Benign"]["radius"].values, df[df.target == "Benign"]["texture"].values, marker="^", color="r", label="Benign")
    if radius and texture:
        plt.scatter([radius], [texture], color="lightgreen", edgecolor="k", marker="o", s=200)
        if zoom:
            d = 1
            plt.xlim(radius - d, radius + d)
            plt.ylim(texture - d, texture + d)
    plt.xlabel("Radius")
    plt.ylabel("Texture")
    plt.legend()
    plt.show()

In [None]:
make_plot()

Create an instance of the kNN classifer with 5 neighbors:

In [None]:
number_of_neighbors = 5
kNN = KNeighborsClassifier(number_of_neighbors)

Fit the model using the training data:

In [None]:
kNN = kNN.fit(X_train, y_train)

We can now apply the model to new samples. For instance, what is the prediction for the following case:

In [None]:
radius = 20
texture = 20

In [None]:
make_plot(radius, texture)

We can call the predict function on the model to see the prediction for the given values of radius and texture:

In [None]:
print(kNN.predict([[radius, texture]]))

The prediction of "Malignant" is correct.

### Exercise 3

Create a plot and generate a prediction for the following new case:

In [None]:
radius = 10
texture = 15

In [None]:
# you write code here

# Confidence and Correctness

We can be confident in the two predictions above because the test points are surrounded exclusively by the same class. What about the following case?

In [None]:
radius = 13
texture = 20

### Exercise 4

Make a plot and a prediction for the new test case. Are you confident about the prediction?

In [None]:
# you write code here

Remove the "#" in the line below and run the code. You will get an output of the form [[P1 P2]] where P1 is the probability of being Benign and P2 is the probability of being Malignent. Probability varies between 0 and 1. The higher the probability the more confident one can be about the prediction.

In [None]:
#print(kNN.predict_proba([[radius, texture]]))

Remove the "#" character in the line below and run the code to zoom in on the test point so that we can see the 5 nearest neighbors:

In [None]:
#make_plot(radius, texture, zoom=True)

Looking at the figure above, do the predicted probabilities make sense?

What would happen if a patient with a malignant tumor was told that it was benign?

What would happen if a patient with a benign tumor was told that it was malignment?

What should a patient be told when the prediction of a machine learning model is not very confident?

What could be done to make the model more accurate?