# Lab Report - First Machine Learning Project

##### Nakul Pandit
##### 250919679

<hr style="border:2px solid gray">

## Index: <a id='index'></a>
1. [Building a Machine Learning Model](#model)
1. [The dataset](#dataset)
1. [Decision Trees](#DT)
1. [Expectations](#expectations)
1. [Nearest Neighbours](#knn)


<hr style="border:2px solid gray">

## Section 1: Building a Machine Learning Model  [^](#index) <a id='model'></a>

Steps:
1. **Problem formulation:** 
2. **Data collection:** 
3. **Data preparation and feature engineering:** 
4. **Model selection and training:** 
5. **Model evaluation:** 
6. **Model tuning:** 

## 1.1 Problem formulation

In the cell below, **define the problem that we want to solve**. Are we trying to predict? Or classify? Do you think we should use supervised learning or unsupervised learning for this task?

**"Iris Dataset"-** For the Iris dataset, the task is to classify each flower sample into one of three species (Setosa, Versicolor, Virginica). We will classify it based on four measured features.

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn import metrics


import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import colors

<hr style="border:2px solid gray">

## Section 2: The dataset [^](#index) <a id='dataset'></a>

### 2.1 Find the appropriate dataset
In the second step, we collect the data that we need to train our model. In this case we were lucky and the [Habitable Worlds Catalogue]('https://phl.upr.edu/hwc') has done this for us already. It lists up to potentially habitable worlds in a list of over five thousand known exoplanets, putting together information gathered by several observatories, including the Kepler and K2 missions and the ongoing Transiting Exoplanet Survey Satellite.

As this is our very first machine learning project, it may be daunting to look at a 5000+ dataset with tens of features, so I made smaller set for you, made of 18 instances, 3 features and our target labels. This is in the csv file called `HabPlanets_simple.csv`.

### 2.2 Read the dataset
In week 1, you learned to use the `read_csv` function from the `panda` module, so I left the cell below for you to complete:

In [None]:
LearningSet = pd.read_csv('hwc.csv')

Unnamed: 0,P_NAME,P_MASS,P_MASS_ERROR_MIN,P_MASS_ERROR_MAX,P_RADIUS,P_RADIUS_ERROR_MIN,P_RADIUS_ERROR_MAX,P_YEAR,P_UPDATED,P_PERIOD,...,S_ABIO_ZONE,S_TIDAL_LOCK,P_HABZONE_OPT,P_HABZONE_CON,P_TYPE_TEMP,P_HABITABLE,P_ESI,S_CONSTELLATION,S_CONSTELLATION_ABR,S_CONSTELLATION_ENG
0,OGLE-2016-BLG-1227L b,251.084120,-123.952920,413.176400,13.90040,0.00000,0.00000,2020,2020,0.000000,...,0.000046,0.000000,0,0,Cold,0,0.146639,Scorpius,Sco,Scorpion
1,Kepler-276 c,16.527056,-3.496108,4.449592,2.90339,-0.28025,1.26673,2013,2014,31.884000,...,2.097783,0.316980,0,0,Hot,0,0.271883,Cygnus,Cyg,Swan
2,Kepler-829 b,5.085248,0.000000,0.000000,2.10748,-0.17936,0.43719,2016,2016,6.883376,...,1.756317,0.459559,0,0,Hot,0,0.254888,Lyra,Lyr,Lyre
3,K2-283 b,12.172812,0.000000,0.000000,3.51994,-0.15694,0.15694,2018,2018,1.921036,...,0.568374,0.000000,0,0,Hot,0,0.193908,Pisces,Psc,Fishes
4,Kepler-477 b,4.926334,0.000000,0.000000,2.07385,-0.12331,0.17936,2016,2016,11.119907,...,0.768502,0.386150,0,0,Hot,0,0.276524,Lyra,Lyr,Lyre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5561,Kepler-58 e,3.210063,0.000000,0.000000,1.69271,0.00000,0.00000,2023,2023,4.458149,...,3.550861,0.453699,0,0,Hot,0,0.273785,Cygnus,Cyg,Swan
5562,KMT-2019-BLG-1216L b,29.875832,-15.891400,15.891400,5.97493,0.00000,0.00000,2023,2023,0.000000,...,0.000072,0.000000,0,0,Cold,0,0.287181,Scorpius,Sco,Scorpion
5563,TOI-1694 b,26.100035,-2.199370,2.199370,5.43685,-0.17936,0.17936,2023,2023,3.770150,...,0.563688,0.000000,0,0,Hot,0,,Camelopardalis,Cam,Giraffe
5564,KMT-2022-BLG-0440L b,15.398767,-7.399036,9.598406,4.04681,0.00000,0.00000,2023,2023,0.000000,...,0.000080,0.000000,1,1,Warm,0,0.414763,Sagittarius,Sgr,Archer


### 2.3 Check that the dataset has been read correctly

Check that your dataset has been read correctly by exploring its structure (displaying the whole `LearningSet`, and using `head()` or `describe(`). **NB You shouldn't be plotting the dataset at this stage as you have not split it into the training and test set yet.**

### 2.4 Understand the features 
The dataset includes 3 features and one target. Looking at the website source we can see that the column features refer to:
 - S_MASS - star mass (solar units).
 - P_PERIOD - planet period (days).
 - P_DISTANCE - planet mean distance from the star (AU).
 - P_HABITABLE - boolean variable telling us if the planet is habitable or not.

You can change the column names to something handier, or keep them as they are, the important thing is that you remember what they mean.

<div style="background-color:#C2F5DD">

## Exercise 1
When dealing with a new dataset it's useful to answer these questions.
1. What's the size of the dataset?
2. Are there any missing data? if yes, how should you handle them?
3. Are all the features in a similar numerical range and is there anything unusual about the distribution of the numerical values?
4. Is the dataset imbalanced (ie one or more classes are much more heavily populated than others)?
5. Start developing some intuition on how well you expect the model to work: are these features meaningful? do we have enough samples?

## 1.3 Data preparation and feature engineering

This step involves preparing the data for training, such as cleaning and transforming it. This may involve removing outliers, imputing missing values, and normalising the data. Select the features that are most important for the problem. This may involve creating new features or removing irrelevant features.

### 1.3.1 Splitting between the Train and Test set
The first thing we will want to do is split the data set into training and test sets. Normally the train/test split choice happens at random, but for this notebook we will choose a specific split so that the results are reproducible.
Use the first 13 instances of the dataframe as a train set and the last 5 as a test set (*hint: you could use the `panda` method `iloc` for this)*.

In [None]:
TrainSet = ...
TestSet = ...

Now create `Xtrain` and `Xtest` sets which will not have the name and habitable columns (*hint: you can use `drop` to do this*). And create your label sets `ytrain` and `ytest` which will only include the habitable column.

In [None]:
Xtrain = ...
Xtest = ...
ytrain = ...
ytest  = ...

Verify that the `shape` of `Xtrain`, `Xtest`, `ytrain`, `ytest` is what you expect.

In [None]:
Xtrain.shape, Xtest.shape

<div style="background-color:#C2F5DD">

## Exercise 2
Plot the train and test set in a nice scatter graph. I added some bits of code which I know will make the plot prettier once you have correctly defined everything. We want the following features:
1. Plot a scatter graph of the `Xtrain` dataset (mass of parent star on the x-axis and the orbital period on the y axis), and using the `ytrain` as the `c` option for the colormap (`cmap` has already been defined for you below), `*` as markers, and an `alpha` of 0.5. We will want to label this `Train`.
2. Add a scatter graph of the `Xtest` dataset (mass of parent star on the x-axis and the orbital period on the y axis), and using the `ytest` as the `c` option for the colormap, `o` as markers, and an `alpha` of 0.5. We will want to label this `Test`.
3. Add descriptive axis labels (including units)
4. The y axis should be in a logarithmic scale
5. Plot the legend

3. Using `plt.axvline` and `plt.axhline` plot a horizontal line at 3.5 and a vertical line at 0.5.



In [None]:
plt.figure(figsize=(10,6))
cmap = colors.ListedColormap(['purple', 'green'])

...

purplepatch = mpatches.Patch(color='purple', label='Not Habitable')
greenpatch = mpatches.Patch(color='green', label='Habitable')

ax = plt.gca()
leg = ax.get_legend()
leg.legend_handles[0].set_color('k')
leg.legend_handles[1].set_color('k')
plt.legend(handles=[leg.legend_handles[0],leg.legend_handles[1], purplepatch, greenpatch])


<hr style="border:2px solid gray">

## Section 3: Decision Trees  [^](#index) <a id='DT'></a>


A decision tree is a type of machine learning algorithm that uses a tree-like structure to make decisions or predictions. It's often used in supervised learning tasks, (where you have a dataset with labeled examples and you want to learn a model that can predict the labels for new, unseen examples).

Here's how a decision tree works:
 - **Start at the root node.** This node represents the entire dataset.
 - **Ask a question about one of the features in the data.** The answer to the question will determine which branch of the tree to take.
 - **Continue asking questions and following branches until you reach a leaf node.** The leaf node represents a prediction or classification.

A good decision is characterised by efficient splits, which has the maximum information gain or maximum decrease of impurity. A metric that is often is used is the **Gini impurity** defined as:
$$
 I_G = 1 - \sum_i f(i)^2
$$
where $f(i)$ is the fractional abundance of each class.

To calculate if a split is convenient or not, we need to perform 3 steps:
1. Calculate the Gini impurity of the current dataset.
2. Calculate the Gini impurity of the proposed split.
3. Calculate the difference between the two.

The largest decrease in impurity will be the preferable option. **NB. The Gini impurity of a proposed split is the sum of the fractional impurities of the two resulting nodes, weighted by the fractional volume of each node with respect to its parent node.**

<div style="background-color:#C2F5DD">

## Exercise 3
Using the two lines defined in the scatter plot above and the definition of the Gini impurity, assess whether it is more convenient to split the **train** dataset vertically and then horizontally or the other way round. When calculting


## Train the model!
It's time to train our Decision Tree and see if our model finds our same results. The following cells does two things:
 - It defines our model as our decision tree classifier
 - It then trains the model with out train set
The `random_state` variable in this case is set to a specific value for reproducibility purposes 

In [None]:
model = DecisionTreeClassifier(random_state=3)
model.fit(Xtrain,ytrain)

#### Let's visualize the graph!

In [None]:
plot_tree(model, feature_names=['Stellar Mass (M*)', 'Orbital Period (d)', 'Distance (AU)'], 
          class_names=['Not Habitable','Habitable'], filled = True, rounded = True)
plt.show()

These numbers are a little bit different from what we found above. Can you guess why?

In [None]:
# Because we only looked at 2 features, whereas the DT is looking at 3 features.

### Let's take a look at some metrics.
Using the `model.predict` function, apply the model to `Xtest` and calculate our prediction on the test set and on the training set.

In [None]:
ytestpred = ...
ytrainpred = ...

Using the `metrics` module you can calculate the `accuracy_score` and compare the performance of the two

In [None]:
test_accuracy  = ...
train_accuracy = ...
print("The accuracy of the test set is {:.3f}".format(test_accuracy))
print("The accuracy of the train set is {:.3f}".format(train_accuracy))


The following cells make a pretty Confusion Matrix and print out the number of true negatives, true positives, false negatives and false positives.

In [None]:
cm = metrics.confusion_matrix(ytest,ytestpred, labels=model.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=['Not Habitable','Habitable'])
disp.plot()

print("Number of True Negatives: {:.3f}".format(cm[0,0]))
print("Number of True Positives: {:.3f}".format(cm[1,1]))
print("Number of False Negatives: {:.3f}".format(cm[1,0]))
print("Number of False Positives: {:.3f}".format(cm[0,1]))

<div style="background-color:#C2F5DD">

## Exercise 4
Repeat the same exercise but with taking the first 5 instances of the `LearningSet` as our test set and the last 13 as our training set:
1. Plot a scatter graph of the new train set and test set.
2. Train the new model (using again `random_state=3` to have reproducibility)
3. Visualise the decision tree
4. Calculate and display the new accuracy
5. Discuss which training is better

<hr style="border:2px solid gray">

## Section 4: Nearest Neighbours  [^](#index) <a id='neighbor'></a>

The nearest neighbour method, also known as the k-nearest neighbours (k-NN) algorithm, is a simple yet powerful technique in machine learning used for both classification and regression tasks. It works on the fundamental assumption that similar data points are likely to have similar labels or values.
We can use it in a similar way just by calling the classifier from `scikit-learn`. In this case we use the `KNeighborsClassifier`.


<div style="background-color:#C2F5DD">

## Exercise 5

1. Define the new model as the `KNeighborsClassifier` using the option of `n_neighbor=3` (we use just 3 neighbors as this is a very small dataset, the default is 5 neighbours). Train the dataset using the `fit` method as you've done previously.
2. Use the `predict` method from the model to get the predictions and calculate the accuracy scores.
3. Plot the confusion matrix and print out the number of true positives, true negatives, false positives, false negatives.
4. What do you think about this classifier? did it work well?
5. Plot the scatter graph of the test and train set again (yes the usual one!) but without the logarithmic y-axis. Then, use the code below to plot the 5 circles representing the circle of the closest 3 instances to the 5 test points. (Yes, I am giving you the code for this one!). Does this explain the results of the training?

```
dist, ind = model.kneighbors(Xtest)

for index in range(5):
    x0 = TestSet.loc[index, 'S_MASS']
    y0 = TestSet.loc[index, 'P_PERIOD']
    r0 = dist[index].max()
    circle=plt.Circle((x0, y0), r0, color='r', fill=False)
    ax = plt.gca()
    ax.add_patch(circle)

plt.xlim(-10, 10)
plt.show()
```


In [None]:
# The training performed better as the features now weigh a similar amount and the dataset is not skewed towards the features with highest numerical values.

### Preprocessing and Scaling

Hopefully you have now noticed that one of our features has much larger numerical values than the others, so it takes more weight in the machine learning process. Note that this was not a problem for the Decision Tree, as the decisions were made one at a time.

There are a few different options to define a scaler as you have seen in the notes. We will start with a `RobustScaler`, then we use the `fit` method to compute the median and quartiles of the set and scale the set so that the median in 0 and the quartiles are appropriately distributed.

In [None]:
scaler = preprocessing.RobustScaler()
scaler.fit(Xtrain)

To apply this transformation, ie to *scale* the training data, we use the `transform` method of the scaler. The `transform` method is used in `scikit-learn` whenever a model returns a new representation of the data

In [None]:
scaledXtrain = ...

**Print the dataset properties** (median, 0.25, 0.75 quantiles) before and after the scaling.

The transformed data has the same shape as the original data - the features are simply shifted and scaled.

To apply the kNN to the scaled data we need to **apply** the same transformation to the test set as well. **It is important not to use the test set to make the transformation as we don't want to *see* the test set statistical properties**.

In [None]:
scaledXtest  = ...

Print test set properties before and after the scaling

<div style="background-color:#C2F5DD">

## Exercise 6

1. Retrain the neighbour classifier with your new scaled training set. 
2. Calculate the new accuracy.
3. Calculate the new confusion matrix and true positives/negatives, false positives/negatives.
4. Remake the scatter plot with the circles.
5. Write a short sentence with your thoughts on the performance.