In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Tarefa 4 - Decision Trees, Random Forest and K-Means
Fourth assessed coursework for the course: Técnicas e Algoritmos em Ciência de Dados

This tarefa provides an exciting opportunity for students to put their knowledge acquired in class into practice, using decision trees and random forests to solve a real-world problem in classification and delve into the world of unsupervised learning by implementing the K-means algorithm. Students will also get used to generating important plots during training to analyse the models' behaviour. 

## General guidelines:

* This work must be entirely original. You are allowed to research documentation for specific libraries, but copying solutions from the internet or your classmates is strictly prohibited. Any such actions will result in a deduction of points for the coursework.
* Before submitting your work, make sure to rename the file to the random number that you created for the previous coursework (for example, 289479.ipynb).
* Please try to not use any LLM-generated code. This coursework is designed for you to learn crucial concepts. Once you master them, then using LLMs become much easier.

## Notebook Overview:

1. [Decision Trees](#Decision_Trees) (30%)
2. [Random Forest](#Random_Forest) (30%)
3. [K-Means](#K-Means) (30%)

### Decision_Trees
## Part 1 - Decision Trees for Classification (value: 30%)

In this exercise, you will implement a decision tree for classifying whether the income of a person exceeds $50k/yr based on census data (adult_census_subset.csv in ECLASS). You will use the Information Gain based on the Gini Index as the impurity measure as the splitting criterion. The maximum depth and the minimum number of instances per leaf will be your stopping criteria. Be aware that some of the variables in this dataset are nominal (or categorical).

To complete this exercise, you will write code to build a decision tree for this problem: 

1. Dataset Splitting:
    - Load the provided dataset into your code.
	- Split the dataset into three sets: training, validation, and testing, with a 70/15/15 ratio, respectively. 
2. Implement a function to learn Decision Trees – the main conceptual steps are detailed below:
	- Initialize an empty decision tree.
	- Implement a recursive function to build the decision tree:
        - The stopping conditions for the recursive function are [note: satisfying only one of them is sufficient to stop the recursion]:
            - If the maximum depth is reached, stop growing the tree and create a leaf node with the frequency of the positive class for the remaining instances.
            - If the number of instances at a leaf node is less than the minimum number of instances per leaf, create a leaf node with the frequency of the positive class for those instances.
        - Your code will calculate the Information Gain (based on the Gini Index) for each possible value of each attribute and choose the attribute and value that maximizes the Information Gain (explanation below).
        - Your code will create a new internal node using the chosen attribute and value.
        - Your code will recursively call the build function on each subset of instances created by the split.
3. Implement a classification function. Implement a function to classify new instances using the decision tree:
	- For each instance, traverse the decision tree by comparing its attribute values to the decision nodes and move down the tree based on the attribute values until a leaf node is reached.
	- Return the frequency of the positive class that is associated with the leaf node as the prediction for the instance.
4. Run your algorithm and evaluate its performance:
	- Call the build function with the training set to construct the decision tree. You will vary the maximum depth and minimum number of instances per leaf to observe their effects on the decision tree performance. You will use the training set to learn the tree and the validation set using the Area Under the Roc Curve (AUROC) to find the optimal parameters. Try only shallow trees of a depth not deeper than 10, and min_instances not smaller than 10. If you try more extreme values, the training time could be too much.
	- Build a decision tree using the training + validation sets with the best combination of parameters.
	- Calculate the accuracy (threshold: 0.5) and AUROC of the decision tree in the testing set and report them.

To select the best split at each node you will use the Information Gain based on the Gini Index. The Gini Index measures the impurity of a node in a decision tree. To calculate the Information Gain based on the Gini Index, follow these steps [note: the same is explained in the slides for the case of entropy]:
- For each potential split (feature and value):
	- Calculate the Gini Index for node m (before any splits) using the class distribution within the node, using the following formula:
        - $G_m=\sum_{k=1}^K (\hat{p}_{mk} (1-\hat{p}_{mk})$, where $\hat{p}_{mk}$ represents the proportion of instances in the node $m$ that belong to class $k$.
	- Calculate the Gini Index for each possible outcome. This involves the following steps:
        - Split the data based on the attribute's possible outcomes.
        - Calculate the Gini index for each resulting subset using the same formula as in step a.
	- Calculate the weighted Gini index by summing up the Gini indexes of each subset, weighted by the proportion of instances it represents in the original node. The formula for the weighted Gini index ($W$) is as follows:
        - $W_V=\sum_v^V \frac{|S_v|}{|S|} G_{S_v} $ where $S_v$ is the node after the split and the sum iterates over all the children nodes; $|S_v|$ represents the cardinality of the node and $|S|$ the cardinality of the node before splitting; $G_{S_v}$ represents the Gini index of the node.
	- Calculate the information gain by subtracting the weighted Gini index obtained in step c. from the Gini index of the current node. The formula is as follows:
        - $InformationGain=G_{node}-W_V$


In [1]:
## your code goes here:

## Random_Forest
## Part 2 - Random Forest for Classification Networks (value: 30%)

In this exercise, you will expand on the previous exercise and implement Random Forests. You will build an ensemble of decision trees and use them for the same classification task from Part 1. The dataset used for this exercise will be the same as in the previous exercise. Your task is to write code to construct a Random Forest model, evaluate its performance, and compare it to the decision tree implementation. 

To complete this exercise, you will write code to implement Random Forest for this problem: 

1. Dataset Splitting: use the same splits you used for Part 1.
2. Implement a function to learn Random Forest – the main steps are detailed below:
	- Initialize an empty Random Forest.
	- Determine the number of decision trees to include in the forest (e.g. 20), and the number of the random features to consider, generally `num_features` $≈\sqrt{p}$ where $p$ is the total number of features.
	- Implement a loop to build the specified number of decision trees:
        - Generate a bootstrap sample from the training set (sampling with replacement).
        - Build a decision tree using the bootstrap sample, using your implementation from Part I.
        - Add the constructed decision tree to the Random Forest.
3. Implement a classification function. Implement a function to classify new instances using the Random Forest:
	- For each instance, pass it through all decision trees in the Random Forest and collect the predictions. Note that you should binarize the prediction of each decision tree, that is, use a threshold of 0.5 to determine the actual class label.
	- The prediction for the random forest will be the frequency of the positive class in the predictions collected by all the decision trees.
4. Run your algorithm and evaluate its performance:
	- Call the function to learn the Random Forest with your training set. You will vary the different parameters of the Random Forest to observe their effect on the performance on the validation set. You will use the training set to learn the tree and the validation set using the Area Under the Roc Curve (AUROC) to find the optimal parameters. Again, keep your trees shallow and don’t build many decision trees, as this could delay the training time quite a lot.
	- Build a Random Forest using the training + validation sets with the best combination of parameters.
	- Classify the instances of the testing set using the Random Forest, calculate the accuracy (threshold: 0.5) and Area Under the ROC Curve (AUROC) and report the results.
5. Experimentation: Compare the performance of Random Forests with the single decision tree implementation from the previous exercise reporting the performance on the test set in a table (either a dataframe or markdown). 


In [2]:
## your code goes here:

## K-Means
## Part 3 – Clustering with K-means (value: 40%)

In this exercise, you will explore clustering by implementing the K-means algorithm. You will write code to perform K-means clustering while visualizing the movement of the centroids at each iteration. 

To complete this exercise, you will write code to implement K-means for clustering: 

1. Dataset Preparation: Run the cells provided in the notebook that generate the artificial data points for this exercise.
2. K-means Clustering:
	- Initialize K cluster centroids by selecting K points from your dataset at random.
	- Implement a loop to perform the following steps until convergence (or until a specified maximum number of iterations is reached, e.g., 150):
        - Assign each data point to the nearest centroid (you will have to calculate the Euclidean distance between the data point and each centroid).
        - Update each centroid by moving it to the mean of all data points assigned to it.
        - Check for convergence by comparing the new centroids with the previous centroids. If the difference is smaller than an $\epsilon=1^{-4}$, exit the loop.
3. Centroid Movement Visualization:
	- At 5 different moments during training, plot a figure showing the centroids and the points. Figure 1 should show the situation at the beginning, before learning. Figure 5 should show the situation at the end of the learning. The remaining Figures 2-4 should show intermediary situations.
	- For each figure, each centroid will be represented by a large black cross and each cluster with a different colour, the points must be coloured according to their respective cluster.
4. Sum of squared distances:
	- Along with plotting the centroid movement, calculate the sum of squared distances at each iteration as follows:
        - $\sum_{j=1}^K \sum_{n \in S_j}d(x_n,\mu_j )^2$, where $K$ is the number of clusters, $x_n$ represents the $n^{th}$ datapoint, $n \in S_j$ indicates a set of points that belong to cluster $S_j$, $\mu_j$ is the mean of the datapoints in $S_j$ and $d(x_n,\mu_j)$ indicates the Euclidean distance between $x_n$ and $\mu_j$.
	- Make a plot of the sum of squared distances at each iteration. 


In [None]:
# Generate artificial data points
np.random.seed(13)
num_samples = 200
num_features = 2
X = np.random.randn(num_samples, num_features) * 1.5 + np.array([[2, 2]])
X = np.concatenate([X, np.random.randn(num_samples, num_features) * 3 + np.array([[-5, -5]])])
X = np.concatenate([X, np.random.randn(num_samples, num_features) * 2 + np.array([[7, -5]])])

In [None]:
## your code goes here: