# **KogSys-ML-B Introduction to Machine Learning**
## **Ensembles and Evaluation**
---

To set up a conda environment suitable for this notebook, you can use the following console commands:

```bash
conda create -y -n ens-eval python=3.13
conda activate ens-eval
python -m pip install -r requirements.txt
```

**Note**: Conda can become very hard-drive hungry when you use many environments. Consider regularly deleting environments you no longer need and running the ``conda clean --all`` command to remove no longer needed packages and cached files.

You can also install the requirements for this notebook into an existing environment by running the cell below:

In [1]:
!python -m pip install -q -r requirements.txt

In [2]:
from __future__ import annotations

import numpy as np
import pandas as pd
from numpy.typing import ArrayLike
from sklearn.base import ClassifierMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
)
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

### **Data Preprocessing**

Last time, we worked with a dataset which was already set up to be used with ``scikit-learn``. Today, we will work with a less favorable base and learn to work around it, "wrangling" our raw data into a shape we can work with.

The dataset we will be working with today is the Spotify tracks dataset, which is available on [Huggingface](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset). However, while this dataset is almost already usable, we will consider a modified version to learn some basic data transformations which will be helpful to you on any future Machine Learning tasks.

With this notebook, you downloaded four files: ``spotify-1.csv``, ``spotify-2.csv``, ``spotify-3.parquet``, and ``spotify-test.csv``:
- ``spotify-1.csv`` and ``spotify-2.csv`` contain the same rows, identified by the column ``"track_id"``, but different columns.
- ``spotify-3.parquet`` contains additional, complete rows, but is saved in a different file format and some column types don't match.
- ``spotify-test.csv`` contains the complete test data. No modifications are needed, but you are not allowed to use this for any purpose during training, only to evaluate the **final** model.

#### **Task: Load ``spotify-1.csv`` and ``spotify-2.csv`` and join them _on_ the column ``"track_id"``**

The resulting DataFrame should have the shape ``(43075, 20)``.

In [79]:
# your code here
spotify_1 = pd.read_csv('spotify-1.csv')
spotify_2 = pd.read_csv('spotify-2.csv')

spotify_data = pd.merge(spotify_1, spotify_2, on = 'track_id', how='left')
spotify_data.head(2)



Unnamed: 0,track_id,danceability,energy,speechiness,acousticness,instrumentalness,liveness,valence,artists,album_name,track_name,popularity,duration_ms,explicit,key,loudness,mode,tempo,time_signature,track_genre
0,67hiQr5yocRHh4fheSEzsC,0.358,0.363,0.0288,0.822,0.0,0.369,0.229,Seiko Matsuda,My Prelude,いくつの夜明けを数えたら,27,303626,False,3,-7.236,1,134.649,4,j-idol
1,7aUuoq4oMfLxaLa5GVUDHi,0.59,0.578,0.0528,0.612,0.000162,0.0837,0.264,KALEO,Way down We Go,Way down We Go,64,219560,False,10,-5.798,0,81.663,4,alt-rock


#### **Task: Load ``spotify-3.parquet`` and combine it with your result from the previous task.**

Note that some columns from the new frame will load with the wrong datatype. To save you time on searching, the columns in question are ``"popularity"`` and ``"explicit"``. ``pandas`` will, however, not raise an error for this, but will silently raise the dtype for the columns to a superset of both types. Make sure that you change the datatype of the columns to the most _expressive_ one. The resulting DataFrame should have the shape ``(71792, 20)``.

If you name your resulting frame ``df``, you can use the assertions in the cell below to check whether your solution worked.

**Note:** In this case, it doesn't matter whether you change the datatype of the columns _before_ or _after_ combining the two frames. However, it is better practice to do it beforehand and combine only DataFrames with matching types for all columns.

In [80]:
# your code here
spotify_3 = pd.read_parquet('spotify-3.parquet')
print(spotify_3.shape)
print(spotify_3.head(2))

# Ensure consistent dtypes
spotify_3['popularity'] = spotify_3['popularity'].astype(int)
spotify_3['explicit'] = spotify_3['explicit'].astype(bool)


(28717, 20)
                 track_id                       artists           album_name  \
0  0Yr1TfeacyGFyDe0aWDla9  Étienne Daho;Italoconnection              Virus X   
1  0oOZgg5OXE9ojXfldt4h4P                 Mickie Krause  Finger Im Po Mexiko   

                           track_name popularity  duration_ms  explicit  \
0  Virus X - SAGE Rework - radio edit         32       165386         0   
1             Laudato Si - DJ Version         25       216093         0   

   danceability  energy  key  loudness  mode  speechiness  acousticness  \
0         0.706   0.766    7    -6.055     0       0.0792       0.00513   
1         0.655   0.921    0    -4.415     1       0.0452       0.12500   

   instrumentalness  liveness  valence    tempo  time_signature track_genre  
0            0.0192     0.199    0.661  101.999               4      french  
1            0.0000     0.227    0.842  135.067               4       party  


In [81]:
spotify_data = pd.concat([spotify_data, spotify_3], axis = 0, join = 'inner')
print(spotify_data.shape)

(71792, 20)


In [82]:
assert spotify_3["popularity"].dtype == int
assert spotify_3["explicit"].dtype == bool

#### **Task: Filter the columns to only columns which can sensibly contribute to decision-making without overfitting the data.**

In [None]:
# your code here
#dropping columns with string inputs that would likely skew outcome of id3
spotify_data = spotify_data.drop(columns = ['artists', 'album_name', 'track_name', ])



                 track_id  danceability  energy  speechiness  acousticness  \
0  67hiQr5yocRHh4fheSEzsC         0.358   0.363       0.0288         0.822   

   instrumentalness  liveness  valence  popularity  duration_ms  explicit  \
0               0.0     0.369    0.229          27       303626     False   

   key  loudness  mode    tempo  time_signature track_genre  
0    3    -7.236     1  134.649               4      j-idol  
bool
[False  True]
                 track_id  danceability  energy  speechiness  acousticness  \
0  67hiQr5yocRHh4fheSEzsC         0.358   0.363       0.0288         0.822   

   instrumentalness  liveness  valence  popularity  duration_ms  explicit  \
0               0.0     0.369    0.229          27       303626     False   

   key  loudness  mode    tempo  time_signature track_genre  
0    3    -7.236     1  134.649               4      j-idol  
(71792, 17)


#### **Task: Performing a dataset split – manually**

In most cases, you will do just fine with using ``Scikit-Learn``'s ``train_test_split`` (or later: ``PyTorch``'s ``random_split``). However, there are some edge cases where you have to handle splitting yourself, so this task teaches you the basics of how to go about this: _Index Lists_.

Essentially, the goal is to list all indices of your data, shuffle that list, and then simply divide it into size-based chunks! In this next cell, write your own function which takes an ``ArrayLike`` object and a list of fractions (i.e. ``float``s) as input and returns a list of ``ArrayLike`` objects of the same lengths as the fraction list.

In [None]:
# your code here
def split_array(data : ArrayLike, frac: list[float]):
    if not np.isclose(sum(frac), 1):
        raise ValueError(f"frac must sum up to one (roughly)")
    
    data_size = len(data)
    random_indices = np.random.permutation(len(data))

    output_data = []

    for i in frac:
        size = int(data_size * i)
        data_indices = random_indices[:size]
        random_indices = random_indices[size:]
        output_data.append(data.iloc[data_indices])


    return output_data
        


In [None]:
# your code here

### **Learning an Ensemble**

#### **Task / Baseline: Use ``scikit-learn``'s ``RandomForestClassifier`` to train a random forest of 50 ID3 trees.**

Now, load ``spotify-test.csv`` as well and use it with the classifier's ``score`` method. You may need to reorder (``reindex``) the columns in the DataFrame to match the ones of your training frame.

In [None]:
# your code here

#### **Task: Voting, DIY**

Your task is to create your own ensemble of trees, with a twist: Each tree should be trained on a random subset (say, $80\%$) of the training data, and validated on the rest. You do not need to implement a $k$-fold like system, simply performing a random split each time will suffice. Use the validation scores to create a weighted decision system which takes the validation performance of each individual tree into account. Do not implement the subspace sampling for the trees which is part of the original Random Forest algorithm.

You can use the following class skeleton to help get you started!

**Note:** Focus on the algorithm, not the performance. Trying to implement this decision algorithm while also trying to maximize performance is very difficult, and is not the goal for this course. It is okay if your implementation is both slower and less powerful than the ``scikit-learn`` base – you are just starting out, after all!

**Hint:** To get your bootstrap sample, you can use ``scikit-learn``'s ``resample`` function.

**Hint:** Add up all class predictions using the computed weights and return the class with the maximum score!

**Hint:** Common errors when working with ``numpy`` arrays arise from using no or incorrect ``dtype`` specifications when creating the arrays.

In [None]:
class DIYForest(ClassifierMixin):
    """
    A DIY Random Forest Class. By using ``ClassifierMixin``, some methods are included automatically, as long as ``fit`` and ``predict`` are implemented. Set the following class attributes in the constructor:

    Attributes
    ----------

    M: np.ndarray
        An array of models, in this case DecisionTreeClassifiers. You may also use a list if you aren't comfortable with numpy arrays. Make sure to adjust the type hint in that case.
    w: np.ndarray
        An array of model weights, which will be filled with validation scores during training. If you use arrays, initialize an array of zeros of the same shape as M in the constructor. If you use lists, you can have this grow organically.
    val_size: float
        The fraction of the training data to use for validation, i.e. calculating model weights
    """

    M: np.ndarray
    w: np.ndarray
    val_size: float

    def __init__(self, n_trees: int, val_size: float = 0.2, **tree_params) -> None:
        """
        In the constructor, set the class attributes. Initialize tree objects at this point.

        Parameters
        ----------
        n_trees: int
            How many trees to include in the forest
        val_size: float
            The fraction of the training data to use for validation
        **tree_params: dict
            Parameters to pass on to the tree constructor, i.e. ``DecisionTreeClassifier(**tree_params)``.
        """

        # TODO: your code here

    def fit(self, X: ArrayLike, y: ArrayLike) -> DIYForest:
        """
        Fit each tree in the forest using decision attributes ``X`` and target attribute ``y``.

        Parameters
        ----------
        X: ArrayLike
            training examples (only decision attributes)
        y: ArrayLike
            labels

        Returns
        -------
        self
        """
        
        # TODO: your code here

    def predict(self, X: ArrayLike) -> np.ndarray:
        """
        The ensemble makes predictions for the labels of samples ``X``.

        Parameters
        ----------
        X: ArrayLike
            data samples

        Returns
        -------
        np.ndarray
            Array of shape [X.shape[0],] containing the predictions
        """
        
        # TODO: your code here


In [None]:
# your code here

### **Evaluation**

Finally, let's calculate some of the evaluation metrics for the ``scikit-learn`` and our model!

#### **Accuracy**

Accuracy, implemented in ``sklearn.metrics.accuracy_score``, is defined as $$\operatorname{Accuracy}=\frac{\text{TP}+\text{TN}}{|\text{TestSet}|}$$

In [None]:
# your code here

#### **Precision**

Precision, implemented in ``sklearn.metrics.precision_score``, is defined as $$\operatorname{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}$$

In [None]:
# your code here

#### **Recall**

Recall, implemented in ``sklearn.metrics.recall_score``, is defined as $$\operatorname{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$$

In [None]:
# your code here

#### **F1-Score**

F1-Score, implemented in ``sklearn.metrics.f1_score``, is defined as $$\operatorname{F1}=\frac{2\times\text{TP}}{2\times\text{TP}+\text{FP}+\text{FN}}$$

In [None]:
# your code here