### **Tutorial: Explainable AI for Unsupervised Learning**


Within this tutorial you will learn to implement [`ClusterExplainR`](https://doi.org/10.1007/978-3-031-63797-1_3)'s local Feature Importance Score ($lFIS$)  developed by Amling et.al in 2024.


## **Preamble: Load a Dataset into a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)**

Load the dataset `mocked_dataset.csv` into a Pandas DataFrame using [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and display its contents.

In [1]:
import pandas as pd

tutorial_dataset = None

## **Task 1: Visualize Data**
Manually inspect the dataset to identify two important variables to plot on the x and y axes. Use the cluster column to assign colors to the points.

**Tip:** Use the [`sns.catplot`](https://seaborn.pydata.org/generated/seaborn.catplot.html) function.

In [2]:
import seaborn as sns 


## **Task 2: Calculate Shannon Entropy**

Calculate the [Shannon entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) of a given probability distribution $ P $, using the following formula:

$$
H_{\text{Shannon}} = -\sum_{i=1}^{M} P_i \, \log_2 \, P_i
$$

### Implementation Steps:
1. **Using NumPy Log function**:
   - Use the [`numpy.log2`](https://numpy.org/doc/2.1/reference/generated/numpy.log2.html) function for the logarithm.

2. **Using the Power of NumPy**:
   - Perform the entire entropy calculation efficiently using NumPy's  [`numpy.sum`](https://numpy.org/doc/2.1/reference/generated/numpy.sum.html) and `np.log2`.

3. **Using the [`scipy`](https://docs.scipy.org/doc/scipy/index.html) Package**:
   - Use the entropy function from [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html#scipy.stats.entropy) for an alternative approach.

**Result**:  $$ H(\text{probabilities}) \approx 1.74 $$

In [None]:
import numpy as np
from scipy.stats import entropy

probabilities = [0.5, 0.25, 0.15, 0.1]


def compute_entropy_manual(probabilities):
    """Computes entropy manually, only using np.log2."""
    pass


def compute_entropy_numpy(probabilities):
    """Computes entropy using the complete power of numpy"""
    pass


def compute_entropy_scipy(probabilities):
    pass


print(f"entropy manual : {compute_entropy_manual(probabilities)}")
print(f"entropy numpy : {compute_entropy_numpy(probabilities)}")
print(f"entropy scipy : {compute_entropy_scipy(probabilities)}")


## **Task 3: Manually Calculate the Local FIS Score**

For Cluster **B**, calculate the **local FIS (lFIS) values** for all three features: **Group**, **Type**, and **Subgroup**.

The formula to calculate lFIS is as follows:

$$
lFIS(X, B) = 1 - \min\left(H(X_B) \cdot H(X)^{-1}, 1\right)
$$

Where:
- $ H(X) $: Entropy of the feature $X$.
- $ H(X_B) $: Entropy of feature $X$ given a Cluster.

**Solutions**:
- $lFIS (\text{'Group', 'B'}) \approx 0,51$
- $lFIS (\text{'Type', 'B'}) \approx 1$
- $lFIS (\text{'Subgroup', 'B'}) \approx 0.27$

## **Task 4: Implement a Function to Calculate Local FIS**

Write a Python function to calculate the **local FIS** for a given cluster and feature, using the following steps:

1. Compute the entropy of the entire population for the specified feature $ H(\text{population}) $.
2. Compute the entropy of the feature within the given cluster $ H(\text{cluster}) $.
3. Handle edge cases:
   - If $ H(\text{cluster}) = 0 $, return ?.
   - If $ H(\text{population}) = 0 $, return ?.

### Function Parameters:
- `clustered_data`: A Pandas DataFrame containing the dataset.
- `cluster_id`: The specific cluster ID to calculate the FIS for.
- `feature_name`: The name of the feature for which the FIS is calculated.
- `cluster_column_name` (default: `"Cluster"`): The name of the column that identifies clusters in the dataset.

**Tips**: 
- in order to verify your outcomes check the solutions proposed in **Task 3**
- in order to calculate the use value distributions use [`pd.Series.value_counts`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)

In [None]:
def calculate_local_FIS(
    clustered_data: pd.DataFrame,
    cluster_id: any,
    feature_name: str,
    cluster_column_name: str = "Cluster",
) -> float: 
    """
    Calculates the local Feature Information Score (lFIS) for a specific feature and cluster.

    The local FIS measures the importance of a feature to a given cluster by comparing 
    the feature's entropy within the cluster to its entropy in the entire dataset.

    Args:
        clustered_data (pd.DataFrame): The dataset containing the features and cluster information.
        cluster_id (any): The ID of the cluster to calculate the local FIS for.
        feature_name (str): The name of the feature to calculate the local FIS for.
        cluster_column_name (str, optional): The column name identifying clusters in the dataset.
            Defaults to "Cluster".

    Returns:
        float: The local FIS value for the specified feature and cluster. 
    """
    pass

print(f"Importance of `Type` to Custer `B`: {calculate_local_FIS(tutorial_dataset,'B','Type')}") # 1
print(f"Importance of `Group` to Custer `B`: {calculate_local_FIS(tutorial_dataset,'B','Group')}") # ~ 0.51
print(f"Importance of `Subgroup` to Custer `B`: {calculate_local_FIS(tutorial_dataset,'B','Subgroup')}") # ~0.27

## **Task 5: Use the `calculate_local_FIS` Function**

Use the `calculate_local_FIS` function to compute the **local FIS values** for all features, given a specific cluster.

#### **Output Requirement**:
The result should be a `pd.DataFrame` with the following columns:
- `Feature`: The name of the feature for which the lFIS was calculated.
- `lFIS`: The computed local FIS value.
- `Cluster`: The cluster ID used in the calculation.


In [5]:
def calculate_local_FIS_for_all_features(
    clustered_data: pd.DataFrame, cluster_id: any, cluster_column_name: str = "Cluster"
) -> pd.DataFrame:
    """
    Calculates the local FIS (Fuzzy Information Score) for all features within a given cluster.

    This function computes the lFIS for each feature in the dataset (excluding the cluster column) 
    for the specified cluster, and returns a sorted DataFrame containing the results.

    Args:
        clustered_data (pd.DataFrame): The dataset containing the features and cluster information.
        cluster_id (any): The ID of the cluster to calculate local FIS values for.
        cluster_column_name (str, optional): The column name identifying clusters in the dataset.
            Defaults to "Cluster".

    Returns:
        pd.DataFrame: A DataFrame containing the following columns:
            - Feature: The name of each feature.
            - lFIS: The local FIS value for each feature.
            - Cluster: The cluster ID used in the calculation.
    """
    pass

calculate_local_FIS_for_all_features(tutorial_dataset, 'B')

## **Task 6: Calculate the Global FIS Score**
Use the provided function to calculate the **global FIS score** for each feature in the dataset. The **global FIS score** aggregates the local FIS values for each feature across all clusters.

#### **Output**:
The result should be a `pd.DataFrame` with the following columns:
- `Feature`: The name of the feature.
- `mean`: The mean local FIS score across all clusters ($gFIS$).
- `min`: The minimum local FIS score across all clusters.
- `max`: The maximum local FIS score across all clusters.

### **Steps**:
1. *Loop through all clusters*: Iterate over the unique cluster IDs i.
2. *Calculate local FIS for all features*: For each cluster, use the `calculate_local_FIS_for_all_features` function.
3. *Combine the results*: Use [`pd.concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) to merge the local FIS results from all clusters into a single DataFrame.
4. *Group and aggregate*: Use the [`pd.DataFrame.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)  and [`pd.DataFrame.agg`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) function to calculate the $gFIS$ score .
5. *Sort the DataFrame*: Sort the resulting DataFrame by the $gFIS$ value.


In [6]:
def calculate_global_FIS(
    clustered_data: pd.DataFrame, cluster_column_name: str = "Cluster"
) -> pd.DataFrame:
    """
    Calculates the global FIS for all features across all clusters.

    This function computes the local FIS values for all features within each cluster,
    aggregates these values across clusters, and computes summary statistics (mean, min, max)
    for each feature.

    Args:
        clustered_data (pd.DataFrame): The dataset containing features and cluster information.
        cluster_column_name (str, optional): The column name identifying clusters in the dataset.
            Defaults to "Cluster".

    Returns:
        pd.DataFrame: A DataFrame containing the following columns:
            - Feature: The name of each feature.
            - mean: The mean local FIS value across all clusters.
            - min: The minimum local FIS value across all clusters.
            - max: The maximum local FIS value across all clusters.
    """
    pass


calculate_global_FIS(tutorial_dataset)