# Exercise - Analyzing Real-World Datasets

<div class="alert alert-block alert-success">
The following practical examples will showcase how outliers can be detected, visualized and how feature correlations can be computed.

## <a id="sec_toc">Content</a>

[Learning Objectives](#sec_0)

[a) Computation of Basic Statistics and Z-score method to detect outliers](#sec_a)

[b) Analysis of Feature Correlation](#sec_b)

[c) Creation of Representative Training and Test Datasets and Different Similarity Measures](#sec_c)

### <a id="sec_0">Learning Objectives</a>

* Understand the Z-score method to detect outliers.
* Apply the threshold value and how it incluences outlier detection.
* Interpret what a correlation matrix reveals about relationships in the data. 
* Know the difference between correlation and causality. 
* Understand k-means clustering.
* Apply the Silhouette score and similarity measures, such as Kullback-Leibler divergence.

In [None]:
# IMPORT THE NEEDED LIBRARIES - DO NOT CHANGE

import pandas as pd
from scipy.stats import zscore
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from scipy.stats import entropy
from sklearn.model_selection import train_test_split
from scipy.stats import norm
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import LabelEncoder
import warnings

warnings.filterwarnings("ignore")

## <a id="sec_a">a) Computation of Basic Statistics and Z-score method to detect outliers</a>

This chapter covers the computation of basic statistics and a common method for outlier detection. 

### Statistical Measures: Mean, Variance, Standard Deviation, and Median

For the following statistics we assume that the underlying distributions are gaussian distributions. 

Below are the formulas for some key statistical measures that help us understand the characteristics of a dataset.

1. **Mean**: The average value of a dataset, calculated by summing all values and dividing by the total count.

   $$
   \text{Mean} = \frac{1}{N} \sum_{i=1}^N x_i
   $$

2. **Variance**: A measure of how spread out the values in a dataset are, showing the average squared deviation from the mean.

   $$
   \text{Variance} = \frac{1}{N} \sum_{i=1}^N (x_i - \text{Mean})^2
   $$

3. **Standard Deviation**: The square root of the variance, providing a measure of spread in the same units as the data.

   $$
   \text{Standard Deviation} = {\sigma} = \sqrt{\text{Variance}} = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \text{Mean})^2}
   $$

4. **Median**: The middle value of a sorted dataset, which separates the dataset into two equal halves.

   - **If $ N $ (number of observations) is odd**:
  
     $$
     \text{Median} = x_{\left(\frac{N+1}{2}\right)}
     $$
     

   - **If $ N $ is even**:
  
     $$
     \text{Median} = \frac{x_{\left(\frac{N}{2}\right)} + x_{\left(\frac{N}{2} + 1\right)}}{2}
     $$
     

These measures provide insights into the central tendency (mean, median) and the spread (variance, standard deviation) of a dataset.

### Z-score Outlier Detection

Z-score outlier detection identifies data points that deviate significantly from the mean by measuring how many standard deviations away they are. The Z-score is calculated as:

$$
\text{Z-score} = \frac{x - \mu}{\sigma}
$$

A common threshold (e.g., $|Z| > 3$) flags points as outliers if they lie far from the mean, helping to detect anomalies in normally distributed data.

Note that the Z-score method will sometimes lead to the exclusion of data points that are actually critical for preserving the integrity and representativeness of the dataset. The primary goal of the Z-score method is to remove erroneous or anomalous data points.

In [None]:
# DO NOT CHANGE

def read_file_into_dataset(filename):
    # Read the CSV file with pandas, keeping headers
    dataset = pd.read_csv(filename, encoding="utf-8")
    return dataset

def print_all_cities_with_mapping(dataset):
    """
    Prints all unique city names from the dataset and assigns a unique number to each city.

    Parameters:
    - dataset: pandas DataFrame, the original dataset (no cleaning or processing applied).
    
    Returns:
    - city_mapping: dict, a mapping of city names to unique numbers.
    """
    if 'City' not in dataset.columns:
        raise ValueError("The dataset does not contain a 'City' column.")
    
    # Get the unique city names
    unique_cities = dataset['City'].unique()
    
    # Create a mapping of city names to unique numbers
    city_mapping = {city: idx for idx, city in enumerate(unique_cities)}
    
    print("List of all cities in the dataset and their mappings:")
    for city, number in city_mapping.items():
        print(f"{number}: {city}")
    
    return city_mapping

# Specific for "UrbanAirQualityandHealthImpactDataset.csv"
def clean_dataframe(dataframe, columns_to_remove):
    # Drop specified columns
    df_cleaned = dataframe.drop(columns=columns_to_remove, errors='ignore')
    
    # Convert boolean columns to 0 and 1
    boolean_columns = df_cleaned.select_dtypes(include='bool').columns
    df_cleaned[boolean_columns] = df_cleaned[boolean_columns].astype(int)
    
    # Encode City column to numerical values starting from 1
    city_codes, unique_cities = pd.factorize(df_cleaned['City'])
    df_cleaned['City'] = city_codes + 1  # Adjust index to start from 1
    
    # Create dictionary with city names as keys and their corresponding numerical values as values
    city_mapping = {unique_cities[i]: i + 1 for i in range(len(unique_cities))}
    
    return df_cleaned, city_mapping

def filter_numeric_columns_and_cities(dataset, cities_to_keep=None, city_mapping=None, keep_all_cities=False):
    """
    Filters the dataset to include only numeric columns and rows corresponding to specified cities, 
    or all cities if `keep_all_cities` is True. Always keeps the city labels as per the provided `city_mapping`.

    Parameters:
    - dataset: pandas DataFrame, the dataset to filter.
    - cities_to_keep: list, a list of city names to keep (ignored if `keep_all_cities` is True).
    - city_mapping: dict, mapping of city names to unique numeric labels.
    - keep_all_cities: bool, whether to keep all cities (default is False).

    Returns:
    - numeric_dataset: DataFrame, a numeric dataset with city labels processed as per `city_mapping`.
    """
    if city_mapping is None:
        raise ValueError("A valid city_mapping must be provided.")

    # Ensure the dataset has a valid 'City' column
    if 'City' not in dataset.columns:
        raise ValueError("The dataset does not contain a 'City' column.")

    dataset = dataset.copy()  # Explicitly create a copy to avoid warnings

    # Keep all cities if specified
    if keep_all_cities:
        dataset['City'] = dataset['City'].map(city_mapping)
    else:
        # Filter the dataset to keep only the specified cities
        if not cities_to_keep:
            raise ValueError("Please provide city names to keep when `keep_all_cities` is False.")
        dataset = dataset[dataset['City'].isin(cities_to_keep)].copy()
        dataset['City'] = dataset['City'].map(city_mapping)

    # Check for unmapped cities
    if dataset['City'].isna().any():
        unmapped_cities = dataset.loc[dataset['City'].isna(), 'City'].unique()
        raise ValueError(f"The following cities were not found in the city_mapping: {unmapped_cities}")

    # Attempt to convert all columns to numeric, forcing non-convertible values to NaN
    dataset_converted = dataset.apply(pd.to_numeric, errors='coerce')

    # Drop columns that contain NaN (these had non-numeric entries)
    numeric_dataset = dataset_converted.dropna(axis=1)

    # Downcast remaining values to appropriate numeric types (float or int)
    numeric_dataset = numeric_dataset.apply(pd.to_numeric, downcast='float')

    return numeric_dataset



def compute_basic_statistics(dataset):
    stats = {
        "mean": dataset.mean(),
        "std_dev": dataset.std(),
        "variance": dataset.var(),
        "median": dataset.median()
    }
    # Convert to DataFrame and format with no scientific notation
    stats_df = pd.DataFrame(stats)
    pd.options.display.float_format = '{:,.2f}'.format  # Set global display option for floating-point numbers
    return stats_df

def reduce_to_selected_columns(dataset, columns_to_keep=None):
    """
    Reduces the dataset to only the specified columns. 
    If no columns are specified, it defaults to the first two columns.

    Parameters:
    - dataset: pandas DataFrame, the dataset to reduce.
    - columns_to_keep: list of str, the column headers to keep (default is None).

    Returns:
    - reduced_dataset: DataFrame, the dataset with only the specified columns.
    """
    if columns_to_keep is not None:
        # Keep only the columns specified in columns_to_keep
        reduced_dataset = dataset[columns_to_keep]
    else:
        # Default to the first two columns if no columns are specified
        reduced_dataset = dataset.iloc[:, :2]
    
    return reduced_dataset


def shuffle_dataset(dataset, seed=42):
    # Shuffle the dataset rows with a fixed random seed
    shuffled_dataset = dataset.sample(frac=1, random_state=seed).reset_index(drop=True)
    return shuffled_dataset

def detect_outliers_z_score(dataset, threshold=3):
    """
    Detects outliers in the dataset based on the Z-score method.

    Parameters:
    - dataset: pandas DataFrame, the dataset to analyze.
    - threshold: float, the Z-score threshold to identify outliers (default is 3).

    Returns:
    - outliers: DataFrame, a boolean DataFrame indicating outliers (True for outliers).
    """
    # Calculate Z-scores for each numeric column
    z_scores = dataset.apply(zscore)
    
    # Identify outliers where |Z| > threshold
    outliers = (z_scores.abs() > threshold)
    
    return outliers

def remove_outliers(dataset, outliers):
    """
    Removes rows from the dataset where any feature is marked as an outlier.

    Parameters:
    - dataset: pandas DataFrame, the dataset to remove outliers from.
    - outliers: pandas DataFrame, a boolean DataFrame indicating outliers (True for outliers).

    Returns:
    - cleaned_dataset: DataFrame, the dataset with outliers removed.
    """
    # Identify rows where any column is marked as an outlier
    rows_with_outliers = outliers.any(axis=1)
    
    # Remove these rows from the dataset
    cleaned_dataset = dataset[~rows_with_outliers].reset_index(drop=True)
    
    return cleaned_dataset

def visualize_data_with_outliers_histogram(dataset, outliers, selected_columns):
    # Create a 1x2 figure for the histograms
    fig, axs = plt.subplots(1, len(selected_columns), figsize=(14, 6))
    fig.suptitle("Histogram of Data with Outliers Highlighted", fontsize=16)
    
    # Plot histograms for each selected column
    for i, column in enumerate(selected_columns):
        bins = 20
        data_min, data_max = dataset[column].min(), dataset[column].max()
        bin_edges = np.linspace(data_min, data_max, bins + 1)
        
        # Histogram for the full data (blue) and outliers (red)
        axs[i].hist(dataset[column], bins=bin_edges, color='blue', alpha=0.7, label='Data')
        axs[i].hist(dataset[column][outliers[column]], bins=bin_edges, color='red', alpha=0.7, label='Outliers')
        
        axs[i].set_xlabel(column)
        axs[i].set_ylabel("Frequency")
        axs[i].set_title(f"Histogram of {column} with Outliers Highlighted")
        axs[i].legend()

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjust layout to fit title and avoid overlapping
    plt.show()

def calculate_whis_for_zscore(dataset, column, threshold=3.0):
    """
    Calculate the equivalent 'whis' parameter for the IQR to align with a given Z-score threshold.

    Parameters:
        dataset (DataFrame): The dataset containing the data.
        column (str): The column for which to calculate the whis value.
        threshold (float): The Z-score threshold.

    Returns:
        float: The equivalent whis parameter for the boxplot.
    """
    q1 = dataset[column].quantile(0.25)
    q3 = dataset[column].quantile(0.75)
    iqr = q3 - q1
    std_dev = dataset[column].std()

    # Convert Z-score threshold to an IQR multiplier
    whis_equivalent = (threshold * std_dev) / iqr
    return float(whis_equivalent)

def visualize_data_with_outliers_boxplot_aligned(dataset, outliers, selected_columns, threshold=3.0):
    """
    Visualizes the dataset with boxplots for selected columns, highlighting outliers based on a Z-score threshold.

    Parameters:
        dataset (DataFrame): The dataset to visualize.
        outliers (dict): A dictionary indicating outliers for each column.
        selected_columns (list): Columns to include in the visualization.
        threshold (float): The Z-score threshold for outlier detection.
    """
    # Create a figure for boxplots of the selected columns
    fig, axs = plt.subplots(1, len(selected_columns), figsize=(12, 6))
    fig.suptitle("Boxplot of Data with Outliers Highlighted", fontsize=16)

    for i, column in enumerate(selected_columns):
        # Calculate the whis equivalent for the Z-score threshold
        whis = calculate_whis_for_zscore(dataset, column, threshold=threshold)
        
        # Plot the boxplot using the dynamically calculated whis
        sns.boxplot(data=dataset, x=column, ax=axs[i], color="lightblue", fliersize=5, whis=whis)
        
        # Customize plot aesthetics
        axs[i].set_title(f"Boxplot of {column} with Outliers (Threshold: |Z| > {threshold})")

    plt.tight_layout()
    plt.show()

def visualize_data_with_outliers_violin(dataset, outliers, selected_columns):
    # Create a figure for violin plots of the selected columns
    fig, axs = plt.subplots(1, len(selected_columns), figsize=(12, 6))
    fig.suptitle("Violin Plot of Data with Outliers Highlighted", fontsize=16)

    for i, column in enumerate(selected_columns):
        # Violin plot for the main data
        sns.violinplot(y=dataset[column], ax=axs[i], color="lightblue", inner="quartile")

        # Plot outliers on a single horizontal line below the violin plot
        outlier_values = dataset[column][outliers[column]]
        axs[i].scatter(np.zeros_like(outlier_values), outlier_values, color='red', marker='x', label='Outliers')

        axs[i].set_ylim(bottom=-1)  # Extend y-axis to make space for the outlier line
        axs[i].set_title(f"Violin Plot of {column} with Outliers Highlighted")

        # Custom legend
        data_patch = mpatches.Patch(color='lightblue', label='Data')
        outlier_patch = mpatches.Patch(color='red', label='Outliers')
        axs[i].legend(handles=[data_patch, outlier_patch])

    plt.tight_layout()
    plt.show()

def check_value(x, name):
    if x is None:
        print(f"{name} is still None. Please enter a valid value for {name}.")
        raise Exception(f"Execution stopped because {name} is not defined.")


In [None]:
# DO NOT CHANGE
path = "urban_air_quality_and_health.csv"
dataset = pd.read_csv(path)

city_mapping = print_all_cities_with_mapping(dataset) # Print all cities

<u>TASK:</u> In the code cell below you can decide two cities which will be used for some of the remaining computations of this notebook. \
If you want to change them, always do so here.

In [None]:
city1 = None      # Change None to any of the printed city names
city2 = None      # Change None to any of the printed city names

## DO NOT CHANGE THE CODE BELOW
check_value(city1, "city1")
check_value(city2, "city2")
cities_to_keep = [city1, city2]

The next codeblock removes columns that are unfit for computations (Strings) and computes and prints some basic statistics.

In [None]:
# DO NOT CHANGE

# Remove non-relevant columns
columns_to_remove = [
    'datetime', 'preciptype', 'sunrise', 'sunset', 'conditions', 'description', 
    'icon', 'stations', 'source', 'Season', 'Day_of_Week', 'precip', 'precipprob', 
    'precipcover', 'moonphase'
]
dataset_cleaned = dataset.drop(columns=columns_to_remove, errors='ignore')

# Filter dataset to numeric columns and specified cities
allcities_numeric_dataset = filter_numeric_columns_and_cities(
    dataset_cleaned,
    keep_all_cities=True,
    city_mapping=city_mapping
)

numeric_dataset = filter_numeric_columns_and_cities(
    dataset_cleaned,
    cities_to_keep=cities_to_keep,
    city_mapping=city_mapping
)

# Compute basic statistics
stats = compute_basic_statistics(numeric_dataset)
print("Basic Statistics:")
print(stats)


<u>TASK:</u> In the code cell below you can decide two columns for the plots and computations within the entire notebook. \
If you want to change them, please always do so here and rerun the program from this cell. \
The column city will always be kept, so do not name neither 'col1' nor 'col2' 'City'.

In [None]:
col1 = None    # Change None to any column included in the rows of basic statistics, e.g. "tempmax", "tempmin", or "humidity"
col2 = None    # Change None to any column included in the rows of basic statistics, e.g. "pressure", "windgust", or "windspeed"

## DO NOT CHANGE THE CODE BELOW
check_value(col1, "col1")
check_value(col2, "col2")

<u>TASK:</u> In the code cell below you can change the threshold for the z-score outlier detection \
for the remaining computations in chapter a) and b).

In [None]:
threshold = None  # CHANGE threshold from None to any positive number, default value is usually 3.0

## DO NOT CHANGE THE CODE BELOW
check_value(threshold, "threshold")

The next code block plots and visualizes the distributions and its outliers.

In [None]:
# DO NOT CHANGE

# Reduce to selected columns
reduced_allcities_numeric_dataset = reduce_to_selected_columns(allcities_numeric_dataset, ['City', col1, col2])

# Shuffle dataset (not necessary)
shuffled_reduced_allcities_numeric_dataset = shuffle_dataset(reduced_allcities_numeric_dataset)

# Compute and print the outliers
outliers = detect_outliers_z_score(shuffled_reduced_allcities_numeric_dataset, threshold=threshold)
print("Outliers Detected:")
print(outliers)

# Plot the outliers in a histogram, boxplot, and violin plot
selected_columns = [col1, col2]
visualize_data_with_outliers_histogram(shuffled_reduced_allcities_numeric_dataset, outliers, selected_columns)



# Example usage:
visualize_data_with_outliers_boxplot_aligned(
    shuffled_reduced_allcities_numeric_dataset,
    outliers,
    selected_columns,
    threshold=threshold
)
visualize_data_with_outliers_violin(shuffled_reduced_allcities_numeric_dataset, outliers, selected_columns)

# Remove outliers
cleaned_numeric_dataset = remove_outliers(shuffled_reduced_allcities_numeric_dataset, outliers)

# Final dataset verification
print("Cleaned Numeric Dataset:")
print(cleaned_numeric_dataset)


[Back](#sec_toc)

## <a id="sec_b">b) Analysis of Feature Correlation</a>

In this chapter we will have a look at how much the different features correlate. \
We will also classify the data with the k-Means clustering method.


### Correlation Matrix

**Correlation** is a statistical measure that quantifies the strength and direction of a linear relationship between two variables. It is expressed as a value between $-1$ and $+1$, where:
- $+1$: Perfect positive correlation, indicating that as one variable increases, the other increases proportionally.
- $0$: No linear relationship between the variables.
- $-1$: Perfect negative correlation, indicating that as one variable increases, the other decreases proportionally.

It is important to note that correlation does not imply causation. A correlation between two variables means that changes in one variable are statistically associated with changes in the other, but it does not mean that one variable directly causes the other. Instead, the probability of one variable occurring might be higher when the other variable changes.

A **correlation matrix** is a table showing the correlation coefficients between variables in a dataset, providing a compact representation of their pairwise relationships. Each cell in the matrix contains the correlation value for a specific pair of variables, helping identify trends or dependencies within the data.

### Silhouette Score for k-Means Clustering

The silhouette score measures how similar a data point is to its own cluster compared to other clusters, helping to assess the quality of clustering. It ranges from -1 to 1, where a high score (close to 1) indicates that the point is well-matched to its own cluster and poorly matched to neighboring clusters.

To compute the optimal number of clusters k for k-means clustering, we can calculate the silhouette score for each k and choose the k that yields the highest average silhouette score across all data points.

### Normalization for k-Means Clustering

Normalization is the process of scaling data to a common range, typically between 0 and 1 or -1 and 1, to ensure that all features contribute equally in analyses.

Normalization is important for K-means clustering because it ensures that each feature contributes equally to the distance calculations, preventing features with larger scales from dominating the clustering process. Without normalization, K-means may produce biased clusters that are heavily influenced by features with larger numerical ranges.

$$
  \text{normalized features} = \frac{{x - \mu}}{{\sigma}}
$$

In [None]:
# DO NOT CHANGE

def plot_correlation_matrix(dataset):
    """
    Generates a heatmap to visualize correlations between numeric features.
    """
    plt.figure(figsize=(10, 8))
    correlation_matrix = dataset.corr()
    sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", square=True)
    plt.title("Feature Correlation Matrix")
    plt.show()

def apply_kmeans_clustering(dataset, n_clusters=3, selected_columns=None, seed=42):
    """
    Applies K-means clustering on selected features and visualizes the clusters.
    
    Parameters:
    - dataset: pandas DataFrame, the dataset to perform clustering on.
    - n_clusters: int, the number of clusters to form.
    - selected_columns: list of str, the column names to use for clustering.
    - seed: int, the random seed for reproducibility.
    
    Returns:
    - dataset_with_clusters: DataFrame, the dataset with an added 'Cluster' column.
    """
    if selected_columns is None:
        # Default to using the first two numeric columns if none are provided
        selected_columns = dataset.select_dtypes(include=['float', 'int']).columns[:2].tolist()
    
    # Extract data for the selected columns
    data_to_cluster = dataset[selected_columns]
    
    # Initialize and fit the KMeans model
    kmeans = KMeans(n_clusters=n_clusters, random_state=seed)
    clusters = kmeans.fit_predict(data_to_cluster)
    
    # Add cluster labels to the dataset
    dataset_with_clusters = dataset.copy()
    dataset_with_clusters['Cluster'] = clusters
    
    # Visualize the clusters
    plt.figure(figsize=(8, 6))
    plt.scatter(data_to_cluster.iloc[:, 0], data_to_cluster.iloc[:, 1], c=clusters, cmap='viridis', s=50, alpha=0.7)
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='X', s=200, label='Centroids')
    plt.xlabel(selected_columns[0])
    plt.ylabel(selected_columns[1])
    plt.title(f'K-Means Clustering on {selected_columns[0]} and {selected_columns[1]}')
    plt.legend()
    plt.show()
    
    return dataset_with_clusters

def compute_silhouette_score_for_k(data, k, seed=42):
    """
    Computes the silhouette score for a given k.
    
    Parameters:
    - data: pandas DataFrame, data to cluster.
    - k: int, number of clusters.
    - seed: int, random seed for reproducibility.
    
    Returns:
    - silhouette score for the given k.
    """
    kmeans = KMeans(n_clusters=k, random_state=seed)
    labels = kmeans.fit_predict(data)
    score = silhouette_score(data, labels)
    return score

def find_optimal_k_silhouette(dataset, selected_columns=None, seed=42, k_range=(2, 10)):
    """
    Finds the optimal number of clusters using silhouette score.
    
    Parameters:
    - dataset: pandas DataFrame, the dataset to analyze.
    - selected_columns: list of str, columns to use for clustering.
    - seed: int, random seed for reproducibility.
    - k_range: tuple, the range of k values to consider (inclusive).
    
    Returns:
    - best_k: int, the optimal number of clusters.
    """
    if selected_columns is None:
        # Default to using the first two numeric columns if none are provided
        selected_columns = dataset.select_dtypes(include=['float', 'int']).columns[:2].tolist()
    
    data_to_cluster = dataset[selected_columns]
    
    best_k = k_range[0]
    best_score = -1
    
    for k in range(k_range[0], k_range[1] + 1):
        score = compute_silhouette_score_for_k(data_to_cluster, k, seed)
        print(f"Silhouette score for k={k}: {score:.4f}")
        
        if score > best_score:
            best_score = score
            best_k = k
            
    print(f"Optimal number of clusters: {best_k} with silhouette score: {best_score:.4f}")
    return best_k

def normalize_dataset(dataset, method='z-score'):
    """
    Normalizes a pandas DataFrame containing numeric values, excluding the 'City' column.

    Parameters:
    - dataset: pandas DataFrame, the dataset to normalize.
    - method: str, the normalization method to use ('z-score' or 'min-max').

    Returns:
    - normalized_dataset: DataFrame, the normalized dataset with the 'City' column unchanged.
    """
    # Initialize the scaler based on the selected method
    if method == 'z-score':
        scaler = StandardScaler()
    elif method == 'min-max':
        scaler = MinMaxScaler()
    else:
        raise ValueError("Method must be 'z-score' or 'min-max'")

    # Separate the 'City' column (if present) from the rest of the dataset
    if 'City' in dataset.columns:
        city_column = dataset['City']
        dataset_to_normalize = dataset.drop(columns=['City'])
    else:
        city_column = None
        dataset_to_normalize = dataset

    # Apply the scaler to the numeric columns
    normalized_values = scaler.fit_transform(dataset_to_normalize)
    
    # Convert the result back to a DataFrame with the original column names
    normalized_dataset = pd.DataFrame(normalized_values, columns=dataset_to_normalize.columns)
    
    # Reattach the 'City' column if it was separated
    if city_column is not None:
        normalized_dataset['City'] = city_column.reset_index(drop=True)

    return normalized_dataset


def plot_two_correlation_matrices(dataset1, dataset2, selected_columns=None, title1="Correlation Matrix 1", title2="Correlation Matrix 2"):
    """
    Plots two correlation matrices side by side, with the option to select specific columns.
    
    Parameters:
    - dataset1: pandas DataFrame, first dataset for the correlation matrix.
    - dataset2: pandas DataFrame, second dataset for the correlation matrix.
    - selected_columns: list of str, column names to include in the correlation matrix.
    - title1: str, title for the first correlation matrix.
    - title2: str, title for the second correlation matrix.
    """
    # If selected_columns is provided, filter datasets to include only those columns
    if selected_columns:
        dataset1 = dataset1[selected_columns]
        dataset2 = dataset2[selected_columns]
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Correlation matrix for the first dataset
    corr_matrix1 = dataset1.corr()
    sns.heatmap(corr_matrix1, annot=True, cmap="coolwarm", fmt=".2f", square=True, ax=axes[0])
    axes[0].set_title(title1)

    # Correlation matrix for the second dataset
    corr_matrix2 = dataset2.corr()
    sns.heatmap(corr_matrix2, annot=True, cmap="coolwarm", fmt=".2f", square=True, ax=axes[1])
    axes[1].set_title(title2)

    plt.tight_layout()
    plt.show()


def plot_two_kmeans_clusters(dataset1, dataset2, n_clusters1=3, n_clusters2=3, 
                             selected_columns1=None, selected_columns2=None, seed=42):
    """
    Plots two K-means clustering results side by side.
    
    Parameters:
    - dataset1: pandas DataFrame, first dataset for K-means clustering.
    - dataset2: pandas DataFrame, second dataset for K-means clustering.
    - n_clusters1: int, number of clusters for the first K-means plot.
    - n_clusters2: int, number of clusters for the second K-means plot.
    - selected_columns1: list of str, columns to use for clustering in the first plot.
    - selected_columns2: list of str, columns to use for clustering in the second plot.
    - seed: int, random seed for reproducibility.
    """
    # Default to first two numeric columns if not provided
    if selected_columns1 is None:
        selected_columns1 = dataset1.select_dtypes(include=['float', 'int']).columns[:2].tolist()
    if selected_columns2 is None:
        selected_columns2 = dataset2.select_dtypes(include=['float', 'int']).columns[:2].tolist()
    
    # Prepare data for clustering
    data1 = dataset1[selected_columns1]
    data2 = dataset2[selected_columns2]
    
    # Apply KMeans clustering on the first dataset
    kmeans1 = KMeans(n_clusters=n_clusters1, random_state=seed)
    clusters1 = kmeans1.fit_predict(data1)
    
    # Apply KMeans clustering on the second dataset
    kmeans2 = KMeans(n_clusters=n_clusters2, random_state=seed)
    clusters2 = kmeans2.fit_predict(data2)
    
    # Plotting side by side
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # First clustering plot
    axes[0].scatter(data1.iloc[:, 0], data1.iloc[:, 1], c=clusters1, cmap='viridis', s=50, alpha=0.7)
    axes[0].scatter(kmeans1.cluster_centers_[:, 0], kmeans1.cluster_centers_[:, 1], c='red', marker='X', s=200, label='Centroids')
    axes[0].set_xlabel(selected_columns1[0])
    axes[0].set_ylabel(selected_columns1[1])
    axes[0].set_title(f'K-Means Clustering on {selected_columns1[0]} and {selected_columns1[1]} with k={n_clusters1}')
    axes[0].legend()

    # Second clustering plot
    axes[1].scatter(data2.iloc[:, 0], data2.iloc[:, 1], c=clusters2, cmap='viridis', s=50, alpha=0.7)
    axes[1].scatter(kmeans2.cluster_centers_[:, 0], kmeans2.cluster_centers_[:, 1], c='red', marker='X', s=200, label='Centroids')
    axes[1].set_xlabel(selected_columns2[0])
    axes[1].set_ylabel(selected_columns2[1])
    axes[1].set_title(f'K-Means Clustering on {selected_columns2[0]} and {selected_columns2[1]} with k={n_clusters2}, without outliers')
    axes[1].legend()

    plt.tight_layout()
    plt.show()

def plot_custom_scatter(dataset1, dataset2, plot_columns, city_mapping, cmap='viridis_r'):
    """
    Plots two scatter plots side by side with specified columns for x, y, and city (color).

    Parameters:
    - dataset1: pandas DataFrame, first dataset to plot.
    - dataset2: pandas DataFrame, second dataset to plot.
    - plot_columns: list of str, [x, y] columns for the plots.
    - city_mapping: dict, mapping of city names to unique numeric labels.
    - cmap: str, color map to use for the scatter plots.
    """
    # Unpack column names for x and y axes
    x, y = plot_columns

    # Reverse the city mapping to get city names from numeric labels
    reverse_mapping = {v: k for k, v in city_mapping.items()}

    # Ensure that only cities in the dataset are included in the legend and colorbar
    unique_cities1 = dataset1['City'].unique()
    unique_cities2 = dataset2['City'].unique()
    all_unique_cities = sorted(set(unique_cities1).union(set(unique_cities2)))

    # Check that all unique cities are in the mapping
    valid_cities = [city for city in all_unique_cities if city in reverse_mapping]
    if not valid_cities:
        raise ValueError("No valid city mappings found for the provided datasets.")

    # Get the city names for the valid cities
    city_names = [reverse_mapping[city] for city in valid_cities]

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # First plot
    scatter1 = axes[0].scatter(
        dataset1[x], dataset1[y], c=dataset1['City'], cmap=cmap, s=50, alpha=0.7
    )
    axes[0].set_xlabel(x)
    axes[0].set_ylabel(y)
    axes[0].set_title(f'Scatter Plot of {x} vs {y} ({", ".join(city_names)})')
    colorbar1 = fig.colorbar(scatter1, ax=axes[0])
    colorbar1.set_ticks(valid_cities)
    colorbar1.set_ticklabels([reverse_mapping[city] for city in valid_cities])
    colorbar1.set_label('City')

    # Second plot
    scatter2 = axes[1].scatter(
        dataset2[x], dataset2[y], c=dataset2['City'], cmap=cmap, s=50, alpha=0.7
    )
    axes[1].set_xlabel(x)
    axes[1].set_ylabel(y)
    axes[1].set_title(f'Scatter Plot of {x} vs {y} ({", ".join(city_names)} - without outliers)')
    colorbar2 = fig.colorbar(scatter2, ax=axes[1])
    colorbar2.set_ticks(valid_cities)
    colorbar2.set_ticklabels([reverse_mapping[city] for city in valid_cities])
    colorbar2.set_label('City')

    plt.tight_layout()
    plt.show()


<u>TASK:</u> How does the correlation matrix change with different thresholds? \
Compare the correlation matrices for threshold = [1, 3]. \
Why is the effect strongest at threshold = 1? 


In [None]:
threshold = None  # CHANGE threshold from None to any positive number, default value is usually 3.0

## DO NOT CHANGE THE CODE BELOW
check_value(threshold, "threshold")

The following codeblock computes the correlation matrices, once with outliers and once without.

In [None]:
# DO NOT CHANGE

selected_columns_correlation_matrix = ['tempmax', 'tempmin', 'humidity', 'dew', 'pressure', 'windgust', 'windspeed', 'solarradiation', 'uvindex']
selected_columns = [col1, col2]  

# Compute and plot the correlation matrix without outliers
outliers = detect_outliers_z_score(allcities_numeric_dataset, threshold=threshold)
cleaned_allcities_numeric_dataset = remove_outliers(allcities_numeric_dataset, outliers)

plot_two_correlation_matrices(
    allcities_numeric_dataset,
    cleaned_allcities_numeric_dataset,
    selected_columns=selected_columns_correlation_matrix,
    title1="All Cities with Outliers",
    title2="All Cities without Outliers"
)

Now we use K-means Clustering to try to distinguish between the two cities. The Silhouette Score is also computed to find out the theoretical optimal amount of clusters.

In [None]:
# DO NOT CHANGE

# Reduce to selected columns
reduced_numeric_dataset = reduce_to_selected_columns(numeric_dataset, ['City', col1, col2])

# Shuffle dataset (not necessary)
shuffled_reduced_numeric_dataset = shuffle_dataset(reduced_numeric_dataset)

# Compute and plot the result of k-means clustering with outliers and without outliers
# Normalize the shuffled reduced numeric dataset
normalized_shuffled_reduced_numeric_dataset = normalize_dataset(shuffled_reduced_numeric_dataset)

# Find optimal k for clustering before and after outlier removal
optimal_k = find_optimal_k_silhouette(
    normalized_shuffled_reduced_numeric_dataset,
    selected_columns=selected_columns,
    k_range=(2, 10)
)
outliers = detect_outliers_z_score(normalized_shuffled_reduced_numeric_dataset, threshold=threshold)
cleaned_normalized_shuffled_reduced_numeric_dataset = remove_outliers(normalized_shuffled_reduced_numeric_dataset, outliers)

cleaned_optimal_k = find_optimal_k_silhouette(
    cleaned_normalized_shuffled_reduced_numeric_dataset,
    selected_columns=selected_columns,
    k_range=(2, 10)
)

# Plot K-means clustering results
plot_two_kmeans_clusters(
    normalized_shuffled_reduced_numeric_dataset,
    cleaned_normalized_shuffled_reduced_numeric_dataset,
    n_clusters1=optimal_k,
    n_clusters2=cleaned_optimal_k,
    selected_columns1=selected_columns,
    selected_columns2=selected_columns
)

# Compare clustering results with fixed k=2
plot_two_kmeans_clusters(
    normalized_shuffled_reduced_numeric_dataset,
    cleaned_normalized_shuffled_reduced_numeric_dataset,
    n_clusters1=2,
    n_clusters2=2,
    selected_columns1=selected_columns,
    selected_columns2=selected_columns
)

# Convert City column to integers in both datasets
normalized_shuffled_reduced_numeric_dataset['City'] = normalized_shuffled_reduced_numeric_dataset['City'].astype(int)
cleaned_normalized_shuffled_reduced_numeric_dataset['City'] = cleaned_normalized_shuffled_reduced_numeric_dataset['City'].astype(int)

plot_columns = [col1, col2]  # Replace with the desired column names
plot_custom_scatter(
    normalized_shuffled_reduced_numeric_dataset,
    cleaned_normalized_shuffled_reduced_numeric_dataset,
    plot_columns=plot_columns,
    city_mapping=city_mapping  # Pass the correct city mapping
)

[Back](#sec_toc)

## <a id="sec_c">c) Creation of Representative Training and Test Datasets and Different Similarity Measures </a>

In this chapter we will understand how representative training and test splits can be generated based on quantitative measures.

### Kullback-Leibler (KL) Divergence


The Kullback-Leibler (KL) divergence is a measure of how one probability distribution diverges from a reference distribution, quantifying the "distance" between them. It is commonly used to compare the similarity between two distributions, where a smaller KL divergence indicates closer alignment. KL divergence is especially useful in assessing the difference between training and test data distributions in machine learning. 

Below the KL divergence is defined for discrete probability distributions. Discrete distributions are probability distributions defined over a finite or countable set of outcomes.

The KL divergence from a discrete distribution $P$ to a discrete distribution $Q$ is given by:

$$
D_{KL}(P || Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}
$$

where:
- P(i) is the probability of an event i in the distribution P,
- Q(i) is the probability of the same event i in the distribution Q.

A KL divergence of 0 means the two distributions are identical, while higher values indicate greater divergence between them. 

The KL divergence is asymmetric, meaning $ D_{KL}(P || Q) \neq D_{KL}(Q || P) $. This asymmetry indicates that the "distance" measured depends on the direction, as it penalizes differences between $ P $ and $ Q $ more heavily based on the reference distribution.


### Kullback-Leibler Divergence for Gaussian Distributions

When both distributions are Gaussian (bell-shaped curves), thereâ€™s a specific formula we can use to calculate this difference more easily. This formula only needs the mean and standard deviation of each Gaussian distribution, making it faster and simpler than the general approach for other types of distributions.

#### KL Divergence for Gaussian Distributions

For two Gaussian distributions with means and variances:

$$
P \sim \mathcal{N}(\mu_P, \sigma_P^2) \quad \text{and} \quad Q \sim \mathcal{N}(\mu_Q, \sigma_Q^2)
$$

the KL divergence is given by:

$$
D_{KL}(P || Q) = \ln \frac{\sigma_Q}{\sigma_P} + \frac{\sigma_P^2 + (\mu_P - \mu_Q)^2}{2\sigma_Q^2} - \frac{1}{2}
$$




In [None]:
# DO NOT CHANGE

# Task 1: Split Dataset into Train and Test
def split_dataset(data, test_size=0.3, random_state=42):
    """
    Splits the dataset into train and test sets.
    
    Parameters:
    data (pd.DataFrame): The input dataset.
    test_size (float): Proportion of the dataset to include in the test split.
    random_state (int): Seed for reproducibility.
    
    Returns:
    train (pd.DataFrame): Training dataset.
    test (pd.DataFrame): Testing dataset.
    """
    train, test = train_test_split(data, test_size=test_size, random_state=random_state)
    return train, test

# Task 2: Create Histograms for Features
def plot_histograms(train, test, features, num_bins=20):
    """
    Creates histograms of the features for both train and test datasets, with one plot per feature.
    
    Parameters:
    train (pd.DataFrame): Training dataset.
    test (pd.DataFrame): Testing dataset.
    features (list): List of feature names to plot.
    num_bins (int): Number of bins to use for histogram comparison.
    """
    for feature in features:
        plt.figure(figsize=(8, 4))

        # Determine the range for bins based on both train and test data
        min_value = min(train[feature].min(), test[feature].min())
        max_value = max(train[feature].max(), test[feature].max())
        bins = np.linspace(min_value, max_value, num_bins + 1)

        # Plot histograms with the shared bins for exact alignment
        plt.hist(train[feature], bins=bins, alpha=0.5, label='Train', density=True, color='blue')
        plt.hist(test[feature], bins=bins, alpha=0.5, label='Test', density=True, color='orange')
        plt.title(f'Histogram of {feature}')
        plt.xlabel(feature)
        plt.ylabel('Density')
        plt.legend()
        plt.grid(True)
        plt.show()

# Task 3: Compute KL Divergence
def compute_kl_divergence(train, test, features, num_bins=20):
    """
    Computes the Kullback-Leibler (KL) divergence for each feature to compare the train and test distributions.
    
    Parameters:
    train (pd.DataFrame): Training dataset.
    test (pd.DataFrame): Testing dataset.
    features (list): List of feature names to compare.
    num_bins (int): Number of bins to use for histogram comparison.
    
    Returns:
    kl_divergences (dict): Dictionary of KL divergence values for each feature.
    """
    kl_divergences = {}
    for feature in features:
        # Create histograms for the train and test datasets for the feature
        min_value = min(train[feature].min(), test[feature].min())
        max_value = max(train[feature].max(), test[feature].max())
        bins = np.linspace(min_value, max_value, num_bins + 1)

        train_hist, _ = np.histogram(train[feature], bins=bins, density=True)
        test_hist, _ = np.histogram(test[feature], bins=bins, density=True)

        # Ensure no zero values in histograms (for stable KL divergence calculation)
        train_hist += 1e-10
        test_hist += 1e-10

        # Compute KL divergence
        kl_div = entropy(train_hist, test_hist)
        kl_divergences[feature] = kl_div
        print(f"KL Divergence for {feature}: {kl_div:.4f}")

    return kl_divergences

def plot_histogram_with_gaussian(train, test, features, num_bins=20):
    """
    Creates histograms of the features for both train and test datasets with an overlaid Gaussian distribution.
    
    Parameters:
    train (pd.DataFrame): Training dataset.
    test (pd.DataFrame): Testing dataset.
    features (list): List of feature names to plot.
    num_bins (int): Number of bins to use for the histogram.
    """
    for feature in features:
        plt.figure(figsize=(8, 4))

        # Determine the range for bins based on both train and test data
        min_value = min(train[feature].min(), test[feature].min())
        max_value = max(train[feature].max(), test[feature].max())
        bins = np.linspace(min_value, max_value, num_bins + 1)

        # Plot histograms with the shared bins for exact alignment
        plt.hist(train[feature], bins=bins, alpha=0.5, label='Train', density=True, color='blue')
        plt.hist(test[feature], bins=bins, alpha=0.5, label='Test', density=True, color='orange')

        # Fit and plot Gaussian for train data
        mean_train = train[feature].mean()
        std_train = train[feature].std()
        x_vals = np.linspace(min_value, max_value, 100)
        plt.plot(x_vals, norm.pdf(x_vals, mean_train, std_train), color='blue', linestyle='--', label='Gaussian (Train)')

        # Fit and plot Gaussian for test data
        mean_test = test[feature].mean()
        std_test = test[feature].std()
        plt.plot(x_vals, norm.pdf(x_vals, mean_test, std_test), color='orange', linestyle='--', label='Gaussian (Test)')

        plt.title(f'Histogram of {feature} with Gaussian Fit')
        plt.xlabel(feature)
        plt.ylabel('Density')
        plt.legend()
        plt.grid(True)
        plt.show()

def compute_kl_divergence_gaussian(mean_p, std_p, mean_q, std_q):
    """
    Computes the KL divergence between two Gaussian distributions P and Q.
    
    Parameters:
    mean_p (float): Mean of the first Gaussian distribution (P).
    std_p (float): Standard deviation of the first Gaussian distribution (P).
    mean_q (float): Mean of the second Gaussian distribution (Q).
    std_q (float): Standard deviation of the second Gaussian distribution (Q).
    
    Returns:
    float: KL divergence between the two Gaussian distributions.
    """
    kl_divergence = np.log(std_q / std_p) + (std_p**2 + (mean_p - mean_q)**2) / (2 * std_q**2) - 0.5
    return kl_divergence

def compute_kl_divergence_gaussian_multiple(train, test, features):
    """
    Computes the KL divergence between Gaussian approximations of multiple features in train and test sets.
    
    Parameters:
    features (list): List of feature names to compute KL divergence for.
    train (pd.DataFrame): Training dataset.
    test (pd.DataFrame): Testing dataset.
    
    Returns:
    kl_divergences (dict): Dictionary of KL divergence values for each feature.
    """
    kl_divergences = {}
    
    for feature in features:
        # Calculate mean and standard deviation for the feature in both train and test sets
        mean_train = train[feature].mean()
        std_train = train[feature].std()
        mean_test = test[feature].mean()
        std_test = test[feature].std()
        
        # Compute KL divergence for the current feature
        kl_divergence = compute_kl_divergence_gaussian(mean_train, std_train, mean_test, std_test)
        kl_divergences[feature] = kl_divergence
        print(f"Gaussian KL Divergence for {feature}: {kl_divergence:.4f}")
    
    return kl_divergences


<u>TASK:</u> In the code cell below you can change the variables test_size and random_state. \
A different test_size changes the sizes of the training and test splits. \
A different random_state decides different samples for the training and test splits.

When answering the moodle questions use 'test_size = 0.5' and 'random_state = 42'.

In [None]:
test_size = None    # CHANGE test_size TO ANY Float BETWEEN 0.0 AND 1.0 
                    # (e.g. 0.3 equals 30% of data is used for the test set, the rest for the train set)
random_state = None # CHANGE random_state TO ANY Integer FOR NEW RANDOM SPLIT

## DO NOT CHANGE THE CODE BELOW
check_value(test_size, "test_size")
check_value(random_state, "random_state")

<u>TASK:</u> In the code cell below you can change the threshold for the z-score outlier detection for the remaining computations in chapter c).

In [None]:
threshold = None  # CHANGE threshold from None to any positive number, default value is usually 3.0

## DO NOT CHANGE THE CODE BELOW
check_value(threshold, "threshold")

The following codeblock computes and plots the distributions for all cities of chosen columns (col1, col2). 

In [None]:
# DO NOT CHANGE

# Split the dataset into train and test sets
numeric_train, numeric_test = split_dataset(allcities_numeric_dataset, test_size=test_size, random_state=random_state)

# Plot histograms and gaussian distributions of the features
features = [col1, col2]
plot_histogram_with_gaussian(numeric_train, numeric_test, features)


Below, the KL Divergence is computed - for the discrete and gaussian distributions.

In [None]:
# DO NOT CHANGE

# Compute KL Divergence between train and test distributions for each feature
kl_divergences = compute_kl_divergence(numeric_train, numeric_test, features)
kl_divergences_gaussian = compute_kl_divergence_gaussian_multiple(numeric_train, numeric_test, features)


[Back](#sec_toc)