# Exercise - Data Organization & Machine Learning

<div class="alert alert-block alert-success">
This is an exercise on the basics of data organization techniques, focussing on data splitting, and validation strategies in machine learning workflows.

## <a id="sec_toc">Content</a>

[Learning Objectives](#sec_0)

[a) Prepare and analyse the data](#sec_a)

[b) Simple Train Test Split](#sec_b)

[c) Implementing K-Fold Cross-Validation](#sec_c)

[d) Stratified K-Fold Cross Validation](#sec_d)

[Bonus) Plot the data](#sec_e)

### <a id="sec_0">Learning Objectives</a>

* Identify and apply necessary pre-processing steps to prepare data for analysis. 
* Become familiar with evaluation metrics, such as Accuracy, Precision, Recall and F1-score.
* Understand and Interpret these evaluation metrics.
* Understand the basics of train-test splits and k-fold cross validation.
* Determine when and why to use these splitting techniques.

In [None]:
%matplotlib widget

### Import the needed libaries

In [None]:
import warnings
import random
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn import linear_model
from sklearn.metrics import (
    confusion_matrix,
    f1_score,
    recall_score,
    precision_score,
    accuracy_score,
)
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.utils import shuffle
from sklearn.manifold import TSNE

from typing import Union, Dict, Literal, Sequence
import numpy.typing as npt

In [None]:
warnings.filterwarnings("ignore")

In [None]:
seed_value = 1
np.random.seed(seed_value)
random.seed(seed_value)

### Define all functions that are needed in the notebook
The code cell below defines all the functions used in the notebook. \
If you want to save space you can click on the blue tab on the left to truncate the code cell.\
This is only possible, if you are using jupyterhub.\
This improves the readability of the Python code and should help you to understand the Python code. \
If you are interested in the code, expand the code cell by clicking on the blue tab on the left.

In [None]:
# DO NOT CHANGE THIS CODE
def binarize_labels(labels: pd.DataFrame, mapping: Dict[str, int]) -> pd.DataFrame:
    """Maps the label values to int-values, based on the mapping dictionary."""
    c_labels = labels.copy()
    c_labels[c_labels.columns[0]] = c_labels[c_labels.columns[0]].map(mapping)
    return c_labels


def binarize_labels_rev(labels: pd.DataFrame, mapping: Dict[str, int]) -> pd.DataFrame:
    """Maps the encoded label values back to their original string values."""
    reverse_mapping = {v: k for k, v in mapping.items()}  # Create reverse mapping
    c_labels = labels.copy()
    c_labels[c_labels.columns[0]] = c_labels[c_labels.columns[0]].map(reverse_mapping)
    return c_labels


def check_data_consistency(data, normalized_data, labels):
    if normalized_data.shape != data.shape:
        print(
            "Deleted columns (because of NaN-values): ",
            set(data.columns).symmetric_difference(set(normalized_data.columns)),
        )
    else:
        print("No columns contain any NaN values.")

    if normalized_data.isna().sum().sum() != 0:
        raise Exception("Error: normalized_data still contains NaN values.")

    # check if data and labels have the same number of rows
    if normalized_data.shape[0] != labels.shape[0]:
        raise Exception("Error: data and labels do not have the same number of rows.")
    

def check_dims(input_data: pd.DataFrame, labels: pd.DataFrame) -> None:
    """Takes two DataFrames and prints their shape."""
    print("dimensions of the data: ", input_data.shape, "type: ", type(input_data))
    print("dimension of the labels: ", labels.shape, "type: ", type(labels))


def check_factor(stratification_factor):
    if stratification_factor is None:
        print("stratification_factor is still None. Please enter a valid value for stratification_factor.")
        raise Exception("Execution stopped because stratification_factor is not defined.")
    
    if stratification_factor<0 or stratification_factor>=1:
        print("The entered value fotr the stratification_factor is invalid. Please enter a valid value for stratification_factor.")
        raise Exception("Execution stopped because stratification_factor is not valid.")
    

def check_k(k):
    if k is None:
        print("k is still None. Please enter a valid value for k.")
        raise Exception("Execution stopped because k is not defined.")
    
    if k<3 or k>10:
        print("The value for k is invalid. Please enter a valid value for k.")
        raise Exception("Execution stopped because the entered value for k is not valid.")


def check_plot_values(X, y):
    if X is None:
        print("X_plot is still None. Please enter a valid value for X_plot.")
        raise Exception("Execution stopped because X_plot is not defined.")

    if y is None:
        print("y_plot is still None. Please enter a valid value for y_plot.")
        raise Exception("Execution stopped because y_plot is not defined.")
    

def check_values(city1, city2):
    cities = ["Chicago","New York City","Dallas","Phoenix","Philadelphia","Los Angeles","San Diego","San Jose","Houston", "San Antonio"]
    if city1 is None:
        print("city1 is still None. Please enter a valid city name.")
        raise Exception("Execution stopped because city1 is not defined.")

    if city2 is None:
        print("city2 is still None. Please enter a valid city name.")
        raise Exception("Execution stopped because city2 is not defined.")
    
    if city1 not in cities:
        print("city1 is an invalid value. Please enter a valid city name. Valid cities are listed in the output of the cell above.")
        raise Exception("Execution stopped because city1 is not valid.")

    if city2 not in cities:
        print("city2 is an invalid value. Please enter a valid city name. Valid cities are listed in the output of the cell above.")
        raise Exception("Execution stopped because city2 is not valid.")
    
    return [city1, city2]


def check_value(random_seed):
    if random_seed is False:
        print("random_seed is still False. Please enter a valid value (e.g. True).")
        raise Exception("Execution stopped because random_seed is still False.")
    
    
def create_city_mapping(cities: list[str]) -> Dict[str, int]:
    """Takes a list of values, in this case cities and returns a dictionary,
    where each of these cities is assigned to an integer value.
    """
    mapping = {}
    for i, city in enumerate(cities):
        mapping.update({city: i})
    return mapping


def create_k_folds(
    data: pd.DataFrame,
    labels: pd.DataFrame,
    k: int = 5,
    fold: Union[KFold, StratifiedKFold] = KFold,
) -> tuple[pd.DataFrame]:
    """Divides a dataset into k different equivalent subsets and returns them."""
    data_shuffled, labels_shuffled = shuffle(data, labels, random_state=42)
    kf = fold(n_splits=k)
    folds_data = []
    folds_labels = []
    for _, val_index in kf.split(data_shuffled, labels_shuffled):
        fold_d = data_shuffled.iloc[val_index]
        fold_l = labels_shuffled.iloc[val_index]
        folds_data.append(fold_d)
        folds_labels.append(fold_l)
    return folds_data, folds_labels


def create_legend(mapping: Dict[str, int], y_test: pd.DataFrame) -> None:
    """Creates a legend for the scatter plot, based on the 'mapping' of the label values."""
    city_labels = {v: k for k, v in mapping.items()}
    k = len(mapping) - 1
    handles = [
        plt.Line2D(
            [0],
            [0],
            marker="o",
            color="k",
            markerfacecolor=plt.cm.Paired(i / k),
            markersize=10,
            label=city_labels[i],
        )
        for i in np.unique(y_test.values)
    ]
    plt.legend(
        handles=handles, title="Cities", bbox_to_anchor=(1.50, 1), loc="upper right"
    )
    plt.show()


def eval_cross_val(
    kf_data: list[pd.DataFrame], kf_labels: list[pd.DataFrame]
) -> tuple[float]:
    """Performs the model training and returns the different evaluation metrics."""
    accuracies = []
    precisions = []
    recalls = []
    F1_scores = []
    training_accuracies = []
    for i in range(len(kf_data)):
        kf_data_copy = kf_data.copy()
        kf_labels_copy = kf_labels.copy()
        X_test = kf_data_copy.pop(i)
        y_test = kf_labels_copy.pop(i).to_numpy().squeeze()
        X_train = pd.concat(kf_data_copy, ignore_index=True)
        y_train = pd.concat(kf_labels_copy, ignore_index=True).to_numpy().ravel()

        model = linear_model.LogisticRegression()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        y_train_pred = model.predict(X_train)
        training_acc = accuracy_score(y_train, y_train_pred)

        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        accuracies.append(accuracy_score(y_test, y_pred))
        precisions.append(precision)
        recalls.append(recall)
        F1_scores.append(f1_score(y_test, y_pred))
        training_accuracies.append(float(training_acc))
    return (
        np.array(accuracies),
        np.array(precisions),
        np.array(recalls),
        np.array(F1_scores),
        np.array(training_accuracies),
    )


def filter_label_values(
    data: pd.DataFrame, labels: pd.DataFrame, values: list[str]
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Filters the data based on the values in the 'labels' DataFrame.
    After applying this function, the dataset only contains rows with
    the label described in the 'values' list.
    """
    indices_to_keep = labels[labels[labels.columns[0]].isin(values)].index
    filtered_data = data.loc[indices_to_keep]
    filtered_labels = labels.loc[indices_to_keep]
    return filtered_data, filtered_labels


def logistic_reg_3d(X_train_3dims, y_plot, xx, yy):
    logistic_reg_plot = linear_model.LogisticRegression()
    model = logistic_reg_plot.fit(X_train_3dims, y_plot.to_numpy().ravel())
    coef = model.coef_[0]
    intercept = model.intercept_
    zz = -(coef[0] * xx + coef[1] * yy + intercept) / coef[2]
    return zz, model


def get_principal_components(data: pd.DataFrame, nr_cols: int) -> pd.DataFrame:
    """Takes a DataFrame and returns the reduced dataset based on principal component analysis."""
    U, s, VT = np.linalg.svd(data, full_matrices=False)
    Vk = VT[:, :nr_cols]
    P = np.dot(data, Vk)
    cols = [f"{i+1}. Principal Component" for i in range(nr_cols)]
    df = pd.DataFrame(P, columns=cols)
    print("variance kept [%]: ", (s**2 / sum(s**2)) * 100)
    return df


def get_results(data: pd.DataFrame, labels: pd.DataFrame, k_range: list[int], fold: Union[KFold, StratifiedKFold] = KFold) -> tuple[list[float]]:
    k_result_train = []
    k_result_test = []
    k_result_f1 = []
    for k in k_range:
        data_folds, label_folds = create_k_folds(data, labels, k, fold)
        acc_test, _, _, f1, acc_train = eval_cross_val(data_folds, label_folds)
        mean_acc_test = acc_test.mean()
        mean_acc_train = acc_train.mean()
        mean_f1_test = f1.mean()
        k_result_test.append(mean_acc_test)
        k_result_train.append(mean_acc_train)
        k_result_f1.append(mean_f1_test)
    return k_result_train, k_result_test, k_result_f1


def get_ymin_ymax_scores(scores: Union[Sequence[np.ndarray], np.ndarray]) -> tuple[float, float]:
    selected_columns_val = np.concatenate([score[0] for score in scores])
    selected_columns_test = np.concatenate([score[0] for score in scores])
    return get_ymin_ymax(selected_columns_val, selected_columns_test)


def get_ymin_ymax(acc_train, acc_test) -> tuple[float, float]:
    cols = np.concatenate([acc_train, acc_test])
    ymin = np.min(cols)-0.01
    ymax = np.max(cols)+0.01
    return ymin, ymax


def minorize_label(
    data: pd.DataFrame, labels: pd.DataFrame, value: str, frac: float = 0.5
) -> tuple[pd.DataFrame]:
    """Randomly deletes 'frac'*100 percent of the data, where the label equals 'value'."""
    data_shuffled, labels_shuffled = shuffle(data, labels, random_state=42)
    label_indices = labels_shuffled[
        labels_shuffled[labels_shuffled.columns[0]] == value
    ].index
    num_rows_to_delete = int(len(label_indices) * frac)
    rows_to_delete = np.random.choice(label_indices, num_rows_to_delete, replace=False)
    data_min = data_shuffled.drop(rows_to_delete)
    labels_min = labels_shuffled.drop(rows_to_delete)
    return data_min, labels_min


def plot_acc_per_fold(
    scores: tuple[float], y_lim_min: float, y_lim_max: float, text: str = "Training and Validation Accuracy for each fold"
) -> None:
    """Plots the training and validation accuracy of the model for each fold."""
    val_accuracies, train_accuracies = scores[0], scores[4]
    k = len(val_accuracies)

    fig, ax = plt.subplots(figsize=(8, 6.5))
    ax.plot(range(1, k + 1), train_accuracies, label="Training Accuracy", marker="o")
    ax.plot(range(1, k + 1), val_accuracies, label="Validation Accuracy", marker="o")
    ax.set_ylim(y_lim_min, y_lim_max)
    ax.set_xlabel("Fold Number")
    ax.set_ylabel("Accuracy")
    ax.title.set_text(text)
    ax.legend(loc="best")
    ax.grid(True)
    plt.show()


def plot_data_3dim(
    data: pd.DataFrame,
    labels: pd.DataFrame,
    title: str,
    mode: Literal["orig axes", "pca", "tsne"] = "pca",
    column_names: list[str] = ["Severity_Score", "Health_Risk_Score", "severerisk"],
) -> None:
    """Performs PCA, TSNE or no dimensionality reduction on the data and plots it in a 3-dimensional plot."""
    if mode == "pca":
        data_3d = get_principal_components(data, 3)
    elif mode == "tsne":
        data_3d = TSNE(n_components=3).fit_transform(data)
        data_3d = pd.DataFrame(data_3d, columns=[f"{i}. dim tsne" for i in range(1, 4)])
    elif mode == "orig axes":
        data_3d = data[column_names].copy()

    first_col, second_col, third_col = (
        data_3d.iloc[:, 0].to_frame(),
        data_3d.iloc[:, 1].to_frame(),
        data_3d.iloc[:, 2].to_frame(),
    )

    cities = labels[labels.columns[0]].unique()

    mapping_all_cities = create_city_mapping(cities)
    all_labels_bin = binarize_labels(labels, mapping_all_cities)

    # Creating a 3D scatter plot
    fig = plt.figure(figsize=(10.5, 6.5))
    ax = fig.add_subplot(111, projection="3d")
    ax.scatter(
        first_col,
        second_col,
        third_col,
        c=all_labels_bin,
        edgecolors="k",
        marker="o",
        cmap=plt.cm.Paired,
    )
    ax.set_xlabel(first_col.columns[0])
    ax.set_ylabel(second_col.columns[0])
    ax.set_zlabel(third_col.columns[0])
    ax.title.set_text(title)
    create_legend(mapping_all_cities, all_labels_bin)


def plot_decision_boundary_2d(model, xx, yy, first_col, second_col, y_plot, mapping):
    # calculate decision boundary by inserting meshgrid values
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # plot the decision boundary and the datapoints
    title = "Logistic Regression Decision Boundary and test data points"
    scatter_plot(
        first_col,
        second_col,
        y_plot,
        title,
        contour=True,
        spec=[xx, yy, Z, 0.4, plt.cm.Paired],
    )
    create_legend(mapping, y_plot)


def plot_decision_boundary_3d(model, xx, yy, zz, y_plot, first_col, second_col, third_col, mapping):
    # Calculate decision boundary using meshgrid values
    Z = model.predict(np.c_[xx.ravel(), yy.ravel(), zz.ravel()])
    Z = Z.reshape(xx.shape)

    # Create a 3D plot with the decision boundary and data points
    fig = plt.figure(figsize=(10.5, 6.5))
    ax = fig.add_subplot(111, projection='3d')
    title = "Logistic Regression 3D Decision Boundary and Test Data Points"
    ax.set_title(title)

    y_plot_array = y_plot.to_numpy().ravel()

    # Plot the training points
    ax.scatter(first_col, second_col, third_col, c= y_plot_array, cmap=plt.cm.Paired, alpha=0.8)

    # Plot the decision boundary in 3D
    ax.plot_surface(xx, yy, zz, color='lightblue', alpha=0.5)

    # Label axes
    ax.set_xlabel("1. Principal Component")
    ax.set_ylabel("2. Principal Component")
    ax.set_zlabel("3. Principal Component")

    # Add legend
    create_legend(mapping, y_plot)


def prepare_data(X_plot):
    X_train_3dims = get_principal_components(X_plot, 3)
    first_col = X_train_3dims.iloc[:, 0].to_frame()
    second_col = X_train_3dims.iloc[:, 1].to_frame()
    third_col = X_train_3dims.iloc[:,2].to_frame()
    x_min, x_max = first_col.min().values[0] - 1, first_col.max().values[0] + 1
    y_min, y_max = second_col.min().values[0] - 1, second_col.max().values[0] + 1
    z_min, z_max = third_col.min().values[0] - 1, third_col.max().values[0] + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1)) 
    return X_train_3dims, xx, yy, first_col, second_col, third_col

    
def prepare_4_plot(X_train: pd.DataFrame) -> tuple[Union[pd.DataFrame, npt.NDArray]]:
    """Returns the first two columns of 'X_train' and points within the plot spectrum as a meshgrid with a
    distance of 0.1 between the different points. This function is used as preprocessing for the plot of
    the decision areas of the logistic regression.
    """
    first_col = X_train.iloc[:, 0].to_frame()
    second_col = X_train.iloc[:, 1].to_frame()
    x_min, x_max = first_col.min().values[0] - 1, first_col.max().values[0] + 1
    y_min, y_max = second_col.min().values[0] - 1, second_col.max().values[0] + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
    return first_col, second_col, xx, yy


def print_cross_val_result(scores: list[float]) -> None:
    """Takes the list of scores, that is returned by the function
    `eval_cross_val` and prints the results.
    """
    score_names = [
        "accuracies",
        "precisions",
        "recalls",
        "F1_scores",
        "training accuracies",
    ]
    for score, name in zip(scores, score_names):
        print_score(name, score)


def print_score(metric: str, score: npt.NDArray) -> None:
    """Takes the 'metric' and the 'score' of the cross validation and prints
    the values, the mean and the standard deviation.
    """
    print(f"\033[4m{metric}\033[0m")
    print(f"\033[1mCross validation scores:\033[0m ", score)
    print(
        f"\033[1mMean\033[0m of the cross validation scores: ", round(score.mean(), 4)
    )
    print(
        f"\033[1mStandard deviation\033[0m of the cross validation scores: ",
        round(score.std(), 4),
        "\n",
    )


def scatter_plot(
    first_col: pd.DataFrame,
    second_col: pd.DataFrame,
    y_test: pd.DataFrame,
    title: str,
    contour: bool = False,
    spec: list = None,
) -> None:
    """Creates a scatter plot of the data points. This function also takes two optional parameters, ``contour`` and ``spec``
    these two inputs enable that the scatter plot also contains the decision boundary of the linear model.
    """
    fig = plt.figure(figsize=(8, 5))
    ax = fig.add_axes([0.1, 0.1, 0.6, 0.75])
    if contour:
        ax.contourf(spec[0], spec[1], spec[2], alpha=spec[3], cmap=spec[4])
    ax.scatter(
        first_col.values,
        second_col.values,
        c=y_test.values,
        edgecolors="k",
        marker="o",
        cmap=plt.cm.Paired,
    )
    ax.title.set_text(title)
    ax.set_xlabel(f"{first_col.columns[0]}")
    ax.set_ylabel(f"{second_col.columns[0]}")


def scatter_plot_train_test(filtered_data, X_train, X_test):
    """creates a scatter plot of the train and test data. 
    This is done to visualize, that the test data is randomly drawn from the test data.
    """
    fig, ax = plt.subplots(figsize=(6, 4.5))
    U, s, VT = np.linalg.svd(filtered_data, full_matrices=False)
    Vk = VT[:, :2]
    X_2dims_train = np.dot(X_train, Vk)
    X_2dims_test = np.dot(X_test, Vk)

    X_2dims_train_df = pd.DataFrame(X_2dims_train, columns=["1. Principal Component", "2. Principal Component"])
    X_2dims_train_df['Set'] = 'Train'

    X_2dims_test_df = pd.DataFrame(X_2dims_test, columns=["1. Principal Component", "2. Principal Component"])
    X_2dims_test_df['Set'] = 'Test'

    X_2dims_combined_df = pd.concat([X_2dims_train_df, X_2dims_test_df], ignore_index=True)

    # Plot the combined data with a single scatter plot
    sns.scatterplot(
        data=X_2dims_combined_df,
        x="1. Principal Component",
        y="2. Principal Component",
        hue="Set",
        alpha=0.8,
        legend = False
    )
    plt.show()


def train_plot_folds(
    k_range: list[int],
    k_results_train: list[float],
    k_result_test: list[float],
    k_result_f1: list[float], 
    y_lim_min: float, 
    y_lim_max: float,
    mode: Literal["Accuracy", "F1-Score"] = "Accuracy",
    text: str = "",
) -> None:
    fig, ax = plt.subplots(figsize=(8, 6.5))
    if mode == "Accuracy":
        ax.plot(
            k_range, k_result_test, label="Mean Test Accuracy", marker="o", color="b"
        )
        ax.plot(
            k_range, k_results_train, label="Mean Train Accuracy", marker="o", color="r"
        )
    else:
        ax.plot(k_range, k_result_f1, label="Mean Test F1-Score", marker="o", color="b")

    ax.set_xlabel("Number of Folds (k)")
    ax.set_ylabel(f"Mean {mode}")
    ax.set_ylim(y_lim_min, y_lim_max)
    ax.title.set_text(f"Mean {mode} vs. Number of Folds \n {text}")
    ax.legend()
    ax.grid(True)
    plt.show()


def train_test_split(
    X: pd.DataFrame, y: pd.DataFrame, test_size: float = 0.2, seed: bool = True
) -> tuple[pd.DataFrame]:
    """This function is splitting the input data and the corresponding labels into random train and test sets.

    Parameters:
    X: The input data
    y: The labels for the input data
    test_size (float): The size ratio of the test set compared to all data points (0 <= test_size <= 1, default value: 0.2).

    Returns:
    Returns a tuple with the following values:
    X_train: The input data, that is used for training.
    X_test: The input data, that is used for testing.
    y_train: The labels, that are used for training.
    y_test: The labels, that are used for testing.
    """
    np.random.seed(3) if seed else None
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 100 - test_size * 100)

    X_train = X[split]
    X_test = X[~split]
    y_train = y[split]
    y_test = y[~split]

    return X_train, X_test, y_train, y_test

## <a id="sec_a">a) Prepare and analyse the data</a>
The dataset is stored in a csv table and consists of columns representing, for example, different characteristics in the data. \
The ``cities`` column contains the label for each row of data. \
First, the .csv table is read into a DataFrame (DataFrame: an efficient and powerful data storage type in Python). \
The next step is to split the data columnwise into the input data columns and the label. 

### Load the dataset

In [None]:
# DO NOT CHANGE THIS CODE
# get the filepath and import data
orig_data = pd.read_csv("urban_air_quality_and_health.csv")

# display 5 rows of the dataset
orig_data.iloc[[5, 120, 250, 650, 907]].squeeze()

### Split dataset in data and labels
The next two sections define the columns to be removed and print them. \
Later, the dataset will be split into labels and data, with only the columns needed for the experiment. 

<u>TASK:</u> Inspect the columns that have been removed (i.e. ``columns_to_drop``). \
What do the removed columns have in common? \
Use this table to solve the first question of the moodle test. 

In [None]:
# DO NOT CHANGE THIS CODE
# prints the columns, that should be removed

columns_to_drop = [
    "datetimeEpoch", "datetime", "Month", "preciptype", "sunrise", "sunriseEpoch", "sunset",
    "sunsetEpoch", "conditions", "description", "icon", "stations", "source", "Season",
    "Day_of_Week", "feelslikemin", "feelslikemax", "feelslike", "Condition_Code", "Is_Weekend",
]

extra_drop = [
    "tempmax", "tempmin", "temp", "dew", "humidity", "precip", "precipprob", "precipcover",
    "snow", "snowdepth", "windgust", "windspeed", "winddir", "pressure", "cloudcover", 
    "visibility", "solarradiation", "solarenergy", "uvindex", "moonphase", "City",
]

print("Columns to be removed from the dataset")
orig_data[columns_to_drop].iloc[:5, :]

In [None]:
# DO NOT CHANGE THIS CODE
# save the city column as labels
labels = orig_data["City"].to_frame()

# preprocess data, remove some data columns
data = orig_data.drop(columns=columns_to_drop)
data = data.drop(columns=extra_drop)

print("This are some rows of the dataset after these preprocessing steps: ")
data.iloc[[5, 120, 250, 650, 907]].squeeze()

The dataset currently consists of 5 columns that describe different features in the data.

The following code cell displays the dimension of the original data in the form ``(rows, columns)``. \
Below, all remaining column names are printed to give an overview. 


In [None]:
# DO NOT CHANGE THIS CODE
# compare the dimensions of the data and the labels
print("dimensions of the data before preprocessing: ", orig_data.shape)
print("dimension of the data after preprocessing: ", data.shape)
data.columns

### Normalize and clean data 
The code in the next section normalises the data and removes columns containing nan values. \
NaN means 'not a number', these values cannot be processed, so it is important to remove these values before using the data.



**Rule of thumb**

If the data dimensions are the same, e.g. all characteristics have the unit m/s, it is not necessary to divide by the standard deviation. \
Normalise by making the data mean free:
$$
  \text{normalized feature} = {x - \mu}
$$

In this dataset, the features have different dimensions, so it is useful to normalise the \
data by making it mean free with unit variance (Z-Score). 
$$
  \text{normalized feature} = \frac{{x - \mu}}{{\sigma}}
$$
If you are training machine learning models with enough data, this rule does not necessarily apply, but it is a good reference.

In [None]:
# DO NOT CHANGE THIS CODE
# normalize the data
normalized_data = (data - data.mean()) / data.std()

In [None]:
# DO NOT CHANGE THIS CODE
# remove columns, containing nan values
normalized_data = normalized_data.dropna(axis=1)

check_data_consistency(data, normalized_data, labels)

### Preparation of the data
The task is to train a logistic regression model that predicts the city where the air quality data are collected. \
We are going to train a binary logistic regression model, so we need to prepare the data so that it only contains data from two cities.\
It is important to note that the cities are represented with different frequencies in the dataset.

In [None]:
# DO NOT CHANGE THIS CODE
frequency_counts = labels[labels.columns[0]].value_counts().reset_index()
frequency_counts.columns = ["City", "Frequency"]
frequency_counts

From all the cities in the data, two are selected to train the model on. \
The dataset now consists of all the rows containing the labels of the different cities. \
The model will be trained on this data, so the model will only decide between these two cities. \
The labels DataFrame currently contains city names. \
To enable the model to classify on the different cities, the city names are changed to a binary coding. \
All entries that were, for example, New York City are now 0 and Los Angeles is now 1.

<u>TASK:</u> Please choose two different cities of the table above and replace the None values below with the city names. \
The city names have to be in apostrophs, which are indicating the datatype as a string (e.g. "Phoenix").

In [None]:
city1 = None # please change None to a city, this is listed above 
city2 = None # please change None to a city, this is listed above 


# DO NOT CHANGE THIS CODE
values = check_values(city1, city2)

In [None]:
# DO NOT CHANGE THIS CODE
# filters data, only rows with the cities above are left
filtered_data, filtered_labels = filter_label_values(normalized_data, labels, values)

# changes all vlaues in label, that it only contains numbers
mapping = create_city_mapping(values)
print("\n mapping: ", mapping)
binarized_labels = binarize_labels(filtered_labels, mapping)

if filtered_data.shape[0] != binarized_labels.shape[0]:
    raise Exception("Error: data and labels do not have the same number of rows.")

Data preparation is now complete. \
With all this pre-processing done, let's print the labels and data.

In [None]:
# DO NOT CHANGE THIS CODE
print("labels:")
binarized_labels = binarized_labels.reset_index()['City'].to_frame()
binarized_labels

In [None]:
# DO NOT CHANGE THIS CODE
print("data:")
filtered_data = filtered_data.reset_index().drop(["index"], axis=1)
filtered_data

[Back](#sec_toc)

## <a id="sec_b">b) Simple Train-Test Split</a>

### 1. Load the dataset

The dataset is stored in a spreadsheet and consists of columns representing different features in the data. \
The cities column contains the label for each row of data. \
First, the .csv table is read into a DataFrame (DataFrame: an efficient and powerful data storage type in Python). \
The next step is to split the data columnwise into the input data columns and the labels. 

<u>TASK:</u> To load and preprocess the dataset, please execute all the code cells in [Prepare and analyse the data](#sec_a).

### 2. Split the dataset into training and test sets using a 80-20 train-test ratio

To test the model, e.g. whether it generalises well to new, unseen data, the data must be split into several datasets. \
In this case, we will split the dataset into two subsets to be used for training and testing the model. \
The train-test ratio should be 80-20, meaning that 80% of the data is in the training set and 20% is in the test set. \
This split is often used for train-test splits.

### Plot the distribution of the training and test dataset

<u>TASK:</u> Run the train-test split (following two code cells) several times and check the distribution of the data points. \
To do this, run the next two cells multiple times: 

**Important**: The assignment of data points to the train and test splits is random. \
The seed ensures that the random split is the same for each run of the code cell. \
When you are finished checking the train-test split, please change the code to ``random_seed=True``.

In [None]:
random_seed = None # change this value to True after executing the next two code cells a few times

In [None]:
# DO NOT CHANGE THIS CODE
X_train, X_test, y_train, y_test = train_test_split(
    filtered_data, binarized_labels, test_size=0.2, seed=random_seed
)

# Plot train and test datasets
scatter_plot_train_test(filtered_data, X_train, X_test)

In [None]:
# DO NOT CHANGE THIS CODE
check_value(random_seed)

### 3. Train a simple classification model on the training set

In this case, a logistic regression model is trained on the data. \
The logistic regression model is used from scikit-learn, an open source machine learning library. \
The model is first trained on the training data and then used to make predictions on the test data.  

The logistic regression model is optimised for classification problems, it can be thought of as a 'decision line', \
 or 'plane' used to divide the data points into two or more different classes. \
 In our case, the cities are the different classes we want to predict. \
 Below is a graph showing the logistic regression 'decision line'. 

In [None]:
# DO NOT CHANGE THIS CODE
logistic_reg = linear_model.LogisticRegression()

# fit the logistic regression model on the training data
model = logistic_reg.fit(X_train, y_train.to_numpy().ravel())

# now the trained model predicts on the test data
y_pred = logistic_reg.predict(X_test)
print("predicitons:", y_pred)

### Plot the decision boundary and the datapoints of the test dataset

This plot visualizes the decision boundary of the model. \
Since the dataset has 5 dimensions, plotting all dimensions directly isnâ€™t feasible. \
Instead, we used Principal Component Analysis (PCA) to reduce the data to a few key dimensions that capture the most significant patterns. \
The axes shown represent combinations of original features that contain the highest variance, \
helping to show the underlying structure of the data in a simpler, two-dimensional view.

<u>TASK:</u> Plot the test and train datasets with the logistic regression decision line. \
You can do so by changing the code below and use the datasets ``X_train`` and ``y_train`` or ``X_test`` and ``y_test``.\
These are variable names, please enter them without apostrophs.

In [None]:
# change these values:
X_plot = None # e.g. X_train or X_test
y_plot = None # e.g. y_train or y_test


# DO NOT CHANGE THIS CODE
check_plot_values(X_plot, y_plot)

In [None]:
# DO NOT CHANGE THIS CODE
X_train_2dims = get_principal_components(X_plot, 2)
first_col, second_col, xx, yy = prepare_4_plot(X_train_2dims)

logistic_reg_plot = linear_model.LogisticRegression()
model = logistic_reg_plot.fit(X_train_2dims, y_plot.to_numpy().ravel())

plot_decision_boundary_2d(model, xx, yy, first_col, second_col, y_plot, mapping)

In [None]:
# DO NOT CHANGE
X_train_3dims, xx, yy, first_col, second_col, third_col = prepare_data(X_plot)

# Fit a logistic regression model
zz, model = logistic_reg_3d(X_train_3dims, y_plot, xx, yy)
# plot the data and the decision boundary
plot_decision_boundary_3d(model, xx, yy, zz, y_plot, first_col, second_col, third_col, mapping)

### 4. Evaluate the performance

The prediction results from the previous section should be used to evaluate the performance on the test set using commonly used metrics (accuracy, precision, recall and F1-score).

<u>Confusion Matrix</u> \
The confusion matrix classifies predictions into four different sections. These classification values are later used to calculate the evaluation metrics.

![Confusion Matrix](https://miro.medium.com/v2/resize:fit:640/format:webp/1*Z54JgbS4DUwWSknhDCvNTQ.png)

- True Positives (TP): Cases where the model correctly predicts a positive class (e.g., predicts "Yes" when it is indeed "Yes").

- True Negatives (TN): Cases where the model correctly predicts a negative class (e.g., predicts "No" when it is indeed "No").

- False Positives (FP): Cases where the model incorrectly predicts the positive class (e.g., predicts "Yes" when it is actually "No"). 

- False Negatives (FN): Cases where the model incorrectly predicts the negative class (e.g., predicts "No" when it is actually "Yes").


**Criteria for balanced datasets**

<ins>Accuracy</ins>

Fraction of correct predictions from all predictions.
$$
  accuracy =\frac{tn + tp}{tn + fp + fn + tp}
$$


**Criteria for unbalanced datasets**

<ins>Precision</ins>

Percentage of correct positive predictions out of all positive predictions made by the model.
$$
  precision =\frac{tp}{tp + fp}
$$

<ins>Recall</ins>

Percentage of correct positive predictions out of all positively labelled data. 
$$
  recall =\frac{tp}{tp + fn}
$$

<ins>F1-score</ins>

The F1-score combines precision and recall. 
$$
  f1 =\frac{2 * tp}{2 * tp + fp + fn}
$$

In [None]:
# DO NOT CHANGE THIS CODE
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

precision = precision_score(y_test, y_pred)
print("Precision: ", precision)

recall = recall_score(y_test, y_pred)
print("Recall: ", recall)

F1_score = f1_score(y_test, y_pred)
print("F1-Score: ", F1_score)

[Back](#sec_toc)

## <a id="sec_c"> c) Implementing K-Fold Cross-Validation</a> 

### 1. Load the data
If you run all the code cells above, the data will already be loaded and you won't need to do anything. \
If not, use the code sections provided in [Prepare and analyse the data](#sec:a) to load the data and perform preprocessing.


### 2. Implement k-fold cross-validation

<u>Short introduction to k-fold cross-validation</u> \
K-fold cross-validation is a technique used to evaluate how well a machine learning model performs on new, unseen data. \
For k-fold cross validation, only the training set is used. \
The test set is previously splitted of the whole dataset and set aside until final evaluation. \
The training set is divided into k equal parts, or "folds". \
In each round, one fold is kept as the validation set, while the remaining k-1 folds are used to train the model. \
This process is repeated k times, with each fold getting a chance to be the validation set once. \
The model's performance is measured each time, and at the end the results are averaged to give a more reliable measure of accuracy. \
This helps to ensure that the model doesn't overfit to any part of the data, giving a better indication of how it will perform in real-world situations. 

![k-fold cross validation](https://miro.medium.com/v2/resize:fit:640/format:webp/1*PdwlCactbJf8F8C7sP-3gw.png)

<u>TASK:</u> Try different values of k in the range [3, ..., 10] and evaluate the changes on the model results. \
Do this by changing the value of k and running the following three cells of code.


In [None]:
k = None # Change this parameter to a number in the range of [3, ..., 10]


# DO NOT CHANGE
check_k(k)
data_folds, label_folds = create_k_folds(X_train, y_train, k)

### 3. Train a simple classification model and evaluate it k-times

A logistic regression model is trained on the data. First, k-fold cross-validation is performed with k = 5, corresponding to an 80-20 split.

In [None]:
# DO NOT CHANGE THE CODE
# train the logistic regression model for k times and evaluate
scores_k = eval_cross_val(data_folds, label_folds)
print_cross_val_result(scores_k)

### 4. Plot the training and validation accuracy for each fold
Underneath, a plot displays the training and validation accuracies for each fold from the most recent training and evaluation process. \
The x-axis represents the fold number (or count of training folds), while the y-axis shows the accuracy values for both training and validation sets. \
This visualization allows you to observe how the model's performance varies across folds.

In [None]:
# DO NOT CHANGE THE CODE
y_min, y_max = get_ymin_ymax_scores([scores_k])
plot_acc_per_fold(scores_k, y_min, y_max)

### 5. Plot the mean accuracy for different folds (k = 3, 5, 10)
The plot below shows the mean accuracy across different values of (k = 3, 5, 10) from k-fold cross-validation. \
The x-axis represents the number of folds, while the y-axis indicates the mean accuracy achieved for each value of k. \
This plot provides insight into how the choice of k impacts the model's performance.


**Important:** Please note the scaling of the y-axis, which indicates accuracy!

In [None]:
# DO NOT CHANGE THE CODE
k_range = [3, 5, 10]

k_result_train, k_result_val, k_result_f1 = get_results(X_train, y_train, k_range)
y_min, y_max = get_ymin_ymax(k_result_train, k_result_val) 
train_plot_folds(k_range, k_result_train, k_result_val, k_result_f1, y_lim_min=y_min, y_lim_max=y_max)

[Back](#sec_toc)

## <a id="sec_d">d) Stratified K-Fold Cross-Validation</a>

<u>Short introduction to stratified k-fold cross-validation</u> \
Stratified k-fold cross-validation is a technique used to evaluate the performance of a model on unseen data, \
while maintaining the same distribution of target classes in each fold as in the original dataset. \
This is particularly important when the dataset is unbalanced. \
This involves dividing the dataset into k equal parts or "folds", ensuring that each fold has a similar proportion of classes. \
In each round, one fold is used as the validation set, while the remaining k-1 folds are used to train the model. \
This process is repeated k times, so that each fold is used as a validation set once. \
The model's performance is measured each time and the results are averaged to provide a more reliable estimate of accuracy. \
This technique helps to prevent overfitting, particularly where there is class imbalance, and ensures that the model generalises better to new data.

![k-fold cross validation](https://miro.medium.com/v2/resize:fit:640/format:webp/1*PdwlCactbJf8F8C7sP-3gw.png)

### 1. Load the data
Stratified K-fold cross-validation is used for unbalanced datasets. \
The dataset contains the different labels (cities) at different frequencies. \
To create an unbalanced dataset, some randomly drawn data rows with one label value are deleted from the dataset.

<u>TASK:</u> Change the ``stratification_factor`` to a value between 0 and 1.

The data of city2 will then be cropped by the prozentual amount of the stratification factor. \
This will be done to simulate a stratified data set.

In [None]:
stratification_factor = None # Please fill in a value between 0 and 1
# This value is used to delete stratification_factor*100 % of the data, that belongs to the city, which is mapped to the value 1 (city2)


# DO NOT CHANGE 
check_factor(stratification_factor)

In [None]:
# DO NOT CHANGE THE CODE
# to create an imbalanced dataset, 80 % of the data rows will be dropped, where the mapping is 1 (previously assigned city2)
y_train_cities = binarize_labels_rev(y_train, mapping)
normalized_data_min, labels_min = minorize_label(
    X_train, y_train_cities, values[1], stratification_factor
)


filtered_data_str, filtered_labels = filter_label_values(
    normalized_data_min, labels_min, values
)

print("\n mapping: ", mapping)
binarized_labels_str = binarize_labels(filtered_labels, mapping)

if filtered_data_str.shape[0] != binarized_labels_str.shape[0]:
    raise Exception("Error: data and labels do not have the same number of rows.")

### 2. Determine the number of classes and samples per class

In [None]:
# DO NOT CHANGE THE CODE
label_counts = filtered_labels[filtered_labels.columns[0]].value_counts().reset_index()
label_counts.columns = ["City", "Frequency"]
label_counts

### 3. Visualize the findings

In [None]:
# DO NOT CHANGE THE CODE
label_counts.plot(kind="bar", rot=0)
plt.title("Histogram of Data Labels")
plt.ylabel("Count")
plt.show()

### 4. Implement stratified K-fold cross-validation 
Stratified k-fold cross validation ensures, that each fold has the original class proportions. 

<u>TASK:</u> Compare the performance of the model when using different values of k

In [None]:
k = None # Change k to a range of [3, ..., 10]


# DO NOT CHANGE
check_k(k)

In [None]:
# DO NOT CHANGE 
data_folds_k_str, label_folds_k_str = create_k_folds(
    filtered_data_str, binarized_labels_str, k, StratifiedKFold
)
scores_k_str = eval_cross_val(data_folds_k_str, label_folds_k_str)
print_cross_val_result(scores_k_str)

### 5. Plot the training and validation accuracy for each fold
Underneath, a plot displays the training and validation accuracies for each fold from the most recent training and evaluation process. \
The x-axis represents the fold number (or count of training folds), while the y-axis shows the accuracy values for both training and validation sets. \
This visualization allows you to observe how the model's performance varies across folds.

The first plot shows the imbalanced dataset with stratified k-fold cross validation, the second plot uses normal k-fold cross validation.

<u>TASK:</u> Compare both plots, what do you observe?

In [None]:
# DO NOT CHANGE THE CODE
data_folds_k_imbalanced, label_folds_k_imbalanced = create_k_folds(
    filtered_data_str, binarized_labels_str, k, KFold
)
scores_k_imbalanced = eval_cross_val(data_folds_k_imbalanced, label_folds_k_imbalanced)

y_min, y_max = get_ymin_ymax_scores([scores_k_str, scores_k_imbalanced])

text_str = "Imbalanced dataset, Stratified k-fold cross validation"
plot_acc_per_fold(scores_k_str, y_min, y_max, text_str)

text_kf = "Imbalanced dataset, basic k-fold cross validation"
plot_acc_per_fold(scores_k_imbalanced, y_min, y_max, text_kf)

### 6. Plot the mean accuracy for different folds (k = 3, 5, 10)
The plot below shows the mean accuracy across different values of (k = 3, 5, 10) from k-fold cross-validation. \
The x-axis represents the number of folds, while the y-axis indicates the mean accuracy achieved for each value of k. \
This plot provides insight into how the choice of k impacts the model's performance.

The first plot shows the imbalanced dataset with stratified k-fold cross validation, the second plot uses normal k-fold cross validation.

**Important:** Please note the scaling of the y-axis, which indicates accuracy!

<u>TASK:</u> Compare the results with regular k-fold cross-validation and discuss the implications for performance metrics such as F1 score.

In [None]:
# DO NOT CHANGE THE CODE
k_range = [3, 5, 7, 10, 20]

k_result_train_strkfold, k_result_val_strkfold, k_result_f1_strkfold = get_results(filtered_data_str, binarized_labels_str, k_range, fold=StratifiedKFold)
k_result_train_kfold, k_result_val_kfold, k_result_f1_kfold = get_results(filtered_data_str, binarized_labels_str, k_range, fold=KFold)

y_min, y_max = get_ymin_ymax(k_result_train_strkfold, k_result_val_strkfold)

train_plot_folds(
    k_range, k_result_train_strkfold, k_result_val_strkfold, k_result_f1_strkfold, y_min, y_max, text=text_str
)

train_plot_folds(k_range, k_result_train_kfold, k_result_val_kfold, k_result_f1_kfold, y_min, y_max, text=text_kf)

### 7. Plot the mean F1-Score for different folds (k = 3, 5, 10)
The plot below shows the mean F1-Score across different values of (k = 3, 5, 10) from k-fold cross-validation. \
The x-axis represents the number of folds, while the y-axis indicates the mean F1-Score achieved for each value of k. \
This plot provides insight into how the choice of k impacts the model's performance.

In [None]:
# DO NOT CHANGE THE CODE
k_range = [3, 5, 7, 10, 20]

k_result_train_strkfold, k_result_val_strkfold, k_result_f1_strkfold = get_results(filtered_data_str, binarized_labels_str, k_range, fold=StratifiedKFold)
k_result_train_kfold, k_result_val_kfold, k_result_f1_kfold = get_results(filtered_data_str, binarized_labels_str, k_range, fold=KFold)

y_min, y_max = get_ymin_ymax(k_result_f1_strkfold, k_result_f1_kfold)

train_plot_folds(
    k_range, k_result_train_strkfold, k_result_val_strkfold, k_result_f1_strkfold, y_min, y_max, mode = 'F1-Score', text=text_str
)

train_plot_folds(k_range, k_result_train_kfold, k_result_val_kfold, k_result_f1_kfold, y_min, y_max, mode = 'F1-Score', text=text_kf)

[Back](#sec_toc)

## <a id="sec_e">Bonus) Plot the data</a>
A dataset has often multiple features (columns). \
Plotting the data in such a high dimensional feature space (>3) is not possible. \
For plotting, but also for data processing, it is sometimes beneficial to convert \
the dataset into a lower dimensional feature space (reduce the amount of columns). \
This can be done by using different dimensionality reduction techniques.

The next sections provide plots of the data using dimensionality reduction techniques. \
If you are interested you can voluntarily work on these tasks too.

The data will be plottet along two features. \
These features are containing the most information to classify the data. \
The different cities are marked with the different colors. 

In this section we work with a reduced dataset, which only contains the following features:
- ``temp``
- ``dew``
- ``humidity``
- ``pressure``
- ``sunriseEpoch``


In [None]:
# DO NOT CHANGE THIS CODE
columns_to_drop = [
    "datetimeEpoch", "datetime", "Month", "preciptype", "Health_Risk_Score", "sunrise", "sunset",
    "sunsetEpoch", "conditions", "description", "icon", "stations", "source", "Season",
    "Day_of_Week", "feelslikemin", "feelslikemax", "feelslike", "Condition_Code", "Is_Weekend",
    "tempmax", "tempmin", "severerisk", "Temp_Range", "Heat_Index", "precip", "precipprob", "precipcover",
    "snow", "snowdepth", "windgust", "windspeed", "winddir", "Severity_Score", "cloudcover", 
    "visibility", "solarradiation", "solarenergy", "uvindex", "moonphase", "City",
]
labels = orig_data["City"].to_frame()
data = orig_data.drop(columns=columns_to_drop)
normalized_data = (data - data.mean()) / data.std()
normalized_data = normalized_data.dropna(axis=1)
data.columns

In [None]:
# DO NOT CHANGE
first_col = normalized_data["humidity"].to_frame()
second_col = normalized_data["temp"].to_frame()

cities = labels[labels.columns[0]].unique()

mapping_all_cities = create_city_mapping(cities)
all_labels_bin = binarize_labels(labels, mapping_all_cities)

title = "Data points along the two most important features"
scatter_plot(first_col, second_col, all_labels_bin, title)
create_legend(mapping_all_cities, all_labels_bin)

The data will be plottet along the two most impotant features. \
These features are containing the most information to classify the data. \
The different cities are marked with the different colors. 

In [None]:
# DO NOT CHANGE THIS CODE
# PCA

X_train_2dims = get_principal_components(normalized_data, 2)
first_col, second_col, _, _ = prepare_4_plot(X_train_2dims)

cities = labels[labels.columns[0]].unique()

mapping_all_cities = create_city_mapping(cities)
all_labels_bin = binarize_labels(labels, mapping_all_cities)

title = "Data points along the two most important features"
scatter_plot(first_col, second_col, all_labels_bin, title)
create_legend(mapping_all_cities, all_labels_bin)

In [None]:
# DO NOT CHANGE THIS CODE   
# TSNE

# converts similarities into joint probabilities and minimizes kullback leibler
# divergence between joint probabilites of the low dimensional space
X_train_2dims_tsne = TSNE(n_components=2).fit_transform(normalized_data)
X_train_2dims_tsne = pd.DataFrame(
    X_train_2dims_tsne, columns=["first dim tsne", "second dim tsne"]
)
first_col, second_col, _, _ = prepare_4_plot(X_train_2dims_tsne)

cities = labels[labels.columns[0]].unique()

mapping_all_cities = create_city_mapping(cities)
all_labels_bin = binarize_labels(labels, mapping_all_cities)

title = "Data points along the two most important features"
scatter_plot(first_col, second_col, all_labels_bin, title)
create_legend(mapping_all_cities, all_labels_bin)

Underneath, there is a 3 dimensional plot of the data, you can try out different plot modes like PCA (aligning the axis, that the plot shows as much variance in the data as possible) and the original axes.

<u>TASK:</u> Try different plot modes and dimensions of the data and inspect, how the plot changes. 
For mode you can insert the values: "orig axes", "tsne", and "pca".

In [None]:
# change to these modes: "orig axes", "tsne", and "pca"
mode = "pca"


# DO NOT CHANGE THIS CODE
title = f"Data points along the three most important features - {mode}"
plot_data_3dim(normalized_data, labels, title, mode)

[Back](#sec_toc)

### Used images
[https://miro.medium.com/v2/resize:fit:640/format:webp/1*Z54JgbS4DUwWSknhDCvNTQ.png](https://miro.medium.com/v2/resize:fit:640/format:webp/1*Z54JgbS4DUwWSknhDCvNTQ.png)
[https://miro.medium.com/v2/resize:fit:640/format:webp/1*PdwlCactbJf8F8C7sP-3gw.png](https://miro.medium.com/v2/resize:fit:640/format:webp/1*PdwlCactbJf8F8C7sP-3gw.png)
