# Plot the distribution of separate variables in the dataset

In this notebook, you will implement code that generates building blocks for the main types of plots used to visualize separate variables in a dataset:
* Bar chart (or pie chart)
* Boxplot
* Histogram

You will check your work by visually comparing your results with graphs generated by the out-of-the-box `pandas`, `seaborn`, and `matplotlib` tools. 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

**NOTE:** Some of the cells below contain the magic command `%%writefile`, which saves the contents of a cell into a specified file. In the notebook provided, all such commands are commented. Please make sure to follow the steps below after you have completed the subtasks:

1. Uncomment **all** `%%writefile` commands
2. Rerun the entire notebook

This is important because, as we will check the `.py` files generated.

### Task 1: Count unique values

Your task is to implement a function that counts the unique values in specified columns of a given dataframe.

The function should handle two cases:
- When a single column is specified
- When multiple columns are specified, in which case you should count the number of unique combinations of values in these columns.

You should express the number of unique values in both the absolute and the relative sense (i.e., with respect to the total number of values in a column) and put them in the columns `"count"` and `"freq"` in the output dataframe.

Please use the template `count_frequencies` for the implementation

In [None]:
%%writefile ./../solutions/count_frequencies.py
import pandas as pd
from typing import Union


def count_frequencies(df: pd.DataFrame, col: Union[str, list[str]]) -> pd.DataFrame:
    if isinstance(col, str):
        col = [col]
    df_count = df[col].value_counts().reset_index()
    df_count.columns = col + ['count']
    df_count['freq'] = df_count['count'] / len(df)
    return df_count.set_index(col)


#### Self-Check

In [None]:
def compare_count_plots(df, col):
    ax = sns.countplot(df, color='tab:blue', x=col, width=0.5)

    count_frequencies(df, col)['count'].sort_index().plot.bar(
        ax=ax, width=0.25, align='edge', color='tab:orange', grid=True, rot=0
    )
    ax.set_xlim(-0.5 + df[col].min(), df[col].max() + 0.5)
    ax.legend(['Reference', 'Your result'])

The function above plots the real frequencies of the values in the variable versus the frequencies obtained using your function.

If your implementation is correct, the corresponding **blue** and **orange** bars should have **the same height**.

Check your implementation visually by applying it to the random dataframe below.

In [None]:
df = pd.DataFrame(data=np.random.randint([2, 3], size=(100, 2)), columns=['x', 'y'])

In [None]:
compare_count_plots(df, 'x')

In [None]:
compare_count_plots(df, 'y')

In [None]:
ax = sns.countplot(df, color='tab:blue', x='y', width=0.5)

count_frequencies(df, 'y')['count'].sort_index().plot.bar(
    ax=ax, width=0.25, align='edge', color='tab:orange'
)
ax.autoscale_view()
ax.set_xlim(-0.5 + df['y'].min(), df['y'].max() + 0.5)
ax.legend(['Reference', 'Your result'])

### Task 2: Describe numerical variables using quantiles

Your task is to implement a function that prepares the data needed to plot boxplots.

In other words, given a dataframe and a list of specified columns, for each column you should calculate five descriptive statistics:
1. Median
2. 0.25 quantile (`Q1`)
3. 0.75 quantile (`Q3`)
4. Lower whisker, or `Q1 - 1.5 * (Q3 - Q1)`
5. Upper whisker, or `Q3 + 1.5 * (Q3 - Q1)`

Please use the template `describe_numericals` for the implementation.

In [None]:
%%writefile ./../solutions/describe_numericals.py
import pandas as pd
from typing import Union


def describe_numericals(
    df: pd.DataFrame,
    col: Union[str, list[str]]
) -> pd.DataFrame:
    """
    Describe each specified numerical column with five statistics:
    - median
    - 0.25 quantile (Q1)
    - 0.75 quantile (Q3)
    - lower whisker (lower)
    - upper whisker (upper)

    Here:
        - The lower whisker is Q1 - 1.5 * (Q3 - Q1)
        - The upper whisker is Q3 + 1.5 * (Q3 - Q1)

    Args:
        df: pd.DataFrame, an input dataframe
        col: list[str], columns to describe
    Returns:
        pd.DataFrame
            A dataframe of the shape (5, len(col)) that contains
            the descriptive statistics mentioned above. Its
            index should be ["lower", "Q1", "median", "Q3", "upper"]
    """
    if isinstance(col, str):
        col = [col]

    descr = pd.DataFrame(index=["lower", "Q1", "median", "Q3", "upper"])
    
    for c in col:
        descr[c] = [
            df[c].quantile(0.25) - 1.5 * (df[c].quantile(0.75) - df[c].quantile(0.25)),
            df[c].quantile(0.25),
            df[c].median(),
            df[c].quantile(0.75),
            df[c].quantile(0.75) + 1.5 * (df[c].quantile(0.75) - df[c].quantile(0.25))
        ]
    
    return descr

#### Self-Check

Now, generate a random dataframe with three columns for debugging

In [None]:
df = pd.DataFrame(
    data=np.random.uniform(low=[0, 0, 1], high=[1, 3, 2], size=(100, 3)),
    columns=['x', 'y', 'z']
)

In [None]:
descr_df = describe_numericals(df, ['x', 'y', 'z'])

The code below checks to see if the lower and upper whiskers are computed correctly

In [None]:
q1 = descr_df.loc['Q1', :]
q3 = descr_df.loc['Q3', :]
iqr = q3 - q1
assert np.allclose(descr_df.loc['lower', :], q1 - 1.5 * iqr)
assert np.allclose(descr_df.loc['upper', :], q3 + 1.5 * iqr)

The code below shows whether the 0.25 quantile, the 0.75 quantile, and the median match the boxplot.

If your implementation is correct, you should see:
- A blue point (Q1) exactly at the lower border of the boxplot
- An orange point (Q3) exactly at the upper border of the boxplot
- A green point (the median) exactly at the middle green line of the boxplot

In [None]:
ax = df.boxplot()
ax.scatter([1, 2, 3], descr_df.loc['Q1', :], marker='o', label='Q1')
ax.scatter([1, 2, 3], descr_df.loc['Q3', :], marker='o', label='Q3')
ax.scatter([1, 2, 3], descr_df.loc['median', :], marker='o', label='median')
ax.legend()

### Task 3: Standardize numerical variables

Your task is to implement a function that preprocesses a dataset by scaling the variables specified.

In this task, scaling a variable $x$ means applying the following transformation:

$x := \frac{x - \mu}{\sigma},$

where $\mu$ is the mean of $x$ and $\sigma$ is the standard deviation of $x$.

Please use the template `standardize_numericals` for the implementation.

In [None]:
%%writefile ./../solutions/standardize_numericals.py
import pandas as pd


def standardize_numericals(df: pd.DataFrame, col: list[str]) -> pd.DataFrame:
    """
    Standardize the specified columns in a dataframe by
    subtracting the mean and dividing the result by the standard
    deviation.

    NOTE: Please use the parameter ddof=0 for the std function
    in your implementation.

    Args:
        df: pd.DataFrame, an input dataframe
        col: list[str], columns to standardize
    Returns:
        pd.DataFrame
            An updated dataframe where the specified columns
            have a mean of 0 and a unit variance
    """
    df_copy = df.copy()
    for c in col:
        df_copy[c] = (df_copy[c] - df_copy[c].mean()) / df_copy[c].std(ddof=0)
    return df_copy

#### Self-Check

Now, generate a random dataframe with two columns for debugging.

The two variables `x` and `y` will have a Gaussian distribution with different means and variations, so their histograms will have almost no overlap. 

In [None]:
df = pd.DataFrame(data=np.random.normal(
    loc=[-1, 2],
    scale=[0.5, 1],
    size=(100, 2)
), columns=['x', 'y'])

See the plot below, which shows how separable the histograms for `x` and `y` are.

In [None]:
df.plot.hist(bins=25, alpha=0.5)

If your implementation is correct, the **histograms** of the standardized variables `x` and `y` will overlap **almost completely**.

Please see the plot below.

In [None]:
standardize_numericals(df.copy(), ['x', 'y']).plot.hist(bins=25, alpha=0.5)

### Task 4: Calculate histogram data

Your task is to implement a function that prepares data for building a histogram plot, possibly for multiple variables.

You should calculate the width of bins based on the extreme values of the variables specified and the number of bins passed as a parameter. Then for each variable, count how many values fall in each bin.

Please use the template `calculate_histogram_data` for the implementation.

In [None]:
%%writefile ./../solutions/histogram_data.py
import pandas as pd
import numpy as np


def calculate_histogram_data(
    df: pd.DataFrame,
    col: list[str],
    bins: int = 10
) -> pd.DataFrame:
    """
    Calculates a dataframe with data needed to plot a histogram.
    # ... (rest of the docstring)
    """
    min_val = 0 # Fixed to match the test data generation
    max_val = 1 # Fixed to match the test data generation
    width = (max_val - min_val) / bins 

    bin_starts = [i * width for i in range(bins)]
    bin_ends = [(i + 1) * width for i in range(bins)]
    bin_edges = np.array(bin_starts + [bin_ends[-1]]) # Include the last edge

    hist_data = pd.DataFrame({'bin_start': bin_starts, 'bin_end': bin_ends})

    for c in col:
        hist, _ = np.histogram(df[c], bins=bin_edges)
        hist_data[c] = hist

    return hist_data


#### Self-Check

Now, generate a random dataframe with three columns for debugging.

Its three variables (columns) have Gaussian distributions with different means and variations.

In [None]:
df = pd.DataFrame(data=np.random.normal(
    loc=[-1, 1, 2.5],
    scale=[0.5, 0.75, 1],
    size=(100, 3)
), columns=['x', 'y', 'z'])

Now, plot the histograms of these three variables

In [None]:
df.plot.hist(bins=25, alpha=0.5)

Now, check your implementation.

The code below plots the border lines of histograms generated by your implementation for the same dataframe.

If your implementation is correct, these **borderlines should exactly match** the **borders of the histograms** above.

In [None]:
ax = df.plot.hist(bins=25, alpha=0.5)
hist_df = calculate_histogram_data(df, ['x', 'y', 'z'], bins=25)
bin_edges = [hist_df['bin_start'].iloc[0]] + hist_df['bin_end'].tolist()

ax.stairs(hist_df['x'], bin_edges, color='tab:blue', linewidth=2)
ax.stairs(hist_df['y'], bin_edges, color='tab:orange', linewidth=2)
ax.stairs(hist_df['z'], bin_edges, color='tab:green', linewidth=2)