# Explore bivariate relations in a dataset

In this notebook, you will implement code that generates the building blocks of the main types of plots used to visualize bivariate relations in a dataset:
* Categorical-categorical (heatmap)
* Numerical-numerical (scatter plot)
* Numerical-categorical (box plot)

You will check your work by visually comparing your results with graphs. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

**NOTE:** Some of the cells below contain the magic command `%%writefile`, which saves the contents of a cell into a specified file. Here, all such commands are commented. Please make sure to do the following when completing the subtasks:

1. Uncomment **all** `%%writefile` commands
2. Rerun the entire notebook

This is important because the generated `.py` files will be checked.

### Task 1: Categorical-categorical relations: Count the unique values

Your task is to implement a function that counts the unique pairs of values in specified columns of a given dataframe.

You should express the number of unique value pairs in both the absolute and the relative sense (i.e., with respect to the total number of pairs in a column) and put them in the columns `"count"` and `"freq"` in the output dataframe.

Please use the template `count_frequencies` for the implementation.

In [2]:
%%writefile ./../solutions/count_frequencies.py
import pandas as pd

def count_frequencies(df: pd.DataFrame, col_x: str, col_y: str) -> pd.DataFrame:
    df = df.explode(col_x).explode(col_y)
    count_df = df.groupby([col_x, col_y]).size().reset_index(name='count')
    count_df['freq'] = count_df['count'] / len(df)
    count_df.columns = ['cat_x', 'cat_y', 'count', 'freq']  # Set the correct column names
    return count_df


#### Self-Check

In [3]:
df = pd.DataFrame(data=[[['cat', 'dog', 'chicken'], ['meat', 'egg', 'rice']],
                       [['cat', 'frog'], ['rice', 'corn']],
                       [['cat', 'pig'], ['meat', 'cereal', 'corn']],
                       [['cat', 'pig'], ['meat', 'cereal', 'corn']]], columns=['x', 'y'])

In [4]:
check_df = count_frequencies(df, 'x', 'y')

**To check youself, please, run the cells below. You should see the same pictures:**


![](docs/output_file_1_1.png)
![](docs/output_file_1_2.png)

In [4]:
pivot = check_df.pivot(index='cat_x', columns='cat_y', values='count')
ax = sns.heatmap(pivot, annot=True)
plt.show()

In [5]:
pivot = check_df.pivot(index='cat_x', columns='cat_y', values='freq') 
ax = sns.heatmap(pivot,annot=True)
plt.show()

### Task 2: Numerical-numerical relations: Calculating correlations

Your task is to implement a function that prepares the data needed to plot a correlation matrix.

In other words, given a dataframe (with both numerical and categorical features), for each pair of numerical columns, you should calculate the correlation. Then, based on the given threshold, drop the columns or rows where the absolute value (modulus) of the correlation with any other columns or rows isn't greater than the threshold. This functionality is useful when the number of numerical features is extremely large and you want to ignore uncorrelated pairs. 

Please use the template `calculate_correlation` for the implementation.

In [6]:
%%writefile ./../solutions/calculate_correlation.py
import pandas as pd
import numpy as np

def calculate_correlation(
    df: pd.DataFrame,
    threshold: float,
) -> pd.DataFrame:
    """
    Calculate the correlation matrix for a given dataframe, dropping columns and rows where 
    the correlation with any other variables isn't greater than `threshold` by modulus.

    Args:
        df: pd.DataFrame, an input dataframe
        threshold: int, an input threshold
    Returns:
        pd.DataFrame
            A dataframe that contains a submatrix of the correlation matrix
    """

    corr_matrix = df.corr()
    cols_to_keep = []

    for col in corr_matrix:
        keep_col = False
        for row in corr_matrix:
            if col != row and abs(corr_matrix[col][row]) > threshold:
                keep_col = True
                break
        if keep_col:
            cols_to_keep.append(col)

    return corr_matrix.loc[cols_to_keep, cols_to_keep]



#### Self-Check

Now, let's check correctness with the given example below:

In [7]:
df = pd.DataFrame({
    'a': [0, 1, 0, 1, 1, 1],
    'b': [0, 1, 0, 1, 1, 1],
    'c': [0, 0, 0, 0, 1, 1],
    'd': [1, 1, 1, 1, 0, 0],
    'e': [1, 0, 1, 1, 1, 0],
    'f': [1, 1, 1, 1, 1, 1],
    'g': [1, 0, 1, 1, 1, 1],
})

df['c'] = "I'm string"

In [8]:
corr_matrix = calculate_correlation(df, 0.4)

**For a given example, the code below should produce a heat-map like the one in the example:**

![](docs/output_file_2.png)

In [9]:
ax = sns.heatmap(corr_matrix, annot=True, fmt=".2f")
plt.show()

### Task 3: Numerical-categorical relations: Describe a category with numerical statistics

Your task is to implement a function that prepares and preprocesses a dataset for numerical-categorical relations analysis.

Please use the template `describe_categoricals_by_numericals` for the implementation.

In [10]:
%%writefile ./../solutions/describe_categoricals_by_numericals.py
import pandas as pd


def describe_categoricals_by_numericals(
    df: pd.DataFrame, 
    col_cat: str, 
    col_num: str
) -> pd.DataFrame:
   
    descr_df = pd.DataFrame(index=["lower", "Q1", "median", "Q3", "upper"], columns=df[col_cat].unique(), dtype=float) 
    for i in df[col_cat].unique():
      descr_df[i]['median'] = df[df[col_cat] == i][col_num].median()
      descr_df[i]['Q1'] = df[df[col_cat] == i][col_num].quantile(0.25)
      descr_df[i]['Q3'] = df[df[col_cat] == i][col_num].quantile(0.75)
      descr_df[i]['lower'] = descr_df[i]['Q1'] - 1.5 * (descr_df[i]['Q3'] - descr_df[i]['Q1'])
      descr_df[i]['upper'] = descr_df[i]['Q3'] + 1.5 * (descr_df[i]['Q3'] - descr_df[i]['Q1'])
    return descr_df

#### Self-Check

In [11]:
df = pd.DataFrame(data=[*zip(['cat', 'dog', 'pig', 'cat', 'dog', 'frog', 'frog', 'pig'], 
                             [2, 8, 25, 3, 6, 0.5, 1, 15])], columns=['animal', 'weight'])

In [12]:
descr_df = describe_categoricals_by_numericals(df, 'animal', 'weight')

**Please see the plot below. If you implemented everything correctly, the code below will produce the same picture.**

![](docs/output_file_3.png)

In [13]:
q1 = descr_df.loc['Q1', :]
q3 = descr_df.loc['Q3', :]
iqr = q3 - q1
assert np.allclose(descr_df.loc['lower', :], q1 - 1.5 * iqr)
assert np.allclose(descr_df.loc['upper', :], q3 + 1.5 * iqr)

In [14]:
fig, ax = plt.subplots(figsize=(5,4))  
ax.scatter(descr_df.columns.tolist(), descr_df.loc['Q1', :], marker='o', label='Q1')
ax.scatter(descr_df.columns.tolist(), descr_df.loc['Q3', :], marker='o', label='Q3')
ax.scatter(descr_df.columns.tolist(), descr_df.loc['median', :], marker='o', label='median')
ax.legend()
plt.show()