In [1]:
import pandas as pd
import numpy as np

# sum_and_round, mean_and_round compute the sum and mean of a groupby dataframe 
# and round the results to the first and third digit respectively

from src.funct import sum_and_round, mean_and_round

import matplotlib.pyplot as plt
import seaborn as sns

import altair as alt

In [2]:
data = pd.read_excel('./data/superstore_data.xlsx')

# Reinventing the Wheel: A DIY approach to Data Analysis and Data Science

The approach of the Reinventing the Wheel (RtW), is to start from a small, sometime not that small, topic, disassemble it into even smaller pieces and then reassembling it traying to mimic the original form. This approach may help in understand better the basic concepts underying and "own" it. Most of the times, when you dismantle an object and rebuild it, you can end up with something that is clunkier or even look ugly. But for sure is something that you own.

In this notebook, I will discuss Box Plots, a method used in descriptive statistics to visually show the [locality, spread and skewness groups of numerical data throug their quartiles](https://en.wikipedia.org/wiki/Box_plot). As a starting point, I will use the [`altair` boxplot](https://altair-viz.github.io/) of a dataset and drill down concepts like median, quartile, IQR, outliers as well as illustrating python and altair procedure often used in Data Analysis.

To illustrate all the concepts, I will use the [Super Store Sales sample data](https://public.tableau.com/app/learn/sample-data) provided by Tableau Public. Some of the ideas I used for this notebook come from the Coursera Course Data Analysis with Tableau, by Tableau Learning Partner.

Starting from an `altair` boxplot, we will discuss the following learning objectives:

- Median
- Quantile and Quartiles
- Whiskers and IQR
- Outliers

At the end, there is a Python appending with custom functions and plots.

The aim of reinventing the wheel is to explain concepts and theorical ideas implement functions from scratch where needed. In this notebook, the aim of functions that I will implement is to clarify the main practocal aspects and, in doing so, clarity and step-by-step passages are preferred on optimization for speed, performances or error handling (e.g. I will not check if the input must be greater than zero). Moreover, sometimes built-in functions from library are used when their use does not affect my aim (e.g. I am not re-implementing sorting!).

# Box Plots and Data Distributions

Box plots are a type of visualization that show a statistical summary of selected data. While histograms gives you a graphical understanding about how the data are distributed, and so helpfully indicating if they are evenly distributed, normal or skewed, the advantage of Box Plots is that they provide a visual representation of some of the main data distribution main characteristics, specifically median, quartiles and outliers.

Let's start by creating an `altair` boxplot from a data distribution containing very few points, as this will help us to better understand the underlying concept behind a boxplot. To this end, I will use the the Super Store Sales sample data and calculate the sum of the sales for each Product Sub category. 

In [3]:
# From the data, we create a dataframe containing, for each Sub Category, the sum of the sales

df_subcategory_sales = (data
 .groupby('Sub-Category', observed=True)
 .agg(
     sales = ('Sales', sum_and_round),
 )
).sort_values('sales').reset_index()

From the data, we create a dataframe containing, for each Sub Category, the sum of the sales and visualize the distribution using a boxplot

In [4]:
box = (alt
       .Chart(df_subcategory_sales)
       .mark_boxplot(size = 25, color = 'silver')
       .encode(
           x = alt.X('sales:Q')
       )
      )

box.properties(
    title = 'Sales by Sub-category box plot',
    width = 620,
    height = 80
)

Hovering on the newly created Box Plot, Altair provides the following information:
- `Max` of sales: 330007.05
- `Q3` of sales: 203412.73
- `Median` of sales: 114880
- `Q1` of sales: 46673.53
- `Min` of sales: 3024.28

And
- `Upper Whisker` of sales: 330007.054
- `Lower Whisker` of sales: 3024.24

These values aims to show how the data are distributed. But what are their meaning and how can we practically use when analysing the data? `Min` and `Max` are straigthforwardly indicating the minimum and maximum values of the dataset, but what `Median`, `Q1`, `Q3`, `Upper Whisker` and `Lower Whisker` mean and how `altair` calculate it?  

Let's start looking at the concepts of median, quartile and Interquartile Range (IQR) and undersdand their role in the statistical description of a dataset.

## Median, Quartile and Interquartile Range, Upper and Lower Whisker

When analysing dataset, we are dealing with discrete quantities (samples) that might be extracted from continuous distributions, like the height of a person or the distance covered by a starship. When calculating statistical measures of a dataset, like the median, we will always do it on a finite number of data. A way to characterize a dataset, is to divide its elements into parts, and one of the most common way to do it, is dividing it into four sets with the same number of elements each: these are called ***quartiles***.

### Median

> The ***median*** of a set of numbers is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution.

To compute the median of a set of number, we have to consider if the number of its element is odd or even. If the number of data points is odd, the median is the middle point, otherwise, the median is usually defined as the arithmetic mean of the two middle values.

In the following cell we define a `dataset_median` function that takes a pandas series as input, sort it, calculate the midpoint and, based on the fact it is an odd or even number, compute the median of the dataset using the above definition

In [None]:
def dataset_median(dataset):
    """
    input: dataset is a pandas series
    output: returns the median of the dataset
    """
    midpoint = len(dataset)/2
    if midpoint%2 != 0:
        # if the lenths of the dataset is odd, return the middle point 
        return(dataset.sort_values().iloc[int(midpoint)])
    # otherwise, return the arithmetic mean of the two middle points
    return((dataset.sort_values().iloc[int(midpoint)-1]+dataset.sort_values().iloc[int(midpoint)])/2)

In our example, the sum of `Sales` per `Sub-Category` in the Super Store dataset, we have an odd number of elements. This means that the median `sales` is the 8<sup>th</sup> element of the sorter `sales` series.

In [None]:
dataset = df_subcategory_sales['sales']

print("median using the diy function: {}".format(dataset_median(dataset)))
print("median using pandas median   : {}".format(dataset.median()))
## printing the Sub-Category
print("Sub-Category: {}".format(df_subcategory_sales[df_subcategory_sales['sales'] == dataset_median(df_subcategory_sales['sales'])]['Sub-Category'].values[0]))

The following is the altair box-plot of the dataframe `sales` column. As we can see by hoovering on the data points, the median contains the `Bookcases` sub-category. When the dataset length is odd, the median point belongs to the $2$<sup>nd</sup> percentile ($50\%$ of the data)

In [None]:
points = (alt
          .Chart(df_subcategory_sales)
          .mark_point(size = 50, filled=True, opacity=0.8, color = '#953f0a')
          .encode(
              x = alt.X('sales:Q'),
              tooltip = ['Sub-Category:N', 'sales:Q']                        
          )
         )

box = (alt
       .Chart(df_subcategory_sales)
       .mark_boxplot(size = 40)
       .encode(
           x = alt.X('sales:Q')
       )
      )
       

chart = (box + points).properties(
    title = 'Sales by Sub-category box plot',
    width = 620,
    height = 100
)

chart

When we have an odd numer of sample in our dataset, the median also correspond to a specific point of the dataset (in our case is Bookcases).

We now remove the median point from the dataset and compute the new median of the `sales` dataset, now consisting of an odd number of elements.

In [None]:
df_subcategory_without_bookcases = df_subcategory_sales.drop(df_subcategory_sales[df_subcategory_sales['Sub-Category'] == 'Bookcases'].index).reset_index(drop = True)
df_subcategory_without_bookcases.shape[0]

Applying the same steps as before, we obtain:

In [None]:
print("median using the diy function: {}".format(dataset_median(df_subcategory_without_bookcases['sales'])))
print("median using pandas median   : {}".format(df_subcategory_without_bookcases['sales'].median()))

# As we have an even number of elements, there is no sub category that is the median of the dataset
# print("Sub-Category: {}".format(df_subcategory_without_bookcases[df_subcategory_without_bookcases['sales'] == dataset_median(df_subcategory_without_bookcases['sales'])]['Sub-Category'].values[0]))

Plotting the box plot, we can see that now the median is between the two middle points.

In [None]:
points = (alt
          .Chart(df_subcategory_without_bookcases)
          .mark_point(size = 50, filled=True, opacity=0.8, color = '#953f0a')
          .encode(
              x = alt.X('sales:Q'),
              tooltip = ['Sub-Category:N', 'sales:Q']                        
          )
         )

box = (alt
       .Chart(df_subcategory_without_bookcases)
       .mark_boxplot(size = 40)
       .encode(
           x = alt.X('sales:Q')
       )
      )
       

chart = (box + points).properties(
    title = 'Sales by Sub-category box plot',
    width = 620,
    height = 100
)

chart

### Quartile

Let's start from the definition of Quantiles and Quartiles. In statistics, ***quantiles*** are particular points dividing a sample into equally sized, adjacent subgroups. As an example, the median is a quantiles so that exactly half of the data is lower than the median and half of the data is above the median: the median is said to be the 2<sup>nd</sup> quartile.

***Quartiles*** divide the distribution into four equal part.

In our examples, `altair` boxplot provide `Q1` and `Q3` paramenters, meaning the first and third quartile. The first quartile, `Q1`, indicates the point for which the $25\%$ of data are below this number. Similarly, the 3<sup>rd</sup> quartile, `Q3`, is the point for which the $75\%$ of the data, are below this number. In our examples, the original dataset containing 17 points the boxplot reports:
- `Q1` = 46674.54
- `Q3` = 203412.73

In the second one, without the median point, we have:
- `Q1` = 41784.85
- `Q3` = 204300.93

How `altair` calculated this point? Let's start with the first quartile, i.e. the $25\%$ of the data. Breaking down in steps what we need to do to calculate it, we need first to sort the data, and then find the first $25\%$ elements of the sorted array. 

- Sort the array
- 

In [None]:
0.25*16

In [None]:
def mquintile(data, p):
    """
    Compute the p-th quantile of a dataset using linear interpolation.
    
    Parameters:
    -----------
    data : array-like
        Array of observations (numpy array, list, or pandas series)
    p : float
        Percentile between 0 and 1 (e.g., 0.25 for Q1, 0.5 for median)
    
    Returns:
    --------
    float
        The computed quantile value
    
    Examples:
    --------
    >>> data = [1, 2, 3, 4, 5]
    >>> mquintile(data, 0.5)  # median
    3.0
    >>> mquintile(data, 0.25)  # first quartile
    2.0
    """
    # Input validation
    if not 0 <= p <= 1:
        raise ValueError("p must be between 0 and 1")
    if len(data) == 0:
        raise ValueError("data cannot be empty")

    # Sort the data
    samples = np.sort(data)
    print("Sorted data:", samples)
    
    # Calculate position in sorted array
    n = p * (len(samples) - 1)
    print(f"Theoretical position (n) = {n}")
    
    # If n is effectively a whole number (allowing for floating point precision)
    if abs(n - round(n)) < 1e-10:
        result = samples[int(round(n))]
        print(f"Position is whole number, returning value at position {int(round(n))} that is {result}")
        return result
    
    # For positions between two samples, interpolate
    pos = int(n)  # floor of n
    # Get the two samples we'll interpolate between
    lower_sample = samples[pos]
    upper_sample = samples[pos + 1]
    print(f"Interpolating between: lower={lower_sample}, upper={upper_sample}")
    
    # Calculate interpolation fraction
    fraction = n - pos
    print(f"Interpolation fraction = {fraction}")
    
    # Compute interpolated value
    quantile = lower_sample + (fraction * (upper_sample - lower_sample))
    print(f"Final interpolated quantile = {quantile}")
    
    return quantile

In [None]:
test = np.array([32.6,5.4,12.7,54.2,65.1,28.6,54.2,76.65,24.12,90.09,28.3,-21,-34,11,54,2,15])*-1
data_set = 20*np.random.normal(size=22)

p = 0.5

mquintile(test, p)
np.quantile(test, p)

In [None]:
def mquintile(data, p):
    """
    data: np array, list, pandas series is an array of observations
    p: float between 0 and 1, is the percentage of samples you want to consider
    """

    samples = np.sort(data)
    # n is the position of the sorted array containing the samples in the desired quantile
    n = p*(len(samples)-1)
    if n%2 == 0:
        # if the position is an even number, return the sample at that position 
        print(samples[int(n)])
        return(samples[int(n)])
    else:
        # is the position is an odd number, we compute the the values of the sorted array
        # for the considered position
        pos = int(n)
        # compute the adiacent samples to interpole to compute the quartile
        lower_sample = samples[pos]
        upper_sample = samples[pos+1]
        print("lower sample = {}, upper sample {}".format(lower_sample, upper_sample))
        # compute the fraction of sample to use in the interpolation
        f = n-pos
        print("fraction = {}".format(f))
        # Finally, calculate the interpolated point representing the quantile
        quantile = lower_sample+(f * (upper_sample-lower_sample))
        print("quantile value = {}".format(quantile))
        return(quantile)

In [None]:
np.array([32.6,5.4,12.7,54.2,65.1,28.6,54.2,76.65,24.12,90.09,28.3,-21,-34,11,54,2,15])*-1

In [None]:
data_set = 20*np.random.normal(size=22)

In [None]:
np.sort(data_set)

In [None]:
p = 0.5

mquintile(data_set, p)
np.quantile(data_set, p)

In [None]:
quantile = 0.75
data_set = df_subcategory_sales['sales']

samples = np.sort(data_set)
n = quantile*len(samples)
if n%2 == 0:
    print(samples[n-1])
else:
    print((samples[int(n)-1]+samples[int(n)])/2)

In [None]:
(samples[int(n)]+samples[int(n)+1])/2

In [None]:
np.quantile(samples, 0.25)

In [None]:
size = 9
test = np.round(100*np.random.random_sample(size),2)

In [None]:
np.sort(test)

In [None]:
q75, q25 = np.quantile(test, [0.75 ,0.25])
iqr = q75 - q25
iqr

In [None]:
q75

In [None]:
np.quantile(test, 0.75)

In [None]:
q75, q25 = np.quantile(np.arange(1,10), [0.75,0.25])
iqr = q75 - q25
iqr

In [None]:
0.25*8

In [None]:
data_set = np.arange(1,17)
samples = np.sort(data_set)

p = 0.75
n = p*(len(samples)-1)
print(n)
if n%2 == 0:
    print(samples[int(n)])
else:
    n = int(n)
    print(n)
    print(samples[n])
    print(samples[n+1])

In [None]:
np.quantile(samples, [0.25,0.75])

In [None]:
np.quantile(samples,0.5)

In [None]:
data_set = np.array([32.6,5.4,12.7,54.2,65.1,28.6,54.2,76.65,24.12,90.09,28.3])
samples = np.sort(data_set)

p = 0.25
n = p*(len(samples)-1)
print(n)
if n%2 == 0:
    print(samples[n-1])
else:
    n = int(n)
    print(n)
    print(samples[n])
    print(samples[n+1])

In [None]:
data_set = np.array([32.6,5.4,12.7,54.2,65.1,28.6,54.2,76.65,24.12,90.09,28.3,43.2,54,12.22, 87,54.6,112.3])
samples = np.sort(data_set)

p = 0.25
n = p*(len(samples)-1)
print(n)
if n%2 == 0:
    print(samples[int(n)])
else:
    n = int(n)
    print(n)
    print(samples[n])
    print(samples[n+1])

In [None]:
np.sort(data_set)

In [None]:
np.quantile(data_set, 0.25)

In [None]:
0.25*(len(samples)-1)

In [None]:
linear_interpolation(24.12,28.3, 0.5)

In [None]:
np.quantile(samples, 0.25)

In [None]:
data_set = np.array([13,25,41,90,6,74,57,113,82,121,226,105,5,38])
samples = np.sort(data_set)

p = 0.25
n = p*(len(samples)-1)
print(n)
if n%2 == 0:
    print(samples[n-1])
else:
    n = int(n)
    print(n)
    print(samples[n])
    print(samples[n+1])
    #print((samples[int(n)-1]+samples[int(n)])/2)

In [None]:
25+(38-25)*0.25

In [None]:
(90+(105-90)*0.75)

In [None]:
np.quantile(samples, 0.25)

In [None]:
(data
 .groupby(['Region', 'Sub-Category'])
 .agg(profit = ("Profit", "sum"))
).reset_index().pivot(index = 'Region', columns = 'Sub-Category', values = 'profit')

In [None]:
samples[10]

In [None]:
samples

In [None]:
np.quantile(samples, p)

In [None]:
compute_quantile(samples, 0.75)

In [None]:
samples[10]

In [None]:
samples[n2]

In [None]:
samples

In [None]:
np.quantile(samples, 0.75)

In [None]:
(52.9-4.3)/2

In [None]:
(144+72)/2

In [None]:
108418/12226

In [None]:
223843/(719047+836154+741999)

In [None]:
test = np.array([10,4,11,7,3,12,11,9,5,5,8])
np.median(test)

Linear interpolation: Depending on the percentile $p$, NumPy locates the position in the sorted array corresponding to  $p \times (N-1)$ where $N$ is the number of elements.
- If this position is an integer, it selects the value at that index.
- If it is not an integer, `NumPy` interpolates between the values at the two nearest indices.

The formula for linear interpolation between two points $x_{1}$ and $x_{2}$ is as follows:
$$ y = x_{1}+(x_{2}-x_{1}) \times f$$

Where:

- $x_{1}$ is the lower value at the lower index
- $x_{2}$ is the upper value at the upper index
- $f$ is the fractional part of the position

In the context of quantiles:
- $x_{1}$ is the value at the floor of the computed position (i.e., the lower index).
- $x_{2}$ is the value at the ceiling of the computed position (i.e., the upper index).
- $f = pos - \text{lower index}$ where $pos$ is the fractional position of $p \times (n-1)$

Formula for quantile interpolation:
Given:

Formula for quantile interpolation:
Given:
- $p$ as the percentile (quantile fraction, between 0 and 1),
- $N$ as the number of data points,
- Sorted data: $data[i]$,
1. Compute position: $pos=p \times (N−1)$
2. The lower index is $\text{lower index}=floor(pos)$
3. The upper index is $\text{upper index}=\text{lower index}+1$
4. Fraction $f=pos−\text{lower index}$
5. Interpolate between $x_{1}=data[\text{lower index}]$ and $x_{2}=data[\text{upper index}]$

$$ 
q_{p} = x_{1} + f \times (x_{2}+x_{1})
$$

Where $q_{p}$ is the quantile value at the percentile $p$.

In [None]:
data_set = np.array([3,5,21,10,4,11,7,3,12,11,9,5,5,8])
p = 0.75

samples = np.sort(data_set)
pos = p*(len(samples)-1)

n1 = int(np.floor(pos))
n2 = int(np.ceil(pos))
print("pos={}, n1={}, n2={}".format(pos,n1,n2))
(samples[n1]+samples[n2])/2

In [None]:
(samples[9]+samples[10])/2

In [None]:
np.quantile(samples, p)

In [None]:
import numpy as np

# Sample data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Compute the 25th, 50th (median), and 75th percentiles (quantiles)
q25 = np.quantile(data, 0.25)
q50 = np.quantile(data, 0.50)
q75 = np.quantile(data, 0.75)

print(f"25th Percentile (Q1): {q25}")
print(f"50th Percentile (Median): {q50}")
print(f"75th Percentile (Q3): {q75}")


In [None]:
p = 0.25

samples = np.sort(data)
pos = p*(len(data)-1)

n1 = int(np.floor(pos))
n2 = int(np.ceil(pos))
print("pos={}, n1={}, n2={}".format(pos,n1,n2))
(samples[n1-1]+samples[n2-1])/2

In [None]:
def compute_quantile(data, p):
    """
    Compute the quantile of the given data for the given percentile p.
    
    Parameters:
    data (list or array-like): The data for which to compute the quantile.
    p (float): The quantile to compute (between 0 and 1).
    
    Returns:
    float: The computed quantile.
    """
    # Step 1: Sort the data
    sorted_data = sorted(data)
    
    # Step 2: Compute the position in the sorted array
    N = len(sorted_data)
    pos = p * (N - 1)
    
    # Step 3: Find the lower and upper indices
    lower_index = int(pos)
    upper_index = lower_index + 1

    print("pos={}, n1={}, n2={}".format(pos,lower_index,upper_index))
    # Step 4: Interpolation
    if upper_index < N:
        lower_value = sorted_data[lower_index]
        upper_value = sorted_data[upper_index]
        print("Lower Value = {}, Upper Value = {}".format(lower_value, upper_value))
        # Interpolate between the lower and upper values
        fraction = pos - lower_index
        return lower_value + fraction * (upper_value + - lower_value)
    else:
        # If the position is exactly at the last index, return the last element
        return sorted_data[lower_index]

In [None]:
samples

In [None]:
samples = 

p = 0.75
pos = p*(len(samples)-1)

n1 = int(np.floor(pos))
n2 = int(np.ceil(pos))
print("pos={}, n1={}, n2={}".format(pos,n1,n2))
#samples[n2]

In [None]:
compute_quantile(samples, 0.75)

In [None]:
np.quantile(samples, 0.75)

In [None]:
data = data_set = np.array([3,5,21,10,4,11,7,3,12,11,9,5,5,8])
compute_quantile(data, 0.75)

In [None]:
np.quantile(data, 0.75)

In [None]:
np.sort(test)

In [None]:
(np.sort(test)[7]+np.sort(test)[8])/2

In [None]:
np.quantile(test, 0.75, method='linear')

In [None]:
np.sort(test)

In [None]:
test_frame = pd.DataFrame(
    {
        'x' : test
    }
)
test_frame

In [None]:
alt.Chart(test_frame).mark_boxplot().encode(
    x = 'x'
)

In [None]:
values =  [0, 3, 4.4, 4.5, 4.6, 5, 7]
df = pd.DataFrame({'x': values})

points = alt.Chart(df).mark_circle(color='black', size=120).encode(
    x=alt.X('x:Q', scale=alt.Scale(zero=False)),
)

boxplot = alt.Chart(df).mark_boxplot(ticks=True, extent=1.5, outliers=True).encode(
    x='x:Q',
)

iqr = alt.Chart(df).mark_rect(color='lime').encode(
    x='q1(x):Q',
    x2='q3(x):Q'
)

whiskers = alt.Chart(df).mark_rect(color='orange').transform_joinaggregate(
    q1='q1(x)',
    q3='q3(x)',
).transform_calculate(
    iqr='datum.q3 - datum.q1'
).transform_filter(
    # VL concatenates these strings so we can split
    # them on two lines to improve readability
    'datum.x < (datum.q3 + datum.iqr * 1.5)'
    '&& datum.x > (datum.q1 - datum.iqr * 1.5)'
).encode(
    x='min(x)',
    x2='max(x)',
)

minmax = alt.Chart(df).mark_rect(color='red').transform_aggregate(
    xmin='min(x)',
    xmax='max(x)'
).encode(
    x='xmin:Q',
    x2='xmax:Q',
).properties(width=1000)


((boxplot + points) & (minmax + whiskers + iqr + points)).resolve_scale(x='shared')

In this plot, the median value 

In [None]:
df_subcategory_without_bookcases['sales'].median()

In [None]:
df_subcategory_sales

In [None]:
dataset_median(new_dataset)

In [None]:
dataset.median()

In [None]:
df_subcategory_sales = (data
 .groupby('Sub-Category', observed=True)
 .agg(
     sales = ('Sales', 'sum'),
 ).reset_index()
)

dataset = df_subcategory_sales['sales']

midpoint = len(dataset)/2
if midpoint%2 != 0:
    print(dataset.sort_values().iloc[int(midpoint)])
else:
    print((dataset.sort_values().iloc[int(midpoint)]+dataset.sort_values().iloc[int(midpoint)+1])/2)

# dataset.sort_values()

In [None]:
new_dataset = dataset.drop(labels=dataset[dataset.values == 114879.9963].index)

In [None]:
new_dataset.median()

In [None]:
midpoint = len(new_dataset)/2
if midpoint%2 != 0:
    print(new_dataset.sort_values().iloc[int(midpoint)])
else:
    print((new_dataset.sort_values().iloc[int(midpoint)-1]+new_dataset.sort_values().iloc[int(midpoint)])/2)

In [None]:
dataset.sort_values().iloc[1]

In [None]:
median = df_subcategory_sales['sales'].median()
print("sales median as computed by numpy: {}".format(median))


df_subcategory_sales.iloc[8]

In [None]:
def median(dataset):
    len_data = df_subcategory_sales.shape[0]
    
    if (len_data % 2) != 0:
        mid_point_index = len_data //2
        print(df_subcategory_sales.iloc[mid_point_index])

In [None]:
dataset = df_subcategory_sales['sales']
dataset.sort_values()

In [None]:
df_subcategory_sales.sort_values('sales').reset_index(drop=True).iloc[3]

In statistics, ***quantiles*** are particular points dividing a sample into equally sized, adjacent subgroups. As an example, the median is a quantiles, so that exactly half of the data is lower than the median and half of the data is above the median.
***Quartiles*** divide the distribution into four equal part.

In our example, we are considering the sales points from the different Sub Categories. To determine which data point belongs to the right quartile, se sort the values and divide the array into 4 equal parts:

The middle point of the `sales` column is Bookcases, 114879.9963: this value represent the median of the dataset. Now, we break down it further into 4 equal parts and obtain:
- 1st Quartile: `Fastners`, `Labels`, `Envelops` and `Art`
- 2nd Quartile: `Supplies`, `Paper`, `Furnishing` and `Appliances`
- Median: `Bookcases`
- 3rd Quartile: `Copiers`, `Accessories`, `Machines` and `Binders`
- 4th Quartile: `Tables`, `Storage`, `Chairs` and `Phones`

`pandas` and `numpy` both offer methods to easily compute quartile, but to better understand it, let's reinvent the wheel and define a Python function that do the same. To do it, we will go back to the very definition of quartile and focus on the standard definition about the percentages (0.25, 0.5, 0.6

In [None]:
df_subcategory_sales[df_subcategory_sales['sales'] < df_subcategory_sales['sales'].quantile(0.25)]

In [None]:
df_subcategory_sales[(df_subcategory_sales['sales'] >= df_subcategory_sales['sales'].quantile(0.25)) & (df_subcategory_sales['sales'] < df_subcategory_sales['sales'].median())]

To make it more clear, let's plot the box plot and the associated points

In [None]:
points = (alt
          .Chart(df_subcategory_sales)
          .mark_point(size = 50, filled=True, opacity=0.8, color = '#953f0a')
          .encode(
              x = alt.X('sales:Q'),
              tooltip = ['Sub-Category:N', 'sales:Q']                        
          )
          # .configure_mark(
          #     opacity=0.8,
          #     color='#953f0a'
          # )
         )

box = (alt
       .Chart(df_subcategory_sales)
       .mark_boxplot(size = 40)
       .encode(
           x = alt.X('sales:Q')
       )
      )
       

chart = (box + points).properties(
    title = 'Sales by Sub-category box plot',
    width = 620,
    height = 100
)

chart

In [None]:
df_test = df_subcategory_sales[df_subcategory_sales['Sub-Category'] != 'Bookcases'].reset_index(drop=True)
df_test

In [None]:
points = (alt
          .Chart(df_test)
          .mark_point(size = 50, filled=True, opacity=0.8, color = '#953f0a')
          .encode(
              x = alt.X('sales:Q'),
              tooltip = ['Sub-Category:N', 'sales:Q']                        
          )
         )

box = (alt
       .Chart(df_test)
       .mark_boxplot(size = 40)
       .encode(
           x = alt.X('sales:Q')
       )
      )
       

chart = (box + points).properties(
    title = 'Sales by Sub-category box plot',
    width = 620,
    height = 100
)

chart

## IQR - Interquartile Range

In our example, the dataset consist of points that are quete close each other, meaning that they do not particularly spread all over the possible values. Let's now focus our analyisis on the same Sales dataset but we now want to consider the sales only in the `Central` region.

In [None]:
df_central_sales = (data[data['Region'] == 'Central']
                    .groupby('Sub-Category')
                    .agg(
                        sales = ('Sales','sum')
                    )
                   ).sort_values('sales').reset_index()


points = (alt
          .Chart(df_central_sales)
          .mark_point(size = 50, filled=True, opacity=0.8, color = '#953f0a')
          .encode(
              y = alt.Y('sales:Q'),
              tooltip = ['Sub-Category:N', 'sales:Q']                        
          )

         )

box = (alt
       .Chart(df_central_sales)
       .mark_boxplot(size = 40)
       .encode(
           y = alt.Y('sales:Q')
       )
      )
       

chart = (box + points).properties(
    title = 'Sales by Sub-category in Central Region box plot',
    width = 100,
    height = 420
)

chart

In [None]:
df_subcategory_profit = (data
 .groupby('Sub-Category', observed=True)
 .agg(
     profit = ('Profit', 'sum'),
 )
).reset_index().sort_values('profit', ascending = False)

box = (alt
       .Chart(df_subcategory_profit)
       .mark_boxplot(size = 40)
       .encode(
           y = alt.Y('profit:Q')
       )
      )

box.properties(
    title = 'Profit by Sub-category box plot',
    width = 100,
    height = 420
)

In [None]:
df_subcategory_profit

In [None]:
points = (alt
          .Chart(df_subcategory_profit)
          .mark_point(size = 50, filled=True, opacity=0.8, color = '#953f0a')
          .encode(
              y = alt.Y('profit:Q'),
              tooltip = ['Sub-Category:N', 'profit:Q']                        
          )
          # .configure_mark(
          #     opacity=0.8,
          #     color='#953f0a'
          # )
         )

box = (alt
       .Chart(df_subcategory_profit)
       .mark_boxplot(size = 40)
       .encode(
           y = alt.Y('profit:Q')
       )
      )
       

chart = (box + points).properties(
    title = 'Profit by Sub-category box plot',
    width = 100,
    height = 420
)

chart

The Interquartile Range is an important measure of statistical dispersion and its definition is quite basic: is the difference between the third and first quartile. 

$$
IQR = Q_{3}- Q{1}
$$

In [None]:
df_subcategory_sales['sales'].quantile(0.75) - df_subcategory_sales['sales'].quantile(0.25)

The interquartile range plays an important role when we want to identify points in the dataset that are potentially outliers, i.e. points quite far away from the distribution. In a boxplot, 

In statistics, 

In [None]:
df_subcategory_sales['sales'].min()

In [None]:
base = (alt
 .Chart(df_subcategory_sales)
#  .mark_point()
 .encode(
#     x = 'Sub-Category:',
     y = 'sales'
 ).properties(
    width=200
 )
)

base.mark_point() + base.mark_boxplot() 

In [None]:
# Create the dataframe
dataX = {
    'Sub-Category': ['Phones', 'Chairs', 'Storage', 'Tables', 'Binders', 'Machines', 
                     'Accessories', 'Copiers', 'Bookcases', 'Appliances'],
    'Sales': [330007.0540, 328449.1030, 223843.6080, 206965.5320, 203412.7330, 
              189238.6310, 167380.3180, 149528.0300, 114879.9963, 107532.1610]
}

dfX = pd.DataFrame(dataX)

# Create the Altair scatter plot
chart = alt.Chart(dfX).mark_point().encode(
    x=alt.X('Sub-Category:N', sort='-y', title='Sub-Category'),
    y=alt.Y('Sales:Q', title='Sales'),
    tooltip=['Sub-Category:N', 'Sales:Q']  # Add tooltips
).properties(
    title='Sales by Sub-Category',
    width=500,
    height=300
)

chart.display()

In [None]:
np.round(df_subcategory_sales['sales'].median(),0)

In [None]:
np.quantile(df_subcategory_sales['sales'], 0.75)

In [None]:
data.columns

In [None]:
data.Region.unique()

In [None]:
df_region_subcat = (data
                    .groupby(['Region', 'Sub-Category'])
                    .agg(
                        sales = ('Sales','sum'),
                        profit = ('Profit', 'sum'),
                        quantity = ('Quantity', 'sum')
                    )
                   ).reset_index()

df_region_subcat.head(2)

In [None]:
(alt
 .Chart(df_region_subcat)
 .mark_boxplot(ticks=True, size=15)
 .encode(
     x = 'Region:O',
     y = 'sales:Q',
 )
 .properties(
    width=200
 )
 .configure_view(
    stroke=None
 )
)

In [None]:
x = [1,2,3,4,5] 
y = [1,4,9,16,25]

plt.plot(x, y)
plt.title('Square Numbers')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

In [None]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1, 4, 9, 16, 25]})

chart = (alt
         .Chart(df)
         .mark_line()
         .encode(
             x='x',
             y='y'
         ).properties(
             title='Square Numbers'
         )
        )
chart

### Linear Interpolation

In [None]:
test_even = np.array([32.6,5.4,12.7,54.2,65.1,28.6,54.2,76.65,24.12,90.09])
test_odd = np.array([32.6,5.4,12.7,54.2,65.1,28.6,54.2,76.65,24.12,90.09,28.3])
len(test_odd)

In [None]:
np.sort(test_odd)

In [None]:
p = 0.25
p*len(test_odd)-1

In [None]:
a = np.sort(test_odd)[1]
b = np.sort(test_odd)[2]

In [None]:
a

In [None]:
b

In [None]:
f = (p*len(test_odd))

In [None]:
linear_interpolation(24.12,28.3, 0.55)

In [None]:
diff = 28.3 - 24.12
.75*diff

In [None]:
24.12+3.135

In [None]:
linear_interpolation(a, b, f)

In [None]:
def linear_interpolation(a, b, p):
    return(a+(p*(b-a)))

In [None]:
np.quantile(test_odd, p)

In [None]:
np.sort(test_odd)

In [None]:
linear_interpolation(24.12, 28.3, f)

In [None]:
p*(len(test_odd))

In [None]:
46-10

In [None]:
print(p*(np.arange(10,20)))
print(p*(np.arange(10,20))-1)

In [None]:
p*3

In [None]:
p*11

In [None]:
mytest = np.array([0.2,0.3,0.4])

In [None]:
np.quantile(mytest, 0.05)