<h1 style='color:rgb(52, 152, 219)'; align=center><font size = 8> DIAMOND PRICES - ANALYSIS AND MODELING </font></h1>

<h2 style='color:rgb(52, 152, 219)'; align=left><font size = 6> NOTEBOOK 01-03: DESCRIPTIVE STATISTICS </font></h2>

# REVISION HISTORY

| REV | DESCRIPTION             | DATE         |  BY   | CHECK | APPROVE  |
|:---:|:-----------------------:|:------------:|:-----:|:-----:|:--------:|
| A0  | ISSUED FOR REVIEW (IFR) | 2024-APR-XX  |  IAC  |       |          |
|     |                         |              |       |       |          |

## DETAILED DESCRIPTION OF REVISIONS

> **REV A0** - HOLD

# INTRODUCTION

Descriptive statistics summarizes and organizes the characteristics of a dataset. 

## DESCRIPTIVE STATISTICS STEPS

Below are the descriptive statistics steps performed in this notebook:

1. [X] **Measures of Central Tendency:** Calculate descriptive statistics such as mean, median, and mode to understand the central tendency of the data.
2. [X] **Measures of Dispersion**: Compute measures like standard deviation, variance, and range to assess the spread or variability of the data.
3. [X] **Percentiles**: Calculate percentiles to identify specific data points' position within the dataset.
4. [ ] **Binning and Bucketing**: Analyze bin width and number of bins, and create a bin table.
5. [ ] **Skewness and Kurtosis**: Skewness measures the asymmetry of the distribution, indicating whether the data is skewed to the left or right. Kurtosis measures the peakedness or flatness of the distribution.
6. [ ] **Outlier Management**: Identify and manage outliers.
7. [ ] **Correlation Analysis**: Compute correlation coefficients to understand the relationships between pairs of variables.

# REQUIRED LIBRARIES

The following libraries are required to run this notebook.

In [1]:
# Library to create and handle a tabular dataset
import pandas as pd

In [2]:
# Plotly graphing libraries
import plotly.express as px
import plotly.graph_objs as go

In [3]:
# Use the scikit-learn Library to create categorical and numerical features
from sklearn.compose import make_column_selector as selector

In [4]:
# Library used to create interactive controls
import ipywidgets as widgets

In [5]:
# Custom library that allows you to calculate bin width and number of bins using different methods
import binMethods

# LOAD DATASET

The scripts to load this dataset are:

In [6]:
# Path to the dataset
filePath = "./../00_Data/00_Datasets/"

In [7]:
# Filename
diamondsCSVFilename = 'diamonds_DM.csv'

In [8]:
# Create a data frame
diamonds_df = pd.read_csv(
    filePath + diamondsCSVFilename,
    index_col = None
)

# Display the newly created data frame
diamonds_df

Unnamed: 0,Carat,Diamond Type,Cut,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Color,Color Grades,Clarity,Price,Price(2024)
0,0.23,Small Diamond,Ideal,3.95,3.98,2.43,55.0,61.5,E,Colorless,SI2,326,354.04
1,0.21,Small Diamond,Excellent,3.89,3.84,2.31,61.0,59.8,E,Colorless,SI1,326,354.04
2,0.23,Small Diamond,Good,4.05,4.07,2.31,65.0,56.9,E,Colorless,VS1,327,355.12
3,0.29,Small Diamond,Excellent,4.20,4.23,2.63,58.0,62.4,I,Near Colorless,VS2,334,362.72
4,0.31,Medium Diamond,Good,4.34,4.35,2.75,58.0,63.3,J,Near Colorless,SI2,335,363.81
...,...,...,...,...,...,...,...,...,...,...,...,...,...
53789,0.72,Medium Diamond,Ideal,5.75,5.76,3.50,57.0,60.8,D,Colorless,SI1,2757,2994.10
53790,0.72,Medium Diamond,Good,5.69,5.75,3.61,55.0,63.1,D,Colorless,SI1,2757,2994.10
53791,0.70,Medium Diamond,Very Good,5.66,5.68,3.56,60.0,62.8,D,Colorless,SI1,2757,2994.10
53792,0.86,Medium Diamond,Excellent,6.15,6.12,3.74,58.0,61.0,H,Near Colorless,SI2,2757,2994.10


# MEASURE OF CENTRAL TENDENCY

The measure of central tendency are values that represents the center point of a dataset. The center point can be thought of as the most common values. In statistics, the measure of central tendency can described in the following terms: an average value (`mean`), the middle value (`median`), and the most common values (`mode`). 

Choosing the right one depends on whether you're dealing with categorical or numerical values. Below are the differences between the two:

**Categorical Data**:
- **Mode**: In categorical data, the mode is often the most relevant measure of central tendency. It represents the most frequently occurring category or class within the dataset.

**Numerical Data**:
* **Mean**: The mean is a common measure of central tendency for numerical data. It is calculated by summing all the values in the dataset and then dividing by the total number of values. The mean is sensitive to extreme values, also known as outliers, and provides a balance point for the data distribution.
* **Median**: The median is another measure of central tendency for numerical data. It represents the middle value when the data is arranged in ascending or descending order. Unlike the mean, the median is not influenced by extreme values, making it more robust in the presence of outliers.
* **Mode**: Similar to categorical data, the mode can also be used for numerical data, indicating the most frequent value.

<div class="alert alert-info" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p>In order to perform this analysis, we are going to need to split the dataset between categorical and numerical features. The process to split the dataset will be shown below.</p>
</div>

**Dataset splitting method**: There are many ways to split a dataset between categorical and numerical features. The method below will use the [scikit-learn column transformer library](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) to create a categorical and numerical datasets. This method was chosen because it will be the same method used to split the dataset between categorical vs. numerical features during the predictive analytics phase.

## CATEGORICAL DATA

For categorical data, finding the **mode** is the most common measure of central tendency, especially for ordinal data types. Howeer; you might be able to determine a **median**, and even a **mean** in some cases, for the nominal data types. 

In [9]:
# Create an object that will output a list of categorical columns from a dataset
categoricalColumnsSlctr_obj = selector(
    dtype_include = object
)

In [10]:
# Create a list of categorical columns
categoricalColumns_list = categoricalColumnsSlctr_obj(diamonds_df)

In [11]:
# Create a categorical dataset of just categorical features
diamonds_cat_df = diamonds_df[categoricalColumns_list]

# Verify the data
diamonds_cat_df

Unnamed: 0,Diamond Type,Cut,Color,Color Grades,Clarity
0,Small Diamond,Ideal,E,Colorless,SI2
1,Small Diamond,Excellent,E,Colorless,SI1
2,Small Diamond,Good,E,Colorless,VS1
3,Small Diamond,Excellent,I,Near Colorless,VS2
4,Medium Diamond,Good,J,Near Colorless,SI2
...,...,...,...,...,...
53789,Medium Diamond,Ideal,D,Colorless,SI1
53790,Medium Diamond,Good,D,Colorless,SI1
53791,Medium Diamond,Very Good,D,Colorless,SI1
53792,Medium Diamond,Excellent,H,Near Colorless,SI2


### MEASURE OF CENTRAL TENDENCY

The categorical values found in this dataset are all ordinal values. They are all related to grading a diamond. This implies that the only central tendency that can be extracted is the **mode**.

The following scripts graphically display the mode of each categorical feature. 

In [12]:
# Sort categories from most valuable to least.
# Helps in visualizing
sortOrder_dict = {
    'Cut' : ['Ideal', 'Excellent', 'Very Good', 'Good', 'Fair'],
    'Color' : sorted(diamonds_cat_df['Color'].unique().tolist()),
    'Clarity' : ['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'],
}

In [13]:
# Create an interactive widget to help in graphing
categoricalFeatures_widget = widgets.Dropdown(
    options = categoricalColumns_list,
    value = categoricalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [14]:
# Create an interactive graph to help visualize categorical data
@widgets.interact(
    categoricalFeature = categoricalFeatures_widget
)
def categoryInspector(categoricalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_cat_df,
        x = categoricalFeature,
        category_orders = sortOrder_dict,
        color = categoricalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white"
    )

    # Extract the mode of the dataset
    mode =  diamonds_cat_df[categoricalFeature].mode()[0]

    # Create a graph title
    graphTitle = f"Mode analysis of {categoricalFeature}. <b>Mode</b> : {mode}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 800
    )

    # Display the results
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### SUMMARY

The following scripts will create a summary table of the mode for each categorical feature:

In [15]:
# Create a list to store dictionary values
catSummary_list = []

# Create a summary of the mode values for categorical features
for category in categoricalColumns_list:
    # Extract the mode of the dataset
    mode =  diamonds_cat_df[category].mode()[0]
    # Extract the unique values found in the dataset
    data_list = diamonds_cat_df[category].unique().tolist()
    # Apply a criteria to help sort some values
    if category == 'Color' or category == 'Clarity':
        data_list = sorted(data_list)
    # Create the dictionary
    dictionary = {
        'Feature' : category,
        'Values' : data_list,
        'Mode' : mode
    }
    # Append the dictionary to the dictionary list
    catSummary_list.append(dictionary)

# Create a pandas dataframe to display the results as a table
categoricalSummary_df = pd.DataFrame(catSummary_list)

# Display the summary table
categoricalSummary_df

Unnamed: 0,Feature,Values,Mode
0,Diamond Type,"[Small Diamond, Medium Diamond, Large Diamond]",Medium Diamond
1,Cut,"[Ideal, Excellent, Good, Very Good, Fair]",Ideal
2,Color,"[D, E, F, G, H, I, J]",G
3,Color Grades,"[Colorless, Near Colorless]",Near Colorless
4,Clarity,"[I1, IF, SI1, SI2, VS1, VS2, VVS1, VVS2]",SI1


## NUMERICAL DATA

For numerical data, you can determine the **mean**, **median**, and **mode**, for continuous variables. 

In [16]:
# Create an object that will output a list of numerical columns from a dataset
numericalColumnsSlctr_obj = selector(
    dtype_exclude = object
)

In [17]:
# Create a list of numerical columns
numericalColumns_list = numericalColumnsSlctr_obj(diamonds_df)

In [18]:
# Create a categorical dataset
diamonds_num_df = diamonds_df[numericalColumns_list]

# Verify the data
diamonds_num_df

Unnamed: 0,Carat,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Price,Price(2024)
0,0.23,3.95,3.98,2.43,55.0,61.5,326,354.04
1,0.21,3.89,3.84,2.31,61.0,59.8,326,354.04
2,0.23,4.05,4.07,2.31,65.0,56.9,327,355.12
3,0.29,4.20,4.23,2.63,58.0,62.4,334,362.72
4,0.31,4.34,4.35,2.75,58.0,63.3,335,363.81
...,...,...,...,...,...,...,...,...
53789,0.72,5.75,5.76,3.50,57.0,60.8,2757,2994.10
53790,0.72,5.69,5.75,3.61,55.0,63.1,2757,2994.10
53791,0.70,5.66,5.68,3.56,60.0,62.8,2757,2994.10
53792,0.86,6.15,6.12,3.74,58.0,61.0,2757,2994.10


### CENTRAL TENDENCY

The numerical values found in this dataset are all continuous in nature. Therefore; the mean, median, and mode can be extracted from each numerical feature.

These scripts will help visualize numerical features.

In [19]:
# Create an interactive widget to help in graphing
numericalFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [20]:
# Create a list of marginals
marginals_list = [
    None,
    'box',
    'violin',
    'rug'
]

In [21]:
# Create an interactive widget to help in graphing
marginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

In [22]:
# Create an interactive graph to help visualize numerical data
@widgets.interact(
    numericalFeature = numericalFeatures_widget,
    marginalFeature = marginals_widget
)
def numercialInspector(marginalFeature, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when using a marginal plot 
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0
    
    # Bundle values in tuples
    centralTendancyName_tuple = ('Mean', 'Median', 'Mode')
    # Calculate the measures of central tendency for numerical features and bundle in a tuple
    centralTendancy_tuple = (diamonds_num_df[numericalFeature].mean(), diamonds_num_df[numericalFeature].median(), diamonds_num_df[numericalFeature].mode()[0])
    
    # Generate vertical lines and annotations
    for name, calc in zip(centralTendancyName_tuple, centralTendancy_tuple):
        # Used to customize the annotations and vertical lines
        match name:
            case 'Mean':
                tendencyColor = 'black'
                yCord = 0.97
                textString = f'<b>{name}</b>: {calc:.2f}'
                lineType = 'dot'
            case 'Median':
                tendencyColor = 'green'
                yCord = 0.93
                textString = f'<b>{name}</b>: {calc}'
                lineType = 'dot'
            case 'Mode':
                tendencyColor = 'darkred'
                yCord = 0.89
                textString = f'<b>{name}</b>: {calc}'
                lineType = 'dash'
    
        # Add some vertical lines to show the min, max, range value on the histogram
        fig.add_vline(
            x = calc,
            line_width = 3,
            line_dash = lineType,
            line_color = tendencyColor,
        )
        # Add the annotation text using paper reference. See:
        # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
        fig.add_annotation(
            text = textString,
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': tendencyColor ,
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = yCord - yOffset, 
            showarrow = False
        )
    
    # Create a graph title
    graphTitle = f"Measures of central tendency analysis of: {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()    

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

### SUMMARY

The summary of the mean, median, and mode for these numerical features is summarized using the below scripts:

In [23]:
# Create a list to store dictionary values
numericalSummary_list = []

# Create a summary of the mode values for numerical features
for numerical in diamonds_num_df.columns.tolist():
    
    # Calculate the measures of central tendency for numerical features
    meanCalc = round(diamonds_num_df[numerical].mean(), 2)
    medianCalc = diamonds_num_df[numerical].median()
    modeCalc = diamonds_num_df[numerical].mode()[0]

    # Create the dictionary
    dictionary = {
        'Feature' : numerical,
        'Mean' : meanCalc,
        'Median' : medianCalc,
        'Mode' : modeCalc
    }
    
    # Append the dictionary to the dictionary list
    numericalSummary_list.append(dictionary)

# Create a pandas dataframe to display the results as a table
numericalSummary_df = pd.DataFrame(numericalSummary_list)

# Display the summary table
numericalSummary_df

Unnamed: 0,Feature,Mean,Median,Mode
0,Carat,0.8,0.7,0.3
1,Crown Height (mm),5.73,5.7,4.37
2,Girdle Diameter (mm),5.73,5.71,4.34
3,Pavillion Depth (mm),3.54,3.53,2.7
4,Table %,57.46,57.0,56.0
5,Total Depth %,61.75,61.8,62.0
6,Price,3933.07,2401.0,605.0
7,Price(2024),4271.31,2607.49,657.03


# MEASURES OF DISPERSION

The measures of dispersion are statistical metrics that quantify the spread or variability of a dataset. They help quantify how much the individual data points deviate from the central tendency (mean or median) of the dataset. The higher the dispersion, the further my data points are scattered from the center.

These measures will help us understand the shape and distribution of the data, identify outliers, and assess data quality. 

There are different measures of dispersion, for categorical vs. numerical data:

**Categorical**:

1. **Frequency Distribution**: The total count of each category in a feature.
2. **Proportion**: The percentage break-down of each each category in a feature.
3. **Diversity Index**:  The statistical measure that quantifies the evenness of the distribution across categories. A higher diversity index indicates a more balanced distribution, where no single category has a significantly higher frequency.

**Numerical**:

1. **Range**: The simplest measure of dispersion, calculated as the difference between the maximum and minimum values in the dataset. While easy to calculate, the range doesn't account for the distribution of values within the dataset.
2. **Variance**: The average of the squared differences between each data point and the mean of the dataset. Variance gives a more comprehensive understanding of the spread, but since it's in squared units, it may not be as interpretable as other measures.
3. **Standard Deviation**: The square root of the variance. Standard deviation is widely used due to its intuitive interpretation and its measurement in the same units as the original data.
4. **Interquartile Range (IQR)**: The difference between the third quartile (Q3) and the first quartile (Q1) of the dataset. IQR is robust against outliers and provides a measure of spread that focuses on the middle 50% of the data.
5. **Mean Absolute Deviation (MAD)**: The average of the absolute differences between each data point and the mean of the dataset. MAD is more robust to outliers compared to standard deviation, as it does not square the differences.

## CATEGORICAL DATA

Since this dataset has already been split between categorical and numerical features, we can use the same categorical dataset that was previously created:

### FREQUENCY DISTRIBUTION

Frequency distribution involves understanding the total counts of each category in the dataset. The output is a chart that visually depicts this these counts. 

The following callback function will display the frequency counts of the categorical feature.

In [24]:
# Create an interactive widget to help in graphing
frequenceDistribution_widget = widgets.Dropdown(
    options = categoricalColumns_list,
    value = categoricalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [25]:
# Create an interactive graph to help visualize frequency counts
@widgets.interact(
    categoricalFeature = frequenceDistribution_widget
)
def categoryFrequencyInspector(categoricalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_cat_df,
        x = categoricalFeature,
        category_orders = sortOrder_dict,
        color = categoricalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
        text_auto = True
    )

    # Create a graph title
    graphTitle = f"Frequency analysis of: {categoricalFeature}"

    # Display the text on the bars on top
    fig.update_traces(
        textposition = 'outside'
    )
    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 800
    )

    # Display the results
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### PROPORTION

Proportion analysis helps understand the dataset in terms of percent (%) break-downs. 

The following callback function will display the proportion of the categorical feature.

In [26]:
# Create an interactive widget to help in graphing
proportionDistribution_widget = widgets.Dropdown(
    options = categoricalColumns_list,
    value = categoricalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [27]:
@widgets.interact(
    feature = proportionDistribution_widget
)
def proportionFunc(feature):
    # Number of records in the dataset
    records = diamonds_cat_df.shape[0]
    
    # Create a dataset to inspect 
    inspector_pd_series = diamonds_cat_df[feature].value_counts()

    # Create lists to store values that will be used in graphing
    proportions_list = []
    colors_list = []

    # Extract values 
    for idx in range(len(inspector_pd_series.values.tolist())):
        # Extract the observation count (value) of the chosen feature
        value = inspector_pd_series.values.tolist()[idx]
        # Calculate the proportion
        proportion = round((value / records) * 100, 2)
        # Add to list
        proportions_list.append(proportion)
        # Assign a color value
        colorValue = px.colors.qualitative.Dark24[idx]
        # Add the color to the list
        colors_list.append(colorValue)
        
    # Create a figure object
    fig = go.Figure(
        data = [
            go.Bar(
                x = inspector_pd_series.keys().tolist(),
                y = proportions_list,
                texttemplate = '%{y}%',
                marker_color = colors_list,
            )
        ]
    )
    # Display the text on the bars on top
    fig.update_traces(
        textposition = 'outside'
    )
    # Update the layout and title
    fig.update_layout(
        title = f'% Break-Out analysis of: {feature}',
        template = "plotly_white",
        xaxis_title_text = f'{feature}', # xaxis label
        yaxis_title_text = 'Percentages %', # yaxis label
        xaxis = {
            'categoryorder' : 'total descending'
        },
        height = 600,
        width = 800,
        
    )
    
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### DIVERSITY INDEX

Diversity indices are measures that reflect the degree of diversity or variety within a dataset. It allows you to quantify if a dataset is balanced or imbalanced, based on a calculation.

The two most commonly used diversity indices are:

1. **Shannon Diversity Index (H'):**

   - This index accounts for both richness and evenness.
   - It ranges from 0 (low diversity - one dominant category) to a theoretical maximum value (high diversity - all categories equally abundant).
   - It's calculated using the formula: $H' = -\Sigma(\pi * ln(\pi))$, where $\pi$ is the proportion of individuals belonging to the ith category.

2. **Simpson Diversity Index (D):**

   - This index focuses on the probability of picking two individuals from different categories.
   - It ranges from 0 (low diversity - one dominant category) to 1 (high diversity - all categories equally abundant).
   - It's calculated using the formula: $D = 1 - \Sigma(\pi^2)$, where $\pi$ is the proportion of individuals belonging to the ith category.

**Choosing the Best Diversity Index**

There's no single "best" diversity index for categorical data.  The most suitable choice depends on the specific question you're trying to answer and the characteristics of your data. Here are some factors to consider:

* **Focus on richness vs evenness:** If you're primarily interested in the number of categories, Shannon's index might be more appropriate. However, if evenness is crucial, Simpson's index might be a better choice. 
* **Data interpretation:**  Consider how easily interpretable the index's output is for your audience. Shannon's index uses natural logarithms, while Simpson's index is easier to grasp intuitively (0-1 scale).
* **Dominant categories:** If you expect a high degree of dominance (one or few categories with significantly higher counts), Simpson's index might be more sensitive to this and provide a clearer picture.

<div class="alert alert-warning" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p> There are only a few categories in this dataset. Evenness is not a major concern. Frequency and proportion analysis is adequate.</p>
</div>

## NUMERICAL DATA

Since this dataset has already been split between categorical and numerical features, we can use the same numerical dataset that was previously created:

### RANGE

The range helps us in the following way:

1. **Understand the spread of the data:** The range gives you a basic idea of how spread out the data points are for a particular numerical feature. It's simply the difference between the highest and lowest values. While it doesn't take into account the distribution of the data, it does give you a quick sense of the overall spread.

2. **Identify potential outliers:** The range can help identify potential outliers or anomalies in the data. If the range is very large, it may indicate the presence of outliers or errors in the data.

The following script will calculate the range for each feature found in this dataset:

In [28]:
# Create a list to store dictionary values
rangeSummary_list = []

# Create a summary of the mode values for categorical features
for numerical in numericalColumns_list:
    # Extract min / max values
    minValue = diamonds_num_df[numerical].min()
    maxValue = diamonds_num_df[numerical].max()

    # Calculate the range
    rangeValue = maxValue - minValue

    # Create the dictionary
    dictionary = {
        'Feature' : numerical,
        'Min' : minValue,
        'Max' : maxValue,
        'Range' : rangeValue
    }
    
    # Append the dictionary to the dictionary list
    rangeSummary_list.append(dictionary)

# Create a pandas dataframe to display the results as a table
rangeSummary_df = pd.DataFrame(rangeSummary_list)

# Display the summary table
rangeSummary_df

Unnamed: 0,Feature,Min,Max,Range
0,Carat,0.2,5.01,4.81
1,Crown Height (mm),0.0,10.74,10.74
2,Girdle Diameter (mm),0.0,58.9,58.9
3,Pavillion Depth (mm),0.0,31.8,31.8
4,Table %,43.0,95.0,52.0
5,Total Depth %,43.0,79.0,36.0
6,Price,326.0,18823.0,18497.0
7,Price(2024),354.04,20441.78,20087.74


In [29]:
# Create an interactive widget to help in graphing
rangeFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [30]:
# Create an interactive widget to help in graphing
rangeMarginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

The following callback function analyzes range data.

In [31]:
# Create an interactive graph to help analyze range values
@widgets.interact(
    numericalFeature = rangeFeatures_widget,
    marginalFeature = rangeMarginals_widget
)
def rangeInspector(marginalFeature, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when using a marginal plot 
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0
    
    # Extract raw data
    extractedData = rangeSummary_df[rangeSummary_df['Feature'] == numericalFeature]
    # Transform data
    rangeData = extractedData.iloc[0]
    # Extract min/max/range
    minValue = rangeData['Min']
    maxValue = rangeData['Max']
    rangeValue = round(rangeData['Range'], 2)
    
    # Bundle values in tuples
    rangeName_tuple = ('Min', 'Max', 'Range')
    range_tuple = (minValue, maxValue, rangeValue)
    
    # Generate vertical lines and annotations
    for name, calc in zip(rangeName_tuple, range_tuple):
        # Used to customize the annotations and vertical lines
        match name:
            case 'Min':
                tendencyColor = 'black'
                yCord = 0.97
                textString = f'<b>{name}</b>: {minValue}'
                lineType = 'dot'
            case 'Max':
                tendencyColor = 'green'
                yCord = 0.93
                textString = f'<b>{name}</b>: {maxValue}'
                lineType = 'dot'
            case 'Range':
                tendencyColor = 'darkred'
                yCord = 0.89
                textString = f'<b>{name}</b>: {rangeValue}'
                lineType = 'dash'
        
        # Add some vertical lines to show the min, max, range value on the histogram
        fig.add_vline(
            x = calc,
            line_width = 3,
            line_dash = lineType,
            line_color = tendencyColor,
        )
        # Add the annotation text using paper reference. See:
        # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
        fig.add_annotation(
            text = textString,
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': tendencyColor ,
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = yCord - yOffset, 
            showarrow = False
        )
    
    # Create a graph title
    graphTitle = f"Range analysis of: {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()    

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

<div class="alert alert-danger" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p> There appears to be outliers in this dataset that need to be addressed.</p>
</div>

### VARIANCE ($\sigma^2$)

Variance quantifies the spread or dispersion of the data points around the mean. Variance is the square of the standard deviation.

> **High Variance**: Indicates that the data points are more spread out from the mean. This means higher variability in the data.
>
> **Low Variance**: Indicates that the data points are closer to the mean, suggesting lower variability in the data.

Variance $(\sigma^2)$ is calculated using the following formula:

$\sigma^2 = \frac{\sum(x_{i}-\bar{x})^2}{n-1}$

Where:
* **$\sigma^2$** = Variance
* **$x_{i}$** = Dataset record number
* **$\bar{x}$** = Sample mean
* **$\sum$** = Sum
* **$n$** = Sample size

In [32]:
# Create an interactive widget to help in graphing
varianceFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [33]:
# Create an interactive widget to help in graphing
varianceMarginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

In [34]:
# Create an interactive graph to help analyze variance
@widgets.interact(
    numericalFeature = varianceFeatures_widget,
    marginalFeature = varianceMarginals_widget
)
def varianceInspector(marginalFeature, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when using a marginal plot 
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0
    
    # Calculate the mean of the feature
    meanValue = round(diamonds_num_df[numericalFeature].mean(), 2)
    # Calculate the variance of the feature
    varianceValue = round(diamonds_num_df[numericalFeature].var(), 2)

    # Setting the boundaries for a rectangle to show the variance
    x0Value = abs(varianceValue - meanValue)
    x1Value = varianceValue + meanValue
    
    # Section creates a vertical line and annotation on the graph, showing the dataset mean
    # Create a vertical line and annotation for the mean value
    fig.add_vline(
        x = meanValue,
        line_width = 2,
        line_dash = 'dot',
        line_color = 'darkred'
    )
    # Add the annotation text using paper reference. See:
    # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
    # Annotation to display the mean
    fig.add_annotation(
            text = f'<b>Mean</b>: {meanValue}',
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': 'darkred',
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = 0.97 - yOffset, 
            showarrow = False
    )

    # Section creates a rectangle and annotation on the graph, showing the dataset variance
    # Add a rectangle to show the variance relative to the mean
    fig.add_vrect(
        x0 = x0Value,
        x1 = x1Value,
        label = {
            'text' : '<b>Variance</b>',
            'textposition' : 'top center',
            'font' : {
                'size' : 13,
                'family' : 'Times New Roman',
                'color' : 'black',
            },
        },
        fillcolor = 'grey',
        opacity = 0.25,
        line_width = 0
    )
    # Annotation to display the variance
    fig.add_annotation(
            text = f'<b>Variance</b>: {varianceValue}',
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': 'black',
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = 0.93 - yOffset, 
            showarrow = False
    )
        

    # Create a graph title
    graphTitle = f"Variance analysis of: {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()  

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

### STANDARD DEVIATION ($\sigma$)

Another method of analyzing the variability of data is to measure the standard deviation. Just like variance, the standard deviation determines the amount of variability around the mean.

Variance is the square of standard deviation. Therefore; if you know your variance, you can take the square root to know the standard deviation.

> **High Standard Deviation**: Indicates that the data points are spread out.
>
> **Low Standard Deviation**: Indicates that the data points are close to the mean.

In a normal distribution, about 68% of the data points fall within 1 standard deviation of the mean. About 95% fall within 2 standard deviations. A data point that is 1 standard deviation away from the mean is considered relatively close, while data points that are 2-3 standard deviations away from the mean can be considered outliers.

<img 
     src="../../00_Data/01_Assets/standardDeviation.png" 
     alt="Standard Deviation"
     style="width:1000x;height:300px;"
     >

<div class="alert alert-warning" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p> The above rule only applies to normal distributions. Non-normal distributions require other analysis methods to determine the spread of data.</p>
</div>

The relation ship between variance and standard deviation is shown below:

$\sigma = \sqrt(\sigma^2)$

Where:

* **$\sigma$** = Standard deviation.
* **$\sigma^2$** = Variance.

The standard deviation $\sigma$ is calculated using the following formula:

$\sigma = \sqrt(\frac{\sum(x_{i}-\bar{x})^2}{n-1})$

Where:
* **$\sigma$** = Standard deviation
* **$x_{i}$** = Dataset record number
* **$\bar{x}$** = Sample mean
* **$\sum$** = Sum
* **$n$** = Sample size

In [35]:
# Create an interactive widget to help in graphing
stdDeviationFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [36]:
# Create an interactive widget to help in graphing
stdDeviationMarginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

In [37]:
# Create an interactive graph to help analyze standard deviation
@widgets.interact(
    numericalFeature = stdDeviationFeatures_widget,
    marginalFeature = stdDeviationMarginals_widget
)
def stdDeviationInspector(marginalFeature, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when a marginal plot is being used
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0

    # Calculate the mean of the feature
    meanValue = round(diamonds_num_df[numericalFeature].mean(), 2)
    # Calculate the variance of the feature
    stdDeviationValue = round(diamonds_num_df[numericalFeature].var(), 2)

    # Use a for loop to create rectangles and annotations showing the boundaries of the standard deviations: X1, X2, X3
    for i in range(3):
        # Use match / case to store the parameters for creating each rectangle and annotation
        # Cases are in reverse order to manage layers overlapping with each other in the generated graph
        match i:
            # Third standard deviation
            case 0:
                x0Value = abs((3 * stdDeviationValue) - meanValue) 
                x1Value = (3 * stdDeviationValue) + meanValue
                yValue = 0.85
                backgroundFillColor = 'silver'
                annotationTextValue = f'<b>3 Standard Deviations</b>: {stdDeviationValue * 3:.2f}'
            # Second standard deviation
            case 1:
                x0Value = abs((2 * stdDeviationValue) - meanValue) 
                x1Value = (2 * stdDeviationValue) + meanValue
                yValue = 0.89
                backgroundFillColor = 'darkorange'
                annotationTextValue = f'<b>2 Standard Deviations</b>: {stdDeviationValue * 2:.2f}'
            # First standard deviation
            case 2:
                x0Value = abs(stdDeviationValue - meanValue) 
                x1Value = (stdDeviationValue + meanValue)
                yValue = 0.93
                backgroundFillColor = 'lime'
                annotationTextValue = f'<b>1 Standard Deviation</b>: {stdDeviationValue:.2f}'

        # Section creates the rectangles and annotations for each standard deviation
        # Add a rectangle to show the variance relative to the mean
        fig.add_vrect(
            x0 = x0Value,
            x1 = x1Value,
            fillcolor = backgroundFillColor,
            opacity = 0.25,
            line_width = 0
        )
        # Annotation to display the standard deviation value
        fig.add_annotation(
                text = annotationTextValue,
                font = {
                    'size' : 13,
                    'family' : 'Times New Roman',
                    'color': 'black',
                },
                bgcolor = backgroundFillColor,
                xref = "paper", 
                yref = "paper",
                x = 0.90, 
                y = yValue - yOffset, 
                showarrow = False
        )

    # Section creates a vertical line and annotation on the graph, showing the dataset mean
    # Create a vertical line and annotation for the mean value
    fig.add_vline(
        x = meanValue,
        line_width = 2,
        line_dash = 'dot',
        line_color = 'darkred'
    )
    # Add the annotation text using paper reference. See:
    # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
    # Annotation to display the mean
    fig.add_annotation(
            text = f'<b>Mean</b>: {meanValue}',
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': 'darkred',
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = 0.97 - yOffset, 
            showarrow = False
    )
        
    # Create a graph title
    graphTitle = f"Standard deviation analysis of: {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()  

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

### INTERQUARTILE RANGE (IQR)

The IQR provides a measure of spread that focuses on the central part of the data, with respect to the **median** value. It represents the range of values within which the middle 50% of the data points fall. IQR helps us analyze the following:

1. **Understanding spread for skewed data and outliers**:  The range (difference between highest and lowest values) is a common measure of spread, but it can be misleading for skewed data or data with outliers.  The IQR focuses on the middle 50% of the data, giving a better idea of how spread out the typical values are.  Since it's not influenced by extreme values, it's robust to outliers. 

2. **Identifying outliers**:  The IQR helps pinpoint outliers in the data.  By looking at values that fall outside a specific range based on the IQR (typically 1.5 IQR below the first quartile or above the third quartile), we can identify data points that differ significantly from the rest.  This can be crucial for further analysis, as outliers can sometimes skew results.

Analyzing the IQR provides a more robust picture of how spread out the "typical" data points are, even when dealing with skewed distributions or outliers. A large IQR indicated more variability.

IQR is calculated using the following formula:

$IQR = Q3 - Q1$

Where:
* **$Q3$** = 75th Percentile
* **$Q1$** = 25th Percentile
* **$IQR$** = Interquartile range

In [38]:
# Create an interactive widget to help in graphing
iqrFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [39]:
# Create an interactive widget to help in graphing
iqrMarginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

In [40]:
# Create an interactive graph to help analyzing IQR
@widgets.interact(
    numericalFeature = iqrFeatures_widget,
    marginalFeature = iqrMarginals_widget
)
def iqrInspector(marginalFeature, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when a marginal plot is being used
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0

    # Calculate the median of the feature
    medianValue = round(diamonds_num_df[numericalFeature].median(), 2)
    # Extract the 25% and 75% percentiles
    q1, q3 = diamonds_num_df[numericalFeature].quantile(
        q = [0.25, 0.75],
    )
    # Calculate the interquartile range of the data
    IQR = q3 - q1  

    # Section creates the rectangles and annotations for the IQR
    # Add a rectangle to show IQR relative to the median
    fig.add_vrect(
        x0 = abs(IQR - medianValue),
        x1 = IQR + medianValue,
        fillcolor = 'silver',
        opacity = 0.25,
        line_width = 0
    )
    # Annotation to display the IQR value
    fig.add_annotation(
            text = f'<b>IQR</b>: {IQR:.2f}',
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': 'black',
            },
            bgcolor = 'silver',
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = 0.92 - yOffset, 
            showarrow = False
    )

    # Section creates a vertical line and annotation on the graph, showing the dataset median
    # Create a vertical line and annotation for the median value
    fig.add_vline(
        x = medianValue,
        line_width = 2,
        line_dash = 'dot',
        line_color = 'darkred'
    )
    # Add the annotation text using paper reference. See:
    # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
    # Annotation to display the median
    fig.add_annotation(
            text = f'<b>Median</b>: {medianValue}',
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': 'darkred',
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = 0.97 - yOffset, 
            showarrow = False
    )
        
    # Create a graph title
    graphTitle = f"IQR analysis of: {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()  

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

### MEAN ABSOLUTE DEVIATION (MAD)

MAD is a measure of the average absolute deviation from the mean. MAD is less sensitive to outliers compared to variance and standard deviation because it uses absolute deviations rather than squared deviations. This makes it a more robust measure of dispersion for datasets that may contain extreme values. While standard deviation and variance are commonly used measures of dispersion, they are affected by outliers and can sometimes be misleading. MAD provides a complementary perspective, offering a clearer picture of data spread without the influence of extreme values.

MAD is calculated using the following formula:

$\text{MAD} = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|$

Where:

* **$n$** = Sample size 
* **$x_i$** = Dataset record number
* **$\bar{x}$** = Sample mena
* **$\sum$** = Sum

In [41]:
# Create an interactive widget to help in graphing
madFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [42]:
# Create an interactive widget to help in graphing
madMarginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

In [43]:
# Create an interactive graph to help analyzing MAD
@widgets.interact(
    numericalFeature = madFeatures_widget,
    marginalFeature = madMarginals_widget
)
def iqrInspector(marginalFeature, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when a marginal plot is being used
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0

    # Calculate the mean of the feature
    meanValue = round(diamonds_num_df[numericalFeature].mean(), 2)
    # Calculate the mad
    madValue = round((diamonds_num_df[numericalFeature] - diamonds_num_df[numericalFeature].mean()).abs().mean(), 2) 

    # Section creates the rectangles and annotations for the MAD
    # Add a rectangle to show MAD relative to the mean
    fig.add_vrect(
        x0 = abs(madValue - meanValue),
        x1 = madValue + meanValue,
        fillcolor = 'darkseagreen',
        opacity = 0.25,
        line_width = 0
    )
    # Annotation to display the MAD value
    fig.add_annotation(
            text = f'<b>MAD</b>: {madValue:.2f}',
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': 'black',
            },
            bgcolor = 'darkseagreen',
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = 0.92 - yOffset, 
            showarrow = False
    )

    # Section creates a vertical line and annotation on the graph, showing the dataset mean
    # Create a vertical line and annotation for the mean value
    fig.add_vline(
        x = meanValue,
        line_width = 2,
        line_dash = 'dot',
        line_color = 'darkred'
    )
    # Add the annotation text using paper reference. See:
    # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
    # Annotation to display the median
    fig.add_annotation(
            text = f'<b>Mean</b>: {meanValue}',
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': 'darkred',
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = 0.97 - yOffset, 
            showarrow = False
    )
        
    # Create a graph title
    graphTitle = f"MAD analysis of: {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()  

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

# PERCENTILES

Percentiles are values that divide a dataset into 100 equal parts. Each percentile indicates the value below which a given percentage of observations fall. For instance, the 25th percentile (or the first quartile, Q1) is the value below which 25% of the data points lie.

Percentiles are helpful in identifying the shape of the data distribution, whether it's skewed, bimodal, or normal and detecting outliers and anomalies in the data.

The following are some interpretations of percentiles:

1. A large difference between percentiles indicates a wide spread in the data.
2. If the median and mean are not equal, the data is likely skewed.
3. If Q1 and Q3 are not equidistant from the median, the data is likely skewed.
4. Data points beyond the 10th and 90th percentiles may be considered outliers.

The key percentiles that will be analyzed are:

* **10th Percentile**: Marks the cutoff for the lowest 10% of the data.
* **25th Percentile (Q1)**: Marks the cutoff for the lowest 25% of the data.
* **50th Percentile (Median)**: Divides the data into 2 halves.
* **75th Percentile (Q3)**: Marks the cutoff for the lowest 75% of the data.
* **90th Percentile**: Marks the cutoff for the lowest 90% of the data.

<div class="alert alert-warning" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p> Percentile analysis can only be done on <b>Numerical data</b> that is <b>conitinous</b>. Percentile analysis is not applicable for discrete data.</p>
</div>

In [44]:
# Create an interactive widget to help in graphing
percentileFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [45]:
# Create an interactive widget to help in graphing
percentileMarginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

In [46]:
# Create an interactive widget to help choose a percentile to graph
percentileSelector_widget = widgets.RadioButtons(
    options = (
        None,
        '10th Percentile',
        '25th Percentile',
        '75th Percentile',
        '90th Percentile',
        '95th Percentile'
    ),
    value = None, 
    layout = {
        'width': 'max-content', # If the items' names are long
    }, 
    description = 'Percentile:',
    disabled = False
)

In [47]:
# Create an interactive graph to help analyzing Percentiles
@widgets.interact(
    numericalFeature = percentileFeatures_widget,
    marginalFeature = percentileMarginals_widget,
    percentileName = percentileSelector_widget
)
def percentileInspector(marginalFeature, percentileName, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when a marginal plot is being used
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0

    # Set the graphing and annotation values to visualize the Percentiles
    match percentileName:
        case None:
            # Used to draw a graph with no shapes or annotations
            graphType = 0
        case '10th Percentile':
            graphType =1
            # Create a graph title
            graphTitle = f"{percentileName} analysis of: {numericalFeature}"
            # Calculate the percentile
            percentile = diamonds_num_df[numericalFeature].quantile(0.10)
        case '25th Percentile':
            # Used to draw a graph with shapes or annotations
            graphType =1
            # Create a graph title
            graphTitle = f"{percentileName} analysis of: {numericalFeature}"
            # Calculate the percentile
            percentile = diamonds_num_df[numericalFeature].quantile(0.25)
        case '75th Percentile':
            # Used to draw a graph with shapes or annotations
            graphType =1
            # Create a graph title
            graphTitle = f"{percentileName} analysis of: {numericalFeature}"
            # Calculate the percentile
            percentile = diamonds_num_df[numericalFeature].quantile(0.75)
        case '90th Percentile':
            # Used to draw a graph with shapes or annotations
            graphType =1
            # Create a graph title
            graphTitle = f"{percentileName} analysis of: {numericalFeature}"
            # Calculate the percentile
            percentile = diamonds_num_df[numericalFeature].quantile(0.90)
        case '95th Percentile':
            # Used to draw a graph with shapes or annotations
            graphType =1
            # Create a graph title
            graphTitle = f"{percentileName} analysis of: {numericalFeature}"
            # Calculate the percentile
            percentile = diamonds_num_df[numericalFeature].quantile(0.95)

    # Format the graph based on the Percentile selected
    if graphType == 0:
        # Create a graph title
        graphTitle = f"Percentile analysis of: {numericalFeature}"
    # Used when a percentile has been selected
    else:
        # Add a rectangle to show Percentile relative to the median
        fig.add_vrect(
            x0 = diamonds_num_df[numericalFeature].min(),
            x1 = percentile,
            fillcolor = 'darkorange',
            opacity = 0.25,
            line_width = 0
        )
        # Annotation to display the Percentile
        fig.add_annotation(
                text = f'<b>{percentileName}</b>: {percentile:.2f}',
                font = {
                    'size' : 13,
                    'family' : 'Times New Roman',
                    'color': 'black',
                },
                bgcolor = 'darkorange',
                xref = "paper", 
                yref = "paper",
                x = 0.90, 
                y = 0.89 - yOffset, 
                showarrow = False
        )  
        
    # Extract the mean and median
    meanCalc = diamonds_num_df[numericalFeature].mean()
    medianCalc = diamonds_num_df[numericalFeature].median()
    
    # Bundle values in tuples
    centralTendancyName_tuple = ('Mean', 'Median')
    centralTendancy_tuple = (meanCalc, medianCalc)
    
    # Generate vertical lines and annotations
    for name, calc in zip(centralTendancyName_tuple, centralTendancy_tuple):
        # Used to customize the annotations and vertical lines
        match name:
            case 'Mean':
                tendencyColor = 'black'
                yCord = 0.97
                textString = f'<b>{name}</b>: {meanCalc:.2f}'
                lineType = 'dot'
            case 'Median':
                tendencyColor = 'darkred'
                yCord = 0.93
                textString = f'<b>{name}</b>: {medianCalc}'
                lineType = 'dash'
    
        # Add some vertical lines to show the min, max, range value on the histogram
        fig.add_vline(
            x = calc,
            line_width = 3,
            line_dash = lineType,
            line_color = tendencyColor,
        )
        # Add the annotation text using paper reference. See:
        # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
        fig.add_annotation(
            text = textString,
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': tendencyColor ,
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = yCord - yOffset, 
            showarrow = False
        )

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()    

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

# FREQUENCY DISTRIBUTIONS

Frequency distribution for categorical data was already completed in the [Frequency Distribution](#FREQUENCY-DISTRIBUTION) section of this notebook. This section can be found in the [MEASURE OF DISPERSION](#MEASURES-OF-DISPERSION) chapter above.

Therefore; the focus of this chapter will be as follows:

1. Analysis of bin width, and number of bins of each feature.
2. Discretize continuous values to simplify frequency distributions, where applicable.
3. Create a bin table

There are several methods to determine bin width and number of bins. The ones that will be explored in this notebook are:

1. **Sturges' Rule**:
   - **Formula**: $k = \lceil \log_2(n) + 1 \rceil$
   - **Description**: This method is based on the assumption that the data follows a normal distribution. It works well for smaller datasets.
   - **Usage**: Suitable for small to moderately sized datasets.
2. **Scott's Rule**:
   - **Formula**: $h = \frac{3.5 \cdot \sigma}{n^{1/3}}$
   - **Description**: This method calculates bin width based on the standard deviation of the data and the number of data points. It aims to minimize the mean integrated squared error.
   - **Usage**: Suitable for data that follows a normal distribution.
3. **Freedman-Diaconis Rule**:
   - **Formula**: $h = \frac{2 \cdot IQR}{n^{1/3}}$
   - **Description**: This method uses the interquartile range (IQR) to calculate the bin width, making it more robust to outliers compared to Scott's Rule.
   - **Usage**: Suitable for skewed distributions or data with outliers.
4. **Square Root Choice (Manual)**:
   - **Formula**: $k = \lceil \sqrt{n} \rceil$
   - **Description**: A simple method where the number of bins is the square root of the number of data points.
   - **Usage**: Provides a quick and easy estimate, often used in exploratory data analysis.

A bin table will be created compare these values:

In [48]:
# List that will store dictionaries that will be used to create a dataframe
calculatedBin_list = []
# For loop...
for feature in diamonds_num_df.columns.tolist():
    # Cycle through the 4 types of binning methods to calculate bin width and number of bins
    for i in range(4):
        match i:
            case 0:
                binningMethod = "Manual"
                binWidth, numBins = binMethods.manualMethod(diamonds_num_df[feature])
            case 1:
                binningMethod = "Sturges' Rule"
                binWidth, numBins = binMethods.sturgesRule(diamonds_num_df[feature])
            case 2:
                binningMethod = "Scott's Rule"
                binWidth, numBins = binMethods.scottsRule(diamonds_num_df[feature])
            case 3:
                binningMethod = "Freedman-Diaconis Rule"
                binWidth, numBins = binMethods.freedmanDiaconisRule(diamonds_num_df[feature])
        
        # Handles the cases where the binning method does not provide a bin width
        if binWidth == None:
            binWidth = '---'
        
        # Create a dictionary of values
        binValues_dict = {
            "Feature Name" : feature,
            "Binning Method" : binningMethod,
            "Bin Width" : binWidth,
            "Number of Bins" : numBins
        }
        
        # Add the dictionary to the list
        calculatedBin_list.append(binValues_dict)

# Create a pandas dataframe from the list, to display the results as a table
binTable_df = pd.DataFrame(calculatedBin_list)

# Display the summary table
binTable_df

Unnamed: 0,Feature Name,Binning Method,Bin Width,Number of Bins
0,Carat,Manual,---,232
1,Carat,Sturges' Rule,---,12
2,Carat,Scott's Rule,0.044,110
3,Carat,Freedman-Diaconis Rule,0.034,141
4,Crown Height (mm),Manual,---,232
5,Crown Height (mm),Sturges' Rule,---,12
6,Crown Height (mm),Scott's Rule,0.104,104
7,Crown Height (mm),Freedman-Diaconis Rule,0.097,110
8,Girdle Diameter (mm),Manual,---,232
9,Girdle Diameter (mm),Sturges' Rule,---,12
