<h1 style='color:rgb(52, 152, 219)'; align=center><font size = 8> DIAMOND PRICES - ANALYSIS AND MODELING </font></h1>

<h2 style='color:rgb(52, 152, 219)'; align=left><font size = 6> NOTEBOOK 01-03: DESCRIPTIVE STATISTICS </font></h2>

# REVISION HISTORY

| REV | DESCRIPTION             | DATE         |  BY   | CHECK | APPROVE  |
|:---:|:-----------------------:|:------------:|:-----:|:-----:|:--------:|
| A0  | ISSUED FOR REVIEW (IFR) | 2024-APR-XX  |  IAC  |       |          |
|     |                         |              |       |       |          |

## DETAILED DESCRIPTION OF REVISIONS

> **REV A0** - HOLD

# INTRODUCTION

Descriptive statistics summarizes and organizes the characteristics of a dataset. 

## DESCRIPTIVE STATISTICS STEPS

Below are the descriptive statistics steps performed in this notebook:

1. [X] **Measures of Central Tendency:** Calculate descriptive statistics such as mean, median, and mode to understand the central tendency of the data.
2. [ ] **Measures of Dispersion:** Compute measures like standard deviation, variance, and range to assess the spread or variability of the data.
3. [ ] **Percentiles:** Calculate percentiles to identify specific data points' position within the dataset, and calculate bin-width and bins.
4. [ ] **Frequency Distributions:** Create frequency tables or histograms to visualize the distribution of categorical and numerical variables.
5. [ ] **Skewness and Kurtosis:** Skewness measures the asymmetry of the distribution, indicating whether the data is skewed to the left or right. Kurtosis measures the peakedness or flatness of the distribution.
6. [ ] **Outlier Management:** Identify and manage outliers.
7. [ ] **Correlation Analysis:** Compute correlation coefficients to understand the relationships between pairs of variables.

# REQUIRED LIBRARIES

The following libraries are required to run this notebook.

In [1]:
# Library to create and handle a tabular dataset
import pandas as pd

In [2]:
# Plotly graphing libraries
import plotly.express as px
import plotly.graph_objs as go

In [3]:
# Use the scikit-learn Library to create categorical and numerical features
from sklearn.compose import make_column_selector as selector

In [4]:
# Library used to create interactive controls
import ipywidgets as widgets

# LOAD DATASET

The scripts to load this dataset are:

In [5]:
# Path to the dataset
filePath = "./../00_Data/00_Datasets/"

In [6]:
# Filename
diamondsCSVFilename = 'diamonds_DM.csv'

In [7]:
# Create a data frame
diamonds_df = pd.read_csv(
    filePath + diamondsCSVFilename,
    index_col = None
)

# Display the newly created data frame
diamonds_df

Unnamed: 0,Carat,Diamond Type,Cut,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Color,Color Grades,Clarity,Price,Price(2024)
0,0.23,Small Diamond,Ideal,3.95,3.98,2.43,55.0,61.5,E,Colorless,SI2,326,354.04
1,0.21,Small Diamond,Excellent,3.89,3.84,2.31,61.0,59.8,E,Colorless,SI1,326,354.04
2,0.23,Small Diamond,Good,4.05,4.07,2.31,65.0,56.9,E,Colorless,VS1,327,355.12
3,0.29,Small Diamond,Excellent,4.20,4.23,2.63,58.0,62.4,I,Near Colorless,VS2,334,362.72
4,0.31,Medium Diamond,Good,4.34,4.35,2.75,58.0,63.3,J,Near Colorless,SI2,335,363.81
...,...,...,...,...,...,...,...,...,...,...,...,...,...
53789,0.72,Medium Diamond,Ideal,5.75,5.76,3.50,57.0,60.8,D,Colorless,SI1,2757,2994.10
53790,0.72,Medium Diamond,Good,5.69,5.75,3.61,55.0,63.1,D,Colorless,SI1,2757,2994.10
53791,0.70,Medium Diamond,Very Good,5.66,5.68,3.56,60.0,62.8,D,Colorless,SI1,2757,2994.10
53792,0.86,Medium Diamond,Excellent,6.15,6.12,3.74,58.0,61.0,H,Near Colorless,SI2,2757,2994.10


# MEASURE OF CENTRAL TENDENCY

The measure of central tendency are values that represents the center point of a dataset. The center point can be thought of as the most common values. In statistics, the measure of central tendency can described in the following terms: an average value (`mean`), the middle value (`median`), and the most common values (`mode`). 

Choosing the right one depends on whether you're dealing with categorical or numerical values. Below are the differences between the two:

**Categorical Data**:
- **Mode**: In categorical data, the mode is often the most relevant measure of central tendency. It represents the most frequently occurring category or class within the dataset.

**Numerical Data**:
* **Mean**: The mean is a common measure of central tendency for numerical data. It is calculated by summing all the values in the dataset and then dividing by the total number of values. The mean is sensitive to extreme values, also known as outliers, and provides a balance point for the data distribution.
* **Median**: The median is another measure of central tendency for numerical data. It represents the middle value when the data is arranged in ascending or descending order. Unlike the mean, the median is not influenced by extreme values, making it more robust in the presence of outliers.
* **Mode**: Similar to categorical data, the mode can also be used for numerical data, indicating the most frequent value.

<div class="alert alert-info" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p>In order to perform this analysis, we are going to need to split the dataset between categorical and numerical features. The process to split the dataset will be shown below.</p>
</div>

**Dataset splitting method**: There are many ways to split a dataset between categorical and numerical features. The method below will use the [scikit-learn column transformer library](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) to create a categorical and numerical datasets. This method was chosen because it will be the same method used to split the dataset between categorical vs. numerical features during the predictive analytics phase.

## CATEGORICAL DATA

For categorical data, finding the **mode** is the most common measure of central tendency, especially for ordinal data types. Howeer; you might be able to determine a **median**, and even a **mean** in some cases, for the nominal data types. 

In [8]:
# Create an object that will output a list of categorical columns from a dataset
categoricalColumnsSlctr_obj = selector(
    dtype_include = object
)

In [9]:
# Create a list of categorical columns
categoricalColumns_list = categoricalColumnsSlctr_obj(diamonds_df)

In [10]:
# Create a categorical dataset of just categorical features
diamonds_cat_df = diamonds_df[categoricalColumns_list]

# Verify the data
diamonds_cat_df

Unnamed: 0,Diamond Type,Cut,Color,Color Grades,Clarity
0,Small Diamond,Ideal,E,Colorless,SI2
1,Small Diamond,Excellent,E,Colorless,SI1
2,Small Diamond,Good,E,Colorless,VS1
3,Small Diamond,Excellent,I,Near Colorless,VS2
4,Medium Diamond,Good,J,Near Colorless,SI2
...,...,...,...,...,...
53789,Medium Diamond,Ideal,D,Colorless,SI1
53790,Medium Diamond,Good,D,Colorless,SI1
53791,Medium Diamond,Very Good,D,Colorless,SI1
53792,Medium Diamond,Excellent,H,Near Colorless,SI2


### MEASURE OF CENTRAL TENDENCY

The categorical values found in this dataset are all ordinal values. They are all related to grading a diamond. This implies that the only central tendency that can be extracted is the **mode**.

The following scripts graphically display the mode of each categorical feature. 

In [11]:
# Sort categories from most valuable to least.
# Helps in visualizing
sortOrder_dict = {
    'Cut' : ['Ideal', 'Excellent', 'Very Good', 'Good', 'Fair'],
    'Color' : sorted(diamonds_cat_df['Color'].unique().tolist()),
    'Clarity' : ['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'],
}

In [12]:
# Create an interactive widget to help in graphing
categoricalFeatures_widget = widgets.Dropdown(
    options = categoricalColumns_list,
    value = categoricalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [13]:
# Create an interactive graph to help visualize categorical data
@widgets.interact(
    categoricalFeature = categoricalFeatures_widget
)
def categoryInspector(categoricalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_cat_df,
        x = categoricalFeature,
        category_orders = sortOrder_dict,
        color = categoricalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white"
    )

    # Extract the mode of the dataset
    mode =  diamonds_cat_df[categoricalFeature].mode()[0]

    # Create a graph title
    graphTitle = f"Mode: {mode}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 800
    )

    # Display the results
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### SUMMARY

The following scripts will create a summary table of the mode for each categorical feature:

In [14]:
# Create a list to store dictionary values
catSummary_list = []

# Create a summary of the mode values for categorical features
for category in categoricalColumns_list:
    # Extract the mode of the dataset
    mode =  diamonds_cat_df[category].mode()[0]
    # Extract the unique values found in the dataset
    data_list = diamonds_cat_df[category].unique().tolist()
    # Apply a criteria to help sort some values
    if category == 'Color' or category == 'Clarity':
        data_list = sorted(data_list)
    # Create the dictionary
    dictionary = {
        'Feature' : category,
        'Values' : data_list,
        'Mode' : mode
    }
    # Append the dictionary to the dictionary list
    catSummary_list.append(dictionary)

# Create a pandas dataframe to display the results as a table
categoricalSummary_df = pd.DataFrame(catSummary_list)

# Display the summary table
categoricalSummary_df

Unnamed: 0,Feature,Values,Mode
0,Diamond Type,"[Small Diamond, Medium Diamond, Large Diamond]",Medium Diamond
1,Cut,"[Ideal, Excellent, Good, Very Good, Fair]",Ideal
2,Color,"[D, E, F, G, H, I, J]",G
3,Color Grades,"[Colorless, Near Colorless]",Near Colorless
4,Clarity,"[I1, IF, SI1, SI2, VS1, VS2, VVS1, VVS2]",SI1


## NUMERICAL DATA

For numerical data, you can determine the **mean**, **median**, and **mode**, for continuous variables. 

In [15]:
# Create an object that will output a list of numerical columns from a dataset
numericalColumnsSlctr_obj = selector(
    dtype_exclude = object
)

In [16]:
# Create a list of numerical columns
numericalColumns_list = numericalColumnsSlctr_obj(diamonds_df)

In [17]:
# Create a categorical dataset
diamonds_num_df = diamonds_df[numericalColumns_list]

# Verify the data
diamonds_num_df

Unnamed: 0,Carat,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Price,Price(2024)
0,0.23,3.95,3.98,2.43,55.0,61.5,326,354.04
1,0.21,3.89,3.84,2.31,61.0,59.8,326,354.04
2,0.23,4.05,4.07,2.31,65.0,56.9,327,355.12
3,0.29,4.20,4.23,2.63,58.0,62.4,334,362.72
4,0.31,4.34,4.35,2.75,58.0,63.3,335,363.81
...,...,...,...,...,...,...,...,...
53789,0.72,5.75,5.76,3.50,57.0,60.8,2757,2994.10
53790,0.72,5.69,5.75,3.61,55.0,63.1,2757,2994.10
53791,0.70,5.66,5.68,3.56,60.0,62.8,2757,2994.10
53792,0.86,6.15,6.12,3.74,58.0,61.0,2757,2994.10


### CENTRAL TENDENCY

The numerical values found in this dataset are all continuous in nature. Therefore; the mean, median, and mode can be extracted from each numerical feature.

These scripts will help visualize numerical features.

In [18]:
# Create an interactive widget to help in graphing
numericalFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [19]:
# Create a list of marginals
marginals_list = [
    None,
    'box',
    'violin',
    'rug'
]

In [20]:
# Create an interactive widget to help in graphing
marginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

In [21]:
# Create an interactive graph to help visualize numerical data
@widgets.interact(
    numericalFeature = numericalFeatures_widget,
    marginalFeature = marginals_widget
)
def numercialInspector(marginalFeature, numericalFeature):
    
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when using a marginal plot 
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0
    
    # Calculate the measures of central tendency for numerical features
    meanCalc = diamonds_num_df[numericalFeature].mean()
    medianCalc = diamonds_num_df[numericalFeature].median()
    modeCalc = diamonds_num_df[numericalFeature].mode()[0]
    
    # Bundle values in tuples
    centralTendancyName_tuple = ('Mean', 'Median', 'Mode')
    centralTendancy_tuple = (meanCalc, medianCalc, modeCalc)
    
    # Generate vertical lines and annotations
    for name, calc in zip(centralTendancyName_tuple, centralTendancy_tuple):
        
        # Used to customize the annotations and vertical lines
        match name:
            case 'Mean':
                tendencyColor = 'darkred'
                yCord = 0.97
                textString = f'<b>{name}</b>: {meanCalc:0.2f}'
            case 'Median':
                tendencyColor = 'darkgreen'
                yCord = 0.93
                textString = f'<b>{name}</b>: {medianCalc}'
            case 'Mode':
                tendencyColor = 'black'
                yCord = 0.89
                textString = f'<b>{name}</b>: {modeCalc}'
    
        # Add some vertical lines to show the mean, median, mode on the histogram
        fig.add_vline(
            x = calc,
            line_width = 2,
            line_dash = 'dot',
            line_color = tendencyColor,
        )

        # Add the annotation text using paper reference. See:
        # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
        fig.add_annotation(
            text = textString,
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': tendencyColor ,
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = yCord - yOffset, 
            showarrow = False
        )
    
    # Create a graph title
    graphTitle = f"Measures of central tendency for {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()    

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

### SUMMARY

The summary of the mean, median, and mode for these numerical features is summarized using the below scripts:

In [22]:
# Create a list to store dictionary values
numericalSummary_list = []

# Create a summary of the mode values for categorical features
for numerical in numericalColumns_list:
    
    # Calculate the measures of central tendency for numerical features
    meanCalc = round(diamonds_num_df[numerical].mean(), 2)
    medianCalc = diamonds_num_df[numerical].median()
    modeCalc = diamonds_num_df[numerical].mode()[0]

    # Create the dictionary
    dictionary = {
        'Feature' : numerical,
        'Mean' : meanCalc,
        'Median' : medianCalc,
        'Mode' : modeCalc
    }
    
    # Append the dictionary to the dictionary list
    numericalSummary_list.append(dictionary)

# Create a pandas dataframe to display the results as a table
numericalSummary_df = pd.DataFrame(numericalSummary_list)

# Display the summary table
numericalSummary_df

Unnamed: 0,Feature,Mean,Median,Mode
0,Carat,0.8,0.7,0.3
1,Crown Height (mm),5.73,5.7,4.37
2,Girdle Diameter (mm),5.73,5.71,4.34
3,Pavillion Depth (mm),3.54,3.53,2.7
4,Table %,57.46,57.0,56.0
5,Total Depth %,61.75,61.8,62.0
6,Price,3933.07,2401.0,605.0
7,Price(2024),4271.31,2607.49,657.03


# MEASURES OF DISPERSION

The measures of dispersion are statistical metrics that quantify the spread or variability of a dataset. They help quantify how much the individual data points deviate from the central tendency (mean or median) of the dataset. The higher the dispersion, the further my data points are scattered from the center.

These measures will help us understand the shape and distribution of the data, identify outliers, and assess data quality. 

There are different measures of dispersion, for categorical vs. numerical data:

**Categorical**:

1. **Frequency Distribution**: The total count of each category in a feature.
2. **Proportion**: The percentage break-down of each each category in a feature.
3. **Diversity Index**:  The statistical measure that quantifies the evenness of the distribution across categories. A higher diversity index indicates a more balanced distribution, where no single category has a significantly higher frequency.

**Numerical**:

1. **Range**: The simplest measure of dispersion, calculated as the difference between the maximum and minimum values in the dataset. While easy to calculate, the range doesn't account for the distribution of values within the dataset.
2. **Variance**: The average of the squared differences between each data point and the mean of the dataset. Variance gives a more comprehensive understanding of the spread, but since it's in squared units, it may not be as interpretable as other measures.
3. **Standard Deviation**: The square root of the variance. Standard deviation is widely used due to its intuitive interpretation and its measurement in the same units as the original data.
4. **Interquartile Range (IQR)**: The difference between the third quartile (Q3) and the first quartile (Q1) of the dataset. IQR is robust against outliers and provides a measure of spread that focuses on the middle 50% of the data.
5. **Mean Absolute Deviation (MAD)**: The average of the absolute differences between each data point and the mean of the dataset. MAD is more robust to outliers compared to standard deviation, as it does not square the differences.

## CATEGORICAL DATA

Since this dataset has already been split between categorical and numerical features, we can use the same categorical dataset that was previously created:

### FREQUENCY DISTRIBUTION

Frequency distribution involves understanding the total counts of each category in the dataset. The output is a chart that visually depicts this these counts. 

The following callback function will display the frequency counts of the categorical feature.

In [23]:
# Create an interactive widget to help in graphing
frequenceDistribution_widget = widgets.Dropdown(
    options = categoricalColumns_list,
    value = categoricalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [24]:
# Create an interactive graph to help visualize frequency counts
@widgets.interact(
    categoricalFeature = frequenceDistribution_widget
)
def categoryFrequencyInspector(categoricalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_cat_df,
        x = categoricalFeature,
        category_orders = sortOrder_dict,
        color = categoricalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
        text_auto = True
    )

    # Create a graph title
    graphTitle = f"Frequency counts: {categoricalFeature}"

    # Display the text on the bars on top
    fig.update_traces(
        textposition = 'outside'
    )

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 800
    )

    # Display the results
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### PROPORTION

Proportion analysis helps understand the dataset in terms of percent (%) break-downs. 

The following callback function will display the proportion of the categorical feature.

In [25]:
# Create an interactive widget to help in graphing
proportionDistribution_widget = widgets.Dropdown(
    options = categoricalColumns_list,
    value = categoricalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [26]:
@widgets.interact(
    feature = proportionDistribution_widget
)
def proportionFunc(feature):
    # Number of records in the dataset
    records = diamonds_cat_df.shape[0]
    
    # Create a dataset to inspect 
    inspector_pd_series = diamonds_cat_df[feature].value_counts()

    # Create lists to store values that will be used in graphing
    proportions_list = []
    colors_list = []

    # Extract values 
    for idx in range(len(inspector_pd_series.values.tolist())):
        # Extract the observation count (value) of the chosen feature
        value = inspector_pd_series.values.tolist()[idx]
        # Calculate the proportion
        proportion = round((value / records) * 100, 2)
        # Add to list
        proportions_list.append(proportion)
        # Assign a color value
        colorValue = px.colors.qualitative.Dark24[idx]
        # Add the color to the list
        colors_list.append(colorValue)

    # Create a figure object
    fig = go.Figure(
        data = [
            go.Bar(
                x = inspector_pd_series.keys().tolist(),
                y = proportions_list,
                texttemplate = '%{y}%',
                marker_color = colors_list,
            )
        ]
    )

    # Display the text on the bars on top
    fig.update_traces(
        textposition = 'outside'
    )

    # Update the layout and title
    fig.update_layout(
        title = f'% Break-Out of: {feature}',
        template = "plotly_white",
        xaxis_title_text = f'{feature}', # xaxis label
        yaxis_title_text = 'Percentages %', # yaxis label
        xaxis = {
            'categoryorder' : 'total descending'
        },
        height = 600,
        width = 800,
        
    )
    
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### DIVERSITY INDEX

Diversity indices are measures that reflect the degree of diversity or variety within a dataset. It allows you to quantify if a dataset is balanced or imbalanced, based on a calculation.

The two most commonly used diversity indices are:

1. **Shannon Diversity Index (H'):**

   - This index accounts for both richness and evenness.
   - It ranges from 0 (low diversity - one dominant category) to a theoretical maximum value (high diversity - all categories equally abundant).
   - It's calculated using the formula: $H' = -\Sigma(\pi * ln(\pi))$, where $\pi$ is the proportion of individuals belonging to the ith category.

2. **Simpson Diversity Index (D):**

   - This index focuses on the probability of picking two individuals from different categories.
   - It ranges from 0 (low diversity - one dominant category) to 1 (high diversity - all categories equally abundant).
   - It's calculated using the formula: $D = 1 - \Sigma(\pi^2)$, where $\pi$ is the proportion of individuals belonging to the ith category.

**Choosing the Best Diversity Index**

There's no single "best" diversity index for categorical data.  The most suitable choice depends on the specific question you're trying to answer and the characteristics of your data. Here are some factors to consider:

* **Focus on richness vs evenness:** If you're primarily interested in the number of categories, Shannon's index might be more appropriate. However, if evenness is crucial, Simpson's index might be a better choice. 
* **Data interpretation:**  Consider how easily interpretable the index's output is for your audience. Shannon's index uses natural logarithms, while Simpson's index is easier to grasp intuitively (0-1 scale).
* **Dominant categories:** If you expect a high degree of dominance (one or few categories with significantly higher counts), Simpson's index might be more sensitive to this and provide a clearer picture.

<div class="alert alert-warning" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p> There are only a few categories in this dataset. Evenness is not a major concern. Frequency and proportion analysis is adequate.</p>
</div>

## NUMERICAL DATA

Since this dataset has already been split between categorical and numerical features, we can use the same numerical dataset that was previously created:

### RANGE

The range helps us in the following way:

1. **Understand the spread of the data:** The range gives you a basic idea of how spread out the data points are for a particular numerical feature. It's simply the difference between the highest and lowest values. While it doesn't take into account the distribution of the data, it does give you a quick sense of the overall spread.

2. **Identify potential outliers:** The range can help identify potential outliers or anomalies in the data. If the range is very large, it may indicate the presence of outliers or errors in the data.

The following script will calculate the range for each feature found in this dataset:

In [27]:
# Create a list to store dictionary values
rangeSummary_list = []

# Create a summary of the mode values for categorical features
for numerical in numericalColumns_list:
    
    # Extract min / max values
    minValue = diamonds_num_df[numerical].min()
    maxValue = diamonds_num_df[numerical].max()

    # Calculate the range
    rangeValue = maxValue - minValue

    # Create the dictionary
    dictionary = {
        'Feature' : numerical,
        'Min' : minValue,
        'Max' : maxValue,
        'Range' : rangeValue
    }
    
    # Append the dictionary to the dictionary list
    rangeSummary_list.append(dictionary)

# Create a pandas dataframe to display the results as a table
rangeSummary_df = pd.DataFrame(rangeSummary_list)

# Display the summary table
rangeSummary_df

Unnamed: 0,Feature,Min,Max,Range
0,Carat,0.2,5.01,4.81
1,Crown Height (mm),0.0,10.74,10.74
2,Girdle Diameter (mm),0.0,58.9,58.9
3,Pavillion Depth (mm),0.0,31.8,31.8
4,Table %,43.0,95.0,52.0
5,Total Depth %,43.0,79.0,36.0
6,Price,326.0,18823.0,18497.0
7,Price(2024),354.04,20441.78,20087.74


In [28]:
# Create an interactive widget to help in graphing
rangeFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [29]:
# Create an interactive widget to help in graphing
rangeMarginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
    tooltips = ['Do not dispaly any marignals', 'Display a box plot', 'Dispaly a violin plot', 'Display a rug plot'],
)

The following callback function analyzes range data.

In [30]:
# Create an interactive graph to help visualize numerical data
@widgets.interact(
    numericalFeature = rangeFeatures_widget,
    marginalFeature = rangeMarginals_widget
)
def rangeInspector(marginalFeature, numericalFeature):
    
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when using a marginal plot 
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0
    
    # Extract raw data
    extractedData = rangeSummary_df[rangeSummary_df['Feature'] == numericalFeature]
    # Transform data
    rangeData = extractedData.iloc[0]
    # Extract min/max/range
    minValue = rangeData['Min']
    maxValue = rangeData['Max']
    rangeValue = round(rangeData['Range'], 2)
    
    # Bundle values in tuples
    rangeName_tuple = ('Min', 'Max', 'Range')
    range_tuple = (minValue, maxValue, rangeValue)
    
    # Generate vertical lines and annotations
    for name, calc in zip(rangeName_tuple, range_tuple):
        
        # Used to customize the annotations and vertical lines
        match name:
            case 'Min':
                tendencyColor = 'black'
                yCord = 0.97
                textString = f'<b>{name}</b>: {minValue}'
            case 'Max':
                tendencyColor = 'darkgreen'
                yCord = 0.93
                textString = f'<b>{name}</b>: {maxValue}'
            case 'Range':
                tendencyColor = 'darkred'
                yCord = 0.89
                textString = f'<b>{name}</b>: {rangeValue}'
    
        # Add some vertical lines to show the min, max, range value on the histogram
        fig.add_vline(
            x = calc,
            line_width = 2,
            line_dash = 'dot',
            line_color = tendencyColor,
        )

        # Add the annotation text using paper reference. See:
        # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
        fig.add_annotation(
            text = textString,
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': tendencyColor ,
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = yCord - yOffset, 
            showarrow = False
        )
    
    # Create a graph title
    graphTitle = f"Measures of central tendency for {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()    

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

<div class="alert alert-warning" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p> There appears to be outliers in this dataset that need to be addressed.</p>
</div>

### VARIANCE

Variance quantifies the spread or dispersion of the data points around the mean. High variance indicates that the data points are spread out over a wider range of values, whereas low variance indicates that they are closer to the mean.

In [37]:
# Create an interactive graph to help visualize numerical data
@widgets.interact(
    numericalFeature = rangeFeatures_widget,
    marginalFeature = rangeMarginals_widget
)
def rangeInspector(marginalFeature, numericalFeature):
    
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
        template = "plotly_white",
    )

    # Used to adjust the annotation (text positioning) when using a marginal plot 
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0
    
    # Extract raw data
    extractedData = rangeSummary_df[rangeSummary_df['Feature'] == numericalFeature]
    # Transform data
    rangeData = extractedData.iloc[0]
    # Extract min/max/range
    minValue = rangeData['Min']
    maxValue = rangeData['Max']
    rangeValue = round(rangeData['Range'], 2)
    
    # Bundle values in tuples
    rangeName_tuple = ('Min', 'Max', 'Range')
    range_tuple = (minValue, maxValue, rangeValue)
    
    # Generate vertical lines and annotations
    for name, calc in zip(rangeName_tuple, range_tuple):
        
        # Used to customize the annotations and vertical lines
        match name:
            case 'Min':
                tendencyColor = 'black'
                yCord = 0.97
                textString = f'<b>{name}</b>: {minValue}'
            case 'Max':
                tendencyColor = 'darkgreen'
                yCord = 0.93
                textString = f'<b>{name}</b>: {maxValue}'
            case 'Range':
                tendencyColor = 'darkred'
                yCord = 0.89
                textString = f'<b>{name}</b>: {rangeValue}'
    
        # Add some vertical lines to show the min, max, range value on the histogram
        fig.add_vline(
            x = calc,
            line_width = 2,
            line_dash = 'dot',
            line_color = tendencyColor,
            line_text = textString,
            line_hoverinfo = 'text'
        )
        
        # Add the annotation text using paper reference. See:
        # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
        fig.add_annotation(
            text = textString,
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': tendencyColor ,
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = yCord - yOffset, 
            showarrow = False
        )
    
    # Create a graph title
    graphTitle = f"Measures of central tendency for {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()  

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), tooltips…

In [31]:
fig = px.histogram(
    diamonds_num_df,
    x = 'Carat',    
)

In [32]:
fig.update_layout?

[0;31mSignature:[0m [0mfig[0m[0;34m.[0m[0mupdate_layout[0m[0;34m([0m[0mdict1[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0moverwrite[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m [0;34m->[0m [0;34m'Figure'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Update the properties of the figure's layout with a dict and/or with
keyword arguments.

This recursively updates the structure of the original
layout with the values in the input dict / keyword arguments.

Parameters
----------
dict1 : dict
    Dictionary of properties to be updated
overwrite: bool
    If True, overwrite existing properties. If False, apply updates
    to existing properties recursively, preserving existing
    properties that are not specified in the update operation.
kwargs :
    Keyword/value pair of properties to be updated

Returns
-------
BaseFigure
    The Figure object that the update_layout method was called on
[0;31mFile:[0m      ~/.virtualenvs/diamon

In [None]:
px.line?

In [None]:
fig.update_layout_images

In [36]:
fig.add_annotation?

[0;31mSignature:[0m
[0mfig[0m[0;34m.[0m[0madd_annotation[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0marg[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0malign[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0marrowcolor[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0marrowhead[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0marrowside[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0marrowsize[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0marrowwidth[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0max[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxref[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0may[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mayref[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbgcolor

In [None]:
fig.add_vline?

In [None]:
fig.add_shape?

In [None]:
# Extract the 25% and 75% percentiles
q1, q3 = diamonds_num_df['Carat'].quantile(
    q = [0.25, 0.75],
)

In [None]:
# Calculate the interquartile range of the data
iqr = q3 - q1

In [None]:
# Number of data points
n = diamonds_num_df.shape[0]

In [None]:
binWidth = 2 * iqr * n**(-1/3)

In [None]:
round(binWidth, 3)

In [None]:
binWidth = 0.02

In [None]:
numBins = int((diamonds_num_df['Girdle Diameter (mm)'].max() - diamonds_num_df['Girdle Diameter (mm)'].min()) / binWidth)

In [None]:
numBins