<h1 style='color:rgb(52, 152, 219)'; align=center><font size = 8> DIAMOND PRICES - ANALYSIS AND MODELING </font></h1>

<h2 style='color:rgb(52, 152, 219)'; align=left><font size = 6> NOTEBOOK 01-03: DESCRIPTIVE STATISTICS </font></h2>

# REVISION HISTORY

| REV | DESCRIPTION             | DATE         |  BY   | CHECK | APPROVE  |
|:---:|:-----------------------:|:------------:|:-----:|:-----:|:--------:|
| A0  | ISSUED FOR REVIEW (IFR) | 2024-APR-XX  |  IAC  |       |          |
|     |                         |              |       |       |          |

## DETAILED DESCRIPTION OF REVISIONS

> **REV A0** - HOLD

# INTRODUCTION

Descriptive statistics summarizes and organizes the characteristics of a dataset. 

## DESCRIPTIVE STATISTICS STEPS

Below are the descriptive statistics steps performed in this notebook:

1. [X] **Measures of Central Tendency:** Calculate descriptive statistics such as mean, median, and mode to understand the central tendency of the data.
2. [ ] **Measures of Dispersion:** Compute measures like standard deviation, variance, and range to assess the spread or variability of the data.
3. [ ] **Frequency Distributions:** Create frequency tables or histograms to visualize the distribution of categorical and numerical variables.
4. [ ] **Percentiles:** Calculate percentiles to identify specific data points' position within the dataset.
5. [ ] **Skewness and Kurtosis:** Skewness measures the asymmetry of the distribution, indicating whether the data is skewed to the left or right. Kurtosis measures the peakedness or flatness of the distribution.
6. [ ] **Outlier Management:** Identify and manage outliers.
7. [ ] **Correlation Analysis:** Compute correlation coefficients to understand the relationships between pairs of variables.

# REQUIRED LIBRARIES

The following libraries are required to run this notebook.

In [1]:
# Library to create and handle a tabular dataset
import pandas as pd

In [2]:
# Library to plot graphs
import plotly.express as px

In [3]:
# Use the scikit-learn Library to create categorical and numerical features
from sklearn.compose import make_column_selector as selector

In [4]:
# Library used to create interactive controls
import ipywidgets as widgets

# LOAD DATASET

The scripts to load this dataset are:

In [5]:
# Path to the dataset
filePath = "./../00_Data/00_Datasets/"

In [6]:
# Filename
diamondsCSVFilename = 'diamonds_DM.csv'

In [7]:
# Create a data frame
diamonds_df = pd.read_csv(
    filePath + diamondsCSVFilename,
    index_col = None
)

# Display the newly created data frame
diamonds_df

Unnamed: 0,Carat,Diamond Type,Cut,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Color,Color Grades,Clarity,Price,Price(2024)
0,0.23,Small Diamond,Ideal,3.95,3.98,2.43,55.0,61.5,E,Colorless,SI2,326,354.04
1,0.21,Small Diamond,Excellent,3.89,3.84,2.31,61.0,59.8,E,Colorless,SI1,326,354.04
2,0.23,Small Diamond,Good,4.05,4.07,2.31,65.0,56.9,E,Colorless,VS1,327,355.12
3,0.29,Small Diamond,Excellent,4.20,4.23,2.63,58.0,62.4,I,Near Colorless,VS2,334,362.72
4,0.31,Medium Diamond,Good,4.34,4.35,2.75,58.0,63.3,J,Near Colorless,SI2,335,363.81
...,...,...,...,...,...,...,...,...,...,...,...,...,...
53789,0.72,Medium Diamond,Ideal,5.75,5.76,3.50,57.0,60.8,D,Colorless,SI1,2757,2994.10
53790,0.72,Medium Diamond,Good,5.69,5.75,3.61,55.0,63.1,D,Colorless,SI1,2757,2994.10
53791,0.70,Medium Diamond,Very Good,5.66,5.68,3.56,60.0,62.8,D,Colorless,SI1,2757,2994.10
53792,0.86,Medium Diamond,Excellent,6.15,6.12,3.74,58.0,61.0,H,Near Colorless,SI2,2757,2994.10


# MEASURE OF CENTRAL TENDENCY

The measure of central tendency are values that represents the center point of a dataset. The center point can be thought of as the most common values. In statistics, the measure of central tendency is described in terms of: an average value (`mean`), the middle value (`median`), and the most common values (`mode`). 

Choosing a measure of central tendency depends on whether you're dealing with categorical or numerical values. Below are the differences between the categorical and numerical values:

**Categorical Data**:
- **Mode**: In categorical data, the mode is often the most relevant measure of central tendency. It represents the most frequently occurring category or class within the dataset.

**Numerical Data**:
* **Mean**: The mean is a common measure of central tendency for numerical data. It is calculated by summing all the values in the dataset and then dividing by the total number of values. The mean is sensitive to extreme values, also known as outliers, and provides a balance point for the data distribution.
* **Median**: The median is another measure of central tendency for numerical data. It represents the middle value when the data is arranged in ascending or descending order. Unlike the mean, the median is not influenced by extreme values, making it more robust in the presence of outliers.
* **Mode**: Similar to categorical data, the mode can also be used for numerical data, indicating the most frequent value.

<div class="alert alert-info" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p>In order to perform this analysis, we are going to need to split the dataset between categorical and numerical features. The process to split the dataset will be shown below.</p>
</div>

**Dataset splitting method**: There are many ways to split a dataset between categorical and numerical features. The method below will use the [scikit-learn column transformer library](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) to create a categorical and numerical datasets. This method was chosen because it will be the same method used to split the dataset between categorical vs. numerical features during the predictive analytics phase.

## CATEGORICAL DATA

For categorical data, finding the **mode** is the most common measure of central tendency, especially for ordinal data types. Howeer; you might be able to determine a **median**, and even a **mean** in some cases, for the nominal data types. 

In [8]:
# Create an object that will output a list of categorical columns from a dataset
categoricalColumnsSlctr_obj = selector(
    dtype_include = object
)

In [9]:
# Create a list of categorical columns
categoricalColumns_list = categoricalColumnsSlctr_obj(diamonds_df)

In [10]:
# Create a categorical dataset of just categorical features
diamonds_cat_df = diamonds_df[categoricalColumns_list]

# Verify the data
diamonds_cat_df

Unnamed: 0,Diamond Type,Cut,Color,Color Grades,Clarity
0,Small Diamond,Ideal,E,Colorless,SI2
1,Small Diamond,Excellent,E,Colorless,SI1
2,Small Diamond,Good,E,Colorless,VS1
3,Small Diamond,Excellent,I,Near Colorless,VS2
4,Medium Diamond,Good,J,Near Colorless,SI2
...,...,...,...,...,...
53789,Medium Diamond,Ideal,D,Colorless,SI1
53790,Medium Diamond,Good,D,Colorless,SI1
53791,Medium Diamond,Very Good,D,Colorless,SI1
53792,Medium Diamond,Excellent,H,Near Colorless,SI2


### CENTRAL TENDENCY

The categorical values found in this dataset are all ordinal values. They are all related to grading a diamond. This implies that the only central tendency that can be extracted is the **mode**.

The following scripts graphically display the mode of each categorical feature. 

In [11]:
# Create an interactive widget to help in graphing
categoricalFeatures_widget = widgets.Dropdown(
    options = categoricalColumns_list,
    value = categoricalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [12]:
# Sort categories from most valuable to least
sortOrder_dict = {
    'Cut' : ['Ideal', 'Excellent', 'Very Good', 'Good', 'Fair'],
    'Color' : sorted(diamonds_cat_df['Color'].unique().tolist()),
    'Clarity' : ['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'],
}

In [13]:
# Create an interactive graph to help visualize categorical data
@widgets.interact(categoricalFeature = categoricalFeatures_widget)
def categoryInspector(categoricalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_cat_df,
        x = categoricalFeature,
        category_orders = sortOrder_dict,
        color = categoricalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
    )

    # Extract the mode of the dataset
    mode =  diamonds_cat_df[categoricalFeature].mode()[0]

    # Create a graph title
    graphTitle = f"Mode: {mode}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 800
    )

    # Display the results
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### SUMMARY

The following table summarizes the mode of each categorical feature:

| Feature        | Values                                       | Mode           |
|----------------|----------------------------------------------|----------------|
| Diamond Type   | Small Diamond, Medium Diamond, Large Diamond | Medium Diamond |
| Cut            | Ideal, Excellent, Very Good, Good, Fair      | Ideal          |
| Color          | D, E, F, G, H, I, J                          | G              |
| Color Grades   | Colorless, Near Colorless                    | Near Colorless |
| Clarity        | IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1       | SI1            |

## NUMERICAL DATA

For numerical data, you can determine the **mean**, **median**, and **mode**, for continuous variables. 

In [14]:
# Create an object that will output a list of numerical columns from a dataset
numericalColumnsSlctr_obj = selector(
    dtype_exclude = object
)

In [15]:
# Create a list of numerical columns
numericalColumns_list = numericalColumnsSlctr_obj(diamonds_df)

In [16]:
# Create a categorical dataset
diamonds_num_df = diamonds_df[numericalColumns_list]

# Verify the data
diamonds_num_df

Unnamed: 0,Carat,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Price,Price(2024)
0,0.23,3.95,3.98,2.43,55.0,61.5,326,354.04
1,0.21,3.89,3.84,2.31,61.0,59.8,326,354.04
2,0.23,4.05,4.07,2.31,65.0,56.9,327,355.12
3,0.29,4.20,4.23,2.63,58.0,62.4,334,362.72
4,0.31,4.34,4.35,2.75,58.0,63.3,335,363.81
...,...,...,...,...,...,...,...,...
53789,0.72,5.75,5.76,3.50,57.0,60.8,2757,2994.10
53790,0.72,5.69,5.75,3.61,55.0,63.1,2757,2994.10
53791,0.70,5.66,5.68,3.56,60.0,62.8,2757,2994.10
53792,0.86,6.15,6.12,3.74,58.0,61.0,2757,2994.10


### CENTRAL TENDENCY

The numerical values found in this dataset are all continuous in nature. Therefore; the mean, median, and mode can be extracted from each numerical feature.

These scripts will help visualize numerical features.

In [17]:
# Create an interactive widget to help in graphing
numericalFeatures_widget = widgets.Dropdown(
    options = numericalColumns_list,
    value = numericalColumns_list[0],
    description = 'Feature:',
    disabled = False,
)

In [18]:
# Create a list of marginals
marginals_list = [
    None,
    'box',
    'violin',
    'rug'
]

In [19]:
# Create an interactive widget to help in graphing
marginals_widget = widgets.ToggleButtons(
    options = marginals_list,
    value = marginals_list[0],
    button_style = '', # '', 'success', 'info', 'warning', 'danger'
    description = 'Marginals:',
    disabled = False,
)

In [25]:
# Create an interactive graph to help visualize numerical data
@widgets.interact(
    numericalFeature = numericalFeatures_widget,
    marginalFeature = marginals_widget
)
def numercialInspector(marginalFeature, numericalFeature):
    # Create a histogram object
    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature,
        marginal = marginalFeature 
    )

    # Used to adjust the annotation (text positioning) when using a marginal plot 
    if marginalFeature != None:
        yOffset = 0.18
    else:
        yOffset = 0
    
    # Calculate the measures of central tendency for numerical features
    meanCalc = diamonds_num_df[numericalFeature].mean()
    medianCalc = diamonds_num_df[numericalFeature].median()
    modeCalc = diamonds_num_df[numericalFeature].mode()[0]
    
    # Bundle values in tuples
    centralTendancyName_tuple = ('Mean', 'Median', 'Mode')
    centralTendancy_tuple = (meanCalc, medianCalc, modeCalc)
    
    # Generate vertical lines and annotations
    for name, calc in zip(centralTendancyName_tuple, centralTendancy_tuple):
        
        # Used to customize the annotations and vertical lines
        match name:
            case 'Mean':
                tendencyColor = 'red'
                yCord = 0.97
                textString = f'<b>{name}</b>: {meanCalc:0.2f}'
            case 'Median':
                tendencyColor = 'darkgreen'
                yCord = 0.93
                textString = f'<b>{name}</b>: {medianCalc}'
            case 'Mode':
                tendencyColor = 'black'
                yCord = 0.89
                textString = f'<b>{name}</b>: {modeCalc}'
    
        # Add some vertical lines to show the mean, median, mode on the histogram
        fig.add_vline(
            x = calc,
            line_width = 3,
            line_dash = 'dot',
            line_color = tendencyColor,
        )

        # Add the annotation text using paper reference. See:
        # https://stackoverflow.com/questions/76046269/how-to-align-annotation-to-the-edge-of-whole-figure-in-plotly
        fig.add_annotation(
            text = textString,
            font = {
                'size' : 13,
                'family' : 'Times New Roman',
                'color': tendencyColor ,
            },
            xref = "paper", 
            yref = "paper",
            x = 0.90, 
            y = yCord - yOffset, 
            showarrow = False
        )
    
    # Create a graph title
    graphTitle = f"Measures of central tendency for: {numericalFeature}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 1200
    )
    
    # Display the results
    return fig.show()    

interactive(children=(ToggleButtons(description='Marginals:', options=(None, 'box', 'violin', 'rug'), value=No…

In [None]:
# Extract the 25% and 75% percentiles
q1, q3 = diamonds_num_df['Carat'].quantile(
    q = [0.25, 0.75],
)

In [None]:
# Calculate the interquartile range of the data
iqr = q3 - q1

In [None]:
# Number of data points
n = diamonds_num_df.shape[0]

In [None]:
binWidth = 2 * iqr * n**(-1/3)

In [None]:
round(binWidth, 3)

In [None]:
binWidth = 0.02

In [None]:
numBins = int((diamonds_num_df['Girdle Diameter (mm)'].max() - diamonds_num_df['Girdle Diameter (mm)'].min()) / binWidth)

In [None]:
numBins

### SUMMARY

The following table summarizes the measure of central tendency of each numerical feature:

| Feature               | Mean     | Median   |  Mode  | Outliers   |
|:----------------------|:--------:|:--------:|:------:|:----------:|
| Carat                 | 0.80     | 0.70     | 0.30   |   Yes      |
| Crown Height (mm)     | 5.73     | 5.70     | 4.37   |   Yes      |
| Girdle Diameter (mm)  | 5.73     | 5.71     | 4.34   |   Yes      |
| Pavillion Depth (mm)  | 3.54     | 3.53     | 2.70   |   Yes      |
| Table %               | 57.46    | 57.00    | 56.00  |   Yes      |
| Total Depth %         | 61.75    | 61.80    | 62.00  |   Yes      |
| Price                 | 3933.07  | 2401.00  | 605.00 |   Possible |
| Price(2024)           | 4271.31  | 2607.49  | 657.03 |   Possible |


<div class="alert alert-warning" role="alert">
  <h3 class="alert-heading">NOTE:</h3>
  <p>There appears to be several outliers in this dataset that will need to be addressed. </p>
</div>

# MEASURES OF DISPERSION