<h1 style='color:rgb(52, 152, 219)'; align=center><font size = 8> DIAMOND PRICES - ANALYSIS AND MODELING </font></h1>

<h2 style='color:rgb(52, 152, 219)'; align=left><font size = 6> NOTEBOOK 01-03: DESCRIPTIVE STATISTICS </font></h2>

# REVISION HISTORY

| REV | DESCRIPTION             | DATE         |  BY   | CHECK | APPROVE  |
|:---:|:-----------------------:|:------------:|:-----:|:-----:|:--------:|
| A0  | ISSUED FOR REVIEW (IFR) | 2024-APR-XX  |  IAC  |       |          |
|     |                         |              |       |       |          |

## DETAILED DESCRIPTION OF REVISIONS

> **REV A0** - HOLD

# INTRODUCTION

Descriptive statistics involves analyzing and summarizing the main characteristics of the data to gain insights on the data's characteristics, patterns, and trends. 

## DESCRIPTIVE STATISTICS STEPS

Below are the descriptive statistics steps performed in this notebook:

1. [ ] **Measures of Central Tendency:** Calculate descriptive statistics such as mean, median, and mode to understand the central tendency of the data.
2. [ ] **Measures of Dispersion:** Compute measures like standard deviation, variance, and range to assess the spread or variability of the data.
3. [ ] **Frequency Distributions:** Create frequency tables or histograms to visualize the distribution of categorical and numerical variables.
4. [ ] **Percentiles:** Calculate percentiles to identify specific data points' position within the dataset.
5. [ ] **Skewness and Kurtosis:** Skewness measures the asymmetry of the distribution, indicating whether the data is skewed to the left or right. Kurtosis measures the peakedness or flatness of the distribution.
6. [ ] **Correlation Analysis:** Compute correlation coefficients to understand the relationships between pairs of variables.

# REQUIRED LIBRARIES

The following libraries are required to run this notebook.

In [1]:
# Library to create and handle a tabular dataset
import pandas as pd

In [2]:
# Library to plot graphs
import seaborn as sns

In [3]:
# Library to plot graphs
import plotly.express as px
import plotly.figure_factory as ff

In [4]:
# Use the scikit-learn Library to create categorical and numerical features
from sklearn.compose import make_column_selector as selector

In [5]:
# Library used to create interactive controls
import ipywidgets as widgets

# LOAD DATASET

The scripts to load this dataset are:

In [6]:
# Path to the dataset
filePath = "./../00_Data/00_Datasets/"

In [7]:
# Filename
diamondsCSVFilename = 'diamonds_DM.csv'

In [8]:
# Create a data frame
diamonds_df = pd.read_csv(
    filePath + diamondsCSVFilename,
    index_col = None
)

# Display the newly created data frame
diamonds_df

Unnamed: 0,Carat,Diamond Type,Cut,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Color,Color Grades,Clarity,Price,Price(2024)
0,0.23,Small Diamond,Ideal,3.95,3.98,2.43,55.0,61.5,E,Colorless,SI2,326,354.04
1,0.21,Small Diamond,Excellent,3.89,3.84,2.31,61.0,59.8,E,Colorless,SI1,326,354.04
2,0.23,Small Diamond,Good,4.05,4.07,2.31,65.0,56.9,E,Colorless,VS1,327,355.12
3,0.29,Small Diamond,Excellent,4.20,4.23,2.63,58.0,62.4,I,Near Colorless,VS2,334,362.72
4,0.31,Medium Diamond,Good,4.34,4.35,2.75,58.0,63.3,J,Near Colorless,SI2,335,363.81
...,...,...,...,...,...,...,...,...,...,...,...,...,...
53789,0.72,Medium Diamond,Ideal,5.75,5.76,3.50,57.0,60.8,D,Colorless,SI1,2757,2994.10
53790,0.72,Medium Diamond,Good,5.69,5.75,3.61,55.0,63.1,D,Colorless,SI1,2757,2994.10
53791,0.70,Medium Diamond,Very Good,5.66,5.68,3.56,60.0,62.8,D,Colorless,SI1,2757,2994.10
53792,0.86,Medium Diamond,Excellent,6.15,6.12,3.74,58.0,61.0,H,Near Colorless,SI2,2757,2994.10


# MEASURES OF CENTRAL TENDENCY

The measure of central tendency are values that represents the center point of a dataset. The center point can be thought of as the most common values. In statistics, the measure of central tendency is described in terms of an average value (mean), the middle value (median), and the most common values (mode). 

The measure of central tendency that can be used on data depends on if the data is categorical or numerical. Therefore; the first step in this process is to create two datasets:

1. Categorical Data
2. Numerical Data

The following scripts uses the [scikit-learn column transformer library](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) to create a categorica and numerical datasets:

## CATEGORICAL DATA

For categorical data, finding the **mode** is the most common measure of central tendency, especially for ordinal data types. Howeer; you might be able to determine a **median**, and even a **mean** in some cases, for the nominal data types. 

In [9]:
# Create an object that will filter categorical columns
categoricalColumnsSlctr_obj = selector(
    dtype_include = object
)

In [10]:
# Create a list of categorical columns
categoricalColumns_list = categoricalColumnsSlctr_obj(diamonds_df)

In [11]:
# Create a categorical dataset
diamonds_cat_df = diamonds_df[categoricalColumns_list]

# Verify the data
diamonds_cat_df

Unnamed: 0,Diamond Type,Cut,Color,Color Grades,Clarity
0,Small Diamond,Ideal,E,Colorless,SI2
1,Small Diamond,Excellent,E,Colorless,SI1
2,Small Diamond,Good,E,Colorless,VS1
3,Small Diamond,Excellent,I,Near Colorless,VS2
4,Medium Diamond,Good,J,Near Colorless,SI2
...,...,...,...,...,...
53789,Medium Diamond,Ideal,D,Colorless,SI1
53790,Medium Diamond,Good,D,Colorless,SI1
53791,Medium Diamond,Very Good,D,Colorless,SI1
53792,Medium Diamond,Excellent,H,Near Colorless,SI2


### CENTRAL TENDENCY

The categorical values found in this dataset are all ordinal values. They are all related to grading a diamond. This implies that the only central tendency that can be extracted is the **mode**.

The following scripts graphically display the mode of each categorical feature. 

In [12]:
# Create a list of categorical features
categoricalFeatures_list = diamonds_cat_df.columns.to_list()

categoricalFeatures_list

['Diamond Type', 'Cut', 'Color', 'Color Grades', 'Clarity']

In [13]:
# Create an interactive widget to help in graphing
categoricalFeatures_widget = widgets.Dropdown(
    options = categoricalFeatures_list,
    value = categoricalFeatures_list[0],
    description = 'Feature:',
    disabled = False,
)

In [14]:
# Sort categories from most valuable to least
sortOrder_dict = {
    'Cut' : ['Ideal', 'Excellent', 'Very Good', 'Good', 'Fair'],
    'Color' : sorted(diamonds_cat_df['Color'].unique().tolist()),
    'Clarity' : ['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'],
}

In [15]:
# Create an interactive graph to help visualize categorical data
@widgets.interact(categoricalFeature = categoricalFeatures_widget)
def categoryInspector(categoricalFeature):
    fig = px.histogram(
        diamonds_cat_df,
        x = categoricalFeature,
        category_orders = sortOrder_dict,
        color = categoricalFeature,
        color_discrete_sequence = px.colors.qualitative.Dark24,
    )

    # Extract the mode of the dataset
    mode =  diamonds_cat_df[categoricalFeature].mode()[0]

    # Create a graph title
    graphTitle = f"Mode: {mode}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 800
    )

    # Display the results
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Diamond Type', 'Cut', 'Color', 'Color Grades'…

### SUMMMARY

The following table summarizes the mode of each categorical feature:
| Feature        | Values                                       | Mode           |
|----------------|----------------------------------------------|----------------|
| Diamond Type   | Small Diamond, Medium Diamond, Large Diamond | Medium Diamond |
| Cut            | Ideal, Excellent, Very Good, Good, Fair        | Ideal          |
| Color          | D, E, F, G, H, I, J                          | G              |
| Color Grades   | Colorless, Near Colorless                    | Near Colorless |
| Clarity        | IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1       | SI1            |


## NUMERICAL DATA

For numerical data, you can determine the **mean**, **median**, and **mode**, for continuous variables. 

In [16]:
# Create an object that will filter numerical columns
numericalColumnsSlctr_obj = selector(
    dtype_exclude = object
)

In [17]:
# Create a list of numerical columns
numericalColumns_list = numericalColumnsSlctr_obj(diamonds_df)

In [18]:
# Create a categorical dataset
diamonds_num_df = diamonds_df[numericalColumns_list]

# Verify the data
diamonds_num_df

Unnamed: 0,Carat,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Price,Price(2024)
0,0.23,3.95,3.98,2.43,55.0,61.5,326,354.04
1,0.21,3.89,3.84,2.31,61.0,59.8,326,354.04
2,0.23,4.05,4.07,2.31,65.0,56.9,327,355.12
3,0.29,4.20,4.23,2.63,58.0,62.4,334,362.72
4,0.31,4.34,4.35,2.75,58.0,63.3,335,363.81
...,...,...,...,...,...,...,...,...
53789,0.72,5.75,5.76,3.50,57.0,60.8,2757,2994.10
53790,0.72,5.69,5.75,3.61,55.0,63.1,2757,2994.10
53791,0.70,5.66,5.68,3.56,60.0,62.8,2757,2994.10
53792,0.86,6.15,6.12,3.74,58.0,61.0,2757,2994.10


### CENTRAL TENDENCY

The numerical values found in this dataset are all continuous in nature. Therefore; the mean, median, and mode can be extracted from each numerical feature.

In [25]:
# Create a list of numerical features
numericalFeatures_list = diamonds_num_df.columns.to_list()

numericalFeatures_list

['Carat',
 'Crown Height (mm)',
 'Girdle Diameter (mm)',
 'Pavillion Depth (mm)',
 'Table %',
 'Total Depth %',
 'Price',
 'Price(2024)']

In [26]:
# Create an interactive widget to help in graphing
numericalFeatures_widget = widgets.Dropdown(
    options = numericalFeatures_list,
    value = numericalFeatures_list[0],
    description = 'Feature:',
    disabled = False,
)

In [51]:
# Create an interactive graph to help visualize numerical data
@widgets.interact(numericalFeature = numericalFeatures_widget)
def numercialInspector(numericalFeature):

    fig = px.histogram(
        diamonds_num_df, 
        x = numericalFeature
    )

    # Extract the mean of the dataset
    mean =  diamonds_num_df[numericalFeature].mean()

    fig.add_vline(
        x = mean,
        line_width = 2,
        line_dash = 'solid',
        line_color = 'red',
        annotation_text = 'mean',
        annotation_position = 'top right'
    )

    # Extract the median of the dataset
    median =  diamonds_num_df[numericalFeature].median()

    fig.add_vline(
        x = median,
        line_width = 2,
        line_dash = 'dash',
        line_color = 'black',
        annotation_text = 'median',
        annotation_position = 'top left'
    )
    
    # Extract the mode of the dataset
    mode =  diamonds_num_df[numericalFeature].mode()[0]

    fig.add_vline(
        x = mode,
        line_width = 2,
        line_dash = 'dash',
        line_color = 'black',
        annotation_text = 'mode',
        annotation_position = 'top left'
    
    )

    # Create a graph title
    graphTitle = f"Mode: {mode}"

    # Update the graph
    fig.update_layout(
        title_text = graphTitle,
        yaxis_title_text = 'Total Counts', # yaxis label
        height = 600,
        width = 800
    )

    # Display the results
    return fig.show()

interactive(children=(Dropdown(description='Feature:', options=('Carat', 'Crown Height (mm)', 'Girdle Diameter…

In [31]:
# Extract the mode of the dataset
mode =  diamonds_num_df['Carat'].mode()[0]

mode

0.3

In [32]:
# Extract the median of the dataset
median =  diamonds_num_df['Carat'].median()

median

0.7

In [33]:
# Extract the mean of the dataset
mean =  diamonds_num_df['Carat'].mean()

mean

0.7977800498196824

In [None]:
fig.add_vline()

In [None]:
x = diamonds_num_df['Carat']

In [None]:
hist_data = [x]

In [None]:
group_labels = ['Carat'] # name of the dataset

In [None]:
fig = ff.create_distplot(
    hist_data, 
    group_labels,
    #bin_size = 6
)

In [None]:
fig.show()

### CENTRAL TENDENCY

###TODO: UPDATE TEXT BELOW FOR NUMERICAL

The categorical values found in this dataset are related to diamond grading. Therefore; these categories are all ordinal values. This implies that only the mode, and perhaps median values, can be extracted from each feature.

The following scripts will prepare the dataset so that 