<h1 style='color:rgb(52, 152, 219)'; align=center><font size = 8> DIAMOND PRICES - ANALYSIS AND MODELING </font></h1>

<h2 style='color:rgb(52, 152, 219)'; align=left><font size = 6> NOTEBOOK 3 - DATA MINING </font></h2>

# REVISION HISTORY

| REV | DESCRIPTION             | DATE         |  BY   | CHECK | APPROVE  |
|:---:|:-----------------------:|:------------:|:-----:|:-----:|:--------:|
| A0  | ISSUED FOR REVIEW (IFR) | 2024-APR-XX  |  IAC  |       |          |
|     |                         |              |       |       |          |

## DETAILED DESCRIPTION OF REVISIONS

> **REV A0** - HOLD

# INTRODUCTION

This notebook describes the steps and methods used in the data mining process. The goal of data mining is to create a tabular dataset. The following steps will be used in this notebook in order to create a high quality dataset:

* [X] **Data Collection:** Gather raw data from various sources such as databases, APIs, files, or web scraping
* [X] **Data Selection:** Identify relevant datasets that contain information pertinent to the analysis objectives
* [X] **Data Integration:** Combine multiple datasets if necessary to create a unified dataset for analysis
* [X] **Data Cleaning:** Perform data cleaning processes to handle missing values, duplicates, and inconsistencies
* [X] **Data Transformation:** Transform the data into a tabular dataset suitable for analysis, including converting data types

# REQUIRED LIBRARIES

The following libraries are required to run this notebook.

In [1]:
# Library to create and handle a tabular dataset
import pandas as pd

In [2]:
# Library used to create interactive controls
import ipywidgets as widgets

# DATA COLLECTION

Data collection is a crucial initial step in the data mining process. It involves gathering relevant data from various sources to build datasets that will be used for analysis and modeling. It is the initial step in the data mining process where relevant data is gathered from various sources. These sources can include databases, files, APIs, sensors, social media, or any other data repositories. The collected data should align with the objectives of the analysis or research. It involves identifying the types of data needed and the sources where this data can be obtained.

## COLLECTION SOURCES

The data needed for this project comes from a single source: The [kaggle website](https://www.kaggle.com/). Kaggle is a popular platform for data science and machine learning competitions, datasets, and learning resources. It was founded in 2010 and acquired by Google in 2017. 

The following link takes you to the kaggle page hosting the dataset:

> https://www.kaggle.com/datasets/shivam2503/diamonds

## COLLECTION METHOD

This dataset must be manually downloaded, un-archived (uncompressed), and a copy of the dataset placed into a working folder. 

## DATA SOURCE LOCATION

The following link will download a copy of the dataset so that you can save it to a local folder:

> https://www.kaggle.com/datasets/shivam2503/diamonds/download?datasetVersionNumber=1

A zip file will be downloaded, the zip filename is: archive.zip. 

### ORIGINAL DATASET

A copy of this zip file, along with the extracted csv: diamonds.csv can be found in the following path:

> 00_Data/00_Datasets/Originals/

### PROJECT DATASET

The csv file for project use is located in the following path:

> 00_Data/00_Datasets/diamonds.csv

# DATA SELECTION

Data selection is the process of identifying and choosing relevant data from the various sources of collected data. Data selection is a crucial step because working with large datasets that contain irrelevant or unnecessary information can lead to inefficiencies in analysis and modeling, as well as increased computational costs. Therefore, selecting the right data subset is essential for obtaining accurate and meaningful insights. 

## SELECTION METHOD

Since there is only 1 data source for this project, there is no need to perform any data selection.

# DATA INTEGRATION

Data integration refers to the process of combining data from different sources into a unified and coherent view, typically within a single system or application. The goal is to provide a comprehensive and holistic understanding of the data, enabling better decision-making, analysis, and insights.

## DATA INTEGRATION METHOD

Since there is only 1 data source for this project, there is no need to perform any data integration.

# DATA CLEANING

Data cleaning, also known as data cleansing or data scrubbing, is a fundamental process in data preparation. It involves identifying and correcting errors, inconsistencies, and inaccuracies in a dataset. It's an essential step in the data analysis and modeling pipeline, as the quality of the data directly impacts the reliability and validity of the insights and decisions derived from it. 

## LOADING AND INSPECTING THE DATASET

The following scripts loads the dataset.

In [3]:
# Path to the dataset
pathToDataset = "./../00_Data/00_Datasets/"

In [4]:
# Filename
diamondsCSVFilename = "diamonds.csv"

In [5]:
# Create a data frame
diamonds_df = pd.read_csv(
    pathToDataset + diamondsCSVFilename,
    index_col = None
)

# Display the newly created data frame
diamonds_df

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...,...
53935,53936,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,53937,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,53938,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


### DATA DICTIONARY

A data dictionary is a structured collection of metadata that provides detailed information about the contents, format, and meaning of data within the dataset.It serves as a comprehensive reference guide for understanding the characteristics and properties of the data elements stored in the dataset.

The following table is the data dictionary for this project:

| Column Name | Description                                                                          | Categorical or Numerical | Type | DType |
|:------------|:-------------------------------------------------------------------------------------|--------------------------|------|-------|
| carat| Carat denotes the weight of a diamond, not the size. 1 carat = 1/5 grams (0.2 grams)        | Numerical | Continuous | Float|
| cut  | Refers to its proportions, symmetry, and polish, which greatly affect its brilliance, sparkle, and overall appearance | Categorical | Ordinal | Object |
| color | Refers to the presence or absence of color in a diamond, specifically the degree of colorlessness or lack of hue | Categorical | Ordinal | Object |
| clarity | Refers to the presence or absence of internal flaws                                      | Categorical | Ordinal | Object |
| depth   | (depth %) A measurement used to assess the depth of a diamond relative to its width (or diameter)  | Numercial | Continuous | Float |
| table   | (table %) The table percentage refers to the size of the table facet of a diamond relative to the diameter of the entire diamond's crown (the top portion above the girdle) | Numercial | Continuous | Float |
| price | The price of diamonds in USD, the target for this project | Numerical | Continuous | Integer |
| x | The crown height of a diamond | Numerical | Continuous | Float |
| y | The girdle diameter of a diamond | Numerical | Continuous | Float |
| z | The pavilion depth of a diamond | Numerical | Continuous | Float |

Any columns, other then the ones identified above, are not required for this analysis.

## IDENTIFYING ERRORS AND INCONSISTENCIES

The first step in data cleaning is identifying any anomalies or discrepancies within the dataset. This could include missing values, duplicate records, incorrect data types, or formatting issues.

### DATASET ANOMALIES / DISCREPANCIES

The following script will provide dataset details related to:

1. Column name and index number
2. The column Dtype
3. If there are any missing values in that column

Inspect the dataset and confirm these details are consistent with the table found in the data dictionary.

In [6]:
# Inspect the data frame
diamonds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


#### OBSERVATIONS

1. The Column named `Unnamed:0` is not a required feature in this dataset. It needs to be deleted.
2. The column Dtypes are consistent with the expected values listed in the table
3. There are no missing values

#### REMOVING ANOMALIES / DISCREPANCIES

The following script(s) will be used to correct the anomalies / discrepancies found in the dataset:

In [7]:
# Delete unnecessary columns
diamonds_df = diamonds_df.drop(
    columns = ['Unnamed: 0']
)

# Confirm the operation was successful
diamonds_df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [8]:
# Inspect the dataset to confirm changes have been made
diamonds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


### DUPLICATE RECORDS

Duplicated records affect data quality and consistency. Duplicated records also create bias and reduce computer performance when loading and using the dataset.

#### CHECKING FOR DUPLICATES

The following scripts will check the dataset for duplicated values:

In [9]:
# Check for missing values and store the results
duplicateValues_df = diamonds_df[diamonds_df.duplicated(
 keep = 'first'   
)
].sort_values(
    by = ['carat', 'cut', 'color']
)

# Display the results
duplicateValues_df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
47296,0.30,Good,J,VS1,63.4,57.0,394,4.23,4.26,2.69
28593,0.30,Ideal,G,VS2,63.0,55.0,675,4.31,4.29,2.71
34422,0.30,Ideal,G,IF,62.1,55.0,863,4.32,4.35,2.69
31627,0.30,Ideal,H,SI1,62.2,57.0,450,4.26,4.29,2.66
31629,0.30,Ideal,H,SI1,62.2,57.0,450,4.27,4.28,2.66
...,...,...,...,...,...,...,...,...,...,...
24863,2.50,Fair,H,SI2,64.9,58.0,13278,8.46,8.43,5.48
26608,2.54,Very Good,H,SI2,63.5,56.0,16353,8.68,8.65,5.50
26554,2.66,Good,H,SI2,63.8,57.0,16239,8.71,8.65,5.54
27516,3.01,Fair,I,SI2,65.8,56.0,18242,8.99,8.94,5.90


**Duplicated Records Report**

> * Total number of records in the dataset: 53,940
> 
> * Total number of duplicate records: 146

Thankfully, the number of duplicate records is small.

#### DELETING DUPLICATE RECORDS

The following scripts will delete duplicate records.

In [10]:
# Create a new dataset with duplicates removed
diamondsNoDuplicates_df = diamonds_df.drop_duplicates()

# Verify the operation was successful
diamondsNoDuplicates_df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


### INCORRECT DATA TYPES / FORMATTING ISSUES

Incorrect data types and formatting issues are the result of either human error or data corruption. The following scripts will inspect for any data types and formatting issues. 

In [11]:
# Create a list of features in order to inspect the column
features_list = diamondsNoDuplicates_df.columns.tolist()

features_list

['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

In [12]:
# Create a drop down in order to inspect the dataset column by column
features_widget = widgets.Dropdown(
    options = features_list,
    value = features_list[0],
    description = 'Feature:',
    disabled = False,
)

In [13]:
# Inspect each column for incorrect data types and formatting issues
@widgets.interact(
    feature = features_widget
)
def datasetInspector(feature):
    # Create a query of the feature
    df = diamondsNoDuplicates_df[feature]

    # If the feature is non-numerical, provide data on unique values
    if df.dtype == 'object':
        print()
        print(f"The unique values found in {feature} are:", sorted(df.unique()))
        print()

    return df.describe()

interactive(children=(Dropdown(description='Feature:', options=('carat', 'cut', 'color', 'clarity', 'depth', '…

#### OBSERVATIONS

There doesn't appear to be any incorrect data types or formatting issues in this dataset.

# DATA TRANSFORMATION

Data transformation is the process of converting data from its raw form into a format suitable for analysis, modeling, or machine learning algorithms. It involves applying a set of rules or functions to modify the data, ensuring it meets the requirements for a specific task or technique.

Typical data transformation tasks, at this step of the process include the following:

* **Standardizing Units of Measure:** If your dataset has different units of measure, pick a standard unit of measure or convert the values to a standard scale between 0 and 1. A common example is standardizing the date format, or converting prices to a common currency.
* **Feature Engineering:** This involves either creating new features, or transforming existing features to improve model performance or extract more meaningful information.
* **Grouping, Bucket/Binning:** Grouping categorical values. Dividing a numerical value into bins / buckets.

## STANDARDIZING UNITS OF MEASURE

There are no requirements to standardized units of measure in this dataset.

## FEATURE ENGINEERING

Feature engineering involves taking these features and manipulating them to create new features that are more relevant to the problem you're trying to solve.

### FEATURE ENGINEERING REQUIREMENTS

The feature engineering that will done on this dataset is rename the columns, and rearranging them to help in analytics.

#### RENAME COLUMNS

The following columns will be renamed:

| Old Column Name | New Column Name      | Reason for change               |
|-----------------|----------------------|---------------------------------|
| carat           | Carat                | Consistent naming style         |
| cut             | Cut                  | Consistent naming style         |
| color           | Color                | Consistent naming style         |
| clarity         | Clarity              | Consistent naming style         |
| depth           | Total Depth %        | Better description of feature   |
| table           | Table %              | Better description of feature   |
| price           | Price                | Consistent naming style         |
| x               | Crown Height (mm)    | Better description of feature   |
| y               | Girdle Diameter (mm) | Better description of feature   |
| z               | Pavilion Depth (mm)  | Better description of feature   |

Below are the scripts that will rename this dataset.

In [14]:
# Dictionary mapping old names to new names
replaceNames_dict = {
    'carat' : 'Carat',
    'cut' : 'Cut',
    'color' : 'Color',
    'clarity' : 'Clarity',
    'price' : 'Price',
    'depth' : 'Total Depth %',
    'table' : 'Table %',
    'x' : 'Crown Height (mm)',
    'y' : 'Girdle Diameter (mm)',
    'z' : 'Pavillion Depth (mm)'
}

In [15]:
# Rename the columns
diamondsNoDuplicates_df.rename(
    columns = replaceNames_dict,
    inplace = True
)

# Inspect the results
diamondsNoDuplicates_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  diamondsNoDuplicates_df.rename(


Unnamed: 0,Carat,Cut,Color,Clarity,Total Depth %,Table %,Price,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm)
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [18]:
# Reordered columns list
reorderedColumns_list = [
    'Carat',
    'Cut',
    'Crown Height (mm)',
    'Girdle Diameter (mm)',
    'Pavillion Depth (mm)',
    'Table %',
    'Total Depth %',
    'Clarity',
    'Color',
    'Price'
]

In [19]:
# Create a new dataset, with reorganized columns
diamondsNoDuplicates_df = diamondsNoDuplicates_df[reorderedColumns_list]

# Inspect the new dataset
diamondsNoDuplicates_df

Unnamed: 0,Carat,Cut,Crown Height (mm),Girdle Diameter (mm),Pavillion Depth (mm),Table %,Total Depth %,Clarity,Color,Price
0,0.23,Ideal,3.95,3.98,2.43,55.0,61.5,SI2,E,326
1,0.21,Premium,3.89,3.84,2.31,61.0,59.8,SI1,E,326
2,0.23,Good,4.05,4.07,2.31,65.0,56.9,VS1,E,327
3,0.29,Premium,4.20,4.23,2.63,58.0,62.4,VS2,I,334
4,0.31,Good,4.34,4.35,2.75,58.0,63.3,SI2,J,335
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,5.75,5.76,3.50,57.0,60.8,SI1,D,2757
53936,0.72,Good,5.69,5.75,3.61,55.0,63.1,SI1,D,2757
53937,0.70,Very Good,5.66,5.68,3.56,60.0,62.8,SI1,D,2757
53938,0.86,Premium,6.15,6.12,3.74,58.0,61.0,SI2,H,2757



1. **Identifying Errors and Inconsistencies**: The first step in data cleaning is identifying any anomalies or discrepancies within the dataset. This could include missing values, duplicate records, outliers, incorrect data types, or formatting issues.

2. **Handling Missing Values**: Missing values are common in real-world datasets and can arise due to various reasons such as data entry errors, equipment malfunction, or intentional omission. Data cleaning involves deciding how to handle these missing values, whether by imputing them with a calculated value (e.g., mean, median, mode), deleting rows or columns with missing values, or using more sophisticated techniques like predictive modeling.

3. **Removing Duplicates**: Duplicate records can skew analysis results and introduce bias into models. Data cleaning typically involves identifying and removing duplicate entries from the dataset to ensure each observation is unique.

4. **Correcting Inaccurate Data**: Inaccurate data can arise from human error, measurement errors, or system glitches. Data cleaning may involve validating data against known standards or business rules and correcting any inaccuracies found.

5. **Standardizing and Normalizing Data**: Data may come from different sources or systems with varying formats and standards. Standardizing and normalizing data involve converting it into a consistent format or unit of measurement to facilitate analysis and comparison.

6. **Dealing with Outliers**: Outliers are data points that deviate significantly from the rest of the dataset and can distort statistical analyses and machine learning models. Data cleaning may involve identifying and either removing outliers or transforming them to mitigate their impact.

7. **Ensuring Consistency**: Consistency is crucial for ensuring that data is interpreted and analyzed correctly. Data cleaning involves enforcing consistency in data formats, naming conventions, and encoding standards across the dataset.

8. **Validating Data Integrity**: Data cleaning also includes validating the integrity of the dataset to ensure that it adheres to logical constraints and business rules. This may involve cross-referencing data across different fields or verifying relationships between related entities.

9. **Documenting Changes**: It's essential to document the data cleaning process, including the steps taken and the rationale behind them. This documentation helps ensure transparency, reproducibility, and accountability in the data analysis workflow.

By performing thorough data cleaning, data scientists can enhance the quality, reliability, and usability of the dataset, ultimately leading to more accurate and robust analysis, modeling, and decision-making.

# SUMMARY

This notebook describes the project objectives, and the methodology that will be followed in executing work in an explainable, repeatable, and reproducible fashion. 

## NEXT STEPS

NOTEBOOK 2 describes the business rules related to diamond pricing.