<h1 style='color:rgb(52, 152, 219)'; align=center><font size = 8> DIAMOND PRICES - ANALYSIS AND MODELING </font></h1>

<h2 style='color:rgb(52, 152, 219)'; align=left><font size = 6> NOTEBOOK 3 - DATA MINING </font></h2>

# REVISION HISTORY

| REV | DESCRIPTION             | DATE         |  BY   | CHECK | APPROVE  |
|:---:|:-----------------------:|:------------:|:-----:|:-----:|:--------:|
| A0  | ISSUED FOR REVIEW (IFR) | 2024-APR-XX  |  IAC  |       |          |
|     |                         |              |       |       |          |

## DETAILED DESCRIPTION OF REVISIONS

> **REV A0** - HOLD

# INTRODUCTION

This notebook describes the steps and methods used in the data mining process. The goal of data mining is to create a tabular dataset. The following steps will be used in this notebook in order to create a high quality dataset:

* [X] **Data Collection:** Gather raw data from various sources such as databases, APIs, files, or web scraping.
* [X] **Data Selection:** Identify relevant datasets that contain information pertinent to the analysis objectives.
* [X] **Data Integration:** Combine multiple datasets if necessary to create a unified dataset for analysis.
* [ ] **Data Cleaning:** Perform data cleaning processes to handle missing values, outliers, duplicates, and inconsistencies.
* [ ] **Data Transformation:** Transform the data into a tabular dataset suitable for analysis, including converting data types and scaling numeric variables.

# LIBRARY

The following libraries are required to run this notebook.

In [1]:
# Library to create and handle a tabular dataset
import pandas as pd

In [2]:
# Library to inspect missing values in a dataset
import missingno as msno

# DATA COLLECTION

Data collection is a crucial initial step in the data mining process. It involves gathering relevant data from various sources to build datasets that will be used for analysis and modeling. It is the initial step in the data mining process where relevant data is gathered from various sources. These sources can include databases, files, APIs, sensors, social media, or any other data repositories. The collected data should align with the objectives of the analysis or research. It involves identifying the types of data needed and the sources where this data can be obtained.

## COLLECTION SOURCES

The data needed for this project comes from a single source: The [kaggle website](https://www.kaggle.com/). Kaggle is a popular platform for data science and machine learning competitions, datasets, and learning resources. It was founded in 2010 and acquired by Google in 2017. 

The following link takes you to the kaggle page hosting the dataset:

> https://www.kaggle.com/datasets/shivam2503/diamonds

## COLLECTION METHOD

This dataset must be manually downloaded, un-archived (uncompressed), and a copy of the dataset placed into a working folder. 

## DATA SOURCE LOCATION

The following link will download a copy of the dataset so that you can save it to a local folder:

> https://www.kaggle.com/datasets/shivam2503/diamonds/download?datasetVersionNumber=1

A zip file will be downloaded, the zip filename is: archive.zip. 

### ORIGINAL DATASET

A copy of this zip file, along with the extracted csv: diamonds.csv can be found in the following path:

> 00_Data/00_Datasets/Originals/

### PROJECT DATASET

The csv file for project use is located in the following path:

> 00_Data/00_Datasets/diamonds.csv

# DATA SELECTION

Data selection is the process of identifying and choosing relevant data from the various sources of collected data. Data selection is a crucial step because working with large datasets that contain irrelevant or unnecessary information can lead to inefficiencies in analysis and modeling, as well as increased computational costs. Therefore, selecting the right data subset is essential for obtaining accurate and meaningful insights. 

## SELECTION METHOD

Since there is only 1 data source for this project, there is no need to perform any data selection for this project.

# DATA INTEGRATION

Data integration refers to the process of combining data from different sources into a unified and coherent view, typically within a single system or application. The goal is to provide a comprehensive and holistic understanding of the data, enabling better decision-making, analysis, and insights.

## DATA INTEGRATION METHOD

Since there is only 1 data source for this project, there is no need to perform any data integration for this project.

# DATA CLEANING

Data cleaning, also known as data cleansing or data scrubbing, is a fundamental process in data preparation. It involves identifying and correcting errors, inconsistencies, and inaccuracies in a dataset. It's an essential step in the data analysis and modeling pipeline, as the quality of the data directly impacts the reliability and validity of the insights and decisions derived from it. 

## LOADING THE DATASET

The following scripts loads the dataset.

In [3]:
# Path to the dataset
pathToDataset = "./../00_Data/00_Datasets/"

In [4]:
# Filename
diamondsCSVFilename = "diamonds.csv"

In [5]:
# Create a data frame
dimaonds_df = pd.read_csv(
    pathToDataset + diamondsCSVFilename,
    index_col = None
)

# Inspect the data frame
dimaonds_df

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...,...
53935,53936,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,53937,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,53938,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


## IDENTIFYING ERRORS AND INCONSISTENCIES

The first step in data cleaning is identifying any anomalies or discrepancies within the dataset. This could include missing values, duplicate records, outliers, incorrect data types, or formatting issues.

### IDENTIFYING ANOMALIES OR DISCREPANCIES

The expected fields found in this dataset are:

> **Carat:** The weight of the diamond, measured in carats.

> **Cut:** The quality of the diamond's cut, ranging from 'Fair' to 'Ideal'.

> **Clarity:** The level of imperfections or blemishes within the diamond, categorized from 'I1' (worst) to the best: 'IF' (internally flawless).

> **Color:** The color grade of the diamond, ranging from 'J' (worst) to 'D' (best).

> **x:** Diamond length in mm

> **y:** Diamond width in mm

> **z:** Diamond depth in mm

> **Depth:** Total depth percentage = z / mean(x, y) = 2 * z / (x + y)

> **Table:** Width of top of diamond relative to widest point

> **Price:** The price of the diamond, in USD. This is the target feature.

Any fields not related to the features described above are not required and need to be deleted.

In [7]:
# Inspect the data frame
dimaonds_df

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...,...
53935,53936,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,53937,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,53938,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


The feature `Unnamed:0` is not a required feature in this dataset. It needs to be deleted.

In [10]:
# Delete unnecessary columns
dimaonds_df = dimaonds_df.drop(
    columns = ['Unnamed: 0']
)

# Confirm that operation was successful
dimaonds_df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74



1. **Identifying Errors and Inconsistencies**: The first step in data cleaning is identifying any anomalies or discrepancies within the dataset. This could include missing values, duplicate records, outliers, incorrect data types, or formatting issues.

2. **Handling Missing Values**: Missing values are common in real-world datasets and can arise due to various reasons such as data entry errors, equipment malfunction, or intentional omission. Data cleaning involves deciding how to handle these missing values, whether by imputing them with a calculated value (e.g., mean, median, mode), deleting rows or columns with missing values, or using more sophisticated techniques like predictive modeling.

3. **Removing Duplicates**: Duplicate records can skew analysis results and introduce bias into models. Data cleaning typically involves identifying and removing duplicate entries from the dataset to ensure each observation is unique.

4. **Correcting Inaccurate Data**: Inaccurate data can arise from human error, measurement errors, or system glitches. Data cleaning may involve validating data against known standards or business rules and correcting any inaccuracies found.

5. **Standardizing and Normalizing Data**: Data may come from different sources or systems with varying formats and standards. Standardizing and normalizing data involve converting it into a consistent format or unit of measurement to facilitate analysis and comparison.

6. **Dealing with Outliers**: Outliers are data points that deviate significantly from the rest of the dataset and can distort statistical analyses and machine learning models. Data cleaning may involve identifying and either removing outliers or transforming them to mitigate their impact.

7. **Ensuring Consistency**: Consistency is crucial for ensuring that data is interpreted and analyzed correctly. Data cleaning involves enforcing consistency in data formats, naming conventions, and encoding standards across the dataset.

8. **Validating Data Integrity**: Data cleaning also includes validating the integrity of the dataset to ensure that it adheres to logical constraints and business rules. This may involve cross-referencing data across different fields or verifying relationships between related entities.

9. **Documenting Changes**: It's essential to document the data cleaning process, including the steps taken and the rationale behind them. This documentation helps ensure transparency, reproducibility, and accountability in the data analysis workflow.

By performing thorough data cleaning, data scientists can enhance the quality, reliability, and usability of the dataset, ultimately leading to more accurate and robust analysis, modeling, and decision-making.

# SUMMARY

This notebook describes the project objectives, and the methodology that will be followed in executing work in an explainable, repeatable, and reproducible fashion. 

## NEXT STEPS

NOTEBOOK 2 describes the business rules related to diamond pricing.