<h1 style='color:rgb(52, 152, 219)'; align=center><font size = 8> DIAMOND PRICES - ANALYSIS AND MODELING </font></h1>

<h2 style='color:rgb(52, 152, 219)'; align=left><font size = 6> NOTEBOOK 00-01: OBJECTIVES & METHODOLOGY </font></h2>

# REVISION HISTORY

| REV | DESCRIPTION             | DATE         |  BY   | CHECK | APPROVE  |
|:---:|:-----------------------:|:------------:|:-----:|:-----:|:--------:|
| A0  | ISSUED FOR REVIEW (IFR) | 2024-APR-XX  |  IAC  |       |          |
|     |                         |              |       |       |          |

## DETAILED DESCRIPTION OF REVISIONS

> **REV A0** - HOLD

# INTRODUCTION

Diamonds are a highly valued gemstone. They are a complex commodity with a well-defined grading system called the 4C's of Diamonds. These 4C's are: 

* Cut
* Color
* Clarity
* Carat

The purpose of this project is to model diamond prices by analyzing the relationship between the 4C's and the price of a diamond.

# OBJECTIVES 

The objectives of this project are as follows:

1. **Identify Key Factors:** Determine which diamond characteristics have the most significant impact on pricing.
2. **Model Development:** Develop predictive models that estimate diamond prices based on their characteristics.
3. **Insight Generation:** Gain insights into diamond pricing that allows consumers to see their options based on price.
4. **Recommendations:** Provide recommendations that can be used by the consumer, according to the analysis findings and modeling.

# DATA SOURCES

A data source is the primary location where data is collected and stored. Data can be stored in either unstructured, semi-structured, and structured formats. Sources of data can include:

* Databases
* Files
* APIs
* Web Scraping
* Sensors and IoT Devices
* Cloud Services
* External Partners

Each data source has its own characteristics, including data format, structure, accessibility, and frequency of updates. Understanding these differences is crucial for selecting the appropriate tools and techniques for data extraction. 

There is only 1 data source for this project, described below.

## KAGGLE

The dataset used in this project can be found on [the kaggle website](https://www.kaggle.com/). Kaggle is a popular platform for data science and machine learning competitions, datasets, and learning resources. It was founded in 2010 and acquired by Google in 2017.

The kaggle website where the dataset can be downloaded is: [Diamonds - Analyze diamonds by their cut, color, clarity, prices and other attributes](https://www.kaggle.com/datasets/shivam2503/diamonds). 

### ATTRIBUTION

The person credited with creating this dataset is: `Shivam Agrawal`. See [Diamonds - Analyze diamonds by their cut, color, clarity, prices and other attributes](https://www.kaggle.com/datasets/shivam2503/diamonds)

### DESCRIPTION OF DATASET

The dataset is a csv file containing information on various diamond characteristics, and their corresponding prices. The features found in this dataset are:

| Feature | Description      |
|---------|------------------|
| carat   | The weight of the diamond, measured in carats |
| cut     | The quality of the diamond's cut, ranging from 'Fair' to 'Ideal' |
| clarity | The level of imperfections or blemishes within the diamond, categorized from 'I1' (worst) to the best: 'IF' (internally flawless) |
| color   | The color grade of the diamond, ranging from 'J' (worst) to 'D' (best) |
| x       | Diamond length in mm |
| y       | Diamond width in mm |
| z       | Diamond depth in mm |
| depth   | Total depth percentage = z / mean(x, y) = 2 * z / (x + y) |
| table   | Width of top of diamond relative to widest point |
| price   | The price of the diamond, in USD. This is the target feature |

# METHODOLOGY

The most effective methodology for extracting actionable insights from data is to use a 3 step analytical process. The steps in this process are as follows: 

1. **Descriptive Analytics:** involves the exploration and summarization of historical data to understand past trends and patterns. It provides a foundational understanding of what happened, offering valuable context for further analysis. 
2. **Predictive Analytics:** utilizes statistical algorithms and machine learning techniques to forecast future outcomes, based on historical data patterns. By identifying potential future trends and behaviors, predictive analytics empowers organizations to anticipate opportunities and mitigate risks.
3. **Prescriptive Analytics:** takes the analysis a step further by recommending actions or strategies to optimize outcomes. It leverages advanced modeling techniques to simulate various scenarios and determine the most effective course of action.

## SCOPE OF DESCRIPTION

This notebook describes the **Descriptive Analytics** step. The Predictive and Prescriptive Analytics steps will be described in other notebooks.

# DESCRIPTIVE ANALYTICS

Descriptive analytics lays the foundation for data-driven decision-making. It provides a comprehensive understanding of historical data patterns and trends. The workflow that will be followed in performing descriptive analytics is shown below:

<img 
     src="../../00_Data/01_Assets/DescriptiveAnalytics.png" 
     alt="Descriptive Analytics Workflow"
     style="width:1000x;height:450px;"
     >

## ETL (EXTRACT, TRANSFORM, LOAD)

ETL stands for Extract, Transform, Load. It refers to the process of extracting raw data from a data source, transforming data into a tabular dataset, and loading data to a target destination such as a folder or database. The process of creating a tabular dataset plays a critical role in ensuring data quality, consistency, and usability for analysis. 

<div class="alert alert-info" role="alert">
  <h3 class="alert-heading">What is a tabular dataset?</h3>
  <p>A tabular dataset is a structured form of data, commonly used in analytics and machine learning. It organizes data into rows and columns, where each row represents an individual observation or record, and every column represents a specific attribute or feature of the data.</p>
  <hr>
  <p class="mb-0">Tabular datasets are typically stored in formats such as CSV (Comma-separated values), spreadsheets, or relational databases, and they are widely used in various domains for storing, analyzing, and visualizing data.</p>
</div>

Below is an explanation of the ETL process:

1. **Extract:**

    * **Data Source Connectivity:** ETL tools or scripts are used to connect to the data sources and retrieve the necessary data. This may involve querying databases, reading files, or pulling data from APIs.

    * **Data Extraction:** The extraction phase involves retrieving raw data from a data source. Some examples of common data sources include:
      
        *  Files (music, images, spreadsheets, or other digital files)
        *  Databases
        *  APIs,
        *  Web services
        *  Web scraping
        *  Streaming data sources
        
        Data can be extracted in its raw form, or from a structured / semi-structured format such as: CSV, JSON, XML, or relational databases.

3. **Transform:**

    * **Cleaning and Standardization:** Creating a [tidy compliant](https://about.dataclassroom.com/blog/keep-your-data-tidy) dataset by cleaning, handling missing values, removing duplicates, correcting errors, and standardizing data and data type formats.

5. **Load:**

   * **Data Loading Strategies:** Describe how often is the data updated, and what kind of data pipeline is required in order to use in a continuous analytics or machine learning project.
   * **Destination Schema:** Define the format that will be used. CSV is the most common, but data if stored in a data warehouse, lake or databases typically load data into a predefined schema or data model in the target destination. This ensures consistency and compatibility with downstream analytics and reporting tools.
    * **Data Loading:** The load phase involves loading the transformed data into a target destination, such as a folder, data warehouse, data lake, or database. This can be a one-time load or a continuous process, depending on the frequency of data updates and the requirements of the analysis.

### SUMMARY

* ETL is the process of creating a tabular dataset from a data source.
* The ETL process must be documented for every data source used on a project.
* A dataset is fit-for-purpose if it is [tidy compliant](https://vita.had.co.nz/papers/tidy-data.pdf).

A tidy compliant dataset allows for efficient data analysis that reduces errors, improves data processing, and better data visualization. It is a critical step to set the foundations for analytics and machine learning.

## DATA MINING

The goal of data mining is to combine all the datasets created during ETL, into one **main** dataset, and verify the data meets [the tidy specification](https://vita.had.co.nz/papers/tidy-data.pdf), and any other quality requirements, before the data is used in analytics and machine learning. 

The steps used to create this dataset are as follows:

1. **Data Collection:** Gather the relevant datasets, and store them together.
2. **Data Selection:** Identify relevant data, in each dataset, that contain information pertinent to the analysis objectives.
3. **Data Integration:** Combine the selected data together into 1 dataset, creating a unified dataset for analysis.
4. **Data Cleaning:** Perform data cleaning processes to handle missing values, duplicates, and inconsistencies. Clean data is data that meets [the tidy specification](https://vita.had.co.nz/papers/tidy-data.pdf). 
5. **Data Transformation:** Transform the data where required, making the dataset suitable for analysis and machine learning.

### SUMMARY

* Data Mining is the process of combining multiple datasets into 1 main dataset.
* It allows for the dataset to be explainable, repeatable, and reproducible.
* This process also validates that the dataset complies with the [tidy specification](https://vita.had.co.nz/papers/tidy-data.pdf).

This process ensures that high quality data is used in analytics and machine learning. 

## DESCRIPTIVE STATISTICS

Descriptive statistics is the analytics starting point. It is essential for understanding the basic properties of a dataset, and identifying patterns & trends. This information serves as a foundation for further analysis, hypothesis testing, and model building. Descriptive statistics provides methods for organizing, visualizing, and analyzing data to gain insights into its characteristics and distributions. Descriptive statistics helps to understand the central tendency, variability, and distribution of data without making inferences or generalizations to a larger population.

The dataset will be described as follows: 

* **Measures of Central Tendency:** Calculate descriptive statistics such as mean, median, and mode to understand the central tendency of the data.
* **Measures of Dispersion:** Compute measures like standard deviation, variance, and range to assess the spread or variability of the data.
* **Frequency Distributions:** Create frequency tables or histograms to visualize the distribution of categorical and numerical variables.
* **Percentiles:** Calculate percentiles to identify specific data points' position within the dataset.
* **Skewness and Kurtosis:** Skewness measures the asymmetry of the distribution, indicating whether the data is skewed to the left or right. Kurtosis measures the peakedness or flatness of the distribution.
* **Correlation Analysis:** Compute correlation coefficients to understand the relationships between pairs of variables.

### SUMMARY

* Descriptive statistics offers a high-level overview of the dataset.
* Enables a deep understanding of the data, helping identify potential issues. or setting the expectations of how the data can be used in analytics and machine learning.
* Provides a concise summary of the data, helping understand the underlying characteristics, patterns, and trends.

Descriptive statistics validates that the dataset is fit-for-purpose, explainable, repeatable, and reproducible. It provides valuable context for more advanced analytical techniques, such as predictive and prescriptive analytics, enabling organizations to build upon this foundational knowledge to make more accurate forecasts and informed decisions.

## EDA (EXPLORATORY DATA ANALYTICS)

The primary goal of EDA is to understand the data and uncover patterns, trends, relationships, and anomalies that may be present in the data. It involves exploring and visualizing the data to gain an understanding of its characteristics, distributions, and relationships. EDA helps identify potential patterns and insights that can guide further analysis. 

The Key components of EDA include:

* **Univariate Analysis:** Explore individual variables to understand their distributions, outliers, and potential patterns.
* **Bivariate Analysis:** Investigate relationships between pairs of variables through scatter plots, correlation matrices, or box plots.
* **Multivariate Analysis:** Examine interactions between multiple variables using techniques like heatmaps, pair plots, or dimensionality reduction methods.
* **Visualization:** Create visualizations such as histograms, bar charts, line plots, and heatmaps to represent data distributions and relationships effectively.
* **Pattern Identification:** Identify trends, anomalies, clusters, or patterns within the data using statistical methods or visualization tools.
* **Hypothesis Generation:** The process of formulating potential explanations or theories about relationships, patterns, or phenomena observed in the data. Hypothesis generation is a critical step in the data analysis process as it guides further investigation and hypothesis testing to validate or refute the proposed hypotheses.
* **ANOVA (Analysis of Variance):** A statistical technique used to analyze differences among means of three or more groups. It determines whether there are statistically significant differences between the means of two or more independent groups. ANOVA is only performed where required, and is subject to a project change order.

### SUMMARY

* EDA is a conversation with the data.
* The specific techniques and tools used will vary depending on the dataset.
* You gain a deeper understanding of your dataset, setting the stage for informed modeling, analysis, and insights!

# CONCLUSION

This notebook describes the project objectives, and the methodology that will be followed in executing and analytical project that is explainable, repeatable, and reproducible.

## NEXT STEPS

NOTEBOOK 00_02 describes the business rules related to diamond pricing.