# Predicting Diamond Prices

## Phase 1: Data preparation & Visualisation

<br>


Group name: Group 27 <br>

Name & IDs of group members: 

## Table of Contents
  * [Introduction](#Introduction)
    + [Dataset Source](#Dataset-Source)
    + [Dataset Details](#Dataset-Details)
    + [Dataset Variables](#Dataset-Variables)
  * [Target Feature](#Target-Feature)
  * [Goals and Objectives](#Goals-and-Objectives)
  * [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
    + [Missing Values](#Missing-values)
    + [Incorrect Values](#Incorrect-values)
    + [Calculating and Removing Outliers](#Calculating-and-Removing-Outliers)
    + [Aggregation](#Aggregation)
    + [Renaming-Columns](#Renaming-Columns)
    + [Random Sampling](#Random-Sampling)
  * [Data Exploration and Visualisation](#Data-Exploration-and-Visualisation)
    + [Univariable Visualisation](#Univariable-Visualisation)
    + [Two Variable Visualisation](#Two-Variable-Visualisation)
    + [Three Variable Visualisation](#Three-Variable-Visualisation)
  * [Summary and Conclusion](#Summary-and-Conclusion)
  * [References](#References)

# Introduction 

## Dataset Source

The 'Diamonds' dataset was used in a study conducted by Shivam Agrawal and sourced from Kaggle 2022. It analyzes almost 54,000 diamonds by their cut, colour, clarity, price and other attributes.

## Dataset Details

This dataset involves various details about diamonds to help with data analysis and visualization based on their attributes. These attributes include carat, cut, color, clarity, depth percentage, table, price, length, width and depth. These attributes make it sufficient to predict the price of diamonds through predictive modelling.

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None) 

###
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")

#Name of the dataset to be imported and read
df = pd.read_csv('diamonds.csv')

print("Number of rows in the dataset:", df.shape[0])
print("Number of columns in the dataset:", df.shape[1])

These are 10 randomly generated observations from the Diamonds dataset.

In [None]:
df.sample(10, random_state=5)

## Dataset Variables

The attributes of each column in this dataset are described in this table.

#### Table 1: Description of features in this dataset

| Name | Data type | Units | Description | 
| :-- | :-- | :-- | :-- |
| Index counter | Discrete Numeric | NA | Index of each diamond |
| Carat | Continuous Numeric | Carats | Carat weight of diamond (1 carat = 0.20g) |
| Cut | Ordinal Categorical | NA | Quality of cut; Increasing order: Fair, Good, Very Good,Premium, Ideal |
| Color | Ordinal Categorical | NA | Colour grade of diamond; (best)D, E, F, G, H, I, J(worst) |
| Clarity | Ordinal Categorical | NA | How obvious inclusions(small imperfections) are within the diamonds. List from best to worst: <br> <b>IF:</b> flawless <br><b>VVS1 or VVS2:</b> Very Very Slightly Included <br> <b>VS1 or VS2:</b> Very Slightly Included <br> <b>SI1 or SI2</b>: Slightly Included <br> <b>I1 or I2</b>:Included|
| Table | Continuous Numeric | Percentage | width of the diamond's table(facet seen when diamond is viwed face up) relative to it's widest point |
| price | Continuous Numeric | US dollars | cost of the diamond |
| x | Continuous Numeric | Millimeter | length of the diamond |
| y | Continuous Numeric | Millimeter | width of the diamond |
| z | Continuous Numeric | Millimeter | depth of the diamond |
| Depth | Continuous Numeric | percentage | Depth percentage measured from the cutlet(flat face at the bottom of the gemstone) to the table, divided by its girdle(line that separates the crown from the pavilion of the edge of a diamond) diameter |

# Target Feature

The aim of this report is to investigate how a range of different variable can impact the perceive value of a diamond. Therefore, the Target feature for this project will be price of diamonds in US dollars, as it's a direct representation of its value. 

# Goals and Objectives

Throughout history, people have been drawn to exquisite and unique items. Diamonds are regarded as the pinnacle of luxury in jewelery, as they have been prized as jewels from ancient times and are admired for their brilliance. Diamonds are treasured for much more than just their alluring beauty, though. They have different qualities that allow people to use it for many different purposes such as a cutting tool, and other tasks requiring durability. This makes diamonds valued beyond all other stones due to their distinctive physical characteristics and are the most popular gemstone in the world.

Because of these different aspects that discussed previously, a predictive model for diamonds' prices would have many practical use and applications in the real world. For example, it could help buyers determine if the price of a singular diamond is reasonable. Potential sellers of jewelry could also use this model to predict an estimate of the price of their diamond.

There are 2 main objectives in this project. The first one is to predict the price of diamonds based on a number of different features, and which features appear to be the greatest indicators or predictors of the diamonds' prices. In addition to that, after some data preprocessing and preparation, which is the focus of this Phase 1 report, the second goal is to undertake some exploratory data analysis using basic descriptive statistics and data visualisation plots to obtain some insight into the patterns and correlations existent in the data.

At this stage, our presumption is that our dataset's rows are not associated. That is, we are assuming that the price of a certain diamond doesn't affect the price of another in this dataset. By making this assumption, we are able to utilize traditional predictive models such as multiple linear regression.

# Data Cleaning and Preprocessing

This process aims to identify data quality issues and transform the data accordingly to ensure that the raw that is suitable for processing and analytics such as predictive modeling, increasing the validity of the data. Data cleaning and preprocessing undertaken for this project include:
- Check for missing values, and removing rows accordingly
- Identify and remove any incorrect values such as outliers
- Data aggregation including encoding categorical columns
- Check columns names and modify is necessary

Let's first display all columns in the dataset

In [None]:
df.columns

## Missing Values 

In [None]:
#check for missing values here 
print("Number of missing values for each column:")
df.isnull().sum()

As there are no null or missing values in this dataset, no rows will need to be dropped for this step.

## Incorrect Values

The aim of this step is to identify values that are Inherently incorrect, such as negative values or outliers 

In [None]:
#check for outliers
from IPython.display import display, HTML
display(HTML('<b>Table 2: Summary of numerical features</b>'))
df.describe(include=['int64','float64']).T

Through analysing the summary of numerical features for this dataset, we are able to identify that no feature has a negative value as all the minimum values are greater or equal to zero. However, further examination of individual values from table 2 indicate that existence of outliers for the following columns:
- carat
- price
- x
- y
- z

This was identified through examining the minimum and maximum values of each column and comparing it its mean, as a minimum or maximum value that have a large gap from its mean may indicate the presence of outliers.

## Calculating and Removing Outliers

In [None]:
#calculating whisker for carat
iqr_c = 1.04 - 0.40
lowerwhisker_c = 0.40 - 1.5*iqr_c
upperwhisker_c = 1.04 + 1.5*iqr_c
print(f"Lower whisker: {lowerwhisker_c}, Upper whisker: {upperwhisker_c}")

df = df[(df['carat'] > lowerwhisker_c) & (df['carat'] < upperwhisker_c)]

In [None]:
#calculating whisker for length
iqr_x = 6.54 - 4.71
lowerwhisker_x = 4.71 - 1.5*iqr_x
upperwhisker_x = 6.54 + 1.5*iqr_x
print(f"Lower whisker: {lowerwhisker_x}, Upper whisker: {upperwhisker_x}")

df = df[(df['x'] > lowerwhisker_x) & (df['x'] < upperwhisker_x)]

In [None]:
#calculating whisker for width
iqr_y = 6.54 - 4.72
lowerwhisker_y = 4.72 - 1.5*iqr_y
upperwhisker_y = 6.54 + 1.5*iqr_y
print(f"Lower whisker: {lowerwhisker_y}, Upper whisker: {upperwhisker_y}")

df = df[(df['y'] > lowerwhisker_y) & (df['y'] < upperwhisker_y)]

In [None]:
#calculating whisker for depth

iqr_z = 4.04 - 2.91
lowerwhisker_z = 2.91 - 1.5*iqr_z
upperwhisker_z = 4.04 + 1.5*iqr_z
print(f"Lower whisker: {lowerwhisker_z}, Upper whisker: {upperwhisker_z}")

df = df[(df['z'] > lowerwhisker_z) & (df['z'] < upperwhisker_z)]

In [None]:
#calculating whisker for price
iqr_p = 5324.25 - 950.0
lowerwhisker_p = 5324.25 - 1.5*iqr_p
upperwhisker_p = 950.0 + 1.5*iqr_p
print(f"Lower whisker: {lowerwhisker_p}, Upper whisker: {upperwhisker_p}")

df = df[(df['price'] > lowerwhisker_p) & (df['price'] < upperwhisker_p)]

In [None]:
df.describe(include=['int64','float64']).T 

## Aggregation

In [None]:
#encoding cut
cut = {'Fair': 0, 'Good': 1, 'Very Good': 2, 'Premium': 3, 'Ideal': 4}
df['cut'].replace(cut, inplace=True)
df.head()

In [None]:
#encoding color
color = {'D': 6, 'E': 5, 'F': 4, 'G': 3, 'H': 2, 'I': 1, 'J': 0}
df['color'].replace(color, inplace=True)
df.head()

In [None]:
#encoding clarity
clarity = {'I2': 0, 'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8}
df['clarity'].replace(clarity, inplace=True)
df.head()

In [None]:
print(f"This is the dataset shape: {df.shape} \n")
print(f"These are the data types; 'object' stands for string type:")
print(df.dtypes)

## Renaming Columns

In [None]:
df.columns

In [None]:
#Changing name of columns
df.columns = df.columns.str.lower().str.strip()

columns_mapping = {
    'depth': 'depth percentage',
    'x': 'length',
    'y': 'width',
    'z': 'depth',
}

df = df.rename(columns = columns_mapping)
df.sample(5, random_state=999)

## Random Sampling

In [None]:
df = df.sample(n=1500, random_state=5)
df.shape
df.sample(5, random_state=5)

# Data Exploration and Visualisation

## Univariable Visualisation

### Histogram of diamond length

In [None]:
sns.histplot(df['length'], bins= 10, color='deeppink')
plt.xlabel ("Length of a diamond")
plt.ylabel ("Count")
plt.title("Histogram of length of diamonds")
plt.show()

### Boxplot of price of diamond

In [None]:
plt.title("Boxplot of price of diamond")
sns.boxplot(df['price'], color='skyblue')
plt.xlabel('price')
plt.ylabel('Number of diamonds')
plt.show()

### Histogram of clarity of diamond grade

In [None]:
plt.title("Histogram of Clarity of Diamonds")
sns.histplot(df['clarity'], bins=8, color='red')
plt.xlabel("Grade for Clarity of Diamonds")
plt.ylabel("Number of Diamonds")
plt.show()

### Histogram of carat

In [None]:
sns.histplot(df['carat'], bins = 10)
plt.xlabel("Carat of a diamond")
plt.ylabel("Count")
plt.title("Histogram of Carat of Diamonds")
plt.show();

## Two Variable Visualisation

### Scatterplot of price of diamonds according to their length

In [None]:
sns.scatterplot(df['length'], df['price'])
plt.xlabel("Length of a diamond")
plt.ylabel("Price of diamond")
plt.title("Scatter Plot of length and price")
plt.show();

In [None]:
plt.title("Boxplot or price grouped by clarity")
sns.boxplot(x="clarity", y="price", data=df, palette="rocket_r")

### Scatterplot of table and depth percentage against price

In [None]:
fig = plt.figure()
fig.set_figwidth(20)
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)



ax1.scatter(df['table'], df['price'])
ax1.set_title('Scatterplot of price and table percentage')
ax1.set_xlabel('table percentage')
ax1.set_ylabel('price')


ax2.scatter(df['depth percentage'], df['price'])
ax2.set_title('Scatterplot of price and depth percentage')
ax2.set_xlabel('depth percentage')
ax2.set_ylabel('price')
plt.show()

In [None]:
#alternate version 
sns.relplot(data=df, x="carat", y="price", kind="line")
plt.title('Plot of carrat and price of the diamond')

## Three Variable Visualisation

### Scatter plot of price of diamonds depending on its length and categorised by color

In [None]:
sns.scatterplot(df['length'], df['price'], hue = df['color'])
plt.title("Scatterplot of price by length and color")
plt.xlabel("Length of a diamond")
plt.ylabel("Price of diamond")
plt.show();

In [None]:
sns.scatterplot(df['carat'], df['price'], hue = df['length'], data = df, palette="viridis")
plt.title('Scatterplot of Diamond Prices by Carat and Length', fontsize = 15)
plt.xlabel('length')
plt.ylabel('Diamond Prices')
plt.legend(loc='lower right', title='Length')
plt.show();

In [None]:
sns.scatterplot(df['carat'], df['price'], hue = df['clarity'], palette="rocket_r")
plt.title("Scatter plot of Price by Carat coloured by Clarity")
plt.show();

In [None]:
x = sns.boxplot(df['color'], df['price'], hue = df['cut'], palette="flare")

x.legend(bbox_to_anchor=(1.05, 1),loc=2, borderaxespad=0.,
          title="Cut Grade");

In [None]:
bp=sns.boxplot(x=df['clarity'],
               y=df['price'],
               hue=df['cut'], 
               data=df, palette="viridis");

bp.legend(bbox_to_anchor=(1.05, 1),loc=2, borderaxespad=0.,
          title="Cut Grade");

plt.title("Boxplot of price according to clarity and cut");

In [None]:
sns.scatterplot(df['clarity'], df['price'], hue = df['length'], data = df)
plt.title('Scatterplot of Diamond Prices by Clarity and Length', fontsize = 15);
plt.xlabel('Clarity')
plt.ylabel('Diamond prices')
plt.legend(loc='lower right', title='Length')
plt.show();

# Summary and Conclusion

add the summary here

# References

Uses of diamonds - Mining for schools. (n.d.). <br> Www.miningforschools.co.za. https://www.miningforschools.co.za/lets-explore/diamond/uses-of-diamonds

‌