# fill in with table of content later (5)

# introduction 

## Dataset source (3)

The 'Diamonds' dataset was used in a study conducted by Shivam Agrawal and sourced from Kaggle 2022. It analyzes almost 54,000 diamonds by their cut, colour, clarity, price and other attributes.

## Dataset details (5)

This dataset involves various details about diamonds to help with data analysis and visualization based on their attributes. These attributes include carat, cut, color, clarity, depth percentage, table, price, length, width and depth. These attributes make it sufficient enough to predict the price of diamonds through predictive modelling.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None) 

###
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")

df = pd.read_csv('diamonds.csv')

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])

Number of rows: 53940
Number of columns: 10


Here are 10 randomly generated observations from the Diamonds dataset.

In [2]:
df.sample(10, random_state=5)

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
4500,0.9,Premium,H,VS2,60.0,59.0,6.23,6.16,3.72,3629
38643,0.33,Ideal,F,VVS2,61.9,56.0,4.46,4.42,2.75,1040
4924,0.9,Good,H,SI1,57.1,56.0,6.33,6.28,3.6,3726
38167,0.34,Ideal,G,IF,60.1,57.0,4.54,4.58,2.74,1014
1937,0.9,Good,E,SI2,64.1,57.0,6.09,6.07,3.9,3084
34265,0.31,Premium,J,SI1,60.9,60.0,4.38,4.36,2.66,465
21774,1.51,Very Good,G,SI1,62.9,58.0,7.25,7.28,4.57,9841
4061,0.34,Very Good,E,SI2,63.1,56.0,4.51,4.46,2.83,571
10715,0.31,Good,D,SI2,63.7,55.0,4.35,4.32,2.76,593
9527,1.02,Very Good,H,SI1,63.0,58.0,6.34,6.4,4.01,4617


## Dataset variables (18) 

explain variables in dataset (in table format with four columns: name of variable, data type, units, brief description) (lavinia is doing it)

| Name | Datatype | Units | Description | 
| :-- | :-- | :-- | :-- |
| Index counter | Discrete Numeric | NA | Index of each diamond |
| Carat | Continuous Numeric | Carats | Carat weight of diamond (1 carat = 0.20g) |
| Cut | Ordinal Categorical | NA | Quality of cut; Increasing order: Fair, Good, Very Good,Premium, Ideal |
| Color | Ordinal Categorical | NA | Colour grade of diamond; (best)D, E, F, G, H, I, J(worst) |
| Clarity | Ordinal Categorical | NA | How obvious inclusions(small imperfections) are within the diamonds. List from best to worst: <br> <b>IF:</b> flawless <br><b>VVS1 or VVS2:</b> Very Very Slightly Included <br> <b>VS1 or VS2:</b> Very Slightly Included <br> <b>S1 or S2</b>: Slightly Included <br> <b>nI1 or I2</b>:Included|
| Table | Continuous Numeric | Percentage | width of the diamond's table(facet seen when diamond is viwed face up) relative to it's widest point |
| price | Continuous Numeric | US dollars | cost of the diamond |
| x | Continuous Numeric | Millimeter | length of the diamond |
| y | Continuous Numeric | Millimeter | width of the diamond |
| y | Continuous Numeric | Millimeter | depth of the diamond |
| Depth | Continuous Numeric | percentage | Depth percentage measured from the cutlet(flat face at the bottom of the gemstone) to the table, divided by its girdle(line that separates the crown from the pavilion of the edge of a diamond) diameter |

## Target variables (2)

The aim of this report is to investigate how a range of different variable can impact the price/value of a diamond. Therefore, the Target feature for this project will be price of diamonds in US dollars. 

# Goals and objectives (7)

Throughout history, people have been drawn to exquisite, unique items. Diamonds are still regarded as the pinnacle of luxury in jewelery since they have been prized as jewels from ancient times and are admired for their brilliance. Diamonds are treasured for much more than just their alluring beauty, though. They have different qualities that allow people to use it for many different purposes such as a cutting tool, and other tasks requiring durability. This makes diamonds valued beyond all other stones due to their distinctive physical characteristics and are the most popular gemstone in the world.

Because of these different aspects that we discussed, a predictive model for diamonds' prices would have many practical use and applications in the real world. For example, it could help buyers determine if the price of a singular diamond is reasonable. Potential sellers of jewelry could also use this model to predict an estimate of the price of their diamond.

There are 2 main objectives in this project. The first one is to predict the price of diamonds based on a number of different features, and which features appear to be the greatest indicators or predictors of the diamonds' prices. In addition to that, after some data preprocessing and preparation, which is the focus of this Phase 1 report, the second goal is to undertake some exploratory data analysis using basic descriptive statistics and data visualisation plots to obtain some insight into the patterns and correlations existent in the data.

At this stage, our presumption is that our dataset's rows are not associated. That is, we are assuming that the price of a certain diamond doesn't affect the price of another in this dataset. By making this assumption, we are able to utilize traditional predictive models such as multiple linear regression.
(https://www.miningforschools.co.za/lets-explore/diamond/uses-of-diamonds)

# Data cleaning and processing (15)

This process aims to ... 

### Missing values 

In [3]:
#check for missing values here 
print("Number of missing values for each column:")
df.isnull().sum()

Number of missing values for each column:


carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
price      0
dtype: int64

### Incorrect values

In [4]:
#check for outliers
from IPython.display import display, HTML
display(HTML('<b>Table 2: Summary of numerical features</b>'))
df.describe(include=['int64','float64']).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carat,53940.0,0.79794,0.474011,0.2,0.4,0.7,1.04,5.01
depth,53940.0,61.749405,1.432621,43.0,61.0,61.8,62.5,79.0
table,53940.0,57.457184,2.234491,43.0,56.0,57.0,59.0,95.0
x,53940.0,5.731157,1.121761,0.0,4.71,5.7,6.54,10.74
y,53940.0,5.734526,1.142135,0.0,4.72,5.71,6.54,58.9
z,53940.0,3.538734,0.705699,0.0,2.91,3.53,4.04,31.8
price,53940.0,3932.799722,3989.439738,326.0,950.0,2401.0,5324.25,18823.0


In [5]:
#describe outlier stuff, if values super far from mean then it's outlier (look at width, depth and price)

In [6]:
#whisker stuff for carat
iqr_c = 1.04 - 0.40
lowerwhisker_c = 0.40 - 1.5*iqr_c
upperwhisker_c = 1.04 + 1.5*iqr_c
print(f"Lower whisker: {lowerwhisker_c}, Upper whisker: {upperwhisker_c}")

Lower whisker: -0.5599999999999999, Upper whisker: 2.0


In [7]:
df['carat'] = df[(df['carat'] > lowerwhisker_c) & (df['carat'] < upperwhisker_c)]['carat']

In [8]:
#whisker stuff for width
iqr_y = 6.54 - 4.72
lowerwhisker_y = 4.72 - 1.5*iqr_y
upperwhisker_y = 6.54 + 1.5*iqr_y
print(f"Lower whisker: {lowerwhisker_y}, Upper whisker: {upperwhisker_y}")

Lower whisker: 1.9899999999999993, Upper whisker: 9.27


In [9]:
df['y'] = df[(df['y'] > lowerwhisker_y) & (df['y'] < upperwhisker_y)]['y']

In [10]:
#whisker stuff for depth
iqr_z = 4.04 - 2.91
lowerwhisker_z = 2.91 - 1.5*iqr_z
upperwhisker_z = 4.04 + 1.5*iqr_z
print(f"Lower whisker: {lowerwhisker_z}, Upper whisker: {upperwhisker_z}")

Lower whisker: 1.2150000000000003, Upper whisker: 5.734999999999999


In [11]:
df['z'] = df[(df['z'] > lowerwhisker_z) & (df['z'] < upperwhisker_z)]['z']

In [12]:
#whisker stuff for price
iqr_p = 5324.25 - 950.0
lowerwhisker_p = 5324.25 - 1.5*iqr_p
upperwhisker_p = 950.0 + 1.5*iqr_p
print(f"Lower whisker: {lowerwhisker_p}, Upper whisker: {upperwhisker_p}")

Lower whisker: -1237.125, Upper whisker: 7511.375


In [13]:
df['price'] = df[(df['price'] > lowerwhisker_p) & (df['price'] < upperwhisker_p)]['price']

In [24]:
df.describe(include=['int64','float64']).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carat,51786.0,0.742335,0.393009,0.2,0.39,0.7,1.02,1.99
cut,53940.0,2.904097,1.1166,0.0,2.0,3.0,4.0,4.0
color,53940.0,3.405803,1.701105,0.0,2.0,3.0,5.0,6.0
clarity,53940.0,4.05102,1.647136,1.0,3.0,4.0,5.0,8.0
depth percentage,53940.0,61.749405,1.432621,43.0,61.0,61.8,62.5,79.0
table,53940.0,57.457184,2.234491,43.0,56.0,57.0,59.0,95.0
length,53940.0,5.731157,1.121761,0.0,4.71,5.7,6.54,10.74
width,53911.0,5.732353,1.109132,3.68,4.72,5.71,6.54,9.26
depth,53891.0,3.538265,0.689473,1.41,2.91,3.53,4.03,5.73
price,45583.0,2495.628239,1916.805856,326.0,862.0,1822.0,3909.5,7511.0


### ID-like columns

In [14]:
#make id column into index for this table

### aggregation

In [15]:
#encoding cut
cut = {'Fair': 0, 'Good': 1, 'Very Good': 2, 'Premium': 3, 'Ideal': 4}
df['cut'].replace(cut, inplace=True)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,0.23,4,E,SI2,61.5,55.0,3.95,3.98,2.43,326.0
1,0.21,3,E,SI1,59.8,61.0,3.89,3.84,2.31,326.0
2,0.23,1,E,VS1,56.9,65.0,4.05,4.07,2.31,327.0
3,0.29,3,I,VS2,62.4,58.0,4.2,4.23,2.63,334.0
4,0.31,1,J,SI2,63.3,58.0,4.34,4.35,2.75,335.0


In [16]:
#encoding color
color = {'D': 6, 'E': 5, 'F': 4, 'G': 3, 'H': 2, 'I': 1, 'J': 0}
df['color'].replace(color, inplace=True)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,0.23,4,5,SI2,61.5,55.0,3.95,3.98,2.43,326.0
1,0.21,3,5,SI1,59.8,61.0,3.89,3.84,2.31,326.0
2,0.23,1,5,VS1,56.9,65.0,4.05,4.07,2.31,327.0
3,0.29,3,1,VS2,62.4,58.0,4.2,4.23,2.63,334.0
4,0.31,1,0,SI2,63.3,58.0,4.34,4.35,2.75,335.0


In [17]:
#encoding clarity
clarity = {'I2': 0, 'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8}
df['clarity'].replace(clarity, inplace=True)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,0.23,4,5,2,61.5,55.0,3.95,3.98,2.43,326.0
1,0.21,3,5,3,59.8,61.0,3.89,3.84,2.31,326.0
2,0.23,1,5,5,56.9,65.0,4.05,4.07,2.31,327.0
3,0.29,3,1,4,62.4,58.0,4.2,4.23,2.63,334.0
4,0.31,1,0,2,63.3,58.0,4.34,4.35,2.75,335.0


In [18]:
print(f"This is the dataset shape: {df.shape} \n")
print(f"These are the data types; 'object' stands for string type:")
print(df.dtypes)

This is the dataset shape: (53940, 10) 

These are the data types; 'object' stands for string type:
carat      float64
cut          int64
color        int64
clarity      int64
depth      float64
table      float64
x          float64
y          float64
z          float64
price      float64
dtype: object


In [19]:
#Changing name of columns
df.columns = df.columns.str.lower().str.strip()

columns_mapping = {
    'depth': 'depth percentage',
    'x': 'length',
    'y': 'width',
    'z': 'depth',
}

df = df.rename(columns = columns_mapping)
df.sample(5, random_state=999)

Unnamed: 0,carat,cut,color,clarity,depth percentage,table,length,width,depth,price
38848,0.4,4,1,8,62.2,56.0,4.75,4.71,2.94,1050.0
9023,1.04,4,2,3,61.9,57.0,6.49,6.46,4.01,4515.0
51799,0.75,3,6,2,60.6,56.0,5.94,5.9,3.59,2415.0
35562,0.35,3,3,5,61.2,58.0,4.54,4.51,2.77,906.0
18923,1.49,2,3,2,62.5,58.0,7.2,7.26,4.52,


In [20]:
#samples

# Data Exploration and Visualisation(15)

two people need to do 3 graphs, 3 people need to do 4 graphs 
this needs to be a mix of scatter, bar, box, count,

### Univariable Visualisation

In [21]:
#add min of 6 graphs

### Two Variable Visualisation

In [22]:
#add min of 6 graphs

### Three Variable Visualisation

In [23]:
#add min of 6 graphs

# Literature review (optional)

Minimum 10 journal articles and 4 conference papers

# Summary and conclusion

add the summary here

# References

add references here