# Analysing Diamonds Dataset from Basic To Advanced with R

This analysis explores the dataset of 53940 diamonds using a variety of graphical techniques and explains how to interpret results and draw insights from a high level perspective. The quality of data driven decisions are only as good as the quality of data analysed and the ability to draw insights from it. This analysis starts from a raw dataset step by step and lets see where it ends.


In [None]:
# Import dataset from input directory
diamonds<-read.csv("../input/diamonds/diamonds.csv")

# Lets do some cleaning
Now lets have a look how it looks like and do some cleaning. Below we have columns named X, carat, cut, color, clarity, depth, table, price, x, y and z. This first column "X" is just row numbers so we will delete it. Than we have how many carats is a diamond, its cut type, color, clarity level, depth of the diamond, table which is the upper width of the diamond when you see it from face up, its price, and than we have x, y and z as its dimensions as length, width and height.

In [None]:
head(diamonds)
diamonds$X <- NULL

**Lets also have a look on a quick summary**

We have cut, color, clarity as categorical types and carat, depth, table, price, x, y and z as numerical data types.

**Things To Notice At First Glance**

So, looking at the summary below we can notice some problems as follows.
1. The categorical columns have a data type as charactor and we need to change it as factors
2. The minimum value of columns x, y and z are 0 which means we have false values in our dataset because a diamond must have at least some height, width or length to even exist.
3. We can also notice that maximum values of carat and y & z dimensions are way higher than upto third quartile which means may be wrong entries or outliers

In [None]:
summary(diamonds)

**Sometimes Most General Data Cleaning Routine Is Enough**

The most general data cleaning routine is as follows:
1. Remove Duplicates
2. Fix Structure of Dataset
3. Filter Wrong Entries & Outliers
4. Manage Missing Values
5. Validate and Quality Check

**1. Remove Duplicates**

Let's start with the first one and remove duplicates. It is not necessary in this dataset but if needed in any other dataset it is done as follows.


In [None]:
# import dplyr
library("dplyr")
unduplicated <- distinct(diamonds) #Remove duplicates and create new dataset
# We will use "unduplicated from now on so our original data stays the same"

# Stats of before and after removing duplicates
cat("Before removing duplicates:" ,nrow(diamonds))
cat("     After removing duplicates:" ,nrow(unduplicated))
cat("     Duplicates Removed:", nrow(diamonds)-nrow(unduplicated))

**2. Fixing Structure Of Dataset**

This is important to us because we needed to change our categorical data structure as factors.

In [None]:
# cut, color, clarity summary not showing up
# change from character type to factor
unduplicated$cut <- as.factor(unduplicated$cut)
unduplicated$color <- as.factor(unduplicated$color)
unduplicated$clarity <- as.factor(unduplicated$clarity)

# check the structure of dataset
str(unduplicated)

Now, the structure is fixed we move on to next step

**3. Filter Wrong Entries and Outliers**

As we noticed earlier we have some entries where dimensions of diamonds are zero which is not realistic so we will remove those entries and create a new dataset called we_removed that has wrong entries (we) removed.

In [None]:
# Remove wrong entries (we) from undulicated dataset
we_removed <- subset(unduplicated, x!=0)
we_removed <- subset(we_removed, y!=0)
we_removed <- subset(we_removed, z!=0)

**Lets have a look at the outliers graphically first.**

After removing wrong entries as above, we move on to removing outliers. For this we will use box plots.

In [None]:
library(ggplot2)
library(cowplot)
outliers_removed <- we_removed
oCarat <- ggplot(outliers_removed, aes(y=carat)) + 
  ggtitle("Outliers in Carat") + geom_boxplot() + theme_light()
oDepth<-ggplot(outliers_removed, aes(y=depth)) + 
  ggtitle("Outliers in Depth") + geom_boxplot() + theme_light()
oTable<-ggplot(outliers_removed, aes(y=table)) + 
  ggtitle("Outliers in Table") + geom_boxplot() + theme_light()
oX<-ggplot(outliers_removed, aes(y=x)) + 
  ggtitle("Outliers in X") + geom_boxplot() + theme_light()
oY<-ggplot(outliers_removed, aes(y=y)) + 
  ggtitle("Outliers in Y") + geom_boxplot() + theme_light()
oZ<-ggplot(outliers_removed, aes(y=z)) + 
  ggtitle("Outliers in Z") + geom_boxplot() + theme_light()
oPrice <- ggplot(outliers_removed, aes(y=price)) + 
  ggtitle("Outliers in Price") + geom_boxplot() + theme_light()
plot_grid(oCarat, oDepth, oTable, oX, oY, oZ, oPrice, ncol = 3, nrow = 3)

**Now, lets remove these outliers**

In [None]:
out<-boxplot(outliers_removed$carat, main="Outliers in Carat", plot = FALSE)$out
outliers_removed <- outliers_removed[-which(outliers_removed$carat %in% out), ]
out<-boxplot(outliers_removed$depth, main="Outliers in Depth", plot = FALSE)$out
outliers_removed <- outliers_removed[-which(outliers_removed$depth %in% out), ]
out<-boxplot(outliers_removed$table, main="Outliers in Table", plot = FALSE)$out
outliers_removed <- outliers_removed[-which(outliers_removed$table %in% out), ]
out<-boxplot(outliers_removed$x, main="No Outliers in x", plot = FALSE)$out
out<-boxplot(outliers_removed$y, main="No Outliers in y", plot = FALSE)$out
outliers_removed <- outliers_removed[-which(outliers_removed$y %in% out), ]
out<-boxplot(outliers_removed$z, main="No Outliers in z", plot = FALSE)$out
outliers_removed <- outliers_removed[-which(outliers_removed$z %in% out), ]
out<-boxplot(outliers_removed$price, main="Outliers in Price", plot = FALSE)$out
outliers_removed<-outliers_removed[-which(outliers_removed$price %in% out), ]
cleaned_data <- outliers_removed

**Managing Missing Values**
After removing outliers, we do the follwing to check if there is any unassigned value in the dataset that needs to be removed because we want our dataset to be complete. Here it comes out to be false which means we have a complete dataset and it contains no missing values.

If, however, there happens to be missing values in the dataset, there are two ways to deal with them.

1. If the number of missing values is small compared to the size of the dataset we can remove them because it is not going to affect our analysis that much.
2. If, on the otherhand, dataset is small and the rows containing missing values are not allowed to be removed we can use predictive mean matching method from MICE package that can fill in missing values based on matching means of means of that specified column.

Lastly, we look at the summary of dataset again to see if we can find anything more to clean or fix otherwise we are good to go.

In [None]:
any(is.na(cleaned_data))

In [None]:
summary(cleaned_data)

So the summary above looks pretty good and provides indication that our work is a success. We dont have any min zero values for dimensions. We also dont have outliers in the data which we can tell by simply looking at the distribution values within quartiles for each column. Now this data is clean and ready for analysis which we do next.

# Diamonds Dataset - Exploratory Data Analysis

In exploratory analysis we will use some data visualisation techniques to look at this dataset and draw some insights. Data visualisation provides a different outlook that is much simplified and normally very hard to find any other way. This is what makes data visualisation so powerful.

Lets start with the simple pie chart and use categorical data to visualise our dataset.

In [None]:
# Plot pie chart
piechart<- count(cleaned_data, cut)
pieCut<-ggplot(piechart, aes(x="", y=n, fill=cut)) + 
  geom_bar(stat = "identity", width = 1) + 
  coord_polar("y", start = 0) + 
  theme_void() + ggtitle("Diamonds per Cut")
piechart<- count(cleaned_data, color)
pieColor<-ggplot(piechart, aes(x="", y=n, fill=color)) + 
  geom_bar(stat = "identity", width = 1) + 
  coord_polar("y", start = 0) + 
  theme_void() + ggtitle("Diamonds per Color")
piechart<- count(cleaned_data, clarity)
pieClarity<-ggplot(piechart, aes(x="", y=n, fill=clarity)) + 
  geom_bar(stat = "identity", width = 1) + 
  coord_polar("y", start = 0) + 
  theme_void() + ggtitle("Diamonds per Clarity")
plot_grid(pieClarity, pieColor, pieCut, nrow = 3)

In [None]:
barchart <- count(cleaned_data, clarity)
ggplot(barchart, aes(x=clarity, y=n, fill=clarity, color=clarity, label=n)) +
  geom_col() + geom_text(nudge_y = 600) + ggtitle("Diamonds per Category") +
  theme_light()
barchart <- count(cleaned_data, cut)
ggplot(barchart, aes(x=cut, y=n, fill=cut, color=cut, label=n)) +
  geom_col() + geom_text(nudge_y = 600) + ggtitle("Diamonds per Category") +
  theme_light()
barchart <- count(cleaned_data, color)
ggplot(barchart, aes(x=color, y=n, fill=color, color=color, label=n)) +
  geom_col() + geom_text(nudge_y = 600) + ggtitle("Diamonds per Category") +
  theme_light()

In [None]:
ggplot(cleaned_data, aes(x=table, fill=cut)) +
  geom_histogram(binwidth = 1, color="white") +
  ggtitle("Table of Diamonds") + theme_light()
ggplot(cleaned_data, aes(x=depth, fill=cut)) +
  geom_histogram(binwidth = 0.5, color="white") +
  ggtitle("Depth of Diamonds") + theme_light()
ggplot(cleaned_data, aes(x=carat, fill=cut)) +
  geom_histogram(binwidth = 0.1, color="white") +
  ggtitle("Carat of Diamonds") + theme_light()
ggplot(cleaned_data, aes(x=price, fill=cut)) +
  geom_histogram(binwidth = 100, color="white") +
  ggtitle("Price of Diamonds") + theme_light()

In [None]:
sp_carat<-ggplot(cleaned_data, aes(x=price, y=carat)) +
  geom_point(color="blue") + geom_smooth(method = lm, color= "darkred") +
  ggtitle("Scatter plot") + theme_light()
sp_table<-ggplot(cleaned_data, aes(x=price, y=table)) +
  geom_point(color="blue") + geom_smooth(method = lm, color= "darkred") +
  ggtitle("Scatter plot") + theme_light()
sp_depth<-ggplot(cleaned_data, aes(x=price, y=depth)) +
  geom_point(color="blue") + geom_smooth(method = lm, color= "darkred") +
  ggtitle("Scatter plot") + theme_light()
sp_x<-ggplot(cleaned_data, aes(x=price, y=x)) +
  geom_point(color="blue") + geom_smooth(method = lm, color= "darkred") +
  ggtitle("Scatter plot") + theme_light()
sp_y<-ggplot(cleaned_data, aes(x=price, y=y)) +
  geom_point(color="blue") + geom_smooth(method = lm, color= "darkred") +
  ggtitle("Scatter plot") + theme_light()
sp_z<-ggplot(cleaned_data, aes(x=price, y=z)) +
  geom_point(color="blue") + geom_smooth(method = lm, color= "darkred") +
  ggtitle("Scatter plot") + theme_light()
plot_grid(sp_carat, sp_table, sp_depth, sp_x, sp_y, sp_z,  nrow = 3)

# Analysing Price Variable
Lets go deeper into pricing of diamonds. We divide diamonds into four groups as low, medium, high and top according to prices.

In [None]:
priceRange <- cleaned_data$price
pr <- as.numeric(unlist(priceRange))
min <- as.numeric(min(pr))
max <- as.numeric(max(pr))
Q1 <- as.numeric(summary(cleaned_data$price)[2])
Q3 <- as.numeric(summary(cleaned_data$price)[5])
median <- as.numeric(summary(priceRange)[3])
priceRange <- cut(pr, breaks = c(min, Q1, median, Q3, max),
                  labels = c("low", "medium", "high", "top"),
                  include.lowest = TRUE)
priceGroup <- data.frame(cleaned_data, priceRange)

**Summary of diamonds with 'low' price group:**

In [None]:
summary(priceGroup$price[priceGroup$priceRange=="low"])

**Summary of diamonds with 'medium' price group:**

**Summary of diamonds with 'high' price group:**

In [None]:
summary(priceGroup$price[priceGroup$priceRange=="high"])

**Summary of diamonds with 'top' price group:**

In [None]:
summary(priceGroup$price[priceGroup$priceRange=="top"])

**Prices according to their cut types Boxplots**

In [None]:
ggplot(cleaned_data, aes(x=cut, y=price, fill=cut)) +
  geom_boxplot() + ggtitle("Prices with cut types") +
  theme_light()

In [None]:
corZP <- ggplot(cleaned_data, aes(x=z, y=price)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Price and Z")
corCP <-ggplot(cleaned_data, aes(x=carat, y=price)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Price and Carat")
corDP <- ggplot(cleaned_data, aes(x=depth, y=price)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Price and Depth")
corTP <- ggplot(cleaned_data, aes(x=table, y=price)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Price and Table")
corXP <- ggplot(cleaned_data, aes(x=x, y=price)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Price and X")
corYP <- ggplot(cleaned_data, aes(x=y, y=price)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Price and Y")
plot_grid(corCP, corDP, corTP, corXP, corYP, corZP, ncol = 3, nrow = 2)

# Compute Volume Variable from Dimensions
**We see strong correlation between price and dimensions of diamonds. So lets join them together and compute a new variable called volume out of these dimensions and check correlation of price and volume**

In [None]:
cleaned_data$volume <- cleaned_data$x*cleaned_data$y*cleaned_data$z
cleaned_data$volume <- round(cleaned_data$volume, digits = 2)
cor.test(x=cleaned_data$price, y=cleaned_data$volume)
ggplot(cleaned_data, aes(x=volume, y=price)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Price and Volume")

**As expected, volume strongly correlates with price. Let's check if it also correlates with carat because previously carat had strong correlation with price too.**

In [None]:
cor.test(x=cleaned_data$carat, y=cleaned_data$volume)
ggplot(cleaned_data, aes(x=carat, y=volume)) + 
  geom_point(color="blue") + geom_smooth(method = lm, color="darkred") +
  theme_light() + ggtitle("Carat and Volume")

**Now we can see, Volume is strongly correlated to carat as well as price**

**Below plots were plotted to double check if there is any correlation of table of diamonds with any other variable and the plots show there is none. So at this point we know strong correlation exist between price, volume and carat variables from the dataset.**

In [None]:
tcarat <- ggplot(cleaned_data, aes(x=carat, y=table)) + 
  ggtitle("Table and Carat") + 
  geom_point(color="blue") + 
  geom_smooth(method = lm, color="darkred") + 
  theme_light()
tDepth<-ggplot(cleaned_data, aes(x=table,y=depth)) + 
  ggtitle("Table and Depth") + 
  geom_point(color="blue") + 
  geom_smooth(method = lm, color="darkred") + 
  theme_light()
tX<-ggplot(cleaned_data, aes(x=table,y=x)) + 
  ggtitle("Table and X") + 
  geom_point(color="blue") + 
  geom_smooth(method = lm, color="darkred") + 
  theme_light()
tY<-ggplot(cleaned_data, aes(x=table,y=y)) + 
  ggtitle("Table and Y") + 
  geom_point(color="blue") + 
  geom_smooth(method = lm, color="darkred") + 
  theme_light()
tZ<-ggplot(cleaned_data, aes(x=table,y=z)) + 
  ggtitle("Table and Z") + 
  geom_point(color="blue") + 
  geom_smooth(method = lm, color="darkred") + 
  theme_light()
tPrice <- ggplot(cleaned_data, aes(x=table,y=price)) + 
  ggtitle("Table and Price") + 
  geom_point(color="blue") + 
  geom_smooth(method = lm, color="darkred") + 
  theme_light()
plot_grid(tcarat, tDepth, tX, tY, tZ, tPrice, ncol = 3, nrow = 2)