# Diamonds

The diamonds dataset is a well known dataset. It's one of the basic examplesets you get in the R-programming suite, but you can also [download](https://www.kaggle.com/datasets/shivam2503/diamonds) it as a csv. We've done that for you and stored it in the files folder, but still we have some exercises using it.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df=pd.read_csv("files/diamonds.csv", index_col=0)
df.head()


Also describe the dataset, so you have an idea of how big it is and what sizes we are talking about.

In [None]:
df.describe()

Create a bar-chart showing how many diamonds there are of every cut. You can use matplotlib and grouping, but easier would be to use seaborn's countplot.

In [None]:
# Up to you!



The bottom line of the previous graph is:

![](files/2022-08-30-13-39-46.png)

Which isn't good because there is an order in the cut of diamonds. It's:

<code>['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']</code>


Turn the "cut"-column into an ordered categorical column. You can use the list in this markdown-cell. We have a second one of those lists:

<code>['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']</code>

This is the "clarity", another ordered categorical.

In [None]:
# Up to you!



Now recreate the previous graph. The labels on the axis should be sorted correctly now.

In [None]:
# Up to you!



Show a distribution of diamonds by size (aka carat). Try to do it in three words. (A _histogram_ on _carat_ in the dataset _df_, but in reverse).

In [None]:
# Up to you!



A good plot, but when doing histograms the bin width is important. Draw four with varying bin widths (5, 10, 20 and 40).

In [None]:
# Up to you!



Carat weight distribution is skewed. What does this mean?

(In the following code block you should only write text. It's a small statistical interpretation that you can get on the exam as well.)

In [None]:
# Up to you!



In the last graph you notice that more is going on in this dataset. There are spikes. Also, why is does the line go to 5 when the last datapoint is at 3?

Show the diamonds with a carat size above 3.5. How many are there?

In [None]:
# Up to you!



These are what we call outliers. They're not that interesting, especially since we're talking about 9 observations in a list of 50.000 of them.

Draw:
* A histogram
* 200 bins
* Carat size beneath 3.5

In [None]:
# Up to you!



Another non-technical question:

* Which values are most common? Why?
* Which values are rare? Why? Is this expected?
* Why are there more values to the right of the peak?
* Why almost no diamonds bigger than 3 carats?
* There seem to be (sub)groups, or clusters, of similar values.
    * How are observations in cluster similar?
    * How are observations from separate clusters different?

To interpret, image you are buying (or receiving) a diamond engagement ring. What sounds bigger, 1 carat or 0.99 carats?

In [None]:
# Up to you!



Plot 100 bins in a histogram on the Y value (which is the depth of the diamong in mm).

In [None]:
# Up to you!



Many outliers here! Show price, x, y and z for every diamond with Y bigger than 20 or smaller than 3. The bins start at 0 and end at 58.9, while most observations lie between 3 and 10.

Show the records with a y-size smaller than 3 or bigger than 10.

In [None]:
# Up to you!



You should note that:

* A couple of diamonds have no size (in this dataframe) and should not be considered in this graph.
* There's a diamond 5.8cm wide that only costs $12k.

Redraw the graph without these values.

In [None]:
# Up to you!



We have ignored the complete observations now, which was fine because we were only looking at Y. But when continuing to work with a dataset we can't just drop the observations because that would mean we lose all data in those rows.

So set all X, Y and Z values of 0 or 20 or more to NA. Use [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html?highlight=replace#pandas.DataFrame.replace) or [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

In [None]:
# Up to you!



Now show a list of all diamonds where X, Y or Z is Na.

In [None]:
df[(df['x']).isna()|(df['y']).isna()|(df['z']).isna()]

Show the same table as before. Which is very hard because the old code won't work. But we saved you a list of line numbers:

<code>[11964,15952,24068,24521,26244,27430,49190,49557,49558]</code>

Use [iloc](https://www.statology.org/pandas-select-rows-by-index/).

(Getting the wrong lines? There is a difference between the line number and the index. This difference is 1 (index is 0-based, line number is 1-based).)

In [None]:
# Up to you!



And a new histogram on Y without the filtering?

In [None]:
# Up to you!



Based on this histogram, does a boxplot look like a good idea? Draw one first, then check if it captures the specifics of your data.

In [None]:
# Up to you!



No. You miss out on all of the nuances of sizes. Now do multiple boxplots (using seaborn) with all prices per cut.

In [None]:
# Up to you!



Looks wrong, doesn't it? You would expect the premium and ideal diamonds to be more expensive.

Although the best cut type is Ideal, its price is the lowest. According to the average prices, the most expensive diamonds belong to Premium and Fair cut types. These results present that cut is not enough to explain response variable price, since price does not increase while cut feature improves.

Do the same, but show the weight in stead of the price.

In [None]:
# Up to you!



Still the same picture. Maybe the color comes into play? The color of the diamond goes from D (best) to J (worst) (categoric variable).

In [None]:
# Up to you!



Although the best color type is D, its price is one of the lowest. According to the average prices, the most expensive diamonds belong to J and I cut types which are actually the worst two color type in this dataset. These results clearly show that color is not enough to explain response variable price, since price does not increase while the color feature improves.

Compare the cut with the clarity on average price. Clarity is a measurement of how clear the diamond is, from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)(categoric variable).

Both are categorical variables by the way, so a [crosstable](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) would be a good idea.

In [None]:
# Up to you!



When cut improves and clarity improves the price isn't higher. So both variables are not enough to explain the price of a diamond.

What would happen if we plot the price vs the carat? Both are continuos variables, so a scatterplot is fine.

You could simply write <code>plt.scatter(df.carat, df.price)</code>, but there would be much room for improvement.

* There are 50.000 dots on there. Make sure they are small enough.
* Add titles on the axes.
* Add a trendline in the first order
* Add a trendline in the fourth order
* Make sure Y is limited between 0 en 20.000 (no values there, but the trendlines want to get out of those limits)

In [None]:
# Up to you!



Note the covariance: if price goes up, so goes the weight. Maybe it's the other way around, but you would need a domain-specialist for that (or some common sense). Although covariance or correlation don't always imply causality.

![](files/2022-08-31-15-35-41.png)

[spurious-correlations](http://www.tylervigen.com/spurious-correlations)