# Diamonds modeling

A model can not only be used to predict a value, as we did in the MPG-dataset, it can also be used to get rid of a certain relationship in our data to help us see another relationship.

For example: when working with diamonds, the size is a big factor in the cost. But diamonds have a cut as well [link](https://www.diamondcuts.com/) and that cut also has a relation with the price of the diamond. But it doesn't look that way when we look at the box plots.

In [None]:
# !pip install seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df=pd.read_csv("../files/diamonds.csv", index_col=0)
df.head()

In [None]:
from pandas.api.types import CategoricalDtype

cut_type = CategoricalDtype(categories=['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], ordered=True)
df["cut"] = df['cut'].astype(cut_type)

fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12, 5))
sns.countplot(x="cut", data=df, ax = ax1)
sns.boxplot(data=df, x='cut', y='price', ax = ax2)

It's mainly the boxplots that are baffling: why are fair-cut diamonds the more expensive kind? Ideally cut diamonds are even the cheapest kind! And it's not a not-enough-data-problem, as even for fair cut diamonds there are more than 1000 rows in the dataset.

So we add some domain knowledge: **Weight is important factor in price.** Let’s try to separate out the effect of carat on the price.

(The graph below is interesting: why is the alpha set to 0.1? What does this mean for the dark areas?)

In [None]:
df.plot(kind='scatter', x="carat", y="price", grid=True,fontsize=10, figsize=(12, 6), alpha=0.1)

Price vs carat has a an exponential relationship, meaning the price will rise by the _squared_ weight of the diamond. If you want to transform an exponential relationship into a linear one, you need to use a log-transformation.

Let's recalculate the price value into a linear column.

In [None]:
df["log_price"] = np.log(df["price"])
df["log_carat"] = np.log(df["carat"])
# df.head()
df.plot(kind='scatter', x="log_carat", y="log_price", grid=True,fontsize=10, figsize=(12, 6), alpha=0.1)

Nice and linear. Do remember that you can't use the log_price and log_carat columns to find an actual price or weight. The unit is al wrong (it's in log_of_dollar and log_of_carat, which has no physical meaning).

Next step is creating a model.

In [None]:
from sklearn import datasets, linear_model

x = df.log_carat.values.reshape(-1, 1)
y = df.log_price.values.reshape(-1, 1)

regr = linear_model.LinearRegression()
model = regr.fit(x, y)

fig, ax = plt.subplots(figsize=(12,6))

df.plot(kind='scatter', x="log_carat", y="log_price", grid=True,fontsize=10, ax=ax,  figsize=(12, 6), alpha=0.1)
plt.plot(x, regr.predict(x), color='red')

In [None]:
print(f"a= {model.coef_[0][0]}, b= {model.intercept_[0]}")

And there is our linear model. Now we use this to predict the values for every weight. Once we have this predicted weight, we use it to calculate the residuals (or the error) for every actual value.

In [None]:
df['log_price_predicted'] = model.predict(df.log_carat.values.reshape(-1, 1))
df['log_price_residuals'] = df['log_price'] - df['log_price_predicted']

# df.head()
df.plot(kind='scatter', x="log_carat", y="log_price_residuals", grid=True,fontsize=10, figsize=(12, 6), alpha=0.1)

And there we have it, a graph showing the relationship between the price and the weight (in carat) of a diamond without the strong linear relationship between these two variables. And that means we can draw the graph showing the relationship between the price and the cut and not be bothered by the weight!

In [None]:
sns.boxplot(data=df, x='cut', y='log_price_residuals')

See the mean going up as the cut increases? That was what we needed.

As for the unit of the Y-axis: it's the difference between the log_price and the predicted log_price. It has no monetary value, but is a good indication.