**Note**: Much of the encoding process is from a great tutorial where you can read more: https://pythonprogramming.net/machine-learning-python3-pandas-data-analysis/. I have been following this guy for a while and his tutorials and YouTube videos are well worth it.

## Goal

The year is 1979 and you just made the most important decision of you life: you asked someone to marry you. Being the analytics minded person that you are, you have decided to check if you overpaid for your ring. You will create a machine learning model, based on the diamonds dataset from the 1970s, to determine how much you overpaid or underpaid for your ring based on market prices.

In [1]:
import numpy as np
import pandas as pd

#Some configuration settings
%matplotlib inline
pd.set_option("display.max_columns", 100)

In [2]:
# import diamonds.csv
diamonds_df = pd.read_csv('diamonds.csv', index_col=0)

In [3]:
# take a look at your data
diamonds_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
# Check missing values
diamonds_df.isnull().sum()/len(diamonds_df)

carat      0.0
cut        0.0
color      0.0
clarity    0.0
depth      0.0
table      0.0
price      0.0
x          0.0
y          0.0
z          0.0
dtype: float64

In [5]:
# Take a look at the different cuts
diamonds_df['cut'].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

In [6]:
# Create dictionaries for cut, clarity and color to encode
cut_class_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}
clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}
color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}

In [7]:
# Use map to encode these new dictionaries into your variables
diamonds_df['cut'] = diamonds_df['cut'].map(cut_class_dict)
diamonds_df['clarity'] = diamonds_df['clarity'].map(clarity_dict)
diamonds_df['color'] = diamonds_df['color'].map(color_dict)
diamonds_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,6,4,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,6,5,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,6,7,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,2,6,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,1,4,63.3,58.0,335,4.34,4.35,2.75


In [8]:
# Create X and y (X = 'carat', 'cut', 'color', 'clarity', y = 'price')
X = diamonds_df[['carat', 'cut', 'color', 'clarity']]
y = diamonds_df.price

In [9]:
# Create training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [10]:
# Question: Do you need to normalize the data if we are using a decision tree regression?

In [11]:
# Create ML process with base decision tree (import, instantiate, fit, predict, evaluate)

# import
from sklearn.tree import DecisionTreeRegressor

# instantiate
DT = DecisionTreeRegressor()

# fit
DT.fit(X_train, y_train)

# predict
y_pred = DT.predict(X_test)

# evaluate (mean squared error)
from sklearn.metrics import mean_absolute_error

print('MAE: ', mean_absolute_error(y_test, y_pred))

MAE:  323.85021829696655


In [21]:
# Use GridSearchCV to improve your predictions (lower error and lower variance)
# use 'neg_mean_absolute_error' as the scoring method where greater negative score is better
# see https://stackoverflow.com/questions/48244219/is-sklearn-metrics-mean-squared-error-the-larger-the-better-negated

from sklearn.model_selection import GridSearchCV

#create a dictionary of all values we want to test
param_grid = {'max_depth': np.arange(3, 15), 
              'min_samples_leaf': np.arange(1, 5)}
    
# decision tree model
DT = DecisionTreeRegressor()
    
#use gridsearch to test all values
DT_gridsearch = GridSearchCV(DT, param_grid, cv=5)
    
#fit model to data
DT_gridsearch.fit(X, y)
        
DT_gridsearch.best_params_, -(DT_gridsearch.best_score_), DT_gridsearch.n_splits_

({'max_depth': 14, 'min_samples_leaf': 4}, -0.27631320327176084, 5)

In [22]:
# Run model with best parameters
DT_model_optimal = DecisionTreeRegressor(max_depth=3, min_samples_leaf=2)

DT_model_optimal.fit(X_train, y_train)

y_pred = DT_model_optimal.predict(X_test)

print('MAE: ', mean_absolute_error(y_test, y_pred))

MAE:  772.4148701073096


In [24]:
# Make prediction
carat = 1
cut = 1
color = 1 
clarity = 1 


print(DT_model_optimal.predict(np.array([carat, cut, color, clarity]).reshape(1,-1)))

[5397.34805517]


In [25]:
# create a function that takes in the carat, cut, color and clarity and determine how much you overpaid or underpaid

def diamond_app(carat, cut, color, clarity, what_you_paid):
    ml_price = DT_model_optimal.predict(np.array([carat, cut, color, clarity]).reshape(1,-1))[0]
    total = ml_price - what_you_paid
    
    if total > 0:
        print('You underpaid ${}'.format(round(total, 2)))
    else:
        print('You overpaid ${}'.format(round(abs(total), 2)))

In [26]:
diamond_app(1, 2, 3, 4, 10000)

You overpaid $4602.65
