<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

## 0. python imports & setup

for learning purposes, libraries will be imported inside its corresponding usage section...

## 1. data loading

* diamonds: labeled data we can use for training and testing
* diamonds_predict: diamonds to predict its price and upload result to Kaggle

In [219]:
# imports 
import numpy as np
import pandas as pd

In [220]:
df = pd.read_csv('data/train/diamonds_train_cleaned.csv', index_col=[0])
df['xy'] = df['x']/df['y']
df['xz'] = df['x']/df['z']
df['zy'] = df['z']/df['y']
df['table_depth'] = df['table']/df['depth']
#df['xyz'] = df['x']* df['y'] * df['z']
#df['xyz_sum'] = df['x'] + df['y'] + df['z']
df['carat/dimensons'] = df ['carat'] / (df['x']* df['y'] * df['z'])
df.replace([np.inf, -np.inf], 0 , inplace=True)

df.head()


Unnamed: 0,depth,table,x,y,z,price,carat,cut,color,clarity,city,xy,xz,zy,table_depth,xyz,carat/dimensons
0,62.4,58.0,6.83,6.79,4.25,4268,1.21,Premium,J,VS2,Dubai,1.005891,1.607059,0.62592,0.929487,197.096725,0.006139
1,61.6,58.0,6.4,6.35,3.93,3513,1.02,Premium,J,VS2,Dubai,1.007874,1.628499,0.618898,0.941558,159.7152,0.006386
2,62.3,58.0,5.86,5.8,3.63,1792,0.77,Premium,J,VS2,Dubai,1.010345,1.614325,0.625862,0.930979,123.37644,0.006241
3,59.6,60.0,7.58,7.48,4.49,7553,1.51,Premium,J,VS2,Dubai,1.013369,1.688196,0.600267,1.006711,254.575816,0.005931
4,60.2,62.0,5.4,5.33,3.23,1176,0.57,Premium,J,VS2,Dubai,1.013133,1.671827,0.606004,1.0299,92.96586,0.006131


In [221]:
df.head().T

Unnamed: 0,0,1,2,3,4
depth,62.4,61.6,62.3,59.6,60.2
table,58.0,58.0,58.0,60.0,62.0
x,6.83,6.4,5.86,7.58,5.4
y,6.79,6.35,5.8,7.48,5.33
z,4.25,3.93,3.63,4.49,3.23
price,4268,3513,1792,7553,1176
carat,1.21,1.02,0.77,1.51,0.57
cut,Premium,Premium,Premium,Premium,Premium
color,J,J,J,J,J
clarity,VS2,VS2,VS2,VS2,VS2


In [222]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_columns = list((df.select_dtypes(include=numerics)).columns)
numerical_columns.remove('price')
#numerical_columns.remove('depth')
#numerical_columns.remove('table')
numerical_columns.remove('x')
numerical_columns.remove('y')
numerical_columns.remove('z')
numerical_columns

['depth',
 'table',
 'carat',
 'xy',
 'xz',
 'zy',
 'table_depth',
 'xyz',
 'carat/dimensons']

as you can see, there are both categorical and numerical columns...

## 2. eda

this section is up to you! this guided lesson is about a machine learning pipeline...

## 3. ml preprocessing

in this section I will teach how to use scikit-learn's Pipiline and ColumnTransformer, one of the best practices for composing preprocessing and modeling in a single and elegand class... pay attention as it is hard to understand...

In [223]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
* https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

let's identify numerical and categorical features...

In [224]:
#NUM_FEATS = ['carat', 'depth', 'table', 'x', 'y', 'z']
categorical_columns = ['cut', 'color', 'clarity']
FEATS = numerical_columns + categorical_columns
TARGET = 'price'

let's define a preprocessing transformer for numerical columns...

In [225]:
numeric_transformer = \
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), 
                ('scaler', StandardScaler())])

let's define a preprocessing transformer for categorical columns...

In [226]:
categorical_transformer = \
Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                ('onehot', OneHotEncoder(handle_unknown='ignore'))])

let's join these transformers using a `ColumnTransformer`:

In [227]:
preprocessor = \
ColumnTransformer(transformers=[('num', numeric_transformer, numerical_columns),
                                ('cat', categorical_transformer, categorical_columns)])

inspecting the full preprocessor:

In [228]:
preprocessor

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['depth', 'table', 'carat', 'xy', 'xz', 'zy',
                                  'table_depth', 'xyz', 'carat/dimensons']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['cut', 'color', 'clarity'])])

how does this preprocessing looks like?

at least in this case, it is at the cost of interpretability of transformed DataFrame...

In [229]:
pd.DataFrame(data=preprocessor.fit_transform(df)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
0,0.455382,0.250569,0.868347,0.625806,-0.325378,0.567775,-0.031149,0.878032,0.093385,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-0.108285,0.250569,0.468532,0.817358,0.228399,0.101037,0.221683,0.390588,1.919605,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.384923,0.250569,-0.057541,1.056033,-0.137696,0.563894,0.0001,-0.083259,0.846357,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,-1.517453,1.152886,1.499634,1.348159,1.770336,-1.137133,1.586309,1.627543,-1.440584,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,-1.094703,2.055203,-0.478399,1.325383,1.347527,-0.755893,2.072,-0.479805,0.035516,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [230]:
df[FEATS]

Unnamed: 0,depth,table,carat,xy,xz,zy,table_depth,xyz,carat/dimensons,cut,color,clarity
0,62.4,58.0,1.21,1.005891,1.607059,0.625920,0.929487,197.096725,0.006139,Premium,J,VS2
1,61.6,58.0,1.02,1.007874,1.628499,0.618898,0.941558,159.715200,0.006386,Premium,J,VS2
2,62.3,58.0,0.77,1.010345,1.614325,0.625862,0.930979,123.376440,0.006241,Premium,J,VS2
3,59.6,60.0,1.51,1.013369,1.688196,0.600267,1.006711,254.575816,0.005931,Premium,J,VS2
4,60.2,62.0,0.57,1.013133,1.671827,0.606004,1.029900,92.965860,0.006131,Premium,J,VS2
...,...,...,...,...,...,...,...,...,...,...,...,...
40450,62.2,54.0,0.54,0.994307,1.602446,0.620493,0.868167,90.300396,0.005980,Ideal,F,IF
40451,61.9,54.0,0.53,0.994286,1.611111,0.617143,0.872375,88.792200,0.005969,Ideal,F,IF
40452,62.3,55.0,0.30,0.990783,1.598513,0.619816,0.882825,50.200780,0.005976,Ideal,F,IF
40453,60.9,55.0,0.26,0.981087,1.627451,0.602837,0.903120,44.763975,0.005808,Ideal,F,IF


## 4. train a simple model

first, lets train a simple model using holdout, train - test split...

In [231]:
from sklearn.model_selection import train_test_split

In [232]:
diamonds_train, diamonds_test = train_test_split(df, random_state=42)

In [233]:
print(diamonds_train.shape)
print(diamonds_test.shape)

(30318, 17)
(10107, 17)


let's choose a model from scikit-learn cheatsheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [234]:
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor

model= Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', RandomForestRegressor())])

In [235]:
model.fit(diamonds_train[FEATS], diamonds_train[TARGET]);

## 5. check model performance on test and train data

In [236]:
from sklearn.metrics import mean_squared_error

In [237]:
y_test = model.predict(diamonds_test[FEATS])
y_train = model.predict(diamonds_train[FEATS])

In [238]:
print(f"test error: {mean_squared_error(y_pred=y_test, y_true=diamonds_test[TARGET], squared=False)}")
print(f"train error: {mean_squared_error(y_pred=y_train, y_true=diamonds_train[TARGET], squared=False)}")

test error: 539.9640735010596
train error: 208.59414295357251


## 6. check model performance using cross validation

In [239]:
from sklearn.model_selection import cross_val_score

In [240]:
scores = cross_val_score(model, 
                         df[FEATS], 
                         df[TARGET], 
                         scoring='neg_root_mean_squared_error', 
                         cv=5, n_jobs=-1)

In [241]:
import numpy as np
np.mean(-scores)

548.0537778835132

## 7. optimize model using grid search

In [27]:
from sklearn.model_selection import RandomizedSearchCV

In [28]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'regressor__n_estimators': [16, 32, 64, 128, 256, 512],
    'regressor__max_depth': [2, 4, 8, 16],
}

grid_search = RandomizedSearchCV(model, 
                                 param_grid, 
                                 cv=5, 
                                 verbose=10, 
                                 scoring='neg_root_mean_squared_error', 
                                 n_jobs=-1,
                                 n_iter=32)

grid_search.fit(df[FEATS], df[TARGET])

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV 2/5; 1/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=8, regressor__n_estimators=256
[CV 4/5; 1/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=8, regressor__n_estimators=256
[CV 1/5; 2/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=4, regressor__n_estimators=16
[CV 5/5; 1/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=8, regressor__n_estimators=256
[CV 3/5; 2/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=4, regressor__n_estimators=16
[CV 3/5; 1/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=8, regressor__n_estimators=256
[CV 2/5; 2/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=4, regressor__n_estimators=16
[CV 1/5; 1/32] START preprocessor__num__imputer__strategy=mean, regressor__max_depth=8, regressor__n_estimators=256
[CV 1/5; 2/32

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(transformers=[('num',
                                                                               Pipeline(steps=[('imputer',
                                                                                                SimpleImputer(strategy='median')),
                                                                                               ('scaler',
                                                                                                StandardScaler())]),
                                                                               ['depth',
                                                                                'table',
                                                                                'carat',
                                                                                'xy',
             

In [29]:
grid_search.best_params_

{'regressor__n_estimators': 512,
 'regressor__max_depth': 16,
 'preprocessor__num__imputer__strategy': 'median'}

In [30]:
grid_search.best_score_

-541.0102333789998

## 8. prepare submission

In [31]:
diamonds_train = pd.read_csv('data/test/diamonds_test.csv')
diamonds_train['xy'] = diamonds_train['x']/diamonds_train['y']
diamonds_train['xz'] = diamonds_train['x']/diamonds_train['z']
diamonds_train['zy'] = diamonds_train['z']/diamonds_train['y']
diamonds_train['table_depth'] = diamonds_train['table']/df['depth']
#diamonds_train['xyz'] = diamonds_train['x']* diamonds_train['y'] * diamonds_train['z']
#diamonds_train['xyz_sum'] = diamonds_train['x'] + diamonds_train['y'] + diamonds_train['z']
diamonds_train['carat/dimensons'] = diamonds_train ['carat'] / (diamonds_train['x']* diamonds_train['y'] * diamonds_train['z'])
diamonds_train.replace([np.inf, -np.inf], 0 , inplace=True)

diamonds_train

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,city,xy,xz,zy,table_depth,carat/dimensons
0,0,0.79,Very Good,F,SI1,62.7,60.0,5.82,5.89,3.67,Amsterdam,0.988115,1.585831,0.623090,0.961538,0.006279
1,1,1.20,Ideal,J,VS1,61.0,57.0,6.81,6.89,4.18,Surat,0.988389,1.629187,0.606676,0.925325,0.006118
2,2,1.57,Premium,H,SI1,62.2,61.0,7.38,7.32,4.57,Kimberly,1.008197,1.614880,0.624317,0.979133,0.006359
3,3,0.90,Very Good,F,SI1,63.8,54.0,6.09,6.13,3.90,Kimberly,0.993475,1.561538,0.636215,0.906040,0.006182
4,4,0.50,Very Good,F,VS1,62.9,58.0,5.05,5.09,3.19,Amsterdam,0.992141,1.583072,0.626719,0.963455,0.006098
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13480,13480,0.57,Ideal,E,SI1,61.9,56.0,5.35,5.32,3.30,Amsterdam,1.005639,1.621212,0.620301,0.919540,0.006069
13481,13481,0.71,Ideal,I,VS2,62.2,55.0,5.71,5.73,3.56,New York City,0.996510,1.603933,0.621291,0.878594,0.006096
13482,13482,0.70,Ideal,F,VS1,61.6,55.0,5.75,5.71,3.53,Tel Aviv,1.007005,1.628895,0.618214,0.892857,0.006040
13483,13483,0.70,Very Good,F,SI2,58.8,57.0,5.85,5.89,3.45,Surat,0.993209,1.695652,0.585739,0.934426,0.005889


In [32]:
y_pred = grid_search.predict(diamonds_train[FEATS])

In [33]:
submission_df = pd.DataFrame({'id': diamonds_train['id'], 'price': y_pred})

In [34]:
submission_df.head()

Unnamed: 0,id,price
0,0,3071.301686
1,1,5410.637553
2,2,9521.648622
3,3,3990.109439
4,4,1647.07125


In [40]:
submission_df.describe()

Unnamed: 0,id,price
count,13485.0,13485.0
mean,6742.0,3955.183574
std,3892.928525,3948.800998
min,0.0,370.126102
25%,3371.0,961.396883
50%,6742.0,2465.923075
75%,10113.0,5314.871014
max,13484.0,18225.807471


In [41]:
submission_df.price.clip(0, 20000, inplace=True)

In [42]:
submission_df.to_csv('data/prediction/price_prediction.csv', index=False)

## 9. let's try more models...

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>