<a href="https://colab.research.google.com/github/pgurazada/mlops-workshops/blob/main/advanced-python-course/python-for-machine-learning/diamond_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

To illustrate a supervised learning workflow to solve for regression tasks


# Setup

In [1]:
import sklearn
import joblib

from sklearn.datasets import fetch_openml

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
sklearn.set_config(display='diagram')

# Business Context

For this session consider the case of a popular diamond jeweller - Brilliant Earth - with 30 showrooms across the US facing a price prediction problem. A common customer question that echoes in their retail outlets is the impact on price because of changes in some aspects of the ornament. For example, usually customers ask: "If I decreased the carat of the diamonds used in this design, by how much would the price reduce?". Such queries often require an expert intervention on the shopfloor and result in a subdued customer experience. The company also wants to implement a price predictor tool on their website so customers can engage with the brand better. At the moment, no such tool exists and the business team estimates that a price predictor will improve traffic to the website and also improve the time spent on the website.

The dataset used in this session is scraped from the Brilliant Earth website and hosted on Open ML.

# Data

In [3]:
dataset = fetch_openml(data_id=43355, as_frame=True, parser="auto")

In [4]:
data_df = dataset.data

In [5]:
data_df.sample(5)

Unnamed: 0,id,url,shape,price,carat,cut,color,clarity,report,type,date_fetched
4388,10079231,https://www.brilliantearth.com//loose-diamonds...,Oval,640,0.3,Ideal,D,SI1,GIA,natural,'2020-11-29 12-26 PM'
94428,9833774,https://www.brilliantearth.com//lab-diamonds-s...,Round,1880,1.19,Ideal,I,VS1,GCAL,lab,'2020-11-29 12-26 PM'
84864,9890736,https://www.brilliantearth.com//lab-diamonds-s...,Round,1070,0.93,Ideal,J,VS1,IGI,lab,'2020-11-29 12-26 PM'
71208,9932586,https://www.brilliantearth.com//lab-diamonds-s...,Round,330,0.3,'Super Ideal',D,SI2,IGI,lab,'2020-11-29 12-26 PM'
87423,9308611,https://www.brilliantearth.com//lab-diamonds-s...,Radiant,1300,0.76,Good,E,VVS2,IGI,lab,'2020-11-29 12-26 PM'


## Data Description

- id: Diamond identification number provided by Brilliant Earth (int)

- url: URL for the diamond details page (string)

- shape: External geometric appearance of a diamond (string/categorical)

- price: Price in U.S. dollars (int)

- carat: Unit of measurement used to describe the weight of a diamond (float)

- cut: Facets, symmetry, and reflective qualities of a diamond (string/categorical)

- color: Natural color or lack of color visible within a diamond, based on the GIA grade scale (string/categorical)

- clarity: Visibility of natural microscopic inclusions and imperfections within a diamond (string/categorical)

- report: Diamond certificate or grading report provided by an independent gemology lab (string)

- type: Natural or lab created diamonds (string)

- date_fetched: Date the data was fetched (date)



In [6]:
target = 'price'
numeric_features = ['carat']
categorical_features = ['shape', 'cut', 'color', 'clarity', 'report', 'type']

# EDA

In [7]:
data_df[numeric_features].describe()

Unnamed: 0,carat
count,119307.0
mean,0.884169
std,0.671141
min,0.25
25%,0.4
50%,0.7
75%,1.1
max,15.32


In [8]:
data_df[categorical_features].describe()

Unnamed: 0,shape,cut,color,clarity,report,type
count,119307,119307,119307,119307,119307,119307
unique,10,5,7,8,4,2
top,Round,'Super Ideal',E,VS1,GIA,natural
freq,76080,55244,24730,27259,68782,70313


In [9]:
data_df[target].describe()

count    1.193070e+05
mean     3.286843e+03
std      9.114695e+03
min      2.700000e+02
25%      9.000000e+02
50%      1.770000e+03
75%      3.490000e+03
max      1.348720e+06
Name: price, dtype: float64

# Model Estimation

In [10]:
X = data_df[numeric_features + categorical_features]
y = data_df[target]

In [11]:
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

In [12]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

In [13]:
model_linear_regression = LinearRegression(n_jobs=-1)

In [14]:
model_pipeline = make_pipeline(
    preprocessor,
    model_linear_regression
)

In [15]:
model_pipeline.fit(Xtrain, ytrain)

# Model Evaluation

In [16]:
model_pipeline.predict(Xtest)

array([-3387.14837879, 10361.32166214,  -724.1359052 , ...,
        1922.05376402,    10.64585411,  3726.7628516 ])

In [17]:
print(f"RMSE: {mean_squared_error(ytest, model_pipeline.predict(Xtest), squared=False)}")

RMSE: 7233.547395723714


In [18]:
print(f"R-squared: {r2_score(ytest, model_pipeline.predict(Xtest))}")

R-squared: 0.4717574720851655


# Serialization

In [19]:
!pip show scikit-learn

Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /opt/anaconda3/lib/python3.11/site-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: imbalanced-learn


In [20]:
%%writefile requirements.txt
scikit-learn==1.2.2

Writing requirements.txt


In [21]:
%%writefile train.py

import joblib

from sklearn.datasets import fetch_openml

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

dataset = fetch_openml(data_id=43355, as_frame=True, parser="auto")

data_df = dataset.data

target = 'price'
numeric_features = ['carat']
categorical_features = ['shape', 'cut', 'color', 'clarity', 'report', 'type']

print("Creating data subsets")

X = data_df[numeric_features + categorical_features]
y = data_df[target]

Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

model_linear_regression = LinearRegression(n_jobs=-1)

print("Estimating Model Pipeline")

model_pipeline = make_pipeline(
    preprocessor,
    model_linear_regression
)

model_pipeline.fit(Xtrain, ytrain)

print("Logging Metrics")
print(f"R-squared: {r2_score(ytest, model_pipeline.predict(Xtest))}")

print("Serializing Model")

saved_model_path = "model.joblib"

joblib.dump(model_pipeline, saved_model_path)

Writing train.py


In [22]:
!python train.py

Creating data subsets
Estimating Model Pipeline
Logging Metrics
R-squared: 0.4717574720851655
Serializing Model


# Test Predictions

In [23]:
saved_model = joblib.load("model.joblib")

In [24]:
saved_model

In [25]:
saved_model.predict(Xtest)

array([-3387.14837879, 10361.32166214,  -724.1359052 , ...,
        1922.05376402,    10.64585411,  3726.7628516 ])