# Modeling

**Goal is to build something like this Web Predictor: https://pgurazada1-diamond-price-predictor.hf.space**

### Step 0: Install the Dependencies 

In [3]:
# %pip install --upgrade scikit-learn

In [4]:
# %pip install openml

In [65]:
import openml as fetch_openml
from sklearn import impute, tree, pipeline
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.tree     import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

import joblib
import os

#### Step 1: Collect Data

**(A) Translation of business problem into a data problem.**
* The target
* The features
* Sources of data
 
**(B) What are the business KPIs here? How do you think the business team will measure the success of your effort?**
     
     - Simply put business doesn't care about Accuracy, Precision, Recall, F1 Score and Regression Related scores. They have their own concerns. 
     - Ask the business what KPI objects they will be meausering your model on 

**(C) A Data Engineer usually handles the orchestration of workflows that are usually referred to as Extract-Transform-Load(ETL) jobs**

**(D) Data is scraped from SKUs listed on https://www.brilliantearth.com and hosted on https://www.openml.org**


###### Access the data from OpenML and pull it into the training environment

In [2]:
dataset = fetch_openml(data_id=43355,
                       as_frame=True,
                       parser='auto')

In [6]:
print(dataset.DESCR)

Context
Buying a diamond can be frustrating and expensive.  
It inspired me to create this dataset of 119K natural and lab-created diamonds from brilliantearth.com to demystify the value of the 4 Cs  cut, color, clarity, carat.
This data was scraped using DiamondScraper.
Content



Attribute
Description
Data Type




id
Diamond identification number provided by Brilliant Earth
int


url
URL for the diamond details page
string


shape
External geometric appearance of a diamond
string/categorical


price
Price in U.S. dollars
int


carat
Unit of measurement used to describe the weight of a diamond
float


cut
Facets, symmetry, and reflective qualities of a diamond
string/categorical


color
Natural color or lack of color visible within a diamond, based on the GIA grade scale
string/categorical


clarity
Visibility of natural microscopic inclusions and imperfections within a diamond
string/categorical


report
Diamond certificate or grading report provided by an independent gemology lab
s

In [7]:
diamond_prices = dataset.data

In [8]:
diamond_prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119307 entries, 0 to 119306
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            119307 non-null  int64  
 1   url           119307 non-null  object 
 2   shape         119307 non-null  object 
 3   price         119307 non-null  int64  
 4   carat         119307 non-null  float64
 5   cut           119307 non-null  object 
 6   color         119307 non-null  object 
 7   clarity       119307 non-null  object 
 8   report        119307 non-null  object 
 9   type          119307 non-null  object 
 10  date_fetched  119307 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 10.0+ MB


Decide on Features in the Dataset that are relevant to data problem framed

In [9]:
target = 'price'
numeric_features = ['carat']
categorical_features = ['shape', 'cut', 'color', 'clarity', 'report', 'type']

#### Step 2: EDA - Exploratory Data Analysis

**Crucial to identify good Test Cases for the Deployed Model** : For Instance: What happens if customer is looking for 50k carat diamond instead of what we have in the dataset**

**Good EDA** allows you to build a Good Customer Facing Interface that works with actual existing data customer can interact with`

In [10]:
diamond_prices.head()

Unnamed: 0,id,url,shape,price,carat,cut,color,clarity,report,type,date_fetched
0,10086429,https://www.brilliantearth.com//loose-diamonds...,Round,400,0.3,'Very Good',J,SI2,GIA,natural,'2020-11-29 12-26 PM'
1,10016334,https://www.brilliantearth.com//loose-diamonds...,Emerald,400,0.31,Ideal,I,SI1,GIA,natural,'2020-11-29 12-26 PM'
2,9947216,https://www.brilliantearth.com//loose-diamonds...,Emerald,400,0.3,Ideal,I,VS2,GIA,natural,'2020-11-29 12-26 PM'
3,10083437,https://www.brilliantearth.com//loose-diamonds...,Round,400,0.3,Ideal,I,SI2,GIA,natural,'2020-11-29 12-26 PM'
4,9946136,https://www.brilliantearth.com//loose-diamonds...,Emerald,400,0.3,Ideal,I,SI1,GIA,natural,'2020-11-29 12-26 PM'


In [12]:
diamond_prices.loc[:, [target]].describe()

Unnamed: 0,price
count,119307.0
mean,3286.843
std,9114.695
min,270.0
25%,900.0
50%,1770.0
75%,3490.0
max,1348720.0


In [17]:
diamond_prices.loc[:, numeric_features].describe()

Unnamed: 0,carat
count,119307.0
mean,0.884169
std,0.671141
min,0.25
25%,0.4
50%,0.7
75%,1.1
max,15.32


**`Observation`**
   - Max number of carats that can be searched in End Model/Interface is 15 Carats, in order to predict the correct price

In [14]:
diamond_prices.loc[:, categorical_features].describe()

Unnamed: 0,shape,cut,color,clarity,report,type
count,119307,119307,119307,119307,119307,119307
unique,10,5,7,8,4,2
top,Round,'Super Ideal',E,VS1,GIA,natural
freq,76080,55244,24730,27259,68782,70313


#### Step 3: Build Model

**REMEMBER WE NEVER DEPLOY MODELS BUT MODEL PIPELINEs**

In [20]:
# seperate the features and the target 
# exclude id, url and date_fetched from the features 

X = diamond_prices.drop(columns=[target, 'id', 'url', 'date_fetched'])
y = diamond_prices[target]

In [24]:
# Create a train-test split 
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
                                                test_size = 0.2,
                                                random_state=42)

We Are saving the data with a specific date name in order to Version the data PROPERLY

In [26]:
# Saving the version 
Xtrain.to_csv('data/20230322_training_features.csv', index=False)
ytrain.to_csv('data/20230322_training_target.csv', index=False)

#### Preprocessing & Assembling the Preprocesser into Model Pipeline:  So that when inputs are enetered by the customer in the Interface, they will be preprocessed properly before giving prediction to the customer. This is a background job. 
preprocessor= make_column_transformer((StandardScaler(), numeric_features), (OneHotEncoder(handle_uknown='ignore'), categorical_features))

In [31]:
preprocessor = make_column_transformer((StandardScaler(), numeric_features), 
                                       (OneHotEncoder(handle_unknown='ignore'), 
                                        categorical_features))

In [40]:
model_pipeline = pipeline.make_pipeline(preprocessor,
                               DecisionTreeRegressor())

In [41]:
model_pipeline

In [42]:
model_pipeline.fit(Xtrain, ytrain)

In [44]:
model_pipeline.named_steps

{'columntransformer': ColumnTransformer(transformers=[('standardscaler', StandardScaler(), ['carat']),
                                 ('onehotencoder',
                                  OneHotEncoder(handle_unknown='ignore'),
                                  ['shape', 'cut', 'color', 'clarity', 'report',
                                   'type'])]),
 'decisiontreeregressor': DecisionTreeRegressor()}

In [68]:
# model_pipeline.named_steps['columntransformer'].transformers_[1][1].get

In [46]:
model_pipeline.named_steps['columntransformer'].transformers_

[('standardscaler', StandardScaler(), ['carat']),
 ('onehotencoder',
  OneHotEncoder(handle_unknown='ignore'),
  ['shape', 'cut', 'color', 'clarity', 'report', 'type'])]

#### Step 4: Serialize Model

In order to make sure the model is stored and accessible for making prediction even if the Jupyter Intrepeter is shutdown. 

In [49]:
saved_model_path = "models/model-v1.joblib"

In [52]:
joblib.dump(model_pipeline, saved_model_path)

['models/model-v1.joblib']

In [56]:
os.path.getsize(saved_model_path)

5312064

`Observation` : 5.1MB file is saved on the Disk. 

In [57]:
saved_model = joblib.load("models/model-v1.joblib")

In [58]:
saved_model

In [66]:
print(mean_absolute_error(ytest, saved_model.predict(Xtest)))

405.18052086281756


In [67]:
# Saving the version
Xtrain.to_csv('data/20230322_training_features.csv', index=False)
ytrain.to_csv('data/20230322_training_target.csv', index=False)