In [1]:
import plotly.express as px
from pycaret.datasets import get_data
import numpy as np
from pycaret.regression import *

In [2]:
# load the dataset from pycaret
data = get_data('diamond')

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.1,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Carat Weight  6000 non-null   float64
 1   Cut           6000 non-null   object 
 2   Color         6000 non-null   object 
 3   Clarity       6000 non-null   object 
 4   Polish        6000 non-null   object 
 5   Symmetry      6000 non-null   object 
 6   Report        6000 non-null   object 
 7   Price         6000 non-null   int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 375.1+ KB


In [4]:
data.nunique()

Carat Weight     196
Cut                5
Color              6
Clarity            7
Polish             4
Symmetry           4
Report             2
Price           4821
dtype: int64

- Here based on unique values we can consider the categorical features to be
    - `Cut`
    - `Color`
    - `Clarity`
    - `Polish`
    - `Symmetry`
    - `Report`

In [5]:
data["Cut"].value_counts()


Ideal              2482
Very Good          2428
Good                708
Signature-Ideal     253
Fair                129
Name: Cut, dtype: int64

In [6]:
data["Color"].value_counts()

G    1501
H    1079
F    1013
I     968
E     778
D     661
Name: Color, dtype: int64

In [7]:
data["Clarity"].value_counts()

SI1     2059
VS2     1575
VS1     1192
VVS2     666
VVS1     285
IF       219
FL         4
Name: Clarity, dtype: int64

In [8]:
data["Polish"].value_counts()

EX    2425
VG    2409
ID     595
G      571
Name: Polish, dtype: int64

In [9]:
data["Symmetry"].value_counts()

VG    2417
EX    2059
G      916
ID     608
Name: Symmetry, dtype: int64

In [10]:
data["Report"].value_counts()

GIA     5266
AGSL     734
Name: Report, dtype: int64

In [11]:
# plot scatter carat_weight and Price
fig = px.scatter(x=data['Carat Weight'], y=data['Price'], 
                 facet_col = data['Cut'], opacity = 0.25, trendline='ols',
                 trendline_color_override = 'red', title = 'Diamond Case Study')
fig.show()

In [12]:
# plot histogram - distribution of the target variable.
fig = px.histogram(data, x=["Price"], title = 'Histogram of Price')
fig.show()

- We see the `Price` is right-skewed, so we check if log transformation can transform it

In [13]:
# create a copy of data
data_copy = data.copy()
# create a new feature Log_Price
data_copy['Log_Price'] = np.log(data['Price'])
# plot histogram
fig = px.histogram(data_copy, x=["Log_Price"], title = 'Histgram of Log Price')
fig.show()


- **Log Transform** Confirms our hypothesis, and allows to convert `Price` - target variable into approximately normal.

### Data Preparation

- In pycaret `setup` is first and mandatory step in all ml experiment.
- for more information see the [link](https://pycaret.org/preprocessing/)

In [14]:
# initialize setup
main_setup = setup(data, target = 'Price', 
                   transform_target = True, 
                   log_experiment = True, 
                   experiment_name = 'zoomcamp')

Unnamed: 0,Description,Value
0,session_id,3103
1,Target,Price
2,Original Data,"(6000, 8)"
3,Missing Values,False
4,Numeric Features,1
5,Categorical Features,6
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(4199, 28)"


2022/09/12 18:15:32 INFO mlflow.tracking.fluent: Experiment with name 'zoomcamp' does not exist. Creating a new experiment.


 - Here we have passed `log_experiment = True` and `experiment_name = 'zoomcamp'` , this will tell PyCaret to automatically log all the metrics, hyperparameters, and model artifacts behind the scene as you progress through the modeling phase. 
 - This is possible due to integration with MLflow.
 - The `transform_target=True` will transform the `Price` variable using box-cox transformation.