#Regression with a Tabular Gemstone Price Dataset

Thank you to everyone who participated in and contributed to last year's Tabular Playground Series. And many thanks to all those who took the time to provide constructive feedback! We're thrilled that there continues to be interest in these types of challenges, and we're continuing the series this year but with a few changes.

First, the series is getting upgraded branding. We've dropped "Tabular" from the name because, while we anticipate this series will still have plenty of tabular competitions, we'll also be having some other formats as well. You'll also notice freshly-upgraded (better looking and more fun!) banner and thumbnail images.

Second, rather than naming the challenges by month and year, we're moving to a Season-Edition format. This year is Season 3, and each challenge will be a new Edition. We're doing this to have more flexibility. Competitions going forward won't necessarily align with each month like they did in previous years (although some might!), we'll have competitions with different time durations, and we may have multiple competitions running at the same time on occasion.

Regardless of these changes, the goals of the Playground Series remain the same—to give the Kaggle community a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. We hope we continue to meet this objective!

With the great start and participation in January, we will continue launching the Tabular Tuesday in February every Tuesday 00:00 UTC, with each competition running for 2 weeks instead. Again, these will be fairly light-weight datasets that are synthetically generated from real-world data, and will provide an opportunity to quickly iterate through various model and feature engineering ideas, create visualizations, etc.

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('train.csv')

In [5]:
df

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,2,0.70,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.50,2772
3,3,0.32,Ideal,G,VS1,61.6,56.0,4.38,4.41,2.71,666
4,4,1.70,Premium,G,VS2,62.6,59.0,7.65,7.61,4.77,14453
...,...,...,...,...,...,...,...,...,...,...,...
193568,193568,0.31,Ideal,D,VVS2,61.1,56.0,4.35,4.39,2.67,1130
193569,193569,0.70,Premium,G,VVS2,60.3,58.0,5.75,5.77,3.47,2874
193570,193570,0.73,Very Good,F,SI1,63.1,57.0,5.72,5.75,3.62,3036
193571,193571,0.34,Very Good,D,SI1,62.9,55.0,4.45,4.49,2.81,681


In [4]:
del df['id']

In [6]:
df['cut'].value_counts()

Unnamed: 0_level_0,count
cut,Unnamed: 1_level_1
Ideal,92454
Premium,49910
Very Good,37566
Good,11622
Fair,2021


In [3]:
d={'Ideal':1,'Fair':2,'Good':3,'Very Good':4,'Premium':5}
df['cut']=df['cut'].map(d)

In [11]:
df.isnull().sum()

Unnamed: 0,0
carat,0
cut,0
color,0
clarity,0
depth,0
table,0
x,0
y,0
z,0
price,0


In [10]:
#pip install pycaret

In [9]:
from pycaret.regression import*

In [12]:
setup(data=df,target='price')

Unnamed: 0,Description,Value
0,Session id,7643
1,Target,price
2,Target type,Regression
3,Original data shape,"(193573, 10)"
4,Transformed data shape,"(193573, 23)"
5,Transformed train set shape,"(135501, 23)"
6,Transformed test set shape,"(58072, 23)"
7,Numeric features,7
8,Categorical features,2
9,Preprocess,True


<pycaret.regression.oop.RegressionExperiment at 0x7f7ecb3fa6e0>

In [13]:
best_model=compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
knn,K Neighbors Regressor,396.0647,576867.9478,759.4266,0.9645,0.1478,0.1098,10.365
dt,Decision Tree Regressor,428.929,717846.7039,847.2202,0.9558,0.1545,0.1115,1.662
lr,Linear Regression,625.9297,909698.4765,953.7275,0.9439,0.533,0.3224,1.374
lasso,Lasso Regression,624.5202,909821.7535,953.7943,0.9439,0.5304,0.318,4.997
ridge,Ridge Regression,626.0119,909701.776,953.7293,0.9439,0.5332,0.3224,0.449
llar,Lasso Least Angle Regression,624.419,909832.2609,953.7974,0.9439,0.5326,0.3179,0.433
br,Bayesian Ridge,625.9362,909698.5948,953.7276,0.9439,0.5331,0.3224,0.792
par,Passive Aggressive Regressor,833.1143,1434427.8003,1179.5923,0.9117,0.7666,0.5036,1.921
huber,Huber Regressor,881.0224,1676750.1746,1290.9742,0.8967,0.7001,0.5102,3.635
en,Elastic Net,1203.498,3032468.3657,1741.3498,0.8132,0.8107,0.4682,0.543


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [14]:
abs(df.corr(numeric_only=True)['price'].sort_values(ascending=False))

Unnamed: 0,price
price,1.0
carat,0.943396
y,0.901033
x,0.901004
z,0.893037
table,0.174915
cut,0.141717
depth,0.00188


In [4]:
x=df[['carat', 'depth', 'table', 'x', 'y','z','cut']]
y=df[['price']]

In [5]:
x=pd.get_dummies(x,drop_first=True)

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=42)

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [9]:
model=Sequential()
model.add(Dense(128,activation='relu'))
model.add(Dense(64,activation='relu'))
model.add(Dense(32,activation='relu'))
model.add(Dense(32,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam')

In [10]:
history=model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=10,batch_size=20,verbose=1)

Epoch 1/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 3ms/step - loss: 5830611.5000 - val_loss: 1439284.3750
Epoch 2/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 4ms/step - loss: 1454490.0000 - val_loss: 1359916.5000
Epoch 3/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 3ms/step - loss: 1430035.6250 - val_loss: 1360048.2500
Epoch 4/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 3ms/step - loss: 1416427.2500 - val_loss: 1341618.3750
Epoch 5/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 3ms/step - loss: 1414561.7500 - val_loss: 1442196.8750
Epoch 6/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 3ms/step - loss: 1418029.2500 - val_loss: 1326700.7500
Epoch 7/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 3ms/step - loss: 1403394.8750 - val_loss: 1361581.6250
Epoch 8/10
[1m7743/7743[0m [32m━━━━━━━━━━━━━━━━━━━━

In [11]:
tahmin=model.predict(x_test)

[1m1210/1210[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step


In [12]:
from sklearn.metrics import mean_squared_error,r2_score

In [13]:
r2_score(y_test,tahmin)

0.919446089783461

In [14]:
mean_squared_error(y_test,tahmin)**.5

1140.9281360199004