The Following Notebook displays my attempt in **HackerEarth Machine Learning Competition: Predict A(rt)** and was placed **1st** in the Leaderboard

**Task**

You work for a company that sells sculptures that are acquired from various artists around the world. Your task is to predict the cost required to ship these sculptures to customers based on the information provided in the dataset.

I Used **Catboost Algorithm** as it's much faster and I got much better result using it as compared to XGBOOST

In [1]:
# Extracting Data from zip file

import zipfile
zip_ref = zipfile.ZipFile("/content/drive/MyDrive/Hackerearth_Art.zip", 'r')
zip_ref.extractall("/Hackerearth_Art_NEW")
zip_ref.close()

In [2]:
#installing catboost packagae 
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/96/3b/bb419654adcf7efff42ed8a3f84e50c8f236424b7ed1cc8ccd290852e003/catboost-0.24.4-cp37-none-manylinux1_x86_64.whl (65.7MB)
[K     |████████████████████████████████| 65.7MB 44kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.24.4


In [3]:
#importing Required Libraries

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import sklearn.metrics as metrics
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
import warnings
from catboost import CatBoostRegressor
warnings.filterwarnings("ignore")

In [4]:
#Let's Look at the Data, we are reading train and test data from dataset folder

train = pd.read_csv('/Hackerearth_Art_NEW/dataset/train.csv')
test = pd.read_csv('/Hackerearth_Art_NEW/dataset/test.csv')

In [5]:
train.head()

Unnamed: 0,Customer Id,Artist Name,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Scheduled Date,Delivery Date,Customer Location,Cost
0,fffe3900350033003300,Billy Jenkins,0.26,17.0,6.0,4128.0,Brass,13.91,16.27,Yes,Yes,No,Airways,No,Working Class,No,06/07/15,06/03/15,"New Michelle, OH 50777",-283.29
1,fffe3800330031003900,Jean Bryant,0.28,3.0,3.0,61.0,Brass,6.83,15.0,No,No,No,Roadways,No,Working Class,No,03/06/17,03/05/17,"New Michaelport, WY 12072",-159.96
2,fffe3600370035003100,Laura Miller,0.07,8.0,5.0,237.0,Clay,4.96,21.18,No,No,No,Roadways,Yes,Working Class,Yes,03/09/15,03/08/15,"Bowmanshire, WA 19241",-154.29
3,fffe350031003300,Robert Chaires,0.12,9.0,,,Aluminium,5.81,16.31,No,No,No,,No,Wealthy,Yes,05/24/15,05/20/15,"East Robyn, KY 86375",-161.16
4,fffe3900320038003400,Rosalyn Krol,0.15,17.0,6.0,324.0,Aluminium,3.18,11.94,Yes,Yes,Yes,Airways,No,Working Class,No,12/18/16,12/14/16,"Aprilside, PA 52793",-159.23


*From just looking at the data we can see that there are many categorical values and some columns are of no use to us like Artist Name and we can remove them*

In [6]:
test.head()

Unnamed: 0,Customer Id,Artist Name,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Scheduled Date,Delivery Date,Customer Location
0,fffe3400310033003300,James Miller,0.35,53.0,18.0,871.0,Wood,5.98,19.11,Yes,Yes,No,Airways,No,Working Class,No,07/03/17,07/06/17,"Santoshaven, IA 63481"
1,fffe3600350035003400,Karen Vetrano,0.67,7.0,4.0,108.0,Clay,6.92,13.96,No,No,No,Roadways,Yes,Working Class,No,05/02/16,05/02/16,"Ericksonton, OH 98253"
2,fffe3700360030003500,Roseanne Gaona,0.61,6.0,5.0,97.0,Aluminium,4.23,13.62,Yes,No,No,Airways,No,Working Class,No,01/04/18,01/06/18,APO AP 83453
3,fffe350038003600,Todd Almanza,0.14,15.0,8.0,757.0,Clay,6.28,23.79,No,Yes,No,Roadways,Yes,Wealthy,No,09/14/17,09/17/17,"Antonioborough, AL 54778"
4,fffe3500390032003500,Francis Rivero,0.63,10.0,4.0,1673.0,Marble,4.39,17.83,No,Yes,Yes,Roadways,No,Working Class,Yes,12/03/17,12/02/17,"Lake Frances, LA 03040"


In [7]:
# We will store the customer id so that we can use them later for submission 
cus_id = pd.DataFrame(test['Customer Id'])

In [8]:
df =train

In [9]:
df.drop(['Customer Id','Artist Name'], axis = 1, inplace=True)
df.shape

(6500, 18)

In [10]:
test.drop(['Customer Id','Artist Name'], axis =1, inplace=True)
test.shape

(3500, 17)

In [11]:
# Let's looking at the number of missing values in Data
df.isna().sum()

Artist Reputation         750
Height                    375
Width                     584
Weight                    587
Material                  764
Price Of Sculpture          0
Base Shipping Price         0
International               0
Express Shipment            0
Installation Included       0
Transport                1392
Fragile                     0
Customer Information        0
Remote Location           771
Scheduled Date              0
Delivery Date               0
Customer Location           0
Cost                        0
dtype: int64

***We can see that there are many columns with null values, so will try to impute it.***

In [12]:
df['Artist Reputation'].fillna(df['Artist Reputation'].median(),inplace = True)
test['Artist Reputation'].fillna(test['Artist Reputation'].median(),inplace = True)
df['Artist Reputation'].isna().any()

False

In [13]:
df['Height'].fillna(df['Height'].median(), inplace = True)
test['Height'].fillna(test['Height'].median(), inplace = True)
df['Height'].isna().any()

False

In [14]:
df['Width'].fillna(df['Width'].median(), inplace = True)
test['Width'].fillna(test['Width'].median(), inplace = True)
df['Width'].isna().any()

False

In [15]:
df['Weight'].fillna(df['Weight'].median(), inplace = True)
test['Weight'].fillna(test['Weight'].median(), inplace = True)
df['Weight'].isna().any()

False

In [16]:
df['Transport'].fillna(df['Transport'].mode()[0], inplace = True)
test['Transport'].fillna(test['Transport'].mode()[0], inplace = True)
df['Transport'].isna().any()

False

In [17]:
df['Remote Location'].fillna(df['Remote Location'].mode()[0], inplace = True)
test['Remote Location'].fillna(test['Remote Location'].mode()[0], inplace = True)
df['Remote Location'].isna().any()

False

In [18]:
df['Material'].fillna(df['Material'].mode()[0], inplace = True)
test['Material'].fillna(test['Material'].mode()[0], inplace = True)
df['Material'].isna().any()

False

***We will now convert the dates to datetime values and extract date difference to new columns***

In [19]:
df['Scheduled Date'] = pd.to_datetime(df['Scheduled Date'])
df['Delivery Date'] = pd.to_datetime(df['Delivery Date'])
df['scheduleDiff'] = (df['Delivery Date'] - df['Scheduled Date']).map(lambda x:str(x).split()[0])
df['scheduleDiff'] = pd.to_numeric(df['scheduleDiff'])

In [20]:
test['Scheduled Date'] = pd.to_datetime(test['Scheduled Date'])
test['Delivery Date'] = pd.to_datetime(test['Delivery Date'])
test['scheduleDiff'] = (test['Delivery Date'] - test['Scheduled Date']).map(lambda x:str(x).split()[0])
test['scheduleDiff'] = pd.to_numeric(test['scheduleDiff'])

test.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Scheduled Date,Delivery Date,Customer Location,scheduleDiff
0,0.35,53.0,18.0,871.0,Wood,5.98,19.11,Yes,Yes,No,Airways,No,Working Class,No,2017-07-03,2017-07-06,"Santoshaven, IA 63481",3
1,0.67,7.0,4.0,108.0,Clay,6.92,13.96,No,No,No,Roadways,Yes,Working Class,No,2016-05-02,2016-05-02,"Ericksonton, OH 98253",0
2,0.61,6.0,5.0,97.0,Aluminium,4.23,13.62,Yes,No,No,Airways,No,Working Class,No,2018-01-04,2018-01-06,APO AP 83453,2
3,0.14,15.0,8.0,757.0,Clay,6.28,23.79,No,Yes,No,Roadways,Yes,Wealthy,No,2017-09-14,2017-09-17,"Antonioborough, AL 54778",3
4,0.63,10.0,4.0,1673.0,Marble,4.39,17.83,No,Yes,Yes,Roadways,No,Working Class,Yes,2017-12-03,2017-12-02,"Lake Frances, LA 03040",-1


In [21]:
df.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Scheduled Date,Delivery Date,Customer Location,Cost,scheduleDiff
0,0.26,17.0,6.0,4128.0,Brass,13.91,16.27,Yes,Yes,No,Airways,No,Working Class,No,2015-06-07,2015-06-03,"New Michelle, OH 50777",-283.29,-4
1,0.28,3.0,3.0,61.0,Brass,6.83,15.0,No,No,No,Roadways,No,Working Class,No,2017-03-06,2017-03-05,"New Michaelport, WY 12072",-159.96,-1
2,0.07,8.0,5.0,237.0,Clay,4.96,21.18,No,No,No,Roadways,Yes,Working Class,Yes,2015-03-09,2015-03-08,"Bowmanshire, WA 19241",-154.29,-1
3,0.12,9.0,8.0,3102.0,Aluminium,5.81,16.31,No,No,No,Roadways,No,Wealthy,Yes,2015-05-24,2015-05-20,"East Robyn, KY 86375",-161.16,-4
4,0.15,17.0,6.0,324.0,Aluminium,3.18,11.94,Yes,Yes,Yes,Airways,No,Working Class,No,2016-12-18,2016-12-14,"Aprilside, PA 52793",-159.23,-4


**Time to do some Fetaure Engineering**

As we know it's a shipping problem so we create columns based on that

1.   Volume
2.   Mean shipping price of materials
3.   Weight per shipping 



In [22]:
df['Weight/Shipping'] = df['Weight'] / df['Base Shipping Price']
df['mean_shipping_per_material_transport'] = df.groupby(['Material'])['Base Shipping Price'].transform('mean')
df['volume'] = (df['Height']**2) * (df['Width'])

In [23]:
test['Weight/Shipping'] = test['Weight'] / test['Base Shipping Price']
test['mean_shipping_per_material_transport'] = test.groupby(['Material'])['Base Shipping Price'].transform('mean')
test['volume'] = (test['Height']**2) * (test['Width'])

In [24]:
df['Width_bins'] = np.array(np.floor(np.array(df['Width']) / 10.))
df['price_bins'] = np.array(np.floor(np.array(df['Price Of Sculpture']) / 10))

In [25]:
test['Width_bins'] = np.array(np.floor(np.array(test['Width']) / 10.))
test['price_bins'] = np.array(np.floor(np.array(test['Price Of Sculpture']) / 10))

In [26]:
test.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Scheduled Date,Delivery Date,Customer Location,scheduleDiff,Weight/Shipping,mean_shipping_per_material_transport,volume,Width_bins,price_bins
0,0.35,53.0,18.0,871.0,Wood,5.98,19.11,Yes,Yes,No,Airways,No,Working Class,No,2017-07-03,2017-07-06,"Santoshaven, IA 63481",3,45.578231,17.757576,50562.0,1.0,0.0
1,0.67,7.0,4.0,108.0,Clay,6.92,13.96,No,No,No,Roadways,Yes,Working Class,No,2016-05-02,2016-05-02,"Ericksonton, OH 98253",0,7.73639,25.037646,196.0,0.0,0.0
2,0.61,6.0,5.0,97.0,Aluminium,4.23,13.62,Yes,No,No,Airways,No,Working Class,No,2018-01-04,2018-01-06,APO AP 83453,2,7.12188,19.729349,180.0,0.0,0.0
3,0.14,15.0,8.0,757.0,Clay,6.28,23.79,No,Yes,No,Roadways,Yes,Wealthy,No,2017-09-14,2017-09-17,"Antonioborough, AL 54778",3,31.820092,25.037646,1800.0,0.0,0.0
4,0.63,10.0,4.0,1673.0,Marble,4.39,17.83,No,Yes,Yes,Roadways,No,Working Class,Yes,2017-12-03,2017-12-02,"Lake Frances, LA 03040",-1,93.830623,55.69383,400.0,0.0,0.0


**Time to remove some unwanted columns**

In [27]:
df.drop(['Delivery Date', 'Scheduled Date'], inplace=True, axis=1)
df.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Customer Location,Cost,scheduleDiff,Weight/Shipping,mean_shipping_per_material_transport,volume,Width_bins,price_bins
0,0.26,17.0,6.0,4128.0,Brass,13.91,16.27,Yes,Yes,No,Airways,No,Working Class,No,"New Michelle, OH 50777",-283.29,-4,253.7185,41.622725,1734.0,0.0,1.0
1,0.28,3.0,3.0,61.0,Brass,6.83,15.0,No,No,No,Roadways,No,Working Class,No,"New Michaelport, WY 12072",-159.96,-1,4.066667,41.622725,27.0,0.0,0.0
2,0.07,8.0,5.0,237.0,Clay,4.96,21.18,No,No,No,Roadways,Yes,Working Class,Yes,"Bowmanshire, WA 19241",-154.29,-1,11.189802,26.772745,320.0,0.0,0.0
3,0.12,9.0,8.0,3102.0,Aluminium,5.81,16.31,No,No,No,Roadways,No,Wealthy,Yes,"East Robyn, KY 86375",-161.16,-4,190.190067,19.568071,648.0,0.0,0.0
4,0.15,17.0,6.0,324.0,Aluminium,3.18,11.94,Yes,Yes,Yes,Airways,No,Working Class,No,"Aprilside, PA 52793",-159.23,-4,27.135678,19.568071,1734.0,0.0,0.0


In [28]:
test.drop(['Delivery Date', 'Scheduled Date'], inplace=True, axis=1)
test.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Customer Location,scheduleDiff,Weight/Shipping,mean_shipping_per_material_transport,volume,Width_bins,price_bins
0,0.35,53.0,18.0,871.0,Wood,5.98,19.11,Yes,Yes,No,Airways,No,Working Class,No,"Santoshaven, IA 63481",3,45.578231,17.757576,50562.0,1.0,0.0
1,0.67,7.0,4.0,108.0,Clay,6.92,13.96,No,No,No,Roadways,Yes,Working Class,No,"Ericksonton, OH 98253",0,7.73639,25.037646,196.0,0.0,0.0
2,0.61,6.0,5.0,97.0,Aluminium,4.23,13.62,Yes,No,No,Airways,No,Working Class,No,APO AP 83453,2,7.12188,19.729349,180.0,0.0,0.0
3,0.14,15.0,8.0,757.0,Clay,6.28,23.79,No,Yes,No,Roadways,Yes,Wealthy,No,"Antonioborough, AL 54778",3,31.820092,25.037646,1800.0,0.0,0.0
4,0.63,10.0,4.0,1673.0,Marble,4.39,17.83,No,Yes,Yes,Roadways,No,Working Class,Yes,"Lake Frances, LA 03040",-1,93.830623,55.69383,400.0,0.0,0.0


In [29]:
df.drop(['Customer Location'], inplace=True, axis=1)
test.drop(['Customer Location'], inplace=True, axis=1)

**Now we can seperate out the numerical and categorical values and encode it later**


In [30]:
num_f = df.select_dtypes(include = [np.number])
cat_f = df.select_dtypes(include = [np.object])
num_f1 = test.select_dtypes(include = [np.number])
cat_f2 = test.select_dtypes(include = [np.object])

In [31]:
enc = LabelEncoder()
for i in cat_f:
  df[i] = enc.fit_transform(cat_f[i])
  test[i] = enc.fit_transform(cat_f2[i])

In [32]:
df.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Cost,scheduleDiff,Weight/Shipping,mean_shipping_per_material_transport,volume,Width_bins,price_bins
0,0.26,17.0,6.0,4128.0,1,13.91,16.27,1,1,0,0,0,1,0,-283.29,-4,253.7185,41.622725,1734.0,0.0,1.0
1,0.28,3.0,3.0,61.0,1,6.83,15.0,0,0,0,1,0,1,0,-159.96,-1,4.066667,41.622725,27.0,0.0,0.0
2,0.07,8.0,5.0,237.0,3,4.96,21.18,0,0,0,1,1,1,1,-154.29,-1,11.189802,26.772745,320.0,0.0,0.0
3,0.12,9.0,8.0,3102.0,0,5.81,16.31,0,0,0,1,0,0,1,-161.16,-4,190.190067,19.568071,648.0,0.0,0.0
4,0.15,17.0,6.0,324.0,0,3.18,11.94,1,1,1,0,0,1,0,-159.23,-4,27.135678,19.568071,1734.0,0.0,0.0


In [33]:
test.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,scheduleDiff,Weight/Shipping,mean_shipping_per_material_transport,volume,Width_bins,price_bins
0,0.35,53.0,18.0,871.0,6,5.98,19.11,1,1,0,0,0,1,0,3,45.578231,17.757576,50562.0,1.0,0.0
1,0.67,7.0,4.0,108.0,3,6.92,13.96,0,0,0,1,1,1,0,0,7.73639,25.037646,196.0,0.0,0.0
2,0.61,6.0,5.0,97.0,0,4.23,13.62,1,0,0,0,0,1,0,2,7.12188,19.729349,180.0,0.0,0.0
3,0.14,15.0,8.0,757.0,3,6.28,23.79,0,1,0,1,1,0,0,3,31.820092,25.037646,1800.0,0.0,0.0
4,0.63,10.0,4.0,1673.0,4,4.39,17.83,0,1,1,1,0,1,1,-1,93.830623,55.69383,400.0,0.0,0.0


**We can see from the data that the target variable (cost) has some negative values so will be log transform the absolute value of it.**

In [34]:
df['Cost1'] = np.log1p(abs(df['Cost']))
df.drop(['Cost'],axis=1,inplace=True)
df.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,scheduleDiff,Weight/Shipping,mean_shipping_per_material_transport,volume,Width_bins,price_bins,Cost1
0,0.26,17.0,6.0,4128.0,1,13.91,16.27,1,1,0,0,0,1,0,-4,253.7185,41.622725,1734.0,0.0,1.0,5.649995
1,0.28,3.0,3.0,61.0,1,6.83,15.0,0,0,0,1,0,1,0,-1,4.066667,41.622725,27.0,0.0,0.0,5.081156
2,0.07,8.0,5.0,237.0,3,4.96,21.18,0,0,0,1,1,1,1,-1,11.189802,26.772745,320.0,0.0,0.0,5.045294
3,0.12,9.0,8.0,3102.0,0,5.81,16.31,0,0,0,1,0,0,1,-4,190.190067,19.568071,648.0,0.0,0.0,5.088584
4,0.15,17.0,6.0,324.0,0,3.18,11.94,1,1,1,0,0,1,0,-4,27.135678,19.568071,1734.0,0.0,0.0,5.07661


In [36]:
#Splitting the data for training 

X = df.drop(['Cost1'], axis=1)
y = df['Cost1']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=700, shuffle=True)

In [37]:
X = train.drop(['Cost1'], axis=1)
X.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,scheduleDiff,Weight/Shipping,mean_shipping_per_material_transport,volume,Width_bins,price_bins
0,0.26,17.0,6.0,4128.0,1,13.91,16.27,1,1,0,0,0,1,0,-4,253.7185,41.622725,1734.0,0.0,1.0
1,0.28,3.0,3.0,61.0,1,6.83,15.0,0,0,0,1,0,1,0,-1,4.066667,41.622725,27.0,0.0,0.0
2,0.07,8.0,5.0,237.0,3,4.96,21.18,0,0,0,1,1,1,1,-1,11.189802,26.772745,320.0,0.0,0.0
3,0.12,9.0,8.0,3102.0,0,5.81,16.31,0,0,0,1,0,0,1,-4,190.190067,19.568071,648.0,0.0,0.0
4,0.15,17.0,6.0,324.0,0,3.18,11.94,1,1,1,0,0,1,0,-4,27.135678,19.568071,1734.0,0.0,0.0


In [38]:
y = train['Cost1']
y

0       5.649995
1       5.081156
2       5.045294
3       5.088584
4       5.076610
          ...   
6495    6.772428
6496    7.206392
6497    5.873666
6498    8.524864
6499    6.584059
Name: Cost1, Length: 6500, dtype: float64

In [39]:
# Got these parameters after hyperparameters tuning so I'm directly applying them

params = {'depth': 5, 'n_estimators': 1999, 'learning_rate': 0.07203293}
cat = CatBoostRegressor(random_state = 1, **params)



In [None]:
'''import pickle
file = open('/Hackerearth_Art_NEW/catboostmodel.pkl','wb')
pickle.dump(cat, file) '''

"import pickle\nfile = open('/Hackerearth_Art_NEW/catboostmodel.pkl','wb')\npickle.dump(cat, file) "

In [40]:

cat.fit(X,y)

0:	learn: 1.5588830	total: 48.7ms	remaining: 1m 37s
1:	learn: 1.4677843	total: 51.6ms	remaining: 51.5s
2:	learn: 1.3835231	total: 54.1ms	remaining: 36s
3:	learn: 1.3061480	total: 56.5ms	remaining: 28.2s
4:	learn: 1.2372486	total: 58.8ms	remaining: 23.4s
5:	learn: 1.1753268	total: 61.2ms	remaining: 20.3s
6:	learn: 1.1116925	total: 63.6ms	remaining: 18.1s
7:	learn: 1.0548798	total: 65.8ms	remaining: 16.4s
8:	learn: 1.0025295	total: 68.1ms	remaining: 15.1s
9:	learn: 0.9540553	total: 70.7ms	remaining: 14.1s
10:	learn: 0.9100002	total: 74ms	remaining: 13.4s
11:	learn: 0.8684441	total: 76.4ms	remaining: 12.7s
12:	learn: 0.8315043	total: 78.8ms	remaining: 12s
13:	learn: 0.7973447	total: 81.1ms	remaining: 11.5s
14:	learn: 0.7660819	total: 83.4ms	remaining: 11s
15:	learn: 0.7373722	total: 85.5ms	remaining: 10.6s
16:	learn: 0.7095360	total: 87.6ms	remaining: 10.2s
17:	learn: 0.6847950	total: 89.9ms	remaining: 9.89s
18:	learn: 0.6620150	total: 92.1ms	remaining: 9.59s
19:	learn: 0.6408798	total: 9

<catboost.core.CatBoostRegressor at 0x7fd649b3bf50>

In [41]:
y_pred2 = cat.predict(X)
y_pred2

array([5.62452186, 5.06920403, 5.04915548, ..., 5.94904221, 8.65790105,
       6.53738494])

In [42]:
actual = np.expm1(y)
predicted = np.expm1(y_pred2)
score = 100*max(0, 1-metrics.mean_squared_log_error(actual, predicted))
score

98.44722558473401

In [43]:
pred2 = cat.predict(test)
pred2 = np.expm1(pred2)
pred2

array([   256.23991103,    254.93364468,    184.3472797 , ...,
          675.20056288,    276.93848439, 156629.25398922])

In [44]:
sub1 = cus_id
sub1['Cost'] = pred2
sub1.to_csv("/Hackerearth_Art_NEW/submit_new1.csv", index = False)
sub1.head()

Unnamed: 0,Customer Id,Cost
0,fffe3400310033003300,256.239911
1,fffe3600350035003400,254.933645
2,fffe3700360030003500,184.34728
3,fffe350038003600,199.402754
4,fffe3500390032003500,341.686492
