<a href="https://colab.research.google.com/github/khatgarhaastha/WandbExperiment/blob/main/AasthaKhatgarh_Week4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 - Models and Experimentation

## Step 1 Training a model

For the purposes of this demo, we will be using this [adapted demo](https://www.datacamp.com/tutorial/xgboost-in-python) and training an XGBoost model, and then doing some experimentation and hyperparameter tuning.


If running this notebook locally, use the following steps to create virtual environment:
- Don't use past python 3.10
- To create virtual environment use "venv"

`python -m venv NAME`

- Try to avoid anaconda, poetry or similar package management platforms
- To install a package use pip

`python -m pip install <package-name>`

- once you are done working with this virtual environment, deactivate it with `deactivate`

### Install packages

In [1]:
!pip install wandb -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


### Import data

We will be using Diamonds dataset imported from Seaborn. It is also available on [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds).

Read about the features by following the link. We will be predicting the price of diamonds.

In [3]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [5]:
diamonds.shape

(53940, 10)

In [6]:
X,y = diamonds.drop('price', axis=1), diamonds[['price']]

# For the cut, color and clarity use pandas category to enable XGBoost ability to deal with categorical data.

X['cut'] = X['cut'].astype('category')
X['color'] = X['color'].astype('category')
X['clarity'] = X['clarity'].astype('category')

### Split the data and train a model

In [7]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [8]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
)


    E.g. tree_method = "hist", device = "cuda"



In [9]:
# Define evaluation metrics - Root Mean Squared Error

predictions = model.predict(dtest)
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f"RMSE: {rmse}")

RMSE: 532.8838153117543



    E.g. tree_method = "hist", device = "cuda"



### Incorporate validation

In [10]:
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 100

# Create the validation set
evals = [(dtrain, "train"), (dtest, "validation")]

In [11]:
evals = [(dtrain, "train"), (dtest, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=10,
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630



    E.g. tree_method = "hist", device = "cuda"



[10]	train-rmse:550.99470	validation-rmse:571.16640
[20]	train-rmse:491.51435	validation-rmse:544.08058
[30]	train-rmse:464.38845	validation-rmse:537.01895
[40]	train-rmse:445.99106	validation-rmse:533.85127
[50]	train-rmse:430.36010	validation-rmse:532.90320
[60]	train-rmse:418.87898	validation-rmse:533.04629
[70]	train-rmse:409.66247	validation-rmse:533.58046
[80]	train-rmse:397.34048	validation-rmse:534.31963
[90]	train-rmse:389.94294	validation-rmse:532.61946
[99]	train-rmse:377.70831	validation-rmse:532.88383


In [12]:
# Incorporate early stopping
n = 10000


model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630



    E.g. tree_method = "hist", device = "cuda"



[50]	train-rmse:430.36010	validation-rmse:532.90320
[100]	train-rmse:377.56825	validation-rmse:532.79980
[102]	train-rmse:376.20429	validation-rmse:532.59813


In [13]:
# Cross-validation

params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 1000

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)



    E.g. tree_method = "hist", device = "cuda"



In [14]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,2861.153015,8.266765,2861.773555,36.937516
1,2081.378004,5.534608,2084.973481,32.064109
2,1545.361682,3.287745,1553.681211,31.059209
3,1182.364236,3.585787,1192.464771,26.157805
4,941.828819,2.971779,958.467497,23.613538


In [15]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

549.1039652582465

## Start W&B


- Login into your W&B profile using the code below
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - WANDB_API_KEY - find this in your "Settings" section under your profile
    - WANDB_BASE_URL - this is the url of the W&B server

- Find your API Token in "Profile" -> "Setttings" in the W&B App



In [16]:
!pip install wandb



In [17]:
import wandb

wandb.login()

from sklearn.model_selection import GridSearchCV

wandb.init(
    # set the wandb project where this run will be logged
    project="Week 4",
    config = {
        "Logging": "RMSE ERROR",
        "MODEL TYPE": "XGBOOST",
        "Hyper Parameter": ["Learning Rate", "Max Depth"],

    }
)

learning_rates = [0.01, 0.1, 1, 10]
max_depths = [5,6]

for learning_ in learning_rates:
	for max_depth in max_depths:
		params = {"objective": "reg:squarederror", "tree_method": "gpu_hist", "learning_rate" : learning_, "max_depth": max_depth}
		model = xgb.train(
		params=params,
		dtrain=dtrain,
		num_boost_round=n,
		evals=evals
		)

		predictions = model.predict(dtest)
		rmse = mean_squared_error(y_test, predictions, squared=False)

		wandb.log({"Learning Rate": learning_, "RMSE_Loss": rmse, "max_depth" : max_depth})




<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkhatgarh-aastha[0m ([33mkhatgarh-aastha0014[0m). Use [1m`wandb login --relogin`[0m to force relogin


[0]	train-rmse:3952.55708	validation-rmse:3949.80754
[1]	train-rmse:3915.55510	validation-rmse:3912.78792
[2]	train-rmse:3878.97544	validation-rmse:3876.12914
[3]	train-rmse:3842.78846	validation-rmse:3839.85949
[4]	train-rmse:3806.98393	validation-rmse:3803.95353
[5]	train-rmse:3771.56190	validation-rmse:3768.49003
[6]	train-rmse:3736.51203	validation-rmse:3733.35154
[7]	train-rmse:3701.80275	validation-rmse:3698.43428
[8]	train-rmse:3667.46018	validation-rmse:3663.84118
[9]	train-rmse:3633.44516	validation-rmse:3629.59855
[10]	train-rmse:3599.78512	validation-rmse:3595.71835
[11]	train-rmse:3566.52404	validation-rmse:3562.25174
[12]	train-rmse:3533.57538	validation-rmse:3529.06934
[13]	train-rmse:3500.98543	validation-rmse:3496.25290
[14]	train-rmse:3468.73912	validation-rmse:3463.77791
[15]	train-rmse:3436.86308	validation-rmse:3431.70991
[16]	train-rmse:3405.27079	validation-rmse:3399.97296
[17]	train-rmse:3374.01820	validation-rmse:3368.58308
[18]	train-rmse:3343.00014	validation-


    E.g. tree_method = "hist", device = "cuda"



[21]	train-rmse:3251.98019	validation-rmse:3245.93704
[22]	train-rmse:3222.15696	validation-rmse:3215.98916
[23]	train-rmse:3192.74197	validation-rmse:3186.41546
[24]	train-rmse:3163.69066	validation-rmse:3157.17789
[25]	train-rmse:3134.77831	validation-rmse:3128.24724
[26]	train-rmse:3106.31856	validation-rmse:3099.66496
[27]	train-rmse:3078.12726	validation-rmse:3071.44546
[28]	train-rmse:3050.06496	validation-rmse:3043.48244
[29]	train-rmse:3022.50633	validation-rmse:3015.84312
[30]	train-rmse:2995.08799	validation-rmse:2988.39582
[31]	train-rmse:2967.90551	validation-rmse:2961.31213
[32]	train-rmse:2941.11226	validation-rmse:2934.40841
[33]	train-rmse:2914.50049	validation-rmse:2907.85566
[34]	train-rmse:2888.38305	validation-rmse:2881.64791
[35]	train-rmse:2862.48281	validation-rmse:2855.53057
[36]	train-rmse:2836.73166	validation-rmse:2829.75053
[37]	train-rmse:2811.26062	validation-rmse:2804.27246
[38]	train-rmse:2786.05733	validation-rmse:2779.06548
[39]	train-rmse:2761.25391	v


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



[15]	train-rmse:3427.65383	validation-rmse:3422.21992
[16]	train-rmse:3395.56677	validation-rmse:3389.96690
[17]	train-rmse:3363.75667	validation-rmse:3358.08757
[18]	train-rmse:3332.31965	validation-rmse:3326.47775
[19]	train-rmse:3301.17180	validation-rmse:3295.26749
[20]	train-rmse:3270.41375	validation-rmse:3264.35438
[21]	train-rmse:3239.92519	validation-rmse:3233.88165
[22]	train-rmse:3209.73054	validation-rmse:3203.59201
[23]	train-rmse:3179.80215	validation-rmse:3173.60675
[24]	train-rmse:3150.18892	validation-rmse:3144.01912
[25]	train-rmse:3120.92669	validation-rmse:3114.64985
[26]	train-rmse:3091.93693	validation-rmse:3085.54884
[27]	train-rmse:3063.14791	validation-rmse:3056.74740
[28]	train-rmse:3034.75572	validation-rmse:3028.31985
[29]	train-rmse:3006.57676	validation-rmse:3000.08035
[30]	train-rmse:2978.78205	validation-rmse:2972.18011
[31]	train-rmse:2951.11499	validation-rmse:2944.49426
[32]	train-rmse:2923.86171	validation-rmse:2917.20940
[33]	train-rmse:2896.92357	v


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



[18]	train-rmse:876.04459	validation-rmse:870.20933
[19]	train-rmse:835.72850	validation-rmse:830.74876
[20]	train-rmse:799.97011	validation-rmse:795.34176
[21]	train-rmse:767.54168	validation-rmse:763.90796
[22]	train-rmse:740.82562	validation-rmse:738.32150
[23]	train-rmse:716.57180	validation-rmse:715.43194
[24]	train-rmse:695.35435	validation-rmse:695.91944
[25]	train-rmse:675.79616	validation-rmse:678.07192
[26]	train-rmse:660.31635	validation-rmse:663.60880
[27]	train-rmse:646.02883	validation-rmse:650.00573
[28]	train-rmse:632.50089	validation-rmse:636.87752
[29]	train-rmse:620.32877	validation-rmse:625.63176
[30]	train-rmse:610.24458	validation-rmse:616.48974
[31]	train-rmse:601.33728	validation-rmse:608.95054
[32]	train-rmse:593.38461	validation-rmse:601.44164
[33]	train-rmse:586.78165	validation-rmse:596.37761
[34]	train-rmse:580.18515	validation-rmse:590.00542
[35]	train-rmse:574.76032	validation-rmse:586.21111
[36]	train-rmse:570.13529	validation-rmse:582.19838
[37]	train-r


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



[14]	train-rmse:1052.85512	validation-rmse:1048.28505
[15]	train-rmse:982.80467	validation-rmse:978.20323
[16]	train-rmse:921.95189	validation-rmse:918.57590
[17]	train-rmse:868.86000	validation-rmse:866.21695
[18]	train-rmse:821.45477	validation-rmse:819.79389
[19]	train-rmse:780.15752	validation-rmse:780.41317
[20]	train-rmse:744.73249	validation-rmse:746.78021
[21]	train-rmse:712.48331	validation-rmse:716.67222
[22]	train-rmse:685.73700	validation-rmse:691.37735
[23]	train-rmse:662.00527	validation-rmse:669.30421
[24]	train-rmse:641.58231	validation-rmse:651.29228
[25]	train-rmse:622.49311	validation-rmse:634.41151
[26]	train-rmse:607.23098	validation-rmse:620.24614
[27]	train-rmse:593.31149	validation-rmse:608.92380
[28]	train-rmse:581.00149	validation-rmse:598.52109
[29]	train-rmse:571.35028	validation-rmse:590.04102
[30]	train-rmse:562.18272	validation-rmse:582.61892
[31]	train-rmse:554.44091	validation-rmse:576.08962
[32]	train-rmse:547.95331	validation-rmse:570.85828
[33]	train


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



[20]	train-rmse:534.15503	validation-rmse:625.20865
[21]	train-rmse:531.19671	validation-rmse:624.81109
[22]	train-rmse:526.65594	validation-rmse:623.71310
[23]	train-rmse:524.71252	validation-rmse:624.09216
[24]	train-rmse:520.53834	validation-rmse:624.76006
[25]	train-rmse:516.49792	validation-rmse:624.95814
[26]	train-rmse:512.86374	validation-rmse:623.36803
[27]	train-rmse:508.17552	validation-rmse:621.71147
[28]	train-rmse:504.47074	validation-rmse:621.56041
[29]	train-rmse:501.70129	validation-rmse:623.40245
[30]	train-rmse:498.36103	validation-rmse:620.94348
[31]	train-rmse:497.07856	validation-rmse:622.11906
[32]	train-rmse:496.01848	validation-rmse:623.21396
[33]	train-rmse:494.27469	validation-rmse:621.81862
[34]	train-rmse:491.56497	validation-rmse:626.55317
[35]	train-rmse:489.29301	validation-rmse:627.11675
[36]	train-rmse:486.23739	validation-rmse:627.24382
[37]	train-rmse:485.77792	validation-rmse:626.88635
[38]	train-rmse:481.96334	validation-rmse:625.01534
[39]	train-r


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



[12]	train-rmse:499.46558	validation-rmse:616.88384
[13]	train-rmse:496.11044	validation-rmse:615.84327
[14]	train-rmse:487.49457	validation-rmse:613.80263
[15]	train-rmse:481.88289	validation-rmse:616.17644
[16]	train-rmse:474.15490	validation-rmse:617.57460
[17]	train-rmse:467.80670	validation-rmse:616.29988
[18]	train-rmse:462.64034	validation-rmse:617.64919
[19]	train-rmse:460.42357	validation-rmse:617.90774
[20]	train-rmse:459.41657	validation-rmse:617.34662
[21]	train-rmse:455.14380	validation-rmse:619.24698
[22]	train-rmse:451.47170	validation-rmse:619.32644
[23]	train-rmse:447.18370	validation-rmse:623.97567
[24]	train-rmse:443.66658	validation-rmse:622.12417
[25]	train-rmse:437.74016	validation-rmse:620.29732
[26]	train-rmse:436.24065	validation-rmse:619.33118
[27]	train-rmse:431.94956	validation-rmse:620.94879
[28]	train-rmse:429.12783	validation-rmse:619.19715
[29]	train-rmse:424.24780	validation-rmse:619.29201
[30]	train-rmse:421.53975	validation-rmse:620.99924
[31]	train-r


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



[11]	train-rmse:1078039840585116.50000	validation-rmse:1077879103871953.50000
[12]	train-rmse:9692923326102616.00000	validation-rmse:9691302230233348.00000
[13]	train-rmse:81040172322154832.00000	validation-rmse:81146060464305712.00000
[14]	train-rmse:688450974443589632.00000	validation-rmse:687322867045399936.00000
[15]	train-rmse:4305434412174650880.00000	validation-rmse:4312127161504396800.00000
[16]	train-rmse:inf	validation-rmse:inf
[17]	train-rmse:inf	validation-rmse:inf
[18]	train-rmse:inf	validation-rmse:inf
[19]	train-rmse:inf	validation-rmse:inf
[20]	train-rmse:inf	validation-rmse:inf
[21]	train-rmse:inf	validation-rmse:inf
[22]	train-rmse:inf	validation-rmse:inf
[23]	train-rmse:inf	validation-rmse:inf
[24]	train-rmse:inf	validation-rmse:inf
[25]	train-rmse:inf	validation-rmse:inf
[26]	train-rmse:inf	validation-rmse:inf
[27]	train-rmse:inf	validation-rmse:inf
[28]	train-rmse:inf	validation-rmse:inf
[29]	train-rmse:inf	validation-rmse:inf
[30]	train-rmse:inf	validation-rmse:in


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



[11]	train-rmse:1074990788849983.75000	validation-rmse:1078040584598995.87500
[12]	train-rmse:9656668027636066.00000	validation-rmse:9683672340785148.00000
[13]	train-rmse:81947611904893632.00000	validation-rmse:82344108348039232.00000
[14]	train-rmse:731643152519285120.00000	validation-rmse:734948349880987392.00000
[15]	train-rmse:4270517686378369024.00000	validation-rmse:4276249013604696576.00000
[16]	train-rmse:inf	validation-rmse:inf
[17]	train-rmse:inf	validation-rmse:inf
[18]	train-rmse:inf	validation-rmse:inf
[19]	train-rmse:inf	validation-rmse:inf
[20]	train-rmse:inf	validation-rmse:inf
[21]	train-rmse:inf	validation-rmse:inf
[22]	train-rmse:inf	validation-rmse:inf
[23]	train-rmse:inf	validation-rmse:inf
[24]	train-rmse:inf	validation-rmse:inf
[25]	train-rmse:inf	validation-rmse:inf
[26]	train-rmse:inf	validation-rmse:inf
[27]	train-rmse:inf	validation-rmse:inf
[28]	train-rmse:inf	validation-rmse:inf
[29]	train-rmse:inf	validation-rmse:inf
[30]	train-rmse:inf	validation-rmse:in


    E.g. tree_method = "hist", device = "cuda"



In [20]:


params = {"objective": "reg:squarederror", "tree_method": "gpu_hist", "learning_rate" : 0.01, "max_depth" : 5}

n = 1500

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=25
)

print(f"Best RMSE : {results['test-rmse-mean'].min()}")


    E.g. tree_method = "hist", device = "cuda"



Best RMSE : 535.1642848700978


Results : https://api.wandb.ai/links/khatgarh-aastha0014/00gvzmr5