# How to be a DaskMaster PART-3
Implementing LabelEncoding from dask.preprocessing, dask_ml.xgboost on Dask DataFrame. Nail Dask in 15 minutes. Let’s see how it is done.
This is my last blog on Dask. Reading these three parts won’t exactly get you to God-Level, but you will get a fair idea about how to become “The God” in Dask.


In [1]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import pyarrow

In [2]:
from dask.distributed import Client, progress
client = Client(n_workers=4, memory_limit='4GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:54823  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.00 GB


In [3]:
data = dd.read_parquet('zomato.parquet', engine='pyarrow')

In [4]:
del(data["url"])
del(data["address"])
del(data["menu_item"])
del(data["dish_liked"])
del(data["listed_in(type)"])
del(data["phone"])
del(data["reviews_list"])
del(data["rest_type"])
del(data["cuisines"])

In [5]:
del(data["name"])

In [6]:
data=client.persist(data)

In [7]:
data["rate"]=data["rate"].str.replace("-","650")

In [8]:
data["approx_cost(for two people)"]=data["approx_cost(for two people)"].str.replace(",","")

In [9]:
data["rate"]=data["rate"].str.replace("/5","")
data["rate"]=data["rate"].str.replace("NEW","0.0")

In [10]:
# ((data.isnull() | data.isna()).sum() * 100 / data.index.size).round(2).compute()
data["location"]=data["location"].fillna(method="ffill")
data["approx_cost(for two people)"]=data["approx_cost(for two people)"].fillna(method="ffill")

In [11]:
Y=data.iloc[:,2]
del(data["rate"])
X=data

In [12]:
X=client.persist(X)

# LabelEncoding:
We have categorical values in our data. So, we will use LabelEncoder. Label Encoding is a technique to represent every value in a column with a number. We do Label Encoding because the model doesn’t understand strings. Therefore, we try to give numeric data in the model implementation.

In [14]:
from dask_ml.preprocessing import LabelEncoder

In [15]:
encoder=LabelEncoder()
encoder2=LabelEncoder()
encoder3=LabelEncoder()
encoder4=LabelEncoder()

In the above code, we imported the Label Encoder from the dask_ml.preprocessing and not from the Sklearn. It’s important to understand that we are dealing with the Dask DataFrame and not with the ordinary pandas DataFrame.

In [18]:
del(X["votes"])
del(X["approx_cost(for two people)"])
X2=X.values
encod=encoder.fit_transform(X2[:,0]).compute()
encod2=encoder2.fit_transform(X2[:,1]).compute()
encod3=encoder3.fit_transform(X2[:,2]).compute()
encod4=encoder4.fit_transform(X2[:,3]).compute()

Now, in the above code, we deleted those columns from our dataset X which don’t need encoding (they were numeric by default). After that, we converted our X into an array by using the code in line number 3. Then, we simply implemented the fit_transform on every column in the dataset.
* Now it is important to understand that the result which is computed by the Label Encoder is in the form of an array. It always returns an array-like structure so we will be converting every array back into the dask DataFrame.

In [19]:
X2=X.values

In [20]:
encod=encoder.fit_transform(X2[:,0]).compute()
encod2=encoder2.fit_transform(X2[:,1]).compute()
encod3=encoder3.fit_transform(X2[:,2]).compute()
encod4=encoder4.fit_transform(X2[:,3]).compute()

In [25]:
encod5=data["votes"].values.compute()
encod6=data["approx_cost(for two people)"].values.compute()
X=pd.DataFrame({'online_order': encod, 'book_table': encod2, "location":encod3,"listed_in(city)":encod4,
              "votes":encod5,"approx_cost(for two people)":encod6},
             columns=['online_order', 'book_table',"location","listed_in(city)","votes","approx_cost(for two people)"])
X=dd.from_pandas(X,npartitions=100)

We converted those array-like structures back into the pandas DataFrame. Then, from the pandas DataFrame, we converted it back into the Dask DataFrame by using npartition=100.
Then, we try to clean our Target Y and then label encode that too.

In [31]:
X=client.persist(X)

In [32]:
Y=client.persist(Y)

In [34]:
Y=Y.replace("None",0.0)
Y=Y.fillna(0)
Y=Y.fillna(method="ffill")

In [38]:
encodery=LabelEncoder()
Y2=Y.astype(float)
encodery=encoder.fit_transform(Y2).compute()
encodery

In [41]:
encodery

array([23, 23, 20, ...,  0, 25, 16], dtype=int64)

# After this, we will change Y back into the dask DataFrame.

In [42]:
Y=pd.DataFrame(encodery)
Y=dd.from_pandas(Y,npartitions=100)

In [43]:
Y=client.persist(Y)

In [45]:
Y=client.persist(Y)

In [46]:
X=client.persist(X)

# Now we have both our X and Y ready for training and testing

In [47]:
X=X.astype(float)
Y=Y.astype(float)
from dask_ml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y)

# Model Implementation:
We have successfully fitted the XGBRegressor model and got our predictions.
Now, we will be converting our prediction into dask DataFrame and then calculate the r2_score.

In [49]:
%%time
import dask_ml
from dask_ml.xgboost import XGBRegressor

est = XGBRegressor()
est.fit(X_train, y_train)
prediction = est.predict(X_test)

Wall time: 0 ns


In [50]:
from dask_ml.xgboost import XGBRegressor

est = XGBRegressor()
est.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints=None,
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=1, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method=None, validate_parameters=False, verbosity=None)

In [51]:
prediction = est.predict(X_test)

In [52]:
prediction=dd.from_array(prediction,chunksize=500000)
from sklearn.metrics import r2_score
r2_score(y_test,prediction)

Well, the score is pretty impressive given the fact that we haven’t done any feature engineering or removed any outliers from the dataset. Why haven’t we done any of that? 
# Because our main goal was to Learn dask and implement it.

In [53]:
from sklearn.metrics import r2_score
r2_score(y_test,prediction)

0.8781144173343649

# THE END