### DASK XGBoost 

#### Introduction to Dask and Dask cuDF

[Dask](https://dask.org/) is a Python library for parallel computing. In Dask programming, we create computational graphs that define code we **would like** to execute, and then, give these computational graphs to a Dask scheduler which evaluates them lazily, and efficiently, in parallel. In addition to using multiple CPU cores or threads to execute computational graphs in parallel, Dask schedulers can also be configured to execute computational graphs on multiple CPUs, or, as we will do in this workshop, multiple GPUs. On account of its ability to utilize multiple compute resources, Dask programming facilitates operating on datasets that are larger than the memory of a single compute resource.

[Dask cuDF](https://github.com/rapidsai/dask-cudf) can be used to distribute dataframe operations on larger-than-memory datasets to multiple GPUs. In this notebook you'll receive and introduction to some key Dask concepts, learn how to setup a Dask cluster for utilizing multiple GPUs, and how to perform simple dataframe operations on distributed Dask dataframes.

#### Initialize Dask 

We begin by starting a Dask scheduler which will take care to distribute our work across the 4 available GPUs. In order to do this we need to start a `LocalCUDACluster` instance, using our host machine's IP, and then instantiate a client that can communicate with the cluster.

In [1]:
import dask
from dask import delayed
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

In [2]:
cluster = LocalCUDACluster(ip="")
client = Client(cluster)
client

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


0,1
Client  Scheduler: tcp://192.168.99.2:40069  Dashboard: http://192.168.99.2:33697/status,Cluster  Workers: 4  Cores: 4  Memory: 270.39 GB


### The Dask Dashboard 
As you can see, the `client` instance gives us information about our CUDA cluster (utilizing 4 GPUs), as well as information about our client connection. Dask ships with an incredibly helpful dashboard, which you can see runs on port `8787`. Open a new browser tab now at `<YOUR_IP_ADDRESS>:8787`, for example `ec2-12-345-67-890.us-east-2.compute.amazonaws.com:8787`, which should open the Dask dashboard, currently idle.


### Import Libraries

In [3]:
# !pip install dask_xgboost
import numpy as np; print('numpy Version:', np.__version__)
import pandas as pd; print('pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import dask_xgboost; print('Dask XGBoost Version:', dask_xgboost.__version__)
import dask_cudf

import time 


import rapids_lib_v8 as rl
''' NOTE: anytime changes are made to rapids_lib.py you can either:
      1. refresh/reload via the code below, OR
      2. restart the kernel '''
import importlib; importlib.reload(rl)

numpy Version: 1.16.4
pandas Version: 0.24.2
Scikit-Learn Version: 0.21.2
Dask XGBoost Version: 0.1.7


<module 'rapids_lib_v8' from '/rapids/notebooks/ml_tutorial/testing/rapids_lib_v8.py'>

In [4]:
%store -r expLog
%store -r trainData_pDF
%store -r trainLabels_pDF 
%store -r testData_pDF
%store -r testLabels_pDF

nCores = !nproc --all
nCores = int(nCores[0]) # we want to extract number of cores the CPU has 

paramsCPU = {
    'max_depth': 10,
    'learning_rate': .1,
    'num_boost_rounds': 100,
    'lambda': 1,
    'objective': 'binary:hinge',
    'tree_method': 'hist',
    'n_jobs': nCores,
    'random_state': 0
}

paramsGPU = {
    'max_depth': 10,
    'learning_rate': .1,
    'num_boost_rounds': 100,
    'lambda': 1,
    'objective': 'binary:hinge',
    'tree_method': 'gpu_hist',
    'n_gpus': 1,    
    'random_state': 0
}

In [7]:
''' -------------------------------------------------------------------------
>  CPU Train and Test
------------------------------------------------------------------------- '''
def train_model_CPU (DaskClient, trainData_pDF, testData_pDF, paramsCPU = {}):    
    print('training xgboost model on {} CPU cores'.format(nCores) )
     
    "we will pass the Pandas DataFrame to DMatrix"
#     trainDMatrix = dask_xgboost.DMatrix( trainData_pDF, label = trainLabels_pDF)
    startTime = time.time()
    xgBoostModelCPU = dask_xgboost.train( dtrain = trainData_pDF, params = paramsCPU,
                                   num_boost_round = paramsCPU['num_boost_rounds'])

    print("CPU training time:" + str(time.time()-startTime),"seconds.")
    return xgBoostModelCPU, time.time() - startTime

def test_model_CPU ( DaskClient, trainedModelCPU, testData_pDF, testLabels_pDF ):
    print('testing xgboost model on CPU')
#     testDMatrix = dask_xgboost.DMatrix( testData_pDF, label=testLabels_pDF)
    startTime = time.time()    

    predictionsCPU = trainedModelCPU.predict(testData_pDF)
    print("CPU testing time:" + str(time.time()-startTime), "seconds.")
    return predictionsCPU, time.time() - startTime


In [8]:
#model training
trainedModelCPU, t_trainCPU = train_model_CPU ( client,trainData_pDF, trainLabels_pDF, paramsCPU )
#model inferencing
predictionsCPU, t_inferCPU = test_model_CPU ( client,trainedModelCPU, testData_pDF, testLabels_pDF )

training xgboost model on 40 CPU cores


TypeError: train() missing 3 required positional arguments: 'client', 'data', and 'labels'