![](https://github.com/ashishpatel26/Rapidsai_Machine_learning_on_GPU/raw/main/images/rapidsailogo.jpg?raw=true)

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

이 노트북은 `CUDA_VISIBLE_DEVICES=2`를 이용해 cuDF는 GPU 3번을 사용하도록 임의로 할당했다.

- Pandas: Single CPU
- Dask: Multiple CPUs (사실상 Apache Spark와 동일한 역할을 한다)
> Like Apache Spark, Dask operations are lazy. Instead of being executed at that moment, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.
- cuDF: Single GPU
- Dask-cuDF: Multiple GPUs
- Dask-cuDF: Multiple GPUs in the Cluster

Dask Cluster는 다음과 같이 구성할 수 있으나 기존 Dask(CPU 기반)와 충돌하며, 여기서는 Single GPU에서 Dask-cuDF 예제로도 충분하므로 별도 클러스터를 구동하지 않도록 한다.

```python
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# Create a Dask Cluster with one worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)
```

In [1]:
import os
import numpy as np
import cupy as cp
import pandas as pd
import cudf
import dask
import dask.array as da
import dask_cudf
import time

cp.random.seed(220919)

print(pd.__version__)
print(dask.__version__)
print(cudf.__version__)
print(dask_cudf.__version__)

1.4.2
2022.05.2
22.06.00
22.06.00


In [2]:
# 100M 생성
pdf = pd.DataFrame({'x': np.random.random(100000000),
                    'y': np.random.randint(0, 100000000, size=100000000)})
ddf = dask.dataframe.from_pandas(pdf, npartitions=4)
cdf = cudf.DataFrame.from_pandas(pdf)
dcdf = dask_cudf.from_cudf(cdf, npartitions=1)

In [3]:
cdf

Unnamed: 0,x,y
0,0.380743,72288222
1,0.339687,4515962
2,0.166739,19331262
3,0.042991,63469325
4,0.207507,41130973
...,...,...
99999995,0.951648,95471732
99999996,0.918114,90689195
99999997,0.960183,80650076
99999998,0.924045,96561793


In [4]:
# 사이즈 1.5 GB
cdf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 2 columns):
 #   Column  Dtype
---  ------  -----
 0   x       float64
 1   y       int64
dtypes: float64(1), int64(1)
memory usage: 1.5 GB


# 100M Elements Mean

Dask-cuDF는 별도 클러스터를 구동하지 않았기 때문에 Single GPU로 동작하며, 분산 작업으로 인해 속도 저하가 있다. 하지만 Dask-cuDF는 GPU 메모리 80GB를 초과하는 데이터도 처리할 수 있다.

In [7]:
%time pdf.x.mean()
%time ddf.x.mean().compute()
%time cdf.x.mean()
%time dcdf.x.mean().compute()

CPU times: user 164 ms, sys: 15.1 ms, total: 179 ms
Wall time: 177 ms
CPU times: user 299 ms, sys: 747 µs, total: 299 ms
Wall time: 47.2 ms
CPU times: user 2.04 ms, sys: 0 ns, total: 2.04 ms
Wall time: 2.04 ms
CPU times: user 7.22 ms, sys: 0 ns, total: 7.22 ms
Wall time: 6.93 ms


0.5000558935082762

여기서부터는 단순 행렬 대신 실제 데이터로 변경해 실험 진행

In [8]:
pdf = pd.read_csv("../loan-default-data/SmallSizedData.csv")
ddf = dask.dataframe.read_csv("../loan-default-data/SmallSizedData.csv", blocksize=25e6)  # 25MB chunks
cdf = cudf.read_csv("../loan-default-data/SmallSizedData.csv")
dcdf = dask_cudf.read_csv("../loan-default-data/SmallSizedData.csv", npartitions=1)

In [9]:
# 사이즈 176 MB
cdf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 887379 entries, 0 to 887378
Data columns (total 22 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   cust_id            887379 non-null  int64
 1   year               887379 non-null  int64
 2   state              887379 non-null  object
 3   date_issued        887379 non-null  object
 4   date_final         887379 non-null  int64
 5   emp_duration       887379 non-null  float64
 6   own_type           887379 non-null  object
 7   income_type        887379 non-null  object
 8   app_type           887379 non-null  object
 9   loan_purpose       887379 non-null  object
 10  interest_payments  887379 non-null  object
 11  grade              887379 non-null  object
 12  annual_pay         887379 non-null  int64
 13  loan_amount        887379 non-null  int64
 14  interest_rate      887379 non-null  float64
 15  loan_duration      887379 non-null  object
 16  dti                8873

# Selection

단순 조회는 Pandas 보다 cuDF가 살짝 더 빠르다. 또한 Dask는 각 파티션 결과를 모두 조회해오기 때문에 매우 느리며, 인덱스 조회의 경우 모든 파티션 결과가 출력된다.

In [10]:
%time pdf.loc[200004:200008, ['cust_id', 'year']]

CPU times: user 763 µs, sys: 73 µs, total: 836 µs
Wall time: 812 µs


Unnamed: 0,cust_id,year
200004,7964779,2013
200005,7964799,2013
200006,7964808,2013
200007,7964861,2013
200008,7964872,2013


In [11]:
ddf

Unnamed: 0_level_0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,interest_payments,grade,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
,int64,int64,object,object,int64,float64,object,object,object,object,object,object,int64,int64,float64,object,float64,float64,float64,float64,float64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [12]:
# Dask는 5개의 파티션으로 쪼개져 있기 때문에 인덱스로 추출하면 5개 파티션 결과가 모두 나온다.
ddf.loc[100004:100008, ['cust_id', 'year']].compute()

Unnamed: 0,cust_id,year
100004,3006131,2013
100005,3006168,2013
100006,3006171,2013
100007,3006172,2013
100008,3006181,2013
100004,12956272,2014
100005,12956282,2014
100006,12956288,2014
100007,12956294,2014
100008,12956305,2014


In [15]:
%time ddf.loc[100004:100008, ['cust_id', 'year']].compute()
%time cdf.loc[200004:200008, ['cust_id', 'year']]
%time dcdf.loc[200004:200008, ['cust_id', 'year']].compute()

CPU times: user 1.29 s, sys: 148 ms, total: 1.44 s
Wall time: 633 ms
CPU times: user 239 µs, sys: 203 µs, total: 442 µs
Wall time: 439 µs
CPU times: user 50.3 ms, sys: 42.6 ms, total: 92.9 ms
Wall time: 110 ms


Unnamed: 0,cust_id,year
200004,7964779,2013
200005,7964799,2013
200006,7964808,2013
200007,7964861,2013
200008,7964872,2013


# Sorting Values

전체 소팅은 분산 처리시 속도 저하가 크다. Single GPU로 동작하는 cuDF가 가장 좋은 속도를 보인다. Dask-cuDF는 Pandas보다 아주 조금 더 빠른 속도를 보인다.

In [17]:
%time pdf.sort_values('cust_id').head()
%time ddf.sort_values('cust_id').head()
%time cdf.sort_values('cust_id').head()
%time dcdf.sort_values('cust_id').head()

CPU times: user 141 ms, sys: 24.2 ms, total: 165 ms
Wall time: 164 ms
CPU times: user 4.06 s, sys: 428 ms, total: 4.48 s
Wall time: 2.08 s
CPU times: user 4.22 ms, sys: 15.6 ms, total: 19.8 ms
Wall time: 19.8 ms
CPU times: user 84.1 ms, sys: 58.5 ms, total: 143 ms
Wall time: 151 ms


Unnamed: 0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,...,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
7853,54734,2009,Haryana,01/08/2009,1102011,0.5,RENT,Low,INDIVIDUAL,debt_consolidation,...,85000,25000,11.89,36 months,19.48,29324.32,25000.0,0.0,829.1,0
614,55521,2008,Karnataka,01/07/2008,1032010,0.5,RENT,Low,INDIVIDUAL,debt_consolidation,...,30000,1000,16.08,36 months,23.84,1207.76,999.99,0.0,35.2,0
615,55742,2008,West Bengal,01/05/2008,1062011,0.5,RENT,Low,INDIVIDUAL,credit_card,...,65000,7000,10.71,36 months,14.29,8215.45,7000.0,0.0,228.22,0
636,56413,2008,Nagaland,01/04/2008,1102008,10.0,MORTGAGE,Medium,INDIVIDUAL,debt_consolidation,...,189500,7000,16.08,36 months,22.47,1231.9,783.46,0.25,246.38,1
470508,56705,2015,Andhra Pradesh,01/11/2015,1012016,10.0,MORTGAGE,Low,INDIVIDUAL,debt_consolidation,...,33500,11000,9.99,36 months,18.38,376.25,263.31,0.0,354.89,0


# Boolean Indexing

분산 처리에 잇점이 있을거라 예상했으나 Dask가 Pandas보다 더 늦다. 반면 cuDF가 가장 빠르다.

In [31]:
%time pdf[pdf.cust_id > 83000].head()
%time ddf[ddf.cust_id > 83000].head()
%time cdf[cdf.cust_id > 83000].head()
%time dcdf[dcdf.cust_id > 83000].head()

CPU times: user 63.6 ms, sys: 25.6 ms, total: 89.2 ms
Wall time: 88 ms
CPU times: user 274 ms, sys: 8.32 ms, total: 282 ms
Wall time: 282 ms
CPU times: user 14 ms, sys: 11.1 ms, total: 25.1 ms
Wall time: 25.1 ms
CPU times: user 55.8 ms, sys: 78 ms, total: 134 ms
Wall time: 150 ms


Unnamed: 0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,...,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
0,180675,2007,Andhra Pradesh,01/12/2007,1032009,10.0,MORTGAGE,Low,INDIVIDUAL,debt_consolidation,...,73000,25000,10.91,36 months,22.13,13650.38,8767.32,2207.65,817.41,1
1,85781,2007,Rajasthan,01/06/2007,1072010,0.5,RENT,Low,INDIVIDUAL,other,...,40000,1400,10.91,36 months,8.61,1663.04,1400.0,0.0,45.78,0
2,85675,2007,Manipur,01/06/2007,1062010,10.0,RENT,Low,INDIVIDUAL,other,...,25000,1000,14.07,36 months,16.27,1231.38,1000.0,0.0,34.21,0
3,84918,2007,Andhra Pradesh,01/09/2007,1042008,10.0,MORTGAGE,Low,INDIVIDUAL,other,...,65000,5000,7.43,36 months,0.28,5200.44,5000.0,0.0,155.38,0
4,84670,2007,Arunachal Pradesh,01/06/2007,1082009,10.0,MORTGAGE,High,INDIVIDUAL,other,...,300000,5000,7.75,36 months,5.38,5565.65,5000.0,0.0,156.11,0


# Query API

SQL 쿼리 처리도 cuDF가 가장 빠르다.

In [32]:
%time pdf.query('cust_id > 83000').head()
%time ddf.query('cust_id > 83000').head()
%time cdf.query('cust_id > 83000').head()
%time dcdf.query('cust_id > 83000').head()

CPU times: user 82.5 ms, sys: 7.92 ms, total: 90.5 ms
Wall time: 89.1 ms
CPU times: user 283 ms, sys: 2.97 ms, total: 286 ms
Wall time: 286 ms
CPU times: user 17.2 ms, sys: 8.56 ms, total: 25.7 ms
Wall time: 25.7 ms
CPU times: user 66.9 ms, sys: 69.5 ms, total: 136 ms
Wall time: 151 ms


Unnamed: 0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,...,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
0,180675,2007,Andhra Pradesh,01/12/2007,1032009,10.0,MORTGAGE,Low,INDIVIDUAL,debt_consolidation,...,73000,25000,10.91,36 months,22.13,13650.38,8767.32,2207.65,817.41,1
1,85781,2007,Rajasthan,01/06/2007,1072010,0.5,RENT,Low,INDIVIDUAL,other,...,40000,1400,10.91,36 months,8.61,1663.04,1400.0,0.0,45.78,0
2,85675,2007,Manipur,01/06/2007,1062010,10.0,RENT,Low,INDIVIDUAL,other,...,25000,1000,14.07,36 months,16.27,1231.38,1000.0,0.0,34.21,0
3,84918,2007,Andhra Pradesh,01/09/2007,1042008,10.0,MORTGAGE,Low,INDIVIDUAL,other,...,65000,5000,7.43,36 months,0.28,5200.44,5000.0,0.0,155.38,0
4,84670,2007,Arunachal Pradesh,01/06/2007,1082009,10.0,MORTGAGE,High,INDIVIDUAL,other,...,300000,5000,7.75,36 months,5.38,5565.65,5000.0,0.0,156.11,0


# Applymap

함수를 끼워넣어 데이터를 조작하는 것도 cuDF가 압도적으로 빠르다. 반면 Dask-cuDF는 cuDF의 속도를 다 까먹으며, 기대만큼의 좋은 속도를 보여주지 못한다. (해결 방법은 이후 별도 설명)

In [33]:
def add(x):
  return x + 5
%time pdf['total_pymnt'].apply(add)
%time ddf['total_pymnt'].compute().apply(add)
%time cdf['total_pymnt'].apply(add)
%time dcdf['total_pymnt'].compute().apply(add)

CPU times: user 174 ms, sys: 4.42 ms, total: 178 ms
Wall time: 177 ms
CPU times: user 796 ms, sys: 37.8 ms, total: 834 ms
Wall time: 340 ms
CPU times: user 2.5 ms, sys: 677 µs, total: 3.17 ms
Wall time: 3.06 ms
CPU times: user 51.7 ms, sys: 52 ms, total: 104 ms
Wall time: 106 ms


0         13655.38
1          1668.04
2          1236.38
3          5205.44
4          5570.65
            ...   
887374        5.00
887375        5.00
887376      252.39
887377        5.00
887378        5.00
Name: total_pymnt, Length: 887379, dtype: float64

# String Manipulation

In [34]:
%time pdf['state'].unique()[0].lower()
%time ddf['state'].unique().compute()[0].lower()
%time cdf['state'].unique()[0].lower()
%time dcdf['state'].unique().compute()[0].lower()

CPU times: user 48.8 ms, sys: 227 µs, total: 49 ms
Wall time: 47.8 ms
CPU times: user 669 ms, sys: 61.4 ms, total: 730 ms
Wall time: 221 ms
CPU times: user 11.5 ms, sys: 3.76 ms, total: 15.2 ms
Wall time: 15.7 ms
CPU times: user 47.4 ms, sys: 63.9 ms, total: 111 ms
Wall time: 113 ms


'andhra pradesh'

# Conclusion

분산 처리를 지원하는 Dask로 속도를 더 높일 수 있는 부분이 많지 않다. 하지만 GPU를 사용하는 cuDF는 Pandas는 물론 Dask 보다도 월등히 더 빠른 속도를 보인다. A100의 경우 GPU 메모리 80 GB이내 데이터는 cuDF로 처리를 하는 것이 가장 좋은 선택이다. 또한 NVIDIA에서는 다음과 같이 cuDF를 이용해 전처리 한 데이터를 Apache Arrow 포맷으로 동일한 GPU 메모리를 통해 학습하는 것을 권장하다.

![](https://github.com/rapidsai/cudf/raw/branch-21.08/img/rapids_arrow.png)

이번 노트북에서는 cuDF의 고성능을 살펴봤으며, 생각보다 Dask의 성능이 좋지 않음을 확인했다. 그러나 이어지는 노트북에서 Dask 최적화 및 Dask-cuDF 또한 매우 뛰어난 성능을 보이는 모습을 확인해보도록 하겠다.

# References
- https://github.com/ashishpatel26/Rapidsai_Machine_learning_on_GPU