![](https://github.com/ashishpatel26/Rapidsai_Machine_learning_on_GPU/raw/main/images/rapidsailogo.jpg?raw=true)

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

이 노트북은 `CUDA_VISIBLE_DEVICES=2`를 이용해 cuDF는 GPU 3번을 사용하도록 임의로 할당했다.

- Pandas: Single CPU
- Dask: Multiple CPUs (사실상 Apache Spark와 동일한 역할을 한다)
- cuDF: Single GPU
- Dask-cuDF: Multiple GPUs
- Dask-cuDF: Multiple GPUs in the Cluster

Dask Cluster는 다음과 같이 구성할 수 있으나 기존 Dask(CPU 기반)와 충돌하며, 여기서는 Single GPU에서 Dask-cuDF 예제로도 충분하므로 별도 클러스터를 구동하지 않도록 한다.

```python
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# Create a Dask Cluster with one worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)
```

In [1]:
import os
import numpy as np
import cupy as cp
import pandas as pd
import cudf
import dask
import dask.array as da
import dask_cudf
import time

cp.random.seed(220919)

print(pd.__version__)
print(dask.__version__)
print(cudf.__version__)
print(dask_cudf.__version__)

1.4.2
2022.05.2
22.06.00
22.06.00


In [221]:
# 100M 생성
pdf = pd.DataFrame({'x': np.random.random(100000000),
                    'y': np.random.randint(0, 100000000, size=100000000)})
ddf = dask.dataframe.from_pandas(pdf, npartitions=4)
cdf = cudf.DataFrame.from_pandas(pdf)
dcdf = dask_cudf.from_cudf(cdf, npartitions=1)

In [36]:
cdf

Unnamed: 0,x,y
0,0.638084,85984516
1,0.212242,2583148
2,0.954734,85234516
3,0.313399,65686761
4,0.175229,92447388
...,...,...
99999995,0.971702,48925664
99999996,0.195571,14184915
99999997,0.093529,75506169
99999998,0.020003,1796415


In [37]:
# 사이즈 1.5 GB
cdf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 2 columns):
 #   Column  Dtype
---  ------  -----
 0   x       float64
 1   y       int64
dtypes: float64(1), int64(1)
memory usage: 1.5 GB


# 100M Elements Mean

- Pandas: 173 ms
- Dask: 46.4 ms
- cuDF: 1.31 ms
- Dask-cuDF: 6.76 ms

Dask-cuDF는 별도 클러스터를 구동하지 않았기 때문에 Single GPU로 동작하며, 분산 작업으로 인해 속도 저하가 있다. 하지만 Dask-cuDF는 GPU 메모리 80GB를 초과하는 데이터도 처리할 수 있다.

In [40]:
pdf.x.mean()

0.4999567872380305

In [10]:
%timeit pdf.x.mean()

173 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
%timeit ddf.x.mean().compute()

46.4 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%timeit cdf.x.mean()

1.31 ms ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [13]:
%timeit dcdf.x.mean().compute()

6.76 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


여기서부터는 단순 행렬 대신 실제 데이터로 변경해 실험 진행

In [89]:
pdf = pd.read_csv("../loan-default-data/SmallSizedData.csv")
ddf = dask.dataframe.read_csv("../loan-default-data/SmallSizedData.csv", blocksize=25e6)  # 25MB chunks
cdf = cudf.read_csv("../loan-default-data/SmallSizedData.csv")
dcdf = dask_cudf.read_csv("../loan-default-data/SmallSizedData.csv", npartitions=1)

In [80]:
# 사이즈 176 MB
cdf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 887379 entries, 0 to 887378
Data columns (total 22 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   cust_id            887379 non-null  int64
 1   year               887379 non-null  int64
 2   state              887379 non-null  object
 3   date_issued        887379 non-null  object
 4   date_final         887379 non-null  int64
 5   emp_duration       887379 non-null  float64
 6   own_type           887379 non-null  object
 7   income_type        887379 non-null  object
 8   app_type           887379 non-null  object
 9   loan_purpose       887379 non-null  object
 10  interest_payments  887379 non-null  object
 11  grade              887379 non-null  object
 12  annual_pay         887379 non-null  int64
 13  loan_amount        887379 non-null  int64
 14  interest_rate      887379 non-null  float64
 15  loan_duration      887379 non-null  object
 16  dti                8873

# Selection

단순 조회는 Pandas 보다 cuDF가 살짝 더 빠르다. 또한 Dask는 각 파티션 결과를 모두 조회해오기 때문에 매우 느리며, 인덱스 조회의 경우 모든 파티션 결과가 출력된다.

In [90]:
pdf.loc[100004:100008, ['cust_id', 'year']]

Unnamed: 0,cust_id,year
100004,3006131,2013
100005,3006168,2013
100006,3006171,2013
100007,3006172,2013
100008,3006181,2013


In [84]:
%timeit pdf.loc[200004:200008, ['cust_id', 'year']]

342 µs ± 874 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [91]:
ddf

Unnamed: 0_level_0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,interest_payments,grade,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
,int64,int64,object,object,int64,float64,object,object,object,object,object,object,int64,int64,float64,object,float64,float64,float64,float64,float64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [93]:
# Dask는 5개의 파티션으로 쪼개져 있기 때문에 인덱스로 추출하면 5개 파티션 결과가 모두 나온다.
ddf.loc[100004:100008, ['cust_id', 'year']].compute()

Unnamed: 0,cust_id,year
100004,3006131,2013
100005,3006168,2013
100006,3006171,2013
100007,3006172,2013
100008,3006181,2013
100004,12956272,2014
100005,12956282,2014
100006,12956288,2014
100007,12956294,2014
100008,12956305,2014


In [94]:
%timeit ddf.loc[100004:100008, ['cust_id', 'year']].compute()

619 ms ± 20.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [95]:
%timeit cdf.loc[200004:200008, ['cust_id', 'year']]

288 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [97]:
%timeit dcdf.loc[200004:200008, ['cust_id', 'year']].compute()

80.3 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Sorting Values

전체 소팅은 분산 처리시 속도 저하가 크다. Single GPU로 동작하는 cuDF가 가장 좋은 속도를 보인다. 하지만 Dask-cuDF도 Pandas보다는 조금 더 빠른 속도를 보인다.

In [98]:
pdf.sort_values('cust_id').head()

Unnamed: 0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,...,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
7853,54734,2009,Haryana,01/08/2009,1102011,0.5,RENT,Low,INDIVIDUAL,debt_consolidation,...,85000,25000,11.89,36 months,19.48,29324.32,25000.0,0.0,829.1,0
614,55521,2008,Karnataka,01/07/2008,1032010,0.5,RENT,Low,INDIVIDUAL,debt_consolidation,...,30000,1000,16.08,36 months,23.84,1207.76,999.99,0.0,35.2,0
615,55742,2008,West Bengal,01/05/2008,1062011,0.5,RENT,Low,INDIVIDUAL,credit_card,...,65000,7000,10.71,36 months,14.29,8215.45,7000.0,0.0,228.22,0
636,56413,2008,Nagaland,01/04/2008,1102008,10.0,MORTGAGE,Medium,INDIVIDUAL,debt_consolidation,...,189500,7000,16.08,36 months,22.47,1231.9,783.46,0.25,246.38,1
470508,56705,2015,Andhra Pradesh,01/11/2015,1012016,10.0,MORTGAGE,Low,INDIVIDUAL,debt_consolidation,...,33500,11000,9.99,36 months,18.38,376.25,263.31,0.0,354.89,0


In [99]:
%timeit pdf.sort_values('cust_id').head()

166 ms ± 582 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [103]:
%timeit ddf.sort_values('cust_id').head()

2.02 s ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [105]:
%timeit cdf.sort_values('cust_id').head()

20.4 ms ± 96.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [107]:
%timeit dcdf.sort_values('cust_id').head()

116 ms ± 670 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Boolean Indexing

분산 처리에 잇점이 있을거라 예상했으나 Dask가 Pandas보다 조금 더 늦다. 반면 cuDF는 가장 빠르다.

In [115]:
pdf[pdf.cust_id > 83000].head()

Unnamed: 0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,...,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
0,180675,2007,Andhra Pradesh,01/12/2007,1032009,10.0,MORTGAGE,Low,INDIVIDUAL,debt_consolidation,...,73000,25000,10.91,36 months,22.13,13650.38,8767.32,2207.65,817.41,1
1,85781,2007,Rajasthan,01/06/2007,1072010,0.5,RENT,Low,INDIVIDUAL,other,...,40000,1400,10.91,36 months,8.61,1663.04,1400.0,0.0,45.78,0
2,85675,2007,Manipur,01/06/2007,1062010,10.0,RENT,Low,INDIVIDUAL,other,...,25000,1000,14.07,36 months,16.27,1231.38,1000.0,0.0,34.21,0
3,84918,2007,Andhra Pradesh,01/09/2007,1042008,10.0,MORTGAGE,Low,INDIVIDUAL,other,...,65000,5000,7.43,36 months,0.28,5200.44,5000.0,0.0,155.38,0
4,84670,2007,Arunachal Pradesh,01/06/2007,1082009,10.0,MORTGAGE,High,INDIVIDUAL,other,...,300000,5000,7.75,36 months,5.38,5565.65,5000.0,0.0,156.11,0


In [114]:
%timeit pdf[pdf.cust_id > 83000].head()

97.7 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [116]:
%timeit ddf[ddf.cust_id > 83000].head()

283 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [117]:
%timeit cdf[cdf.cust_id > 83000].head()

23.1 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [118]:
%timeit dcdf[dcdf.cust_id > 83000].head()

113 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# QueryAPI

SQL로 조회하는 것은 cuDF가 가장 빠르다.

In [131]:
pdf.query('cust_id > 83000').head()

Unnamed: 0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,...,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
0,180675,2007,Andhra Pradesh,01/12/2007,1032009,10.0,MORTGAGE,Low,INDIVIDUAL,debt_consolidation,...,73000,25000,10.91,36 months,22.13,13650.38,8767.32,2207.65,817.41,1
1,85781,2007,Rajasthan,01/06/2007,1072010,0.5,RENT,Low,INDIVIDUAL,other,...,40000,1400,10.91,36 months,8.61,1663.04,1400.0,0.0,45.78,0
2,85675,2007,Manipur,01/06/2007,1062010,10.0,RENT,Low,INDIVIDUAL,other,...,25000,1000,14.07,36 months,16.27,1231.38,1000.0,0.0,34.21,0
3,84918,2007,Andhra Pradesh,01/09/2007,1042008,10.0,MORTGAGE,Low,INDIVIDUAL,other,...,65000,5000,7.43,36 months,0.28,5200.44,5000.0,0.0,155.38,0
4,84670,2007,Arunachal Pradesh,01/06/2007,1082009,10.0,MORTGAGE,High,INDIVIDUAL,other,...,300000,5000,7.75,36 months,5.38,5565.65,5000.0,0.0,156.11,0


In [122]:
%timeit pdf.query('cust_id > 83000').head()

108 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [128]:
%timeit ddf.query('cust_id > 83000').head()

288 ms ± 5.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [129]:
%timeit cdf.query('cust_id > 83000').head()

23.6 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [130]:
%timeit dcdf.query('cust_id > 83000').head()

120 ms ± 839 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Applymap

함수를 끼워넣어 데이터를 조작하는 것은 cuDF가 압도적으로 빠르다. 분산 처리가 필요한 Dask-cuDF의 속도 저하가 생각보다 크다.

In [146]:
def add(x):
  return x + 5
%timeit pdf['total_pymnt'].apply(add)

174 ms ± 385 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [149]:
%timeit ddf['total_pymnt'].compute().apply(add)

330 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [150]:
%timeit cdf['total_pymnt'].apply(add)

1.68 ms ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [151]:
%timeit dcdf['total_pymnt'].compute().apply(add)

90 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# String Manipulation

In [33]:
pdf['state'].unique()[0].lower()

'andhra pradesh'

In [203]:
%timeit pdf['state'].unique()[0].lower()

44.2 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [204]:
%timeit ddf['state'].unique().compute()[0].lower()

202 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [205]:
%timeit cdf['state'].unique()[0].lower()

11.1 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [206]:
%timeit dcdf['state'].unique().compute()[0].lower()

99.1 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Conclusion

분산 처리를 지원하는 Dask로 속도를 더 높일 수 있는 부분이 많지 않다. 하지만 GPU를 사용하는 cuDF는 Pandas는 물론 Dask 보다도 월등히 더 빠른 속도를 보임을 알 수 있다. GPU 메모리 이내 데이터(A100의 경우 80G)의 경우 cuDF로 데이터 전처리를 하는 것이 가장 좋은 선택이다. 또한 NVIDIA에서는 다음과 같이 cuDF를 이용해 전처리 한 데이터를 Apache Arrow 포맷으로 동일한 GPU 메모리를 통해 바로 학습할 수 있도록 권장하고 있다.

![](https://github.com/rapidsai/cudf/raw/branch-21.08/img/rapids_arrow.png)