# cuDF vs Pandas speed comparison 

Using popular baby names (6 million rows). 

*Baby name data by state provided by the US Social Security Administration. For the source data see [here](https://www.ssa.gov/oact/babynames/limits.html).*

---
### Download data

In [1]:
%%script echo skipping
%%capture
%%bash
wget "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
unzip namesbystate.zip -d namesbystate && rm namesbystate.zip
cat namesbystate/*.TXT >> namesbystate.csv
rm -r namesbystate

skipping


### Import libraries

In [2]:
import time
import cudf as cd
import pandas as pd
import cupy as cp
import numpy as np

----
### Read data using Pandas

In [3]:
pd.DataFrame({'a': [0]}); # initialize cudf 
startTime = time.time()
pdf = pd.read_csv('namesbystate.csv', names=["state", "sex", "year", "name", "rank"])
print("Records:", len(pdf))
time.time() - startTime

Records: 6215834


2.5206143856048584


### Read data using cuDF

In [4]:
cd.DataFrame({'a': [0]}); # initialize cudf 
startTime = time.time()
cdf = cd.read_csv('namesbystate.csv', names=["state", "sex", "year", "name", "rank"])
print("Records:", len(pdf))
time.time() - startTime

Records: 6215834


0.20464205741882324

---
### Aggregate data with Pandas

In [5]:
startTime = time.time()
print(pdf.groupby(["year", "name", "sex"]).count())
time.time() - startTime

                  state  rank
year name    sex             
1910 Aaron   M       13    13
     Abbie   F        4     4
     Abe     M        4     4
     Abner   M        2     2
     Abraham M        6     6
...                 ...   ...
2020 Zymere  M        1     1
     Zymir   M        8     8
     Zyon    M       14    14
     Zyra    F        4     4
     Zyrah   F        1     1

[642570 rows x 2 columns]


1.7794814109802246

### Aggregate data with cuDF

In [6]:
startTime = time.time()
print(cdf.groupby(["year", "name", "sex"]).count())
time.time() - startTime

                   state  rank
year name     sex             
1920 Fern     F       32    32
1992 Jamila   F       18    18
1999 Shannan  F        3     3
1994 Tyree    M       21    21
1945 Signe    F        1     1
...                  ...   ...
1999 Destin   M       19    19
2012 Precious F       11    11
1990 Tyresha  F        1     1
     Shilpa   F        1     1
1923 Lon      M        6     6

[642570 rows x 2 columns]


0.08793067932128906

---
### Simple linear model with Pandas

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

lr = LinearRegression(fit_intercept=True, normalize=False)
lr.fit(np.array([0,1.0]).reshape(-1, 1),np.array([0,1.0]))

startTime = time.time()
X = pdf[['year']]
y = pdf['name'].str.len()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
model = lr.fit(x_train, y_train)
lr.predict(x_test)
time.time() - startTime

3.05558180809021

### Simple linear model with cuDF

In [8]:
from cuml import train_test_split
from cuml import LinearRegression

lr = LinearRegression(fit_intercept = True, normalize = False, algorithm='svd')
lr.fit(cp.array([0,1.0]).reshape(-1, 1), cp.array([0,1.0]))

startTime = time.time()
X = cdf[['year']].astype('float32')
y = cdf['name'].str.len()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
model = lr.fit(x_train, y_train)
lr.predict(x_test)
time.time() - startTime

0.2588474750518799

---
### Summary

&nbsp;|Records|Pandas|cuDF|Improvement
---|---:|---:|---:|---:
Reading|6,215,834|2.52s|0.20s|**12.6x**
Aggregating|642,570|1.81s|0.08s|**22.6x**
Regression|5,594,251|3.08s|0.26s|**11.8x**