# Performance Comparison &mdash; pandas Versus RAPIDS cuDF

This tutorial uses `timeit` to compare performance benchmarks with pandas and RAPIDS cuDF.

## System Details

### GPU

In [1]:
!nvidia-smi -q



Timestamp                           : Tue Mar 10 21:00:35 2020
Driver Version                      : 440.31
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:81:00.0
    Product Name                    : Tesla T4
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0561119011981
    GPU UUID                        : GPU-8b4068b3-1bcf-8dbe-978e-8eacb3c22801
    Minor Number                    : 0
    VBIOS Version                   : 90.04.38.00.03
    MultiGPU Board                  : No
    Board ID                        : 0x8100
    GPU Part Number                 : 900-2G183-0000-0

## Benchmark Setup

### Installations

Install v3io-generator to create a 1 GB data set for the benchmark.<br>
You only need to run the generator once, and then you can reuse the generated data set.

### Imports

In [2]:
import os
import yaml
import time
import datetime
import json
import itertools

# Generator
from v3io_generator import metrics_generator, deployment_generator

# Dataframes
import cudf
import pandas as pd

### Configurations

In [3]:
# Benchmark configurations
metric_names = ['cpu_utilization', 'latency', 'packet_loss', 'throughput']
nlargest = 10
source_file = os.path.join(os.getcwd(), 'data', 'ops.logs') # Use full path


os.environ['SOURCE_PATH'] = source_file                    # Expose for display
os.environ['SOURCE_DIR'] = os.path.dirname(source_file)    # Expose for display
os.environ['SOURCE_FILE'] = os.path.basename(source_file)  # Expose for display

### Create the Data Source

Use v3io-generator to create a time-series network-operations dataset for 100 companies, including 4 metrics (CPU utilization, latency, throughput, and packet loss).<br>
Then, write the dataset to a JSON file to be used as the data source.

In [4]:
# Create a metadata factory
dep_gen = deployment_generator.deployment_generator()
faker=dep_gen.get_faker()

# Design the metadata
dep_gen.add_level(name='company',number=100,level_type=faker.company)

# Generate a deployment structure
deployment_df = dep_gen.generate_deployment()

# Initialize the metric values
for metric in metric_names:
    deployment_df[metric] = 0

deployment_df.head()

Unnamed: 0,company,cpu_utilization,latency,packet_loss,throughput
0,Hess-Brooks,0,0,0,0
1,Humphrey__Vang_and_Higgins,0,0,0,0
2,Mckee-Garcia,0,0,0,0
3,Howell_PLC,0,0,0,0
4,Shaw-Coleman,0,0,0,0


Specify metrics configuration for the generator.

In [5]:
metrics_configuration = yaml.safe_load("""
errors: {length_in_ticks: 50, rate_in_ticks: 150}
timestamps: {interval: 5s, stochastic_interval: false}
metrics:
  cpu_utilization:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 70, noise: 0, sigma: 10}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  latency:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 5}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  packet_loss:
    accuracy: 0
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 2}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 50, min: 0, validate: true}
  throughput:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 250, noise: 0, sigma: 20}
    is_threshold_below: false
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 300, min: 0, validate: true}
""")

Create the data according to the given hierarchy and metrics configuration.

In [6]:
met_gen = metrics_generator.Generator_df(metrics_configuration, 
                                         user_hierarchy=deployment_df, 
                                         initial_timestamp=time.time())

metrics = met_gen.generate_range(start_time=datetime.datetime.now(),
                                 end_time=datetime.datetime.now()+datetime.timedelta(hours=62),
                                 as_df=True,
                                 as_iterator=False)

# Verify that the source-file parent directory exists.
os.makedirs(os.path.dirname(source_file), exist_ok=1)

# Generate file from metrics
with open(source_file, 'w') as f:
    metrics_batch = metrics
    metrics_batch.to_json(f,
                          orient='records',
                          lines=True)

### Validate the Target File Size

Set the target size for the test file, in MB.

In [7]:
!ls -lah ${SOURCE_DIR} | grep ${SOURCE_FILE}

-rw-r--r-- 1 root nogroup 1.2G Mar 10 21:09 ops.logs


In [8]:
!head ${SOURCE_PATH}

{"company":"Hess-Brooks","cpu_utilization":76.3749519467,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":0.9564425845,"packet_loss_is_error":false,"throughput":240.4432458583,"throughput_is_error":false,"timestamp":1583874132961}
{"company":"Humphrey__Vang_and_Higgins","cpu_utilization":74.2560893723,"cpu_utilization_is_error":false,"latency":1.3648952547,"latency_is_error":false,"packet_loss":0.0,"packet_loss_is_error":false,"throughput":227.331291144,"throughput_is_error":false,"timestamp":1583874132961}
{"company":"Mckee-Garcia","cpu_utilization":90.3479072447,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":5.0601654427,"packet_loss_is_error":false,"throughput":238.7143407116,"throughput_is_error":false,"timestamp":1583874132961}
{"company":"Howell_PLC","cpu_utilization":77.4882786327,"cpu_utilization_is_error":false,"latency":9.592764932,"latency_is_error":false,"packet_loss":4.2747499587,"packet_loss_is_er

## Benchmark

The benchmark tests use the following flow:

- Read file
- Compute aggregations
- Get the n-largest values

In [9]:
benchmark_file = source_file

In the following examples, `timeit` is executed in a loop.<br>
You can change the number of runs and loops:
```
%%timeit -n 1 -r 1
```

### cuDF Benchmark

In [10]:
%%timeit

# Read file
gdf = cudf.read_json(benchmark_file, lines=True)

# Perform aggregation
ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get the n-largest values (from the original DataFrame)
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

4.43 s ± 47.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### pandas Benchmark

In [11]:
%%timeit

# Read file
pdf = pd.read_json(benchmark_file, lines=True)

# Perform aggregation
gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get the n-largest values (from the original DataFrame)
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

51.8 s ± 1.77 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test Load Times

#### cuDF

In [12]:
%%timeit -r 2
gdf = cudf.read_json(benchmark_file, lines=True)

4.14 s ± 52.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)


#### pandas

In [13]:
%%timeit
gdf = pd.read_json(benchmark_file, lines=True)

50.3 s ± 6.38 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test Aggregation

Load the files to memory to allow applying `timeit` only to the aggregations.

In [14]:
gdf = cudf.read_json(benchmark_file, lines=True)
pdf = pd.read_json(benchmark_file, lines=True)

#### cuDF

In [15]:
%%timeit -n 1 -r 1

ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

604 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


#### pandas

In [16]:
%%timeit -n 1 -r 1

gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

4.18 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
