<a id="introduction"></a>
## RAPIDS Foundations with cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames.

**Table of Contents**

* [RAPIDS Foundations with cuDF](#introduction)
* [Setup](#setup)
* [cuDF DataFrame Basics](#basics)
* [Input/Output](#io)
* [Mortgage Dataset](#mortgage)
* [cuDF API](#cudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai:0.6-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai)
* `rapidsai/rapidsai-nightly:0.6-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [None]:
!nvidia-smi

Next, let's see what CUDA version we have:

In [None]:
!nvcc --version

<a id="basics"></a>
## cuDF DataFrame Basics

As we showed in the previous tutorial, cuDF DataFrames are a tabular structure of data that reside on the GPU. We interface with these cuDF DataFrames in the same way we interface with Pandas DataFrames that reside on the CPU - with a few deviations.

In the next several sections, we'll show how to create and manipulate cuDF DataFrames. For more information on using cuDF DataFrames, check out the documentation: https://rapidsai.github.io/projects/cudf/en/latest/index.html

#### Creating a cudf.DataFrame using lists

There are several ways to create a cuDF DataFrame. The easiest of these is to instantiate an empty cuDF DataFrame and then use Python list objects or NumPy arrays to create columns. Below, we import the cuDF library and create an empty cuDF DataFrame.

In [None]:
import cudf; print('cuDF Version:', cudf.__version__)


df = cudf.DataFrame()
print(df)

Next, we can create two columns named `key` and `value` by using the bracket notation with the cuDF DataFrame and storing either a list of Python values or a NumPy array into that column.

In [None]:
import numpy as np; print('NumPy Version:', np.__version__)

# here we create two columns named "key" and "value"
df['key'] = [0, 1, 2, 3, 4]
df['value'] = np.arange(10, 15)
print(df)

#### Creating a cudf.DataFrame using a list of tuples or a dictionary

Another way we can create a cuDF DataFrame is by providing a mapping of column names to column values, either via a list of tuples or by using a dictionary. In the below examples, we create a list of two-value tuples; the first value is the name of the column - for example, `id` or `timestamp` - and the second value is a list of Python objects or Numpy arrays. Note that we don't have to constrain the data stored in our cuDF DataFrames to common data types like integers or floats - we can use more exotic data types such as datetimes or strings. We'll investigate how such data types behave on the GPU a bit later.

In [None]:
from datetime import datetime, timedelta


ids = np.arange(5)
t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
timestamps = [(t0+ timedelta(seconds=x)) for x in range(5)]
timestamps_np = np.array(timestamps, dtype='datetime64')

In [None]:
df = cudf.DataFrame([('id', ids), ('timestamp', timestamps_np)])
print(df)

Alternatively, we can create a dictonary of key-value pairs, where each key in the dictionary represents a column name and each value associated with the key represents the values that belong in that column.

In [None]:
df = cudf.DataFrame({'id': ids, 'timestamp': timestamps_np})
print(df)

#### Creating a cudf.DataFrame from a Pandas DataFrame

Pandas DataFrames are a first class citizen within cuDF - this means that we can create a cuDF DataFrame from a Pandas DataFrame and vice versa.

In [None]:
import pandas as pd

pandas_df = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
print(pandas_df)

In [None]:
df = cudf.from_pandas(pandas_df)
print(df)

#### Inspecting a cuDF DataFrame

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
df

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
print(df.head())

#### Columns

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
print(df.columns)

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df.columns = ['']
# df

#### Data types

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
print(df.dtypes)

#### Series

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
print(df['a'].head())

#### Index

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
df.index

#### Converting a cudf.DataFrame to a Pandas DataFrame

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
df.to_pandas()

#### Converting a cudf.DataFrame to a NumPy Array

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
df.to_pandas().values

#### Converting a cudf.DataFrame to Arrow

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
df.to_arrow()

<a id="io"></a>
## Input/Output

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

#### Loading a CSV file

Perhaps one of the most common ways to create cuDF DataFrames is by loading a table that is stored as a file on disk. cuDF provides a lot of functionality for reading in a variety of different data formats. Below, we show how easy it is to read in a CSV file:

In [None]:
df = cudf.read_csv('../datasets/iris.csv')
print(df)

CSV files come in many flavors and cuDF tries to be as flexible as possible, mirroring the Pandas API wherever possible. For more information on possible parameters for working with files, see the cuDF IO documentation: 

https://rapidsai.github.io/projects/cudf/en/latest/api.html#cudf.io.csv.read_csv

#### Loading a Parquet file

CSV files can be described as a row-based file format because it represents the data row-by-row. However, data can be represented in a columnar format. A common implemtation of such a file format is called Parquet. These file formats have a number of advantages over traditional row-based file formats. cuDF provides support for these as well.

In [None]:
df = cudf.read_parquet('../datasets/iris.parq')
print(df)

#### Loading an ORC file

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df = cudf.read_orc('../datasets/iris.parq')
# print(df)

#### Loading from S3

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df = cudf.read_csv('../datasets/iris.csv')
# print(df)

#### Writing to a CSV file

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df.to_csv('../datasets/iris.csv')

#### Writing to a Parquet file

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df.to_parquet('../datasets/iris.parq')

#### Writing to an ORC file

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df.to_orc('../datasets/iris.orc')

<a id="mortgage"></a>
## Mortgage Dataset

Dataset is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.

For the full raw dataset visit [Fannie Mae](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) to register for an account and to download.

#### Download

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Dataset location: https://docs.rapids.ai/datasets/mortgage-data

The RAPIDS container hosted on our Docker Hub has notebooks that use the following datasets.

Download the datasets inside the container using wget or to the local host and use a docker volume mount to /rapids/data/
Decompress the dataset using tar xzvf NAME_OF_DATASET.tgz
Confirm that the following directory structure exists in /rapids/data/

* /rapids/data/mortgage/acq/         <- all acquisition data
* /rapids/data/mortgage/perf/        <- all performance data
* /rapids/data/mortgage/names.csv    <- lender name normalization


* http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html
* https://loanperformancedata.fanniemae.com/lppub-docs/FNMA_SF_Loan_Performance_FAQs.pdf

In [None]:
# %%bash

# BASE_URL=http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data
# OUTPUT_DIRECTORY=/datasets/rapids/mortgage

# # 1 Year - ~4 GB
# FILENAME=mortgage_2000.tgz
# # 2 Years - ~X GB
# FILENAME=mortgage_2000-2001.tgz
# # 4 Years - ~X GB
# FILENAME=mortgage_2000-2003.tgz
# # 8 Years - ~X GB
# FILENAME=mortgage_2000-2007.tgz
# # 16 Years - ~X GB
# FILENAME=mortgage_2000-2015.tgz
# # 17 Years - ~X GB
# FILENAME=mortgage_2000-2016.tgz

# wget ${BASE_URL}/${FILENAME} -O ${OUTPUT_DIRECTORY}
# tar -xvf ${OUTPUT_DIRECTORY}/${FILENAME}


In [None]:
%%bash

ls -alh /datasets/rapids/mortgage/mortgage_2000_1gb
ls -alh /datasets/rapids/mortgage/mortgage_2000_1gb/acq
ls -alh /datasets/rapids/mortgage/mortgage_2000_1gb/perf
du -h -d 1 /datasets/rapids/mortgage/mortgage_2000_1gb/acq
du -h -d 1 /datasets/rapids/mortgage/mortgage_2000_1gb/perf

#### Acquisition

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
import os

base_path = os.path.join('/', 'datasets', 'rapids', 'mortgage', 'mortgage_2000_1gb')
acquisition_path = os.path.join(base_path, 'acq')
acquisition_filenames = sorted(os.listdir(acquisition_path))
acquisition_file_path = os.path.join(acquisition_path, acquisition_filenames[0])

In [None]:
from collections import OrderedDict

acquisition_dtypes = OrderedDict([
        ('loan_id', 'int64'),
        ('orig_channel', 'category'),
        ('seller_name', 'category'),
        ('orig_interest_rate', 'float64'),
        ('orig_upb', 'int64'),
        ('orig_loan_term', 'int64'),
        ('orig_date', 'date'),
        ('first_pay_date', 'date'),
        ('orig_ltv', 'float64'),
        ('orig_cltv', 'float64'),
        ('num_borrowers', 'float64'),
        ('dti', 'float64'),
        ('borrower_credit_score', 'float64'),
        ('first_home_buyer', 'category'),
        ('loan_purpose', 'category'),
        ('property_type', 'category'),
        ('num_units', 'int64'),
        ('occupancy_status', 'category'),
        ('property_state', 'category'),
        ('zip', 'int64'),
        ('mortgage_insurance_percent', 'float64'),
        ('product_type', 'category'),
        ('coborrow_credit_score', 'float64'),
        ('mortgage_insurance_type', 'float64'),
        ('relocation_mortgage_indicator', 'category')
    ])

In [None]:
print('Loading file:', acquisition_file_path)
acquisition_df = cudf.read_csv(acquisition_file_path, delimiter='|', 
                        names=list(acquisition_dtypes.keys()), 
                        dtype=list(acquisition_dtypes.values()))

In [None]:
print(acquisition_df.head())

#### Performance

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
performance_path = os.path.join(base_path, 'perf')
performance_filenames = sorted(os.listdir(performance_path))
performance_filename = performance_filenames[0]
performance_file_path = os.path.join(performance_path, performance_filename)

In [None]:
performance_dtypes = OrderedDict([
            ('loan_id', 'int64'),
            ('monthly_reporting_period', 'date'),
            ('servicer', 'category'),
            ('interest_rate', 'float64'),
            ('current_actual_upb', 'float64'),
            ('loan_age', 'float64'),
            ('remaining_months_to_legal_maturity', 'float64'),
            ('adj_remaining_months_to_maturity', 'float64'),
            ('maturity_date', 'date'),
            ('msa', 'float64'),
            ('current_loan_delinquency_status', 'int32'),
            ('mod_flag', 'category'),
            ('zero_balance_code', 'category'),
            ('zero_balance_effective_date', 'date'),
            ('last_paid_installment_date', 'date'),
            ('foreclosed_after', 'date'),
            ('disposition_date', 'date'),
            ('foreclosure_costs', 'float64'),
            ('prop_preservation_and_repair_costs', 'float64'),
            ('asset_recovery_costs', 'float64'),
            ('misc_holding_expenses', 'float64'),
            ('holding_taxes', 'float64'),
            ('net_sale_proceeds', 'float64'),
            ('credit_enhancement_proceeds', 'float64'),
            ('repurchase_make_whole_proceeds', 'float64'),
            ('other_foreclosure_proceeds', 'float64'),
            ('non_interest_bearing_upb', 'float64'),
            ('principal_forgiveness_upb', 'float64'),
            ('repurchase_make_whole_proceeds_flag', 'category'),
            ('foreclosure_principal_write_off_amount', 'float64'),
            ('servicing_activity_indicator', 'category')
        ])

In [None]:
print('Loading file:', performance_file_path)
performance_df = cudf.read_csv(performance_file_path, delimiter='|', 
                        names=list(performance_dtypes.keys()), 
                        dtype=list(performance_dtypes.values()))

In [None]:
print(performance_df.head())

#### Names

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
names_dtypes = OrderedDict([
        ("seller_name", "category"),
        ("new", "category"),
    ])

In [None]:
names_file_path = os.path.join(base_path, 'names.csv')
print('Loading file:', names_file_path)
names_df = cudf.read_csv(names_file_path, delimiter='|', 
                         names=list(names_dtypes.keys()), 
                         dtype=list(names_dtypes.values()))

In [None]:
print(names_df.head())

<a id="cudfapi"></a>
## cuDF API

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

#### Selecting Rows or Columns

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df['a']

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df.a

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df[0:10]

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(df.loc[2:5, ['a', 'b']])

#### Boolean Indexing

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(df.b[df.b > 15])

#### Missing Data

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(s.fillna(999))

#### Statistical Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(s.mean(), s.var())

#### Applymap Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# def add_ten(num):
#     return num + 10

# print(df['a'].applymap(add_ten))

#### Histogramming

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(df.a.value_counts())

#### Merges

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
df_merged = acquisition_df.merge(names_df, on=['seller_name'], how='left')

In [None]:
df_merged.dtypes

In [None]:
print(df_merged.shape)

#### Concatenations

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
head_df = performance_df.head()
tail_df = performance_df.tail()
head_and_tail_df = cudf.concat([head_df, tail_df], ignore_index=True)

In [None]:
print(head_and_tail_df.shape)

In [None]:
print(head_and_tail_df)

#### Appends

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(df.a.head().append(df.b.head()))

#### Groupbys

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]
# df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(df.groupby('agg_col1').sum())

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# print(df.groupby(['agg_col1', 'agg_col2']).sum())

In [None]:
# print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))

#### Categoricals

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# pdf = pd.DataFrame({"id":[1,2,3,4,5,6], "grade":['a', 'b', 'b', 'a', 'a', 'e']})
# pdf["grade"] = pdf["grade"].astype("category")

# gdf = cudf.DataFrame.from_pandas(pdf)
# print(gdf)

In [None]:
# print(gdf.grade.cat.categories)

In [None]:
# print(gdf.grade.cat.codes)

<a id="conclusion"></a>
## Conclusion

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)