<a id="introduction"></a>
## Exploratory Data Analysis
#### By Paul Hendricks
-------

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In this notebook, we will also show how to get started with GPU DataFrames using cuDF in RAPIDS.

**Table of Contents**

* [Exploratory Data Analysis](#introduction)
* [Setup](#setup)
* To be edited
* To be edited
* To be edited
* To be edited
* [Conclusion](#conclusion)

Before going any further, let's make sure we have access to `matplotlib` and `seaborn`, popular Python libraries for visualizing data.

In [None]:
import os

try:
    import matplotlib; print('Matplotlib Version:', matplotlib.__version__)
except ModuleNotFoundError:
    os.system('conda install -y matplotlib')

In [None]:
try:
    import seaborn as sns; print('Seaborn Version:', sns.__version__)
except ModuleNotFoundError:
    os.system('conda install -y seaborn')

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai:0.6-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai)
* `rapidsai/rapidsai-nightly:0.6-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [None]:
!nvidia-smi

Next, let's see what CUDA version we have:

In [None]:
!nvcc --version

Next, let's load some helper functions from `matplotlib` and configure the Jupyter Notebook for visualization.

In [None]:
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt


%matplotlib inline

## Creating a CUDA Cluster

Before we start working with cuDF DataFrames with Dask, we need to setup a Local CUDA Cluster and Client to work with our GPUs.

In [None]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import subprocess

# parse the hostname IP address
cmd = "hostname --all-ip-addresses"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
ip_address = str(output.decode()).split()[0]

# create a local CUDA cluster
cluster = LocalCUDACluster(ip=ip_address)
client = Client(cluster)
client

## Loading data

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
!ls -alh /datasets/rapids/mortgage/mortgage_2000_1gb/perf

In [None]:
from collections import OrderedDict

base_path = os.path.join('/', 'datasets', 'rapids', 'mortgage', 'mortgage_2000_1gb')

dtypes = OrderedDict([
        ('loan_id', 'int64'),
        ('monthly_reporting_period', 'date'),
        ('servicer', 'category'),
        ('interest_rate', 'float64'),
        ('current_actual_upb', 'float64'),
        ('loan_age', 'float64'),
        ('remaining_months_to_legal_maturity', 'float64'),
        ('adj_remaining_months_to_maturity', 'float64'),
        ('maturity_date', 'date'),
        ('msa', 'float64'),
        ('current_loan_delinquency_status', 'int32'),
        ('mod_flag', 'category'),
        ('zero_balance_code', 'category'),
        ('zero_balance_effective_date', 'date'),
        ('last_paid_installment_date', 'date'),
        ('foreclosed_after', 'date'),
        ('disposition_date', 'date'),
        ('foreclosure_costs', 'float64'),
        ('prop_preservation_and_repair_costs', 'float64'),
        ('asset_recovery_costs', 'float64'),
        ('misc_holding_expenses', 'float64'),
        ('holding_taxes', 'float64'),
        ('net_sale_proceeds', 'float64'),
        ('credit_enhancement_proceeds', 'float64'),
        ('repurchase_make_whole_proceeds', 'float64'),
        ('other_foreclosure_proceeds', 'float64'),
        ('non_interest_bearing_upb', 'float64'),
        ('principal_forgiveness_upb', 'float64'),
        ('repurchase_make_whole_proceeds_flag', 'category'),
        ('foreclosure_principal_write_off_amount', 'float64'),
        ('servicing_activity_indicator', 'category')
    ])

In [None]:
import cudf; print('cuDF Version:', cudf.__version__)
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)
import numpy as np; print('NumPy Version:', np.__version__)
import os

filepath = os.path.join(base_path, 'perf', 'Performance_*')
# filepath = os.path.join(base_path, 'perf', 'Performance_2000Q1.txt_0')
df = dask_cudf.read_csv(filepath, delimiter='|', 
                        names=list(dtypes.keys()), dtype=list(dtypes.values()))

<a id=""></a>
## Fundamentals of a cuDF DataFrame

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<a id=""></a>
## Working with NULLs

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [None]:
# calculate number of rows
number_of_rows = perf_df.map_partitions(len).compute().sum()
print(number_of_rows)

In [None]:
# calculate how many rows in each column have actual values
# # ideal
# column_counts = df.count()
# column_counts

# alternative
column_counts = []
for column in list(perf_df.columns):
    column_count = perf_df[column].count().compute()
    column_counts.append((column, column_count))

In [None]:
for column, count in column_counts:
    print(column, ':', (count / number_of_rows) * 100)

<a id=""></a>
## to be edited

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<a id=""></a>
## to be edited

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<a id=""></a>
## to be edited

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<a id="conclusion"></a>
## Conclusion

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)