## Data Generation with RAPIDS

In this notebook you will learn how to easy generate syntetic dataset and do the basic manipulation to use it for training. 

**Table of Contents**
* 1. [Import Libraries](#Import-Libraries)
* 2. [Data Loading](#Data-Loading)


### Import Libraries

Built in Python imports:

In [None]:
import time

import os
import sys
from enum import Enum

if sys.version_info[0] >= 3:
    from urllib.request import urlretrieve  # pylint: disable=import-error,no-name-in-module
else:
    from urllib import urlretrieve  # pylint: disable=import-error,no-name-in-module

class LearningTask(Enum):
    REGRESSION = 1
    CLASSIFICATION = 2
    MULTICLASS_CLASSIFICATION = 3

class Data:  # pylint: disable=too-few-public-methods,too-many-arguments
    def __init__(self, X_train, X_test, y_train, y_test, learning_task, qid_train=None,
                 qid_test=None):
        self.X_train = X_train
        self.X_test = X_test
        self.y_train = y_train
        self.y_test = y_test
        self.learning_task = learning_task
        # For ranking task
        self.qid_train = qid_train
        self.qid_test = qid_test

Additional CPU imports:

In [2]:
import numpy as np;print('numpy Version:', np.__version__)
import pandas as pd;print('pandas Version:', pd.__version__)
import sklearn
import pickle
import numpy as np; import numpy.matlib
## Visulaization libraries 
import ipyvolume as ipv
import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D

numpy Version: 1.16.2
pandas Version: 0.24.2


Import Algorithms and Dataset libraries:

In [3]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

Imports for GPU dataset and algorithms accelerations 

In [4]:
import cupy;print('cupy Version', cupy.__version__)
import cudf;print('cudf Version', cudf.__version__)


import rapids_lib_v10 as rl
''' NOTE: anytime changes are made to rapids_lib.py you can either:
      1. refresh/reload via the code below, OR
      2. restart the kernel '''
import importlib; importlib.reload(rl)

cupy Version 6.2.0
cudf Version 0.9.0


<module 'rapids_lib_v10' from '/rapids/notebooks/ml_tutorial/version_101/rapids_lib_v10.py'>

### Data Loading

Here we will load the [Higgs Boson detection data](https://archive.ics.uci.edu/ml/datasets/HIGGS). The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features.
**Attribute Information**
- The first column is the class label (1 for signal, 0 for background)
- 21 low-level features (kinematic properties): lepton pT, lepton eta, lepton phi, missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag
- 7 high-level features derived by physicists: m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.


We will use the function below to get the data and split it to training and testing parts. Here we are using 80% of our data for training. You can easily adjust this as well as  how many of rows (out of 11 milions total) you want to read. This will be usefull for the testing purposes. 

In [5]:
def prepare_higgs(dataset_folder, nrows):
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz'
    local_url = os.path.join(dataset_folder, os.path.basename(url))
    pickle_url = os.path.join(dataset_folder,
                              "higgs" + ("" if nrows is None else "-" + str(nrows)) + ".pkl")

    if os.path.exists(pickle_url):
        return pickle.load(open(pickle_url, "rb"))

    if not os.path.isfile(local_url):
        urlretrieve(url, local_url)
    higgs = pd.read_csv(local_url, nrows=nrows, header = None, 
                        names= ['label','lepton_pT','lepton_eta','lepton_phi','missing_energy_magnitude','missing_energy_phi','jet_1_pt',
                             'jet_1_eta','jet_1_phi','jet_1_b_tag','jet_2_pt','jet_2_eta','jet_2_phi','jet_2_b_tag',
                             'jet_3_pt','jet_3_eta','jet_3_phi','jet_3_b-tag','jet_4_pt','jet_4_eta','jet_4_phi',
                             'jet_4_b_tag','m_jj','m_jjj','m_lv','m_jlv','m_bb','m_wbb','m_wwbb'] )
    X = higgs.iloc[:, 1:]
    y = higgs.iloc[:, 0]
    

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77,
                                                        test_size=0.2,
                                                        )
    data = Data(X_train, X_test, y_train, y_test, LearningTask.CLASSIFICATION)
    pickle.dump(data, open(pickle_url, "wb"), protocol=4)
    return data

Let's first read a very small subset of this dataset. It will take few minutes. While waiting you can open a new jupyter lab window and set up your GPU Dashboard for the monitoring purposes. 


In [6]:
num_rows = 11000000
higgs_df = prepare_higgs(".", num_rows)

In [7]:
print('Number of rows in training dataset:', len(higgs_df.X_train))
print('Number of rows in testing dataset:',len(higgs_df.X_test))

Number of rows in training dataset: 8800000
Number of rows in testing dataset: 2200000


In [8]:
print(type(higgs_df.X_train))
print(type(higgs_df.y_train))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


Now we will convert the data object to 2 formats:

  1. **Pandas DataFrame** for  CPU based tasks 
  2. **cuDF Dataframe** objects for GPU tasks 
  
[Rapids cuDF](https://github.com/rapidsai/cudf) is a GPU Data Frames library for loading, joinign,aggregating, filtering, and optherwise manipulating data. 

In [9]:
# Pandas dataframe objects 
trainData_pDF = higgs_df.X_train
testData_pDF = higgs_df.X_test
trainLabels_pDF = higgs_df.y_train.to_frame()
testLabels_pDF = higgs_df.y_test.to_frame()

In [10]:
#cuDF dataframe objects 
trainData_cDF = cudf.DataFrame.from_pandas(trainData_pDF)
testData_cDF = cudf.DataFrame.from_pandas(testData_pDF)
trainLabels_cDF = cudf.DataFrame.from_pandas(trainLabels_pDF)
testLabels_cDF = cudf.DataFrame.from_pandas(testLabels_pDF)

We will now save those object using jupyter lab magic function `%store` and use it for our next task . 

In [11]:
%store trainData_cDF
%store trainLabels_cDF
%store testData_cDF 
%store testLabels_cDF

Stored 'trainData_cDF' (DataFrame)
Stored 'trainLabels_cDF' (DataFrame)
Stored 'testData_cDF' (DataFrame)
Stored 'testLabels_cDF' (DataFrame)


In [12]:

%store trainData_pDF
%store trainLabels_pDF 
%store testData_pDF
%store testLabels_pDF

Stored 'trainData_pDF' (DataFrame)
Stored 'trainLabels_pDF' (DataFrame)
Stored 'testData_pDF' (DataFrame)
Stored 'testLabels_pDF' (DataFrame)
