# Lab 1:  Data Loading, GPU Dataframe Creation, and Data Manipulation
Thanks to Anaconda for some material

In this lab you will learn how to 
- Load data into a GPU Data Frame (GDF)
- Manitulate data in a GDF and performance basic ETL functions


As you progress in this lab look for instances of ***TASK:***, this will be where you will be asked to take an action to complete this lab.

***TASK:*** Execute the cell below to auto-time execution of every cell

In [None]:
# Add autotime of each block
!pip install ipython-autotime
%load_ext autotime

<br>
## Loading Data
The file we are going to load is netflow1.csv
  
Size = 965M<br>
Records = 17,296,829

<p>
### Traditional interface through Pandas

In [None]:
import pandas as pd

In [None]:
# let's define the data - column snames and data types
cols = [
     "strdate",
    "srcip",
    "dstip",
    "srcport",
    "dstport",
    "srcbytes",
    "dstbytes"   
]


dtypes = {    
    "strdate"  : str,
    "srcip"    : str,
    "dstip"    : str,
    "srcport"  : int,
    "dstport"  : int,
    "srcbytes" : int,
    "dstbytes" : int

}

In [None]:
file_1 = '../data/netflow1.csv'

In [None]:
df = pd.read_csv(file_1,  names=cols, dtype=dtypes, skiprows=1)

In [None]:
df.dtypes

<br>
## Creating a GPU Dataframe


In [None]:
import pygdf

In [None]:
gdf = pygdf.DataFrame.from_pandas(df)

In [None]:
gdf.dtypes

<br>
*Rember that df = CPU and gdf = GPU since we will switch back and forth for performance comparisons*

<br>
## Column Functionals and Transformations
One of the basic GDF operations is column transform. We can perform built-in arithmetic operations on each column, such as type casting:

***Tasks***:  Run the following code block to display the first 5 rows of the 'scrport' column

In [None]:
gdf['srcport'].head()

<br>
***Tasks*** Find the largest amount of data sent to a distination

<details><summary>Click for Answer (one possible)</summary>
<p>
gdf['dstbytes'].max()
</p>
</details>

In [None]:
gdf['dstbytes'].max() 

In [None]:
# Since the values is less than an int32, let's convert the data type
import numpy as np
gdf['dstbytes'] = gdf['dstbytes'].astype(np.int32)

In [None]:
gdf.dtypes

<br>
***Task*** What is the smallest data types that the SRC or DST Ports could be converted to?
<br>
<br>
<details><summary>Click for answer:  Port - smallest data type</summary>
<p>
gdf['srcport'].max()   is 65,534 so int32 is smallest, the same for dstport

</p>
</details>



<br>
### Transformations
***Tasks*** Create a new GDF column called totalbyes that is the sum of src and dst bytes

In [None]:
gdf['totalbytes'] = gdf['srcbytes'] + gdf['dstbytes']

In [None]:
gdf['totalbytes'].max()

In [None]:
# See the same time on the CPU
df['totalbytes'] = df['srcbytes'] + df['dstbytes']

In [None]:
df['totalbytes'].max()

The performance gain is on a small dataset and a simple transformation.  As data size and analytic complexity increas so does the delta in performance.

<br>
### Filtering 
Filtering is done with the query() function that takes an expression string of column names. 

In [None]:
gdf['srcport'].count()

***Note*** Current Beta version doesa not support Strings, so we need to drop the date and IP fields

In [None]:
gdf.drop_column('strdate')
gdf.drop_column('srcip')
gdf.drop_column('dstip')

In [None]:
gdf.dtypes

In [None]:
# How many events are in the dataset that do not connect to port 80
port_80 = gdf.query('dstport != 80')

In [None]:
port_80['srcport'].count()

In [None]:
port_80.dtypes

In [None]:
port_80.head()

<p>
### Grouping and Counting

In [None]:
from collections import OrderedDict

In [None]:
# add a column for count
gdf['count'] = gdf['dstport']
gdf.dtypes

In [None]:
aggs = OrderedDict()
aggs['count'] = 'count'

stats = gdf.groupby(['dstport']).agg(aggs)

In [None]:
stats.dtypes

In [None]:
stats.head(25)

In [None]:
del stats

***TASK***
- Normalize the total byte count to KB
- Compute the Mean and Standard Deviations
- Compute the Z-Score

Answer below

In [None]:
gdf.dtypes

In [None]:
import math

In [None]:
def ScaleData( totalbytes, KB, Scale) :
    for i in range (totalbytes.size) :
        KB[i] = totalbytes[i] / Scale

In [None]:
gdf = gdf.apply_rows(ScaleData, incols=['totalbytes'], outcols=dict(KB=np.float32), kwargs=dict(Scale=(1024)))

In [None]:
gdf.head(25)

In [None]:
mean = gdf['KB'].mean()

In [None]:
std = gdf['KB'].std()

In [None]:
print("Std == %f \t Mean == %f" % (std, mean))

In [None]:
def NormalizeData( KB, Z, s, m) :
    for i in range(KB.size):
        Z[i] =  ( (KB[i] - m)/s )


In [None]:
gdf = gdf.apply_rows(NormalizeData, incols=['KB'], outcols=dict(Z=np.float32), kwargs=dict(s=std, m=mean ))

In [None]:
gdf.head(30)