# Lab 1:  Data Loading, GPU Dataframe Creation, and Data Manipulation
Thanks to Anaconda for some material

In this lab you will learn how to 
- Load data into a GPU Data Frame (GDF)
- Manitulate data in a GDF to performance some basic ETL and statistical functions


As you progress in this lab look for instances of ***TASK:***, this will be where you will be asked to take an action to complete this lab.

***TASK:*** Execute the cell below to auto-time execution of every cell

In [1]:
# Add autotime of each block
!pip install ipython-autotime
%load_ext autotime

[33mYou are using pip version 9.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


<br>
_____
## Step 1: Loading Data
The file we are going to load is netflow1.csv
  
Size = 965 MB<br>
Records = 17,296,829

<p>
### Traditional interface through Pandas

In [2]:
import pandas as pd

time: 202 ms


In [3]:
# let's define the data - column names and data types
cols = [
    "strdate",
    "srcip",
    "dstip",
    "srcport",
    "dstport",
    "srcbytes",
    "dstbytes"   
]


dtypes = {    
    "strdate"  : str,
    "srcip"    : str,
    "dstip"    : str,
    "srcport"  : int,
    "dstport"  : int,
    "srcbytes" : int,
    "dstbytes" : int

}

time: 3.51 ms


In [4]:
file_1 = '../data/netflow1.csv'

time: 1.85 ms


In [5]:
# the data file contains a header line that needs to be skipped
df = pd.read_csv(file_1,  names=cols, dtype=dtypes, skiprows=1)

time: 10.1 s


In [6]:
df.dtypes

strdate     object
srcip       object
dstip       object
srcport      int64
dstport      int64
srcbytes     int64
dstbytes     int64
dtype: object

time: 14.5 ms


<br>
### Creating a GPU Dataframe


In [7]:
import pygdf

time: 331 ms


In [8]:
gdf = pygdf.DataFrame.from_pandas(df)

time: 915 ms


In [9]:
gdf.dtypes

strdate     object
srcip       object
dstip       object
srcport      int64
dstport      int64
srcbytes     int64
dstbytes     int64
dtype: object

time: 5.55 ms


<br>
That's it - the data is now loaded and available for accelerated Data Analysis on the GPU
<p>
*Rember that df = CPU and gdf = GPU since we will switch back and forth for performance comparisons*<p>
***Data Loading Runtime*** = pd.read_csv + DataFrame.from_pandas

***Tasks***:  Run the code block to display the first 5 rows of the 'scrport' column

In [10]:
# what does the data look like
gdf.head()

              strdate        srcip           dstip srcport dstport srcbytes dstbytes
0 2013-04-01 07:50:16   172.20.0.3 172.255.255.255     137     137     1104        0
1 2013-04-01 07:50:21  172.10.0.40 172.255.255.255     137     137      184        0
2 2013-04-01 08:05:00  172.10.1.17       10.0.0.13    5040      80      454      633
3 2013-04-01 08:05:01   172.10.1.9       10.1.0.77    5067      80      454      633
4 2013-04-01 08:05:01 172.30.2.114        10.0.0.9    8749      80      453      632

time: 21.6 ms


### Transitioning GDF back to Pandas

In [11]:
# just as easy
df2 = gdf.to_pandas()

time: 1.15 s


In [12]:
df2.dtypes

strdate     object
srcip       object
dstip       object
srcport      int64
dstport      int64
srcbytes     int64
dstbytes     int64
dtype: object

time: 5.41 ms


In [14]:
# clean up since we don't need df2
del df2

NameError: name 'df2' is not defined

time: 98.6 ms


<br>
## Column Functionals and Transformations
One of the basic GDF operations is column transform. To do that we use built-in arithmetic operations on each column

***Note:*** The followiong function operate against a GDF Column - not against the full dataset.

<br>
The followinig operations will all be against the dstbytes column
<br>

In [15]:
# How many 'Non-NULL' records are in the dataset?
gdf['dstbytes'].count()

17296828

time: 3.83 ms


In [16]:
# Same function on CPU
df['dstbytes'].count()

17296828

time: 51.6 ms


<br>
***Tasks*** Find the largest amount of data sent to a distination

<details><summary>Click for Answer</summary>
<p>
gdf['dstbytes'].max()
</p>
</details>

In [17]:
#remove
gdf['dstbytes'].max() 

7898550

time: 5.45 ms


***Tasks*** What about the Min and Average?

<details><summary>Click for Answer on Min and Average</summary>
<p>
gdf['dstbytes'].min()
gdf['dstbytes'].min()
</p>
</details>

In [18]:
# remove
gdf['dstbytes'].min()

0

time: 3.61 ms


In [19]:
# remove
gdf['dstbytes'].mean()

1531.7509521977092

time: 5.48 ms


#### Changing Data Types

In [20]:
# Since the largest dstbyte size is less than an int32, let's convert the data type
import numpy as np
gdf['dstbytes'] = gdf['dstbytes'].astype(np.int32)

time: 122 ms




In [21]:
gdf.dtypes

strdate     object
srcip       object
dstip       object
srcport      int64
dstport      int64
srcbytes     int64
dstbytes     int32
dtype: object

time: 5.73 ms


<br>
***Task*** 
* determine the smallest data type for SRC and DST Port fields
* convert those fields
<br>
<br>
<details><summary>Click for answer</summary>
<p>
gdf['srcport'].max()   is 65,534 so int32 is smallest, the same for dstport
gdf['srcport'] = gdf['srcport'].astype(np.int32)
gdf['dstport'] = gdf['dstport'].astype(np.int32)
</p>
</details>



In [22]:
#remove 
gdf['srcport'].max()

65534

time: 9.32 ms


In [23]:
#remove
gdf['srcport'] = gdf['srcport'].astype(np.int32)
gdf['dstport'] = gdf['dstport'].astype(np.int32)

time: 13.2 ms




In [24]:
# validate by looking at the dtypes.
gdf.dtypes

strdate     object
srcip       object
dstip       object
srcport      int32
dstport      int32
srcbytes     int64
dstbytes     int32
dtype: object

time: 5.43 ms


***Note*** The data type transforms could have been done during data loading

***Question*** Why do we care about using smaller data types?

<br>
### Transformations
***Tasks*** Create a new GDF column called ***totalbyes*** that is the sum of src and dst bytes

In [26]:
gdf['totalbytes'] = gdf['srcbytes'] + gdf['dstbytes']

time: 99.2 ms




In [27]:
# Verify that a new column was created
gdf.dtypes

strdate       object
srcip         object
dstip         object
srcport        int32
dstport        int32
srcbytes       int64
dstbytes       int32
totalbytes     int64
dtype: object

time: 5.96 ms


In [28]:
# What is the max byte size
gdf['totalbytes'].max()

8094046

time: 5.51 ms


#### Let's try that same function on the CPU

In [29]:
# See the same time on the CPU
df['totalbytes'] = df['srcbytes'] + df['dstbytes']

time: 189 ms


In [30]:
df['totalbytes'].max()

8094046

time: 135 ms


The performance gain is on a small dataset and a simple transformation.  As data size and analytic complexity increas so does the delta in performance.

### Dropping Columns

***Note*** Current GOAI Beta version does not support Strings, so we can drop the date and IP fields<br>
(roadmap section later will highlight Strings)

In [32]:
# Let's drop a colum
gdf.drop_column('strdate')

time: 3.77 ms


In [33]:
gdf.dtypes

srcip         object
dstip         object
srcport        int32
dstport        int32
srcbytes       int64
dstbytes       int32
totalbytes     int64
dtype: object

time: 12.6 ms


<br>
***Task*** Drop the scrip and dstip columns
<br>
<br>
<details><summary>Click for answer</summary>
<p>
gdf.drop_column('srcip')
gdf.drop_column('dstip')
</p>
</details>

In [34]:
gdf.drop_column('srcip')
gdf.drop_column('dstip')

time: 2.52 ms


<br>
### Filtering 
Filtering is done with the query() function that takes an expression string of column names. 

In [35]:
# Let's get a count for reference
gdf['srcport'].count()

17296828

time: 3.71 ms


In [36]:
# Extract a new Dataframe where the DST Port is not port 80
port_80 = gdf.query('dstport != 80')



time: 1.17 s




In [37]:
port_80['srcport'].count()

831256

time: 3.26 ms


In [38]:
port_80.dtypes

srcport       int32
dstport       int32
srcbytes      int64
dstbytes      int32
totalbytes    int64
dtype: object

time: 5.65 ms


In [39]:
port_80.head()

  srcport dstport srcbytes dstbytes totalbytes
0     137     137     1104        0       1104
1     137     137      184        0        184
1105     138     138      243        0        243
1283    4156      25     1901     1123       3024
1908    3754      25      422      457        879

time: 35.6 ms


***Task: *** Drop the port_80 dataframe 

In [45]:
#remove
del port_80

time: 1.76 ms


<p>
### Grouping and Aggregations

In [40]:
from collections import OrderedDict

time: 1.45 ms


In [41]:
gdf.dtypes

srcport       int32
dstport       int32
srcbytes      int64
dstbytes      int32
totalbytes    int64
dtype: object

time: 5.73 ms


In [46]:
# add a column for count
gdf['count'] = gdf['dstport']
gdf.dtypes

srcport       int32
dstport       int32
srcbytes      int64
dstbytes      int32
totalbytes    int64
count         int32
dtype: object

time: 8.43 ms


In [47]:
aggs = OrderedDict()
aggs['count'] = 'count'

stats = gdf.groupby(['dstport']).agg(aggs)



time: 15.1 s


In [48]:
stats.dtypes

dstport      int32
count      float64
dtype: object

time: 7.92 ms


In [None]:
stats.head(25)

In [None]:
del stats

***Tasks:  What is the count of SRC - DST Port groupings
<br>
<br>
<details><summary>Click for answer</summary>
<p>
aggs = OrderedDict()
aggs['count'] = 'count'

stats = gdf.groupby(['srcport','dstport']).agg(aggs)
</p>
</details> 

In [49]:
aggs = OrderedDict()
aggs['count'] = 'count'

stats = gdf.groupby(['srcport','dstport']).agg(aggs)



time: 47.8 s


In [50]:
stats.head(25)

   srcport dstport   count
 0       0       0 86927.0
 1      20    3098     1.0
 2      20    3099     1.0
 3      20    3100     1.0
 4      20    3103     1.0
 5      20    3544     1.0
 6      20    3545     1.0
 7      20    3546     1.0
 8      20    3547     1.0
 9      20    3548     1.0
10      20    3549     1.0
11      20    3552     1.0
12      20    3553     1.0
13      20    8403     1.0
14      20    8404     1.0
15      20    8405     1.0
16      20    8406     1.0
17      20    8407     1.0
18      20    8408     1.0
19      21    1048     1.0
20      21    1074     1.0
21      21    1084     1.0
22      21    1087     1.0
23      21    1137     1.0
24      21    1139     1.0

time: 16.2 ms


## Independent Task

***TASK***
- Normalize the total byte count to KB
- Compute the Mean and Standard Deviations
- Compute the Z-Score

Answer below

In [None]:
gdf.dtypes

In [None]:
def ScaleData( totalbytes, KB, Scale) :
    for i in range (totalbytes.size) :
        KB[i] = totalbytes[i] / Scale

In [None]:
gdf = gdf.apply_rows(ScaleData, incols=['totalbytes'], outcols=dict(KB=np.float32), kwargs=dict(Scale=(1024)))

In [None]:
gdf.head(25)

In [None]:
mean = gdf['KB'].mean()

In [None]:
std = gdf['KB'].std()

In [None]:
print("Std == %f \t Mean == %f" % (std, mean))

In [None]:
def NormalizeData( KB, Z, s, m) :
    for i in range(KB.size):
        Z[i] =  ( (KB[i] - m)/s )


In [None]:
gdf = gdf.apply_rows(NormalizeData, incols=['KB'], outcols=dict(Z=np.float32), kwargs=dict(s=std, m=mean ))

In [None]:
gdf.head(30)