# Lab n: Classification

In [2]:
## Notebook Settings
# Add autotime of each block
!pip install ipython-autotime
%load_ext autotime



### Goals:
- Learn the basics of cyber network data with respect to consumer IoT devices
- Load network data into Pandas and a GDF (for comparison)
- Explore network data and features
- Use XGBoost to build a classification model
- Test the model

This lab builds on the previous labs and will utilize some of those skills.

### Background

#### The Internet of Things and Data at a Massive Scale

Gartner estimates there are currently over 8.4 billion Internet of Things (IoT) devices. By 2020, that number is [estimated to surpass 20 billion](https://www.zdnet.com/article/iot-devices-will-outnumber-the-worlds-population-this-year-for-the-first-time/). These types of devices range from consumer devices (e.g., Amazon Echo, smart TVs, smart cameras, door bells) to commercial devices (e.g., building automation systems, keycard entry). All of these devices exhibit behavior on the Internet as they communicate back with their own clouds and user-specified integrations.

#### Types of Network Data

The most detailed type of data that is typically collected on a network is full Packet CAPture (PCAP) data. This information is detailed and contains everything about the communication, including: source address, destination address, protocols used, bytes transferred, and even the raw data (e.g., image, audio file, executable). PCAP data is fine-grained, meaning that there is a record for each frame being transmitted. A typical communication is composed of many individual packets/frames.

If we aggregate PCAP data so that there is one row of data per communication session, we call that flow level data. A simplified example of this relationship is shown in the figure below.

![PCAP_flow_relationship](pcap_vs_flow.png)

For this tutorial, we use data from the University of New South Wales. In a lab environment, they [collected nearly three weeks of IoT data from 21 IoT devices](http://149.171.189.1). They also kept a detailed [list of devices by MAC address](http://149.171.189.1/resources/List_Of_Devices.txt), so we have ground-truth with respect to each IoT device's behavior on the network.

**Our goal is to utilize the behavior exhibited in the network data to classify IoT devices.**

### Data Investigation

Let's first see some of the data. We'll load a PCAP file in using PyShark (a Python wrapper for Tshark).

In [4]:
import pyshark
cap = pyshark.FileCapture("/cwshare/unsw_iot/16-09-27.pcap")

time: 1.94 ms


In [5]:
print(cap[0])

Packet (Length: 156)
Layer ETH:
	Destination: 14:cc:20:51:33:ea
	.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
	Source: 30:8c:fb:2f:e4:b2
	.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
	Type: IPv4 (0x0800)
	Address: 14:cc:20:51:33:ea
	.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
	.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
	Address: 30:8c:fb:2f:e4:b2
Layer IP:
	Source GeoIP: Unknown
	0000 00.. = Differentiated Services Codepoint: Default (0)
	Protocol: TCP (6)
	Destination GeoIP Country: United States
	Total Length: 142
	Destination: 52.87.241.159
	Header checksum status: Unverified
	Fragment offset: 0
	Destination GeoIP City: Ashburn, VA
	Destination GeoIP Latitude: 39.033501
	.... 0101 = Header Length: 20 bytes (5)
	..0. .... = More fragments: Not set
	Header checksum: 0x7fc2 [validation disabled]
	Destination GeoIP AS Number: AS14618 Amazon.com, Inc.
	Source: 192.

There's really a lot of features there! In addition to having multiple layers (which may differ between packets), there are a number of other issues with working directly with PCAP. Often the payload is encrypted (note the SSL layer in the above example), rendering it useless. The lack of aggregation also makes it difficult to differentiate between packets. What we really care about for this application is what a *session* looks like. In other words, how a Roku interacts with the network is likely quite different than how a Google Home interacts. 

To save time for the tutorial, all three weeks of PCAP data have already been transformed to flow data, and we can load that in to a typical Pandas dataframe. Due to how the data was created, we have a header row (with column names) as well as a footer row. We want to use the header but will skip the footer.

In [6]:
import pandas as pd
pdf = pd.read_csv("/cwshare/unsw_iot/bro/conn.log", sep='\t', skipfooter=1)
print("==> pdf shape: ",pdf.shape)

  


==> pdf shape:  (950384, 23)
time: 12.1 s


We can look at what this new aggregated data looks like, and get a better sense of the columns and their data types.

In [7]:
pdf.head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,...,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents,orig_l2_addr,resp_l2_addr
0,1474553000.0,CIlOTU4kRBDOEJ2zf,192.168.1.241,61725,192.168.1.1,53,udp,dns,-,-,...,-,0,Dc,1,74,0,0,-,70:ee:50:18:34:43,14:cc:20:51:33:ea
1,1474553000.0,CmV0US1aCCPzrVRz36,192.168.1.193,4425,192.168.1.223,49153,tcp,http,0.008820,196,...,-,0,ShADadfF,5,464,5,461,-,ec:1a:59:83:28:11,ec:1a:59:79:f4:89
2,1474553000.0,CEXvDL2UPDYnBDtd6h,192.168.1.193,4426,192.168.1.223,49153,tcp,http,0.008664,198,...,-,0,ShADadfF,5,466,5,461,-,ec:1a:59:83:28:11,ec:1a:59:79:f4:89
3,1474553000.0,CEXDAD42Irgl4M5go8,192.168.1.193,4977,192.168.1.249,49152,tcp,http,0.020995,186,...,-,0,ShADadfF,5,454,5,1438,-,ec:1a:59:83:28:11,00:16:6c:ab:6b:88
4,1474553000.0,CW1YbA2fZzHrztJ0rl,192.168.1.193,4978,192.168.1.249,49152,tcp,http,0.018730,186,...,-,0,ShADadfF,5,454,5,1438,-,ec:1a:59:83:28:11,00:16:6c:ab:6b:88


time: 39.2 ms


In [8]:
pdf.dtypes

ts                float64
uid                object
id.orig_h          object
id.orig_p           int64
id.resp_h          object
id.resp_p           int64
proto              object
service            object
duration           object
orig_bytes         object
resp_bytes         object
conn_state         object
local_orig         object
local_resp         object
missed_bytes        int64
history            object
orig_pkts           int64
orig_ip_bytes       int64
resp_pkts           int64
resp_ip_bytes       int64
tunnel_parents     object
orig_l2_addr       object
resp_l2_addr       object
dtype: object

time: 3.95 ms


In [15]:
# maybe we rename the columns

time: 507 µs


In [None]:
# convert to GDF

In [23]:
import pygdf

time: 1.11 ms


In [24]:
gdf = pygdf.DataFrame.from_pandas(pdf)

time: 250 ms


In [21]:
gdf.dtypes

ts                float64
uid                object
id.orig_h          object
id.orig_p           int64
id.resp_h          object
id.resp_p           int64
proto              object
service            object
duration           object
orig_bytes         object
resp_bytes         object
conn_state         object
local_orig         object
local_resp         object
missed_bytes        int64
history            object
orig_pkts           int64
orig_ip_bytes       int64
resp_pkts           int64
resp_ip_bytes       int64
tunnel_parents     object
orig_l2_addr       object
resp_l2_addr       object
dtype: object

time: 5.14 ms


In [29]:
labels_pdf = pd.read_csv("/cyshare/KDD2018/lab_mac_labels.tab", sep='\t')

time: 4.7 ms


In [34]:
labels_pdf.head()

Unnamed: 0,Device,MAC,Connection
0,Smart Things,d0:52:a8:00:67:5e,Wired
1,Amazon Echo,44:65:0d:56:cc:d3,Wireless
2,Netatmo Welcome,70:ee:50:18:34:43,Wireless
3,TP-Link Day Night Cloud camera,f4:f2:6d:93:51:f1,Wireless
4,Samsung SmartCam,00:16:6c:ab:6b:88,Wireless


time: 9.93 ms


In [43]:
labels_gdf = pygdf.DataFrame.from_pandas(labels_pdf)

time: 9.27 ms


In [48]:
labels_gdf.set_index('MAC')

KeyError: <class 'numpy.object_'>

time: 22 ms


In [40]:
labels_pdf.set_index('MAC')

Unnamed: 0_level_0,Device,Connection
MAC,Unnamed: 1_level_1,Unnamed: 2_level_1
d0:52:a8:00:67:5e,Smart Things,Wired
44:65:0d:56:cc:d3,Amazon Echo,Wireless
70:ee:50:18:34:43,Netatmo Welcome,Wireless
f4:f2:6d:93:51:f1,TP-Link Day Night Cloud camera,Wireless
00:16:6c:ab:6b:88,Samsung SmartCam,Wireless
30:8c:fb:2f:e4:b2,Dropcam,Wireless
00:62:6e:51:27:2e,Insteon Camera,Wired
e8:ab:fa:19:de:4f,unknown device 1,Wireless
00:24:e4:11:18:a8,Withings Smart Baby Monitor,Wired
ec:1a:59:79:f4:89,Belkin Wemo switch,Wireless


time: 9.41 ms


In [35]:
labels_gdf = pygdf.DataFrame.from_pandas(labels_pdf)

time: 8.57 ms


In [None]:
gdf.set_index('orig_l2_addr').join()