# Probability

**Start downloading this file and place it in example_pcaps: https://drive.google.com/file/d/1Lr1dleCbZcQWfHoW_u6Q2uZFte17Y2Z_/view?usp=sharing. You'll need it later!**

## Introduction

Today, we're going to explore probability. The concept of probability is a powerful tool that lets us answer interesting questions about our data, and it serves as the foundation of a commonly used machine learning technique for classification We'll also be building a Naive Bayes classifier from scratch, so you'll get hands-on experience coding a machine learning classifier from scratch!

Let's start with some simple probability examples on the board. Let's see how much you can recall from lecture!

Say I have a bucket with 10 blue balls and 20 red balls. If I choose a ball at random from the bucket, what is the probability that I choose a red ball? That is, we want to calculate:

$P($red ball$)\ =\ ??$

This is equal to the fraction of red balls over the total number of balls.

$P($red ball$)\ =\ \frac{\text{# of red balls}}{\text{# of total balls}}\ =\ \frac{20}{30}\ =\ \frac{2}{3}$

Similarly, the chance of picking a blue ball is:

$P($blue ball$)\ =\ \frac{\text{# of blue balls}}{\text{# of total balls}}\ =\ \frac{10}{30}\ =\ \frac{1}{3}$

Now, let's say we want to find the probability of picking a red ball out of the bucket, **AND THEN** picking a blue ball out of the bucket. When we want to find the probability of two events both occurring, we multiply their probabilities together. The resulting probability is:

$P(\text{red ball})*P(\text{blue ball}\ |\ \text{red ball missing})$

Here, we introduce the concept of conditional probability. $P(\text{blue ball}\ |\ \text{red ball missing})$ represents the probability that a blue ball is pulled from the bucket, **given** that a red ball has already been taken out.

Are these two events independent? Does pulling a red ball affect the result of the probability of pulling a red ball followed by a blue ball? If it had no effect, the overall probability would be equivalent to:

$P(\text{red ball})*P(\text{blue ball})$

But it's not! By removing a red ball, there are now fewer overall balls to choose from, which changes the resulting probability. The full probability is therefore calculated as:

$P(\text{red ball})*P(\text{blue ball}\ |\ \text{red ball missing})\ =\ \frac{20}{30}*\frac{10}{29}\ =\ \frac{20}{87}$

Now that you've had a chance to review, let's dive into the data.

## Basic Probability Analysis on Network Traffic

### Probability of a TCP Packet

Let's compute the probability that a packet from our capture was a TCP packet:

$P(\text{TCP Packet})\ =\ \frac{\text{# of TCP packets}}{\text{# of total packets}}$

We'll start by loading some captured data into Python, and filtering out packets that don't have a DNS query field or a DNS response field. You'll need to fill in the blanks with the correct information. For tcp_packets, there are three options for each blank: "data", "protocol", or "TCP". Consult yesterday's lab if you need!

In [2]:
from data_collection.parse_pcap import pcap_to_pandas
from utils import *
import pandas as pd
import numpy as np
from datetime import datetime, timezone
from sklearn.preprocessing import LabelEncoder
from math import log

In [3]:
data = pcap_to_pandas('example_pcaps/ross.pcap') # call our helper "pcap_to_pandas" function, and pass in the argument "example_pcaps/tplink_switch.pcap"

In [8]:
num_total_packets = len(data) # number of total packets
data.head(n=5)

print(data.shape)

(189085, 18)


In [7]:
tcp_packets = data.loc[data['protocol'] == 'TCP'] # packets with the protocol column equal to "TCP"
udp_packets = data.loc[data['protocol'] == 'UDP']
tcp_packets.head(n=5)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
2,2018-07-30 14:51:41.370868,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,82,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700302
3,2018-07-30 14:51:41.370965,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700399
4,2018-07-30 14:51:41.370966,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.7004
5,2018-07-30 14:51:41.370966,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.7004
6,2018-07-30 14:51:41.370967,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700401


#### TCP Analysis

In [6]:
# len gives the number of packets in some data
# number of TCP packets
num_tcp_packets = len(tcp_packets) 
print(tcp_packets.shape)

tcp_probability = num_tcp_packets / num_total_packets # probability that a packet is a TCP packet

print(tcp_probability)

(144264, 18)
0.7629584578364228


#### UDP Analysis

In [10]:
# len gives the number of packets in some data
# number of TCP packets
num_udp_packets = len(udp_packets) 
print(udp_packets.shape)

udp_probability = num_udp_packets / num_total_packets # probability that a packet is a UDP packet

print(udp_probability)

(44785, 18)
0.23685115159848746


### Conditional Probability of a DNS Packet, Given Port 53

Now, let's compute the probability that a packet is a DNS packet, given that the source port or destination port is 53. A DNS packet is a DNS query **OR** a DNS response. 

We are calculating:

$P($DNS Query $\cup$ DNS Response | Source Port == 53 $\cup$ Dst Port == 53$)$

The $\cup$ means "union".

The probability can be calculated as:

$P(\text{DNS Query} \cup \text{DNS Response}\ |\ \text{Source Port == 53} \cup \text{Dst Port == 53})\ =\ \frac{\text{# of packets with a DNS query or DNS response field}}{\text{# of packets with a SRC port or DST port 53}}$

Because of conditional probability, rather than dividing by the total number of packets, we divide by only the # of packets that satisfy the condition that the SRC or DST port is equal to 53.

For dns_queries and dns_responses, there are three options for each blank: "data", "dns_query", or "dns_resp". 

In [19]:
# packets with a DNS query column that isn't None
dns_queries = data.loc[data["dns_query"].notnull()] # This produces garbage.
dns_queries = dns_queries.loc[dns_queries["port_dst"] == 53]

# packets with a DNS response column that isn't Nonedns_queries = dns_queries.loc[dns_queries["port_dst"] == 53]
dns_responses = data.loc[data["dns_resp"].notnull()] # This produces garbage.
dns_responses = dns_responses.loc[dns_responses["port_src"] == 53]

dns_responses.head(n=10)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
1202,2018-07-30 14:51:52.314586,b'forums.ffshrine.org.',b'forums.ffshrine.org.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,161,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,29421.0,53.0,UDP,1532980000.0,11.64402
1388,2018-07-30 14:51:52.641743,b'img.ffshrine.org.',b'img.ffshrine.org.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,160,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,56591.0,53.0,UDP,1532980000.0,11.971177
3238,2018-07-30 14:51:59.080689,b'imgur.com.',b'imgur.com.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,238,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,22619.0,53.0,UDP,1532980000.0,18.410123
3323,2018-07-30 14:51:59.121489,b'i.imgur.com.',b'i.imgur.com.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,264,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,1811.0,53.0,UDP,1532980000.0,18.450923
4379,2018-07-30 14:51:59.497398,b'puppet.princeton.edu.',b'princeton.edu.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,139,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,59329.0,53.0,UDP,1532980000.0,18.826832
4382,2018-07-30 14:51:59.498462,b'puppet.princeton.edu.',b'princeton.edu.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,139,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,61119.0,53.0,UDP,1532980000.0,18.827896
4384,2018-07-30 14:51:59.499223,b'puppet.',b'.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,141,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,51394.0,53.0,UDP,1532980000.0,18.828657
4386,2018-07-30 14:51:59.499993,b'puppet.',b'.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,141,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,49858.0,53.0,UDP,1532980000.0,18.829427
4716,2018-07-30 14:51:59.873431,b'clients1.google.com.',b'clients1.google.com.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,367,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,14038.0,53.0,UDP,1532980000.0,19.202865
8144,2018-07-30 14:52:30.329074,b'realtalk-princeton.tumblr.com.',b'realtalk-princeton.tumblr.com.',128.112.92.150,2154847000.0,128.112.136.10,2154859000.0,True,530,a8:60:b6:01:d0:a9,185133323899049,04:09:73:5f:c9:00,4438636873984,8669.0,53.0,UDP,1532980000.0,49.658508


We should expect one response for each query. Let's check that assumption.

In [21]:
print(dns_queries.shape)
print(dns_responses.shape)

(283, 18)
(283, 18)


In [22]:
src_port_53 = data.loc[data["port_src"] == 53]
dst_port_53 = data.loc[data["port_dst"] == 53]
print(src_port_53.shape)
print(dst_port_53.shape)

(283, 18)
(283, 18)


In [25]:
num_dns_queries = len(dns_queries)
num_dns_responses = len(dns_responses)

num_dns_total = num_dns_queries + num_dns_responses

# check the total
print(num_dns_total)

num_port_53 = len(src_port_53) + len(dst_port_53)

# should be the same
print(num_port_53)

# Note: This is tricky! Consult the DNS columns of the data in this notebook and/or Wireshark. if you are stuck.
# probability that a packet is a DNS packet, given that at least one port is 53
dns_probability = (num_dns_queries + num_dns_responses) / num_port_53 
print(dns_probability) # Should be 1 (100%).

566
566
1.0


You should expect an answer of 100%. If you got over 100% instead, your probability is likely overcounting some packets.

### Probability that a DNS Response is Longer than the Mean Packet Length

Now let's answer the following questions about our packets: 
What is the probability that a given DNS response has a length longer than the average length of all packets?

$P($Length > Mean Length of **All** Packets | DNS Response$)$

In [26]:
# the mean length of all packets
mean_length = data['length'].mean() 
print(mean_length)

614.1422376180025


In [31]:
dns_responses['length'].mean()

329.85159010600705

In [34]:
# number of DNS packets with a length longer than mean_length
longer_than_mean = dns_responses[dns_responses["length"] > int(mean_length)] 

num_longer = len(longer_than_mean)
print(num_longer)
print(dns_responses['length'].max())

0
550


### Homework Exercises

(Challenge!) Find the probability that a DNS request is immediately followed by a DNS response in the packet trace. This will give us an idea of how fast DNS responses are received, relative to other network traffic.

# Naïve Bayes Classifier

Now we're going to use the Naïve Bayes algorithm to predict which task a user is most likely doing given a particular packet. While there are existing python functions for performing a naive Bayes classification, we already know everything we need to do it ourselves!

### Loading the Packets and the Labels

We first need to label the data with what activity was happening at the time each packet is received.

First, download the ross.pcap file at https://drive.google.com/file/d/1Lr1dleCbZcQWfHoW_u6Q2uZFte17Y2Z_/view?usp=sharing. Place it in the AI4ALL-IoT/example_pcaps folder.

In [47]:
# Load the data, may take a few minutes.
# data = pcap_to_pandas("example_pcaps/ross.pcap")
labels = pd.read_csv('example_pcaps/ross_labels.txt', header=None, names=["time", "activity"])
labels.head(n=5)

Unnamed: 0,time,activity
0,2018-07-30 15:51:41.327734,WEB
1,2018-07-30 15:54:12.815653,AUDIO
2,2018-07-30 15:56:09.083618,VIDEO
3,2018-07-30 15:58:24.929799,WEB
4,2018-07-30 15:58:33.808876,GAMING


First, let's add the timestamp (remember, this is measured in seconds since the epoch) to the data set.

In [91]:
from dateutil import tz

def convert_to_datetime(time):
    return datetime.strptime(time, '%Y-%m-%d %H:%M:%S.%f')
    
labels['datetime'] = labels['time'].apply(convert_to_datetime)

# Force GMT -0400
tzlocal = tz.gettz('EST')
#tzlocal = datetime.now().astimezone().tzinfo
labels['timestamp'] = labels['datetime'].apply(lambda dt: dt.replace(tzinfo=tzlocal).timestamp())

labels.head(n=50)
#print(tzlocal)

Unnamed: 0,time,activity,datetime,timestamp,label
0,2018-07-30 15:51:41.327734,WEB,2018-07-30 15:51:41.327734,1532984000.0,4
1,2018-07-30 15:54:12.815653,AUDIO,2018-07-30 15:54:12.815653,1532984000.0,0
2,2018-07-30 15:56:09.083618,VIDEO,2018-07-30 15:56:09.083618,1532984000.0,3
3,2018-07-30 15:58:24.929799,WEB,2018-07-30 15:58:24.929799,1532984000.0,4
4,2018-07-30 15:58:33.808876,GAMING,2018-07-30 15:58:33.808876,1532984000.0,1
5,2018-07-30 16:00:20.571626,INACTIVE,2018-07-30 16:00:20.571626,1532984000.0,2


### Assign an activity label to each row in the packet trace.

Next, we're going to use the activity log to label the data set. We use a label encoder to assign an integer label for each activity in the dataset.

In [92]:
label_encoder = LabelEncoder()
labels['label'] = label_encoder.fit_transform(labels['activity'])

labels.head(n=5)

Unnamed: 0,time,activity,datetime,timestamp,label
0,2018-07-30 15:51:41.327734,WEB,2018-07-30 15:51:41.327734,1532984000.0,4
1,2018-07-30 15:54:12.815653,AUDIO,2018-07-30 15:54:12.815653,1532984000.0,0
2,2018-07-30 15:56:09.083618,VIDEO,2018-07-30 15:56:09.083618,1532984000.0,3
3,2018-07-30 15:58:24.929799,WEB,2018-07-30 15:58:24.929799,1532984000.0,4
4,2018-07-30 15:58:33.808876,GAMING,2018-07-30 15:58:33.808876,1532984000.0,1


In [93]:

# BUG: All packets are assigned activity 2, regardless of time. Need a different condition in the label assignment.
for index, row in labels.iterrows():
    data.loc[data['time'] >= row['timestamp'], 'label'] = row['label']
    
num_labels = max(labels['label'])
print(num_labels)

data.head(n=50)

4


Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed,label
0,2018-07-30 14:51:40.670566,,,255.255.255.255,4294967000.0,128.112.93.99,2154848000.0,False,184,ff:ff:ff:ff:ff:ff,281474976710655,0c:4d:e9:b0:8e:4b,13528772677195,17500.0,17500.0,UDP,1532980000.0,0.0,2.0
1,2018-07-30 14:51:40.670856,,,128.112.93.255,2154848000.0,128.112.93.99,2154848000.0,False,184,ff:ff:ff:ff:ff:ff,281474976710655,0c:4d:e9:b0:8e:4b,13528772677195,17500.0,17500.0,UDP,1532980000.0,0.00029,2.0
2,2018-07-30 14:51:41.370868,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,82,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700302,2.0
3,2018-07-30 14:51:41.370965,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700399,2.0
4,2018-07-30 14:51:41.370966,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.7004,2.0
5,2018-07-30 14:51:41.370966,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.7004,2.0
6,2018-07-30 14:51:41.370967,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700401,2.0
7,2018-07-30 14:51:41.370973,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700407,2.0
8,2018-07-30 14:51:41.370974,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700408,2.0
9,2018-07-30 14:51:41.370975,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700409,2.0


In [67]:
print(labels['activity'],labels['datetime'])
print(num_labels)
data.head(n=1)

0          WEB
1        AUDIO
2        VIDEO
3          WEB
4       GAMING
5     INACTIVE
Name: activity, dtype: object 0   2018-07-30 15:51:41.327734
1   2018-07-30 15:54:12.815653
2   2018-07-30 15:56:09.083618
3   2018-07-30 15:58:24.929799
4   2018-07-30 15:58:33.808876
5   2018-07-30 16:00:20.571626
Name: datetime, dtype: datetime64[ns]
4


Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed,label
0,2018-07-30 14:51:40.670566,,,255.255.255.255,4294967000.0,128.112.93.99,2154848000.0,False,184,ff:ff:ff:ff:ff:ff,281474976710655,0c:4d:e9:b0:8e:4b,13528772677195,17500.0,17500.0,UDP,1532980000.0,0.0,2.0


### Create training and testing sets. 

Finally, we're going to take 20% of the data set and reserve it as test data.

In [79]:
msk = np.random.rand(len(data)) < 0.8
train = data[msk]
test = data[~msk]

training_size = len(train)
print(training_size)

train['label'].head(n=10)

151215


0    2.0
1    2.0
2    2.0
3    2.0
4    2.0
5    2.0
6    2.0
7    2.0
8    2.0
9    2.0
Name: label, dtype: float64

## Classification

The simplest statistic we need to compute is the probability that each label occurs:

In [75]:
label_probs = np.zeros(num_labels + 1)

for i in range(num_labels + 1):
    label_probs[i] = len(train[train['label'] == i]) / training_size
    print(i, label_probs[i])

0 0.0
1 0.0
2 1.0
3 0.0
4 0.0


Next, we are going to go through the training set and tally up what values appear in different fields and how often they appear

In [22]:
train[train['label'] == 2].head(n=20)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed,label
0,2018-07-30 21:51:40.670566,,,255.255.255.255,4294967000.0,128.112.93.99,2154848000.0,False,184,ff:ff:ff:ff:ff:ff,281474976710655,0c:4d:e9:b0:8e:4b,13528772677195,17500.0,17500.0,UDP,1532980000.0,0.0,2.0
1,2018-07-30 21:51:40.670856,,,128.112.93.255,2154848000.0,128.112.93.99,2154848000.0,False,184,ff:ff:ff:ff:ff:ff,281474976710655,0c:4d:e9:b0:8e:4b,13528772677195,17500.0,17500.0,UDP,1532980000.0,0.00029,2.0
2,2018-07-30 21:51:41.370868,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,82,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700302,2.0
5,2018-07-30 21:51:41.370966,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.7004,2.0
6,2018-07-30 21:51:41.370967,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700401,2.0
7,2018-07-30 21:51:41.370973,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700407,2.0
8,2018-07-30 21:51:41.370974,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700408,2.0
9,2018-07-30 21:51:41.370975,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700409,2.0
10,2018-07-30 21:51:41.370976,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.70041,2.0
11,2018-07-30 21:51:41.370980,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980000.0,0.700414,2.0


In [23]:
ip_table = {} # The table stores tuples of the form (label, value)
ip_list = [] # The list stores every value we've seen (uniquely).

dns_table = {}
dns_list = []

port_table = {}
port_list = []

protocol_table = {}
protocol_list = []

def update_table(table, lst, label, value):
    if value is not None:
        table[(label, value)] = table.get((label, value), 0) + 1
    if value is not None and value not in lst:
        lst.append(value)

for index, row in train.iterrows():
    update_table(ip_table, ip_list, row['label'], row['ip_src'])
    update_table(ip_table, ip_list, row['label'], row['ip_dst'])
    update_table(port_table, port_list, row['label'], row['port_src'])
    update_table(port_table, port_list, row['label'], row['port_dst'])
    update_table(protocol_table, protocol_list, row['label'], row['protocol'])
    update_table(dns_table, dns_list, row['label'], row['dns_query'])
    update_table(dns_table, dns_list, row['label'], row['dns_resp'])

Now we use these tallies to compute the logarithm of the event probabilites. Typically, we prefer to work with log probabilities because many of these events have very small chances of happening. When multiplied together the resulting joint probability often end up inconveniently small for computers to work with. Taking logarithms will not change what we are trying to do conceptually, but improves the numerical properties of the algorithm.

In [24]:
def compute_log_probs(table, lst, smoothing):
    log_probs = {}
    
    for l in range(num_labels + 1):
        total = sum([table.get((l, val), 0) for val in lst])
        
        for val in lst:
            if (l, val) in table:
                log_probs[(l, val)] = log(table[(l, val)] + smoothing) - log(total + smoothing * (len(lst) + 1))
        
        log_probs[(l, '<UNK>')] = log(smoothing) - log(total + smoothing * (len(lst) + 1))
    
    return log_probs

ip_log_prob = compute_log_probs(ip_table, ip_list, 1e-5)
dns_log_prob = compute_log_probs(dns_table, dns_list, 1e-5)
port_log_prob = compute_log_probs(port_table, port_list, 1e-5)
protocol_log_prob = compute_log_probs(protocol_table, protocol_list, 1e-5)

Finally, we are ready to create the classifier. When presented with a new row of data, we simply sum all the relevant log probabilities for each class and report the class with the highest log probability.

In [25]:
def get_log_prob(table, val, label):
    return table.get((label, val), table[(label, '<UNK>')])

def classify(row):
    best_label = -1
    best_label_score = float('-Inf')
    
    for l in range(num_labels + 1):
        score = log(label_probs[l])
        
        score = score + get_log_prob(ip_log_prob, row['ip_src'], l)
        score = score + get_log_prob(ip_log_prob, row['ip_dst'], l)
        
        if row['is_dns']:
            if row['dns_query'] is not None:
                score = score + get_log_prob(dns_log_prob, row['dns_query'], l)
            
            if row['dns_resp'] is not None:
                score = score + get_log_prob(dns_log_prob, row['dns_resp'], l)
        
        score = score + get_log_prob(port_log_prob, row['port_src'], l)
        score = score + get_log_prob(port_log_prob, row['port_dst'], l)
        
        score = score + get_log_prob(protocol_log_prob, row['protocol'], l)
        
        if score > best_label_score:
            best_label = l
            best_label_score = score
    
    return best_label

correct = 0

for index, row in test.iterrows():
    if classify(row) == row['label']:
        correct = correct + 1

print('Accuracy: {}'.format(correct / len(test)))

ValueError: math domain error