In [11]:
from data_collection.parse_pcap import pcap_to_pandas
from utils import *

# Probability

## Introduction

Today, we're going to explore probability. The concept of probability is a powerful tool that lets us answer interesting questions about our data, and it serves as the foundation of a commonly used machine learning technique for classification We'll also be building a Naive Bayes classifier from scratch, so you'll get hands-on experience coding a machine learning classifier from scratch!

Let's start with some simple probability examples on the board. Let's see how much you can recall from lecture!

Say I have a bucket with 10 blue balls and 20 red balls. If I choose a ball at random from the bucket, what is the probability that I choose a red ball? That is, we want to calculate:

$P($red ball$)\ =\ ??$

This is equal to the fraction of red balls over the total number of balls.

$P($red ball$)\ =\ \frac{\text{# of red balls}}{\text{# of total balls}}\ =\ \frac{20}{30}\ =\ \frac{2}{3}$

Similarly, the chance of picking a blue ball is:

$P($red ball$)\ =\ \frac{\text{# of blue balls}}{\text{# of total balls}}\ =\ \frac{10}{30}\ =\ \frac{1}{3}$

Now, let's say we want to find the probability of picking a red ball out of the bucket, **AND THEN** picking a blue ball out of the bucket. When we want to find the probability of two events both occurring, we multiply their probabilities together. The resulting probability is:

$P(\text{red ball})*P(\text{blue ball}\ |\ \text{red ball missing})$

Here, we introduce the concept of conditional probability. $P(\text{blue ball}\ |\ \text{red ball missing})$ represents the probability that a blue ball is pulled from the bucket, **given** that a red ball has already been taken out.

Are these two events independent? Does pulling a red ball affect the result of the probability of pulling a red ball followed by a blue ball? If it had no effect, the overall probability would be equivalent to:

$P(\text{red ball})*P(\text{blue ball})$

But it's not! By removing a red ball, there are now fewer overall balls to choose from, which changes the resulting probability. The full probability is therefore calculated as:

$P(\text{red ball})*P(\text{blue ball}\ |\ \text{red ball missing})\ =\ \frac{20}{30}*\frac{10}{29}\ =\ \frac{20}{87}$

Now that you've had a chance to review, let's dive into the data.

## Exercises

#### Probability of a TCP Packet

Let's compute the probability that a packet from our capture was a TCP packet:

$P(\text{TCP Packet})$

We'll start by loading some captured data into Python, and filtering out packets that don't have a DNS query field or a DNS response field. You'll need to fill in the blanks with the correct information. For tcp_packets, there are three options for each blank: "data", "protocol", or "TCP". Consult yesterday's lab if you need!

In [8]:
data = ?? # call our helper "pcap_to_pandas" function, and pass in the argument "example_pcaps/tplink_switch.pcap"
tcp_packets = ??[??[??] == "??"] # packets with the protocol column equal to "TCP"

# len gives the number of packets in some data
num_tcp_packets = len(??) # number of TCP packets
num_total_packets = len(??) # number of total packets

tcp_probability = ?? / ?? # probability that a packet is a TCP packet

print(tcp_probability)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
15,2017-12-07 15:11:31.532799,b's1a.time.edu.cn.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,32835.0,UDP,1512677000.0,7.217349
16,2017-12-07 15:11:31.763646,b'devs.tplinkcloud.com.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,80,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,43866.0,UDP,1512677000.0,7.448196
17,2017-12-07 15:11:31.775682,b'devs.tplinkcloud.com.',b'devs.tplinkcloud.com.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,533,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,43866.0,53.0,UDP,1512677000.0,7.460232
21,2017-12-07 15:11:31.885528,b's1a.time.edu.cn.',b's1a.time.edu.cn.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,121,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,32835.0,53.0,UDP,1512677000.0,7.570078
57,2017-12-07 15:11:47.922651,b's1b.time.edu.cn.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,39900.0,UDP,1512678000.0,23.607201


#### Probability of a DNS Packet, Given Source Port or Dest Port is 53

Now, let's compute the probability that a packet from our capture was a DNS packet, given that at least one of its ports was 53. We define a DNS packet as a packet that has a DNS query **OR** a DNS response field. We are calculating:

$P($DNS Query $\cup$ DNS Response | Source Port == 53 $\cup$ Dst Port == 53$)$

The $\cup$ means "OR".

The probability can be calculated as:
$P(\text{DNS Query} \cup \text{DNS Response}\ |\ \text{Source Port == 53} \cup \text{Dst Port == 53})\ =\ \frac{\text{# of packets with a DNS query or DNS response field}}{\text{# of packets with a SRC port or DST port 53}}$

Because of conditional probability, rather than dividing by the total number of packets, we divide by only the # of packets that satisfy the condition that the SRC or DST port is equal to 53.

You'll need to fill in the blanks with the correct information. For dns_queries and dns_responses, there are three options for each blank: "data", "dns_query", or "dns_resp". Consult yesterday's lab if you need!

In [21]:
dns_queries = ??[??[??].notnull()] # packets with a DNS query column that isn't None
dns_responses = ??[??[??].notnull()] # packets with a DNS response column that isn't None

src_port_53 = data[data["port_src"] == 53]
dst_port_53 = data[data["port_dst"] == 53]

num_dns_queries = len(??)
num_dns_responses = len(??)
num_dns_total = num_dns_queries + num_dns_responses

num_port_53 = len(src_port_53) + len(dst_port_53)

# Note: This is tricky! Consult the DNS columns of the data in this notebook and/or Wireshark. if you are stuck.
dns_probability = ?? / num_port_53 # probability that a packet is a DNS packet, given that at least one port is 53

print(dns_probability) # Should be 1 (100%).

1.5


You should expect an answer of 100%. If you got over 100% instead, your probability is likely overcounting some packets!

Hint: Examine the "dns_query" and "dns_resp" columns of packets that contain a DNS query or response.

Now let's answer the following questions about our packets. What is the probability that a given DNS response has a length longer than the average length of all packets?

$P($Length > Mean Length of **All** Packets | DNS Response$)$

In [44]:
mean_length = ??[??].mean() # the mean length of all packets
longer_than_mean = dns_responses[dns_responses["length"] > ??] # number of DNS packets with a length longer than mean_length

num_longer = len(longer_than_mean)
print(num_longer / num_dns_responses)

0.5


In [43]:
longer_than_mean

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
17,2017-12-07 15:11:31.775682,b'devs.tplinkcloud.com.',b'devs.tplinkcloud.com.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,533,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,43866.0,53.0,UDP,1512677000.0,7.460232
119,2017-12-07 15:16:20.105092,b'fr.pool.ntp.org.',b'fr.pool.ntp.org.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,513,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,34673.0,53.0,UDP,1512678000.0,295.789642
123,2017-12-07 15:16:20.364034,b'devs.tplinkcloud.com.',b'devs.tplinkcloud.com.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,533,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,59227.0,53.0,UDP,1512678000.0,296.048584


## Additional Exercises

1. Find the probability that a DNS request is immediately followed by a DNS response in the packet trace. This will give us an idea of how fast DNS responses are received, relative to other network traffic.