# Basic Analysis of Network Traffic Traces

In this laboratory, we will explore the basics of network traffic capture. 

## Learning Objectives

By the end of this lab, you should understand the following:

* How to capture a network traffic trace.
* What the meaining of the following fields are in the trace: (1) IP Address; (2) MAC Address; (3) Length; (4) DNS queries and responses.

## Setup

Before we get started, you will need to install a tool to generate packet captures. There are some example pcaps in the `pcaps` directory of this repository, as well, but it is good for everyone to become familiar with how to perform their own network traffic capture.

**Wireshark** The fundamental data that we will use for analysis, in this laboratory and others, is a _network packet trace_, sometimes called a "pcap".  [Wireshark](https://wireshark.org/) is a tool that we can use to capture and analyze network traffic data from the devices on a network. 

### Warmup: Basic Wireshark Analysis

First, you should use wireshark to collect a packet trace. Save the trace as a regular pcap (not pcapng) somewhere on your local machine. Note the location where you have saved the file, as we will be loading that file into the notebook later.

Using Wireshark answer the following questions:
* How many packets are in the trace?
* What is the total volume of traffic in the trace?

These are fairly straightforward questions that wireshark itself can easily tell you. Doing more complicated analysis (and eventually machine learning) requires more sophisticated processing. For that, in this course, we will rely on Python, pandas, and scikit-learn.

## Analyzing Packet Captures in Python

We will now load the packet capture you have generated into Python---specifically, and analysis library called Pandas, which will allow us to ask more complex questions.  This 

In [12]:
import pandas as pd
from datetime import datetime, timezone

# Allow us to load modules from the parent directory
import sys
sys.path.append("../lib") 
from parse_pcap import pcap_to_pandas, send_rates

# Insert your own packet capture here.

#pcap = pcap_to_pandas('../pcaps/uchicagocs-web-20200714.pcap')
pcap = pcap_to_pandas('../pcaps/uchicagocs-web-20200821.pcap')

# look at the first n rows of the packet capture
pcap = pcap.loc[26:,:]
pcap.head(10)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
26,2020-08-21 09:06:50,b'www.cs.uchicago.edu.',,192.168.1.1,3232235777,192.168.1.23,3232235799,True,79,28:80:88:27:6b:a7,44532505209767,3c:15:c2:d9:d3:50,66064161035088,53,55851,UDP,1598018810.95664,1.257173
27,2020-08-21 09:06:50,b'www.cs.uchicago.edu.',b'www.cs.uchicago.edu.',192.168.1.23,3232235799,192.168.1.1,3232235777,True,113,3c:15:c2:d9:d3:50,66064161035088,28:80:88:27:6b:a7,44532505209767,55851,53,UDP,1598018810.986062,1.286595
28,2020-08-21 09:06:50,,,128.135.164.125,2156373117,192.168.1.23,3232235799,False,78,28:80:88:27:6b:a7,44532505209767,3c:15:c2:d9:d3:50,66064161035088,443,51158,TCP,1598018810.987998,1.288531
29,2020-08-21 09:06:51,,,192.168.1.23,3232235799,128.135.164.125,2156373117,False,66,3c:15:c2:d9:d3:50,66064161035088,28:80:88:27:6b:a7,44532505209767,51158,443,TCP,1598018811.0035,1.304033
30,2020-08-21 09:06:51,,,128.135.164.125,2156373117,192.168.1.23,3232235799,False,54,28:80:88:27:6b:a7,44532505209767,3c:15:c2:d9:d3:50,66064161035088,443,51158,TCP,1598018811.003608,1.304141
31,2020-08-21 09:06:51,,,128.135.164.125,2156373117,192.168.1.23,3232235799,False,571,28:80:88:27:6b:a7,44532505209767,3c:15:c2:d9:d3:50,66064161035088,443,51158,TCP,1598018811.005313,1.305846
32,2020-08-21 09:06:51,,,192.168.1.23,3232235799,128.135.164.125,2156373117,False,60,3c:15:c2:d9:d3:50,66064161035088,28:80:88:27:6b:a7,44532505209767,51158,443,TCP,1598018811.026194,1.326727
33,2020-08-21 09:06:51,,,192.168.1.23,3232235799,128.135.164.125,2156373117,False,1514,3c:15:c2:d9:d3:50,66064161035088,28:80:88:27:6b:a7,44532505209767,51158,443,TCP,1598018811.037603,1.338136
34,2020-08-21 09:06:51,,,192.168.1.23,3232235799,128.135.164.125,2156373117,False,1514,3c:15:c2:d9:d3:50,66064161035088,28:80:88:27:6b:a7,44532505209767,51158,443,TCP,1598018811.037609,1.338142
35,2020-08-21 09:06:51,,,192.168.1.23,3232235799,128.135.164.125,2156373117,False,1230,3c:15:c2:d9:d3:50,66064161035088,28:80:88:27:6b:a7,44532505209767,51158,443,TCP,1598018811.037611,1.338144


### Basic Dataframe Statistics

You can use the `shape` function to discover how many rows and columns exist in your dataset and the `columns` function to get a list of column headers.

In [13]:
print('{}\n\n'.format(pcap.shape))
print(pcap.columns)

(254, 18)


Index(['datetime', 'dns_query', 'dns_resp', 'ip_dst', 'ip_dst_int', 'ip_src',
       'ip_src_int', 'is_dns', 'length', 'mac_dst', 'mac_dst_int', 'mac_src',
       'mac_src_int', 'port_dst', 'port_src', 'protocol', 'time',
       'time_normed'],
      dtype='object')


### Slicing and Sub-Selecting Data

Pandas allows the use of slicing to subselect columns. Let's use that function to cut down our list of columns to some columns on which we want to do further analysis.

In [14]:
pcap = pcap.loc[:,['datetime','ip_src','ip_dst',
                   'length','port_src','port_dst','protocol']]
pcap.head(10)

Unnamed: 0,datetime,ip_src,ip_dst,length,port_src,port_dst,protocol
26,2020-08-21 09:06:50,192.168.1.23,192.168.1.1,79,55851,53,UDP
27,2020-08-21 09:06:50,192.168.1.1,192.168.1.23,113,53,55851,UDP
28,2020-08-21 09:06:50,192.168.1.23,128.135.164.125,78,51158,443,TCP
29,2020-08-21 09:06:51,128.135.164.125,192.168.1.23,66,443,51158,TCP
30,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,54,51158,443,TCP
31,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,571,51158,443,TCP
32,2020-08-21 09:06:51,128.135.164.125,192.168.1.23,60,443,51158,TCP
33,2020-08-21 09:06:51,128.135.164.125,192.168.1.23,1514,443,51158,TCP
34,2020-08-21 09:06:51,128.135.164.125,192.168.1.23,1514,443,51158,TCP
35,2020-08-21 09:06:51,128.135.164.125,192.168.1.23,1230,443,51158,TCP


### Conditional Slicing

You can slice a dataframe based on conditionals.  Here we select only the rows whose source IP address corresponds to a certain value.

In [15]:
pcap[pcap['ip_src'] == '192.168.1.23'].head(10)

Unnamed: 0,datetime,ip_src,ip_dst,length,port_src,port_dst,protocol
26,2020-08-21 09:06:50,192.168.1.23,192.168.1.1,79,55851,53,UDP
28,2020-08-21 09:06:50,192.168.1.23,128.135.164.125,78,51158,443,TCP
30,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,54,51158,443,TCP
31,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,571,51158,443,TCP
36,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,54,51158,443,TCP
37,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,54,51158,443,TCP
38,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,54,51158,443,TCP
41,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,54,51158,443,TCP
42,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,180,51158,443,TCP
43,2020-08-21 09:06:51,192.168.1.23,128.135.164.125,1514,51158,443,TCP


### Further Analysis Questions

You could ask some follow up questions about the web download above:
* How many total bytes were exchanged in this web download (in both directions)?
* How many total bytes went from the web server to the client device (e.g., web browser) ("download"))?
* How long did the total download take?
* What is the maximum packet size (length)? What is the average packet size?

You may find the Pandas documentation and examples helpful. (e.g., [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))

In [16]:
# Total bytes exchanged.
# HINT: use the pandas "groupby" function

# Total bytes downloaded.
# HINT: extend the first part by first using a conditional slice to only select the download packets.
# (download would be ip_dst equal to the IP address of the client, ip_src equal to the IP address of the server)

# Total time.
# HINT: Time of last row minus time of first row.

# Max/average length.
# HINT: Apply max, mean pandas functions to the 'length' column of the pcap dataframe.


---
## Basic Analysis of Traffic Using Pandas

Try some of the examples below using a trace that has multiple destination IP addresses (e.g., all of your web traffic).

### List of Unique Destination IP Addresses

What are the unique destinations that our network is communicating with?  We can use the `unique` function to retrieve those.

In [17]:
unique_dst_ip = pd.DataFrame(pcap['ip_dst'].unique())[0]
print(unique_dst_ip)

0        192.168.1.1
1       192.168.1.23
2    128.135.164.125
Name: 0, dtype: object


### Most Popular Destination IP Addresses

We can group the rows of the dataframe using `groupby`, `sum`, and `sort_values` to determine the most popular destination IP addresses?

In [18]:
pkts_dst = pcap.loc[:,['datetime','ip_dst','length']]
pkts_dst.groupby(['ip_dst']).sum().sort_values(by='length',ascending=False)

Unnamed: 0_level_0,length
ip_dst,Unnamed: 1_level_1
192.168.1.23,107974
128.135.164.125,28166
192.168.1.1,2591


Define a reverse lookup function.

In [19]:
from dns import resolver
from dns import reversename

# test reverse DNS lookup
addr = reversename.from_address('34.193.201.2')
print(resolver.query(addr, "PTR")[0])

ec2-34-193-201-2.compute-1.amazonaws.com.


In [20]:
# test reverse DNS lookup
addr = reversename.from_address('204.80.104.218')
print(resolver.query(addr, "PTR")[0])

zoomnye218mmr.ny.zoom.us.


In [21]:
def reverse_lookup(ip):
    if str(ip) == 'None':
        return 'None'
    addr = reversename.from_address(ip)
    try:
        return str(resolver.query(addr, "PTR")[0])
    except Exception as e:
        return 'N/A'

### Apply a Function to an Entire Dataframe

Use the pandas `apply` function to create a new column with the DNS names associated with each destination. 

Then look at the unique destination IP addresses in the trace.

In [22]:
pcap['name_dst'] = pcap['ip_dst'].apply(reverse_lookup)

In [23]:
unique_dst_name = pd.DataFrame(pcap['name_dst'].unique())[0]
print(unique_dst_name)

0                     N/A
1    hnd.cs.uchicago.edu.
Name: 0, dtype: object


## Functions

It is often useful to encapsulate functionality in functions so that we can use those functions again.

Write functions to count ("sum") the length field so that we can know how much total traffic in bytes is sent to each destination, either by IP address or by name.

In [24]:
def volume_stats_by_ip(pcap):
    return pcap.loc[:,['ip_dst','length']].groupby('ip_dst').sum().sort_values(by=['length'], ascending=False)


def volume_stats_by_name(pcap):
    return pcap.loc[:,['name_dst','length']].groupby('name_dst').sum().sort_values(by=['length'], ascending=False)

In [25]:
volume_stats_by_ip(pcap)

Unnamed: 0_level_0,length
ip_dst,Unnamed: 1_level_1
192.168.1.23,107974
128.135.164.125,28166
192.168.1.1,2591


In [26]:
volume_stats_by_name(pcap)

Unnamed: 0_level_0,length
name_dst,Unnamed: 1_level_1
,110565
hnd.cs.uchicago.edu.,28166


## Going Further

For homework, define some questions you want to ask about the network traffic trace and write some functions to analyze the trace.

Here are some example questions.  You can pick one of these or define one yourself:
* What is the maximum, median, minimum, and mean packet size?
* How many DNS queries (destination port 53) are there in this trace?
* What is the most popular DNS query in the trace?