# Basic Analysis of Network Traffic Traces

In this laboratory, we will explore the basics of network traffic capture. 

## Learning Objectives

By the end of this lab, you should understand the following:

* How to capture a network traffic trace.
* What the meaining of the following fields are in the trace: (1) IP Address; (2) MAC Address; (3) Length; (4) DNS queries and responses.

## Setup

Before we get started, you will need to install a tool to generate packet captures. There are some example pcaps in the `pcaps` directory of this repository, as well, but it is good for everyone to become familiar with how to perform their own network traffic capture.

**Wireshark** The fundamental data that we will use for analysis, in this laboratory and others, is a _network packet trace_, sometimes called a "pcap".  [Wireshark](https://wireshark.org/) is a tool that we can use to capture and analyze network traffic data from the devices on a network. 

### Warmup: Basic Wireshark Analysis

First, you should use wireshark to collect a packet trace. Save the trace as a regular pcap (not pcapng) somewhere on your local machine. Note the location where you have saved the file, as we will be loading that file into the notebook later.

Using Wireshark answer the following questions:
* How many packets are in the trace?
* What is the total volume of traffic in the trace?

These are fairly straightforward questions that wireshark itself can easily tell you. Doing more complicated analysis (and eventually machine learning) requires more sophisticated processing. For that, in this course, we will rely on Python, pandas, and scikit-learn.

## Analyzing Packet Captures in Python

We will now load the packet capture you have generated into Python---specifically, and analysis library called Pandas, which will allow us to ask more complex questions.  This 

In [1]:
import pandas as pd
from datetime import datetime, timezone

# Allow us to load modules from the parent directory
import sys
sys.path.append("../lib") 
from parse_pcap import pcap_to_pandas, send_rates

# Insert your own packet capture here.

#pcap = pcap_to_pandas('/tmp/example-20200523.pcap') 
pcap = pcap_to_pandas('/Users/feamster/Downloads/uchicagocs-web-20200714.pcap') 
pcap.head(n=4)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/feamster/Downloads/uchicagocs-web-20200714.pcap'

### Basic Dataframe Statistics

You can use the `shape` function to discover how many rows and columns exist in your dataset and the `columns` function to get a list of column headers.

In [12]:
print(pcap.shape)
print(pcap.columns)

(3589, 18)
Index(['datetime', 'dns_query', 'dns_resp', 'ip_dst', 'ip_dst_int', 'ip_src',
       'ip_src_int', 'is_dns', 'length', 'mac_dst', 'mac_dst_int', 'mac_src',
       'mac_src_int', 'port_dst', 'port_src', 'protocol', 'time',
       'time_normed'],
      dtype='object')


### Slicing and Sub-Selecting Data

Pandas allows the use of slicing to subselect columns. Let's use that function to cut down our list of columns to some columns on which we want to do further analysis.

In [13]:
pcap = pcap.loc[:,['datetime','dns_query','dns_resp','ip_src','ip_dst',
                   'is_dns','length','port_src','port_dst','protocol']]

### Conditional Slicing

You can slice a dataframe based on conditionals.  Here we select only the rows whose source IP address corresponds to a certain value.

In [5]:
pcap[pcap['ip_src'] == '192.168.1.13'].head(2)

Unnamed: 0,datetime,dns_query,dns_resp,ip_src,ip_dst,is_dns,length,port_src,port_dst,protocol
0,2020-05-23 21:47:42,,,192.168.1.13,204.80.104.218,False,1291,54012,8801,UDP
1,2020-05-23 21:47:42,,,192.168.1.13,204.80.104.218,False,1291,54012,8801,UDP


## Basic Analysis of Traffic Using Pandas

### List of Unique Destination IP Addresses

What are the unique destinations that our network is communicating with?  We can use the `unique` function to retrieve those.

In [6]:
unique_dst_ip = pd.DataFrame(pcap['ip_dst'].unique())[0]
print(unique_dst_ip)

0      204.80.104.218
1        192.168.1.13
2         192.168.1.1
3        54.82.161.19
4       34.203.91.157
5       172.217.4.110
6        172.217.4.78
7        157.240.2.53
8         224.0.0.251
9     108.177.111.189
10      18.211.133.65
11     198.252.206.25
12        192.168.1.4
13               None
14        192.168.1.6
15      140.82.112.25
16       192.168.1.10
17      172.217.4.234
18        3.80.20.191
Name: 0, dtype: object


### Most Popular Destination IP Addresses

We can group the rows of the dataframe using `groupby`, `sum`, and `sort_values` to determine the most popular destination IP addresses?

In [7]:
pkts_dst = pcap.loc[:,['datetime','ip_dst','length']]
pkts_dst.groupby(['ip_dst']).sum().sort_values(by='length',ascending=False)

Unnamed: 0_level_0,length
ip_dst,Unnamed: 1_level_1
192.168.1.13,1409869
204.80.104.218,764918
172.217.4.110,2403
172.217.4.78,1915
224.0.0.251,1378
34.203.91.157,888
192.168.1.1,859
54.82.161.19,690
18.211.133.65,690
192.168.1.6,484


Define a reverse lookup function.

In [8]:
from dns import resolver
from dns import reversename

# test reverse DNS lookup
addr = reversename.from_address('34.193.201.2')
print(resolver.query(addr, "PTR")[0])

ec2-34-193-201-2.compute-1.amazonaws.com.


In [9]:
# test reverse DNS lookup
addr = reversename.from_address('204.80.104.218')
print(resolver.query(addr, "PTR")[0])

zoomnye218mmr.zoom.us.


In [10]:
def reverse_lookup(ip):
    if str(ip) == 'None':
        return 'None'
    addr = reversename.from_address(ip)
    try:
        return str(resolver.query(addr, "PTR")[0])
    except Exception as e:
        return 'N/A'

### Apply a Function to an Entire Dataframe

Use the pandas `apply` function to create a new column with the DNS names associated with each destination. 

Then look at the unique destination IP addresses in the trace.

In [11]:
pcap['name_dst'] = pcap['ip_dst'].apply(reverse_lookup)

In [12]:
unique_dst_name = pd.DataFrame(pcap['name_dst'].unique())[0]
print(unique_dst_name)

0                         zoomnye218mmr.zoom.us.
1                                            N/A
2      ec2-54-82-161-19.compute-1.amazonaws.com.
3     ec2-34-203-91-157.compute-1.amazonaws.com.
4                     ord36s04-in-f14.1e100.net.
5                     ord37s18-in-f14.1e100.net.
6            whatsapp-cdn-shv-01-ort2.fbcdn.net.
7     ec2-18-211-133-65.compute-1.amazonaws.com.
8                             stackoverflow.com.
9                                           None
10              lb-140-82-112-25-iad.github.com.
11                   ord30s31-in-f234.1e100.net.
12      ec2-3-80-20-191.compute-1.amazonaws.com.
Name: 0, dtype: object


## Functions

It is often useful to encapsulate functionality in functions so that we can use those functions again.

Write functions to count ("sum") the length field so that we can know how much total traffic in bytes is sent to each destination, either by IP address or by name.

In [13]:
def volume_stats_by_ip(pcap):
    return pcap.loc[:,['ip_dst','length']].groupby('ip_dst').sum().sort_values(by=['length'], ascending=False)


def volume_stats_by_name(pcap):
    return pcap.loc[:,['name_dst','length']].groupby('name_dst').sum().sort_values(by=['length'], ascending=False)

In [14]:
volume_stats_by_ip(pcap)

Unnamed: 0_level_0,length
ip_dst,Unnamed: 1_level_1
192.168.1.13,1409869
204.80.104.218,764918
172.217.4.110,2403
172.217.4.78,1915
224.0.0.251,1378
34.203.91.157,888
192.168.1.1,859
54.82.161.19,690
18.211.133.65,690
192.168.1.6,484


In [15]:
volume_stats_by_name(pcap)

Unnamed: 0_level_0,length
name_dst,Unnamed: 1_level_1
,1413206
zoomnye218mmr.zoom.us.,764918
ord36s04-in-f14.1e100.net.,2403
ord37s18-in-f14.1e100.net.,1915
ec2-34-203-91-157.compute-1.amazonaws.com.,888
ec2-18-211-133-65.compute-1.amazonaws.com.,690
ec2-54-82-161-19.compute-1.amazonaws.com.,690
,295
whatsapp-cdn-shv-01-ort2.fbcdn.net.,163
lb-140-82-112-25-iad.github.com.,160


## Going Further

For homework, define some questions you want to ask about the network traffic trace and write some functions to analyze the trace.

Here are some example questions.  You can pick one of these or define one yourself:
* What is the maximum, median, minimum, and mean packet size?
* How many DNS queries (destination port 53) are there in this trace?
* What is the most popular DNS query in the trace?