Workshop on IoT Networking - Data Analysis Track  - Version 0.8

Network Traffic Analysis Using Python/Jupyter

Guilherme G. Martins - gmartins uchicago edu 2020

Other tracks:
- IoT Networking - Data Analysis - Using Python to Analyze IoT Network Traffic  
- IoT Networking - Data Collection - Using Single Board Computers to Collect and Monitor your Network Traffic. (TBD)
- IoT Survey on open source components and IoT building blocks. (TBD)

Requisites:
- Mac/Windows/Linux OS with terminal
- Python3/jupyter notebook/virtualenv+pip (https://jupyter.org/install)
- Wireshark Application (https://www.wireshark.org/#download)
- *bash* and *wget* via terminal
- IoTLab's geoip API (Use it inside the IoT Lab, network 192.168.XXX.0/24)

Motivation:

It is no secret that the proliferation of connected devices is imposing challenges from security and privacy standpoints. Your home network used to be a safe place with a handful of very well known devices. Now, it's even hard to keep track of the total number of connected devices, temperature sensors, cameras, smart toys, refrigerators, just to name a few. Multiple technologies are used to enable these devices to communicate and to interact with each other: Bluetooth, Zigbee, Near Field Communication (NFC) are just examples of communication protocols. But when it comes to using the full set of features provided by your IoT devices and application, in most cases, it is required an internet connection to allow the IoT sending and receiving data to the cloud, potentially compromising security and privacy. Beyond just hoping that the IoT designers and operators are doing the right thing keeping both backend and IoT software secure, there are a few concepts, tools and techniques that we can be used to expose how these devices operate. The goal of this exercise is to explore some of these techniques.

By the end of this exercise you should be able to learn:
- How to decode a network traffic capture file (pcap) into csv (comma separated values);
- How to identify network packets from a specific devices in your network;
- How to visualize the TCP/UDP endpoints for all the external established connections;
- How to correlate activities and interaction with the IoT devices with a volume of sent and received data;  

References:

https://en.wikipedia.org/wiki/Ethernet_frame <br />
https://en.wikipedia.org/wiki/MAC_address <br />
https://en.wikipedia.org/wiki/IPv4 <br />
https://en.wikipedia.org/wiki/Internet_Protocol <br />
https://en.wikipedia.org/wiki/Transmission_Control_Protocol <br />
<Insert new Nick's book title here>

### 1. Testing Requirements

In [None]:
#we'll be using bash run wget,tshark transformations scripts
!/bin/bash --version

In [None]:
#wget to easily download the datasets and files from jupyter
!wget --version

In [None]:
#Make sure the terminal command 'tshark' is in your PATH environment variable and ready to be used.
#We'll use tshark to extract .csv data from the .pcap (packet capture format) to we can generate analysis
path=%env PATH
%env PATH=$path:/Applications/Wireshark.app/Contents/MacOS/
!tshark --version

### 2. Downloading and Preparing the Data

In [None]:
# For analysing IoT devices you need to link information from multiple sources.
# The very first stop is to look at the MAC addresses and translate the first 3 octets 
# into the manufactor ID. (Keep in mind that mac addresses can be cloned or simply
# assigned to any arbitrary address by a malicious code running with root provileges 
# in the IoT firmwares.)
# The mac address resolution can be done using the field (eth.dst_resolved and
# eth.src_resolved) while extracting csv from pcap (or even enabling mac address 
# resolution in the Wireshark GUI), but here we understand how to link this information
# from it a reliable source without relying on an external application.
# https://en.wikipedia.org/wiki/MAC_address
# https://en.wikipedia.org/wiki/Organizationally_unique_identifier
ouiurl="http://standards-oui.ieee.org/oui/oui.txt" 
#ouiurl="https://linuxnet.ca/ieee/oui.txt" #sanitized version of oui dataset
!if [ ! -f 'oui.txt' ]; then wget $ouiurl; else echo "INFO: file present"; fi

In [None]:
##
# use and pcap file here
pcapfile = "camera1.pcap"
iotlaburl = "http://192.168.143.1/" #don't forget the "/" in the end

In [None]:
##
# let's work with a sample pcap file for now
# later your will use Wireshark to capture traffic from our own laptop
pcapurl = iotlaburl+"iotlab/pcapsample/"+pcapfile # 3 days of packet capture
!if [ ! -f $pcapfile ]; then wget $pcapurl; else echo "INFO: file present"; fi

In [None]:
!tail -n 20 oui.txt

In [None]:
##
# If you have any trouble importing pandas or any other package
# there are a few things you can do, one is to install conda or virtualenv
# 
# UNCOMMENT and run below to install the dependancies using virtualenv:

#!pip3 install virtualenv # install virtual environment
#!virtualenv venv # create venv folder to hold all the packages
#!source venv/bin/activate; pip3 install -r requirements.txt #<--- all the dependencies for this notebook
#import sys
#sys.path.append("venv/lib/python3.7/site-packages")

import re
import pandas as pd
import requests
from urllib.parse import urljoin

In [None]:
def generate_oui_dataframe():
    with open('oui.txt','r') as f:
        ouilines = f.readlines()
        p = re.compile("^(..-..-..).*\t\t(.*)") # Extract mac prefix 44-4A-DB and Manufacturer 
        macoui=[] # Organizational Unique Identifier OUI eg 44:4a:db
        macman=[] # manufacturer eg "Apple, Inc."
        for line in ouilines:
            r = p.match(line)
            if r is not None:
                try:
                    r1=r.group(1).replace("-",":").lower()
                    r2=r.group(2)
                except IndexError as ie:
                    print("WARN: generate_oui_dataframe regex - " + str(ie))
                    continue
                macoui.append(r1)
                macman.append(r2)
        df=pd.DataFrame({'macoui':macoui, 'macman':macman})
    return df

ouidf=generate_oui_dataframe()
ouidf

In [None]:
##
# we will be looking mostly over tcp / udp connections
# keep in mind that data can travel encapsulated or crafted
# inside various protocol/packet breaches. We commonly rely on
# standard IPS/IDS signatures for detecting these cases
# as they likely to deviate from the "normal" usage
#
# please note that the csv can expand 2x or 3x the size of
# the original pcap

!tshark -r $pcapfile -T fields -e frame.number -e frame.time \
-e frame.time_delta -e frame.time_relative -e eth.src_resolved  \
-e eth.dst_resolved -e eth.src -e eth.dst -e eth.type -e ip.version \
-e ip.hdr_len -e ip.len -e ip.id -e ip.flags.df -e ip.flags.mf \
-e ip.flags.rb -e ip.flags.sf -e ip.dsfield.dscp -e ip.dsfield.ecn \
-e ip.tos -e ip.ttl -e ip.proto -e ip.src -e ip.dst -e udp.srcport \
-e udp.dstport -e udp.length -e tcp.flags.cwr -e tcp.flags.ecn \
-e tcp.flags.urg -e tcp.flags.ack -e tcp.flags.push -e tcp.flags.reset \
-e tcp.flags.syn -e tcp.flags.fin -e tcp.flags.res -e tcp.flags.ns \
-e tcp.payload -e tcp.len -e frame.len -E header=y -E separator=\| > pcapfile.csv

In [None]:
dfpcap1 = pd.read_csv("pcapfile.csv", delimiter="|")
dfpcap1.head()

In [None]:
###
# based on each TCP connection, there's an association
# between your local device and an external ip address
# ip_mac will hold that mapping to use later on
ip_mac=dfpcap1[['ip.dst','eth.src']].drop_duplicates()
ouieth=[]
ouiman=[]
for index, row in ip_mac.iterrows():
    ouiman.append(ouidf[ouidf['macoui']==row['eth.src'][:8]]['macman'].iloc[0])
ip_mac['ouiman']=ouiman
ip_mac.head(10)

In [None]:
# We want to say, for example:
# my Apple device (a4:83:8e7) connected to an ip address
# in ireland while I was interacting with it.
# The ip address is managed by the XYZs ISP or CDN.

###
# count frames (packets) for each internet connection destination (end-point)
# rename the ip.src for ip.dst, this is a small hack to allow as to join dfs.
ip_dir='ip.src' # download direction, for upload use ip.dst
frames_per_dest=dfpcap1.loc[~dfpcap1[ip_dir].str.startswith('192.168.', na=False)].\
    groupby(ip_dir)['frame.number'].nunique().reset_index().rename(columns={'ip.src':'ip.dst'})
#frames_per_dest.head(10)

##
# join with ip_mac to see which device is the local IoT Endpoint
# we want to know the exact ouiman (eg, Apple, Inc.) Involved in the
# connection
connection=frames_per_dest.set_index('ip.dst').join(ip_mac.set_index('ip.dst')).\
    sort_values(by=['frame.number'], ascending=False).reset_index()
top5loc=connection.head(5) #we are only processing the top talkers here
top5loc                    #ideally we want to account for all of the connections

In [None]:
##
# We will use the IoT Lab API to help with geoip mapping
# run this cell connected to the IoT Lab Wifi
# this should print "OK" meaning that our local API is up and running
r = requests.request("GET", iotlaburl)
r.text


##### At this point you should be able to have all the data in the right format (dataframe) to start plotting some interesting graphics.
- oui dataset
- pcap file
- local API access
- connection list with merged data from oui 


### 3. Plotting TCP/UDP End-points Using Geo Location of IP addresses

Usually geoip locations can provide a very good source of information
to help investigate the source and destination of IoT communications.
We're going to use a wrapper API around MaxMind GeoIP Dataset. It's not
100% accurate in terms of finding the exact location, but it's
reliable when it comes to find the organizations behind the management
of these IP destinations. For more information: https://www.maxmind.com/en/geoip-demo

In [None]:
###
# If you have any trouble importing the libraries below
# try running the pip installation using the requirements.txt
# files along with the virtual environment activation.
# The commands provided earlier should be all you need.

from bokeh.io import output_file, show, save
from bokeh.models import GeoJSONDataSource
from bokeh.plotting import figure
from bokeh.sampledata.sample_geojson import geojson
from bokeh.tile_providers import get_provider, Vendors
from bokeh.models import ColumnDataSource, HoverTool
import json
from IPython.display import IFrame
import math

In [None]:
###
# This should take some serious time if pcap is larger than 200MB - be prepared.
# Alternatively, you want to purchase the entire MaxMind dataset.
# 
# What to expect: known names in the ISP, like google, amazon, fastly
# Not all ips have a clear location, sometimes CDNs provide ip collocation
# to make content delivery accessible at the edge with low latency.
isp=[]
city=[]
country=[]
lat=[]
lng=[]
color=[]
for index, row in connection.iterrows():
    url=urljoin(iotlaburl+"/geoip/", row['ip.dst'])
    r = requests.request("GET", url)
    j = r.json()
    isp.append(j['isp'])
    city.append(j['city'])
    country.append(j['country'])
    lat.append(float(j['lat']))
    lng.append(float(j['lng']))
    color.append('green') # TODO: anomaly detection here, mark red
    #print(r.json())
connection['isp']=isp
connection['city']=city
connection['country']=country
connection['lat']=lat
connection['lng']=lng
connection['color']=color
connection.head(5)

In [None]:
coords=[]
for index, row in connection.iterrows():
    d=row.to_dict()
    coords.append(d)

In [None]:
#lift from
#https://towardsdatascience.com/
#exploring-and-visualizing-chicago-transit-data-using-pandas-and-bokeh-part-ii-intro-to-bokeh-5dca6c5ced10
def merc(Coords):

    lat = Coords['lat']
    lng = Coords['lng']
    
    r_major = 6378137.000
    x = r_major * math.radians(lng)
    scale = x/lng
    y = 180.0/math.pi * math.log(math.tan(math.pi/4.0 + 
        lat * (math.pi/180.0)/2.0)) * scale
    return (x, y)

In [None]:
# Where are you now? (this is optional, just to plot some cool lines)
origin = {'lat':41.795,'lng':-87.60} #Chicago, Hyde Park, US

In [None]:
def map_plot(coords, origin = None, title = None):
    """Plot multiple points in a world map using array of dict coordinates.

    Keyword arguments:
    coords -- array of dict objects containing lat,lng,color 
              eg. [{lat: 41.795, lng: -87.60, color: 'green'}]
    origin -- coordinates for your origin location (default None)
    """
    
    if origin:
        o = merc(origin)
        x_origin = [o[0]]
        y_origin = [o[1]]

    tile_provider = get_provider(Vendors.CARTODBPOSITRON)

    # TODO: enable the hover tool to display ISP information
    #hover = HoverTool(tooltips=[
    #    ("station", "@stationname"),
    #    ("ridership","@ridership")
    #])

    #source = ColumnDataSource(data=dict(
    #                        x=list(Merged['coords_x']), 
    #                        y=list(Merged['coords_y']),
    #                        ridership=list(Merged['monthtotal']),
    #                        sizes=list(Merged['circle_sizes']),
    #                        stationname=list(Merged['STATION_NAME'])))

    p = figure(x_range=(-18780000, 18000000), y_range=(-1000000, 7000000),
               x_axis_type="mercator", y_axis_type="mercator", 
               plot_width = 980, plot_height = 500, title = title)
               #tools=[hover, '','wheel_zoom','save']
    
    p.add_tile(tile_provider)

    if origin:
        p.circle(x=x_origin, y=y_origin, size=10, color="black")

    ###
    # plot multiple lines coming from the origin to dest. coordinates
    #
    if origin:
        for coord in coords:
            c = merc(coord)
            c_x = [c[0]]
            c_y = [c[1]]
            #print(c)
            #print(coord['color'])
            p.multi_line(xs=[[x_origin, c_x]], 
                         ys=[[y_origin, c_y]],
                         color=[coord['color']],
                         line_width=2)
            p.circle(x=c_x, y=c_y, size=10, color=coord['color'])
    ###
    # only plot point coordinates
    #
    else:
        for coord in coords:
            c_x = [c[0]]
            c_y = [c[1]]
            p.circle(x=c_x, y=c_y, size=10, color=coord['color'])

    output_file("tile1.html")
    save(p)
    return IFrame(src='./tile1.html', width=1024, height=500)
    

map_plot (coords, origin, title='TCP/UDP End Points')


In [None]:
connection.groupby('city')['isp'].nunique()

In [None]:
##
# any thing in Europe or Asia?
connection[connection['lng']>0]

### 4. Visualizing IoT Network Traffic Over Time

Here we want to inspect the TCP/UDP connections established by the device looking at the volume of data sent and received over the time duration of the packet capture. I good a exercise could be something like starting and stopping a video session using a IoT smart camera and a phone. Stop for a 1 or 2 minutes and restart the streaming. The traffic capture must be started before the beginning of video streaming and stopped before the end of all sessions. 

For a device like a camera, the volume of traffic is expected to increase abruptly and remain high as compared to the idle state. It should be steady at a higher level during the video streaming session. If the camera isn't being used, minimal or no traffic is expected. Please keep in mind that smart cameras might even avoid send data if no movement is detected. Also, the behavior of other types of IoT devices should be different and it won't be always clear what would be a traffic pattern of your Smart Refrigerator.
 
If the packet capture only contains traffic for 1 or 2 devices with a few minutes, that's not really a problem to analyze, but if we are considering a dataset with multiple devices, it becomes really challenging to visualize, therefore better tools and methods are required. Not mentions this is only a tiny snapshot in time. As you read this, tons of data are being generated to the point where only an automated / systematic approach to expose anomalies can make a difference.


In [None]:
import numpy as np
from matplotlib import pyplot as plt

In [None]:
###
# For analyzing or processing network traffic, and precisely 
# look at the content of each TCP/UDP connection, we usually
# group packets by a 5-tuple unique identification of each connection.
# The 5-tuple contains ip.src, ip.dst, tcp.dstport, tcp.srcport and ip.proto.
# Here we'll use a simplified grouping of packets considering only
# ip src and dst addresses.

##
# Get the length in time_relative of the full pcap and create time bins
bins = list(range(0,int(np.ceil(dfpcap1['frame.time_relative'].max())+1)))
bins = [float(x) for x in bins]

labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
labels

# Add delta
delta=[]
for index, row in dfpcap1.iterrows():
    delta.append(pd.to_timedelta(int(np.ceil(row['frame.time_relative'])), unit="s"))
dfpcap1['delta']=delta

# This is ascending the delta increase overtime
df_dw = dfpcap1.groupby(['ip.src', 'ip.dst',\
                         pd.Grouper(key='delta', freq='1s')])['frame.len'].sum()\
                         .reset_index().sort_values(by=['delta'], ascending=True)
df_up = dfpcap1.groupby(['ip.dst', 'ip.src',\
                         pd.Grouper(key='delta', freq='1s')])['frame.len'].sum()\
                         .reset_index().sort_values(by=['delta'], ascending=True)

df_conn=dfpcap1.groupby(['ip.src','ip.dst'])['frame.len'].sum()\
    .reset_index().rename(columns={'frame.len':'total_bytes'})\
    .sort_values(by=['total_bytes'], ascending=False)


print("INFO: Processing " + str(len(df_conn.index)) + "/2 connections.")

visited={}

for index, row in df_conn.iterrows():
    try: #TODO: better to use hash here
        if visited[row['ip.src']+"-"+row['ip.dst']]:
            print("CONTINUE")
            continue
    except KeyError as ke:
        pass
    
    # visit the reverse flow as it will be processed in both in one pass
    visited[row['ip.dst']+"-"+row['ip.src']] = True

    try:
        eth_src = ip_mac_map[ip_mac_map['ip.dst']==row['ip.src']]['eth.dst'].iloc[0]
        eth_dst = ip_mac_map[ip_mac_map['ip.dst']==row['ip.dst']]['eth.dst'].iloc[0]
    except:
        #some frames don't have ip addresses (non-ip protocol packets)
        pass
    print("-- Connection " + row['ip.src'] + " " + eth_src +
          " <-> "+ row['ip.dst'] + " " + eth_dst)
    print(" Total Bytes: " + str(row['total_bytes']))
 
    ###
    # we can suppress tiny size connections, 
    # but what if it's malware c&c packet?
    if row['total_bytes'] < 500:
        continue
    
    ##
    # The mac resolution and mapping can be found in both directions
    # The connection dataframe dosn't have complete information,
    # so one should fail.
    direction = ['ip.src', 'ip.dst']
    for dir in direction:
        try:
            print(" Device: " + connection[connection['ip.dst']==row[dir]]['ouiman'].iloc[0])
        except: pass

        try:
            print(" City: " + connection[connection['ip.dst']==row[dir]]['city'].iloc[0])
            print(" ISP: " + connection[connection['ip.dst']==row[dir]]['isp'].iloc[0])
            print(" LatLng: " + str(connection[connection['ip.dst']==row[dir]]['lat'].iloc[0])
                + " " + str(connection[connection['ip.dst']==row[dir]]['lng'].iloc[0]))
        except Exception as e:
            #print("FAILED for "+dir+" error:" +str(e))
            pass

    plt.figure(figsize=(30, 8))
    df_dw1=df_dw[(df_dw['ip.src']==row['ip.src']) & (df_dw['ip.dst']==row['ip.dst'])]
    df_up1=df_up[(df_up['ip.dst']==row['ip.src']) & (df_up['ip.src']==row['ip.dst'])]

    plt.plot(df_dw1['delta'], df_dw1['frame.len'], 'r-o', label = 'Download')
    plt.plot(df_up1['delta'], df_up1['frame.len'], 'b-o', label = 'Upload')

    plt.legend(loc="upper left")
    plt.show()
    
    # This is a 1-second aggregation of network traffic,
    # short connection < 1sec should appear in a single dot (1 for up, 1 for dw)


In [None]:
# Exercise: Plot total network traffic from both directions: the device <--> the router.

In [None]:
###
# Found an IP endpoint in Asia according to MaxMInd, 
# We can still dig in, get more info, wait,
# Google Cloud in Asia?? MaxMind, are you ok?
r = requests.request("GET", urljoin(iotlaburl+"/geoip/","35.201.123.184"))
r.json()

In [None]:
##
# Dispite the bizarre Location MaxMind is right about ISP/CDN
!whois 35.201.123.184

In [None]:
## 
# Latency isn't high (< 10ms), it shouldn't be too far, possibly colocated
!ping -c 5 35.201.123.184

In [None]:
##
# Traceroute usually takes different path everytime you run,
# But it can be still used to reveal important information about
# intermediary hops
!traceroute 35.201.123.184

#traceroute to 35.201.123.184 (35.201.123.184), 64 hops max, 52 byte packets
#...
# 8  r-equinix-isp-ae2-2213.wiscnet.net (216.56.50.45)  4.567 ms  5.589 ms  4.622 ms
# 9  72.14.218.180 (72.14.218.180)  5.763 ms  4.679 ms  4.221 ms
#10  108.170.243.193 (108.170.243.193)  4.016 ms
#    108.170.243.225 (108.170.243.225)  6.219 ms
#    108.170.244.1 (108.170.244.1)  5.244 ms
#11  216.239.51.117 (216.239.51.117)  5.365 ms
#    72.14.232.153 (72.14.232.153)  4.855 ms  <--- intermediary hops can tell something
#    72.14.232.169 (72.14.232.169)  4.571 ms   <--- 
#12  184.123.201.35.bc.googleusercontent.com (35.201.123.184)  4.512 ms  5.164 ms  4.771 ms

In [None]:
##
# IP in Australia, closer to Asia, also Google.
# Nothing to worry about
r = requests.request("GET", urljoin(iotlaburl+"/geoip/","72.14.232.169"))
r.json()