# Assignment: Video Quality Inference

To this point in the class, you have learned various techniques for leading and analyzing packet captures of various types, generating features from those packet captures, and training and evaluating models using those features.

In this assignment, you will put all of this together, using a network traffic trace to train a model to automatically infer video quality of experience from a labeled traffic trace.

## Part 1: Warmup

The first part of this assignment builds directly on the hands-on activities but extends them slightly.

### Extract Features from the Network Traffic

Load the `netflix.pcap` file, which is a packet trace that includes network traffic. 

Click [here](https://github.com/noise-lab/ml-systems/blob/main/docs/notebooks/data/netflix.pcap) to download `netflix.pcap`.


In [1]:
import pandas as pd
import numpy as np
from scapy.all import *

netflix_packets = rdpcap('netflix.pcap')

print(netflix_packets)



<netflix.pcap: TCP:141143 UDP:232 ICMP:16 Other:80>


In [6]:
for i in range(10):
    print(netflix_packets[i].summary())
    print(netflix_packets[i].layers())

#layers: Ether, IP, UDP, DNS Qry

Ether / IP / UDP / DNS Qry "b'fonts.gstatic.com.'" 
[<class 'scapy.layers.l2.Ether'>, <class 'scapy.layers.inet.IP'>, <class 'scapy.layers.inet.UDP'>, <class 'scapy.layers.dns.DNS'>]
Ether / IP / UDP / DNS Qry "b'fonts.gstatic.com.'" 
[<class 'scapy.layers.l2.Ether'>, <class 'scapy.layers.inet.IP'>, <class 'scapy.layers.inet.UDP'>, <class 'scapy.layers.dns.DNS'>]
Ether / IP / UDP / DNS Qry "b'googleads.g.doubleclick.net.'" 
[<class 'scapy.layers.l2.Ether'>, <class 'scapy.layers.inet.IP'>, <class 'scapy.layers.inet.UDP'>, <class 'scapy.layers.dns.DNS'>]
Ether / IP / UDP / DNS Qry "b'googleads.g.doubleclick.net.'" 
[<class 'scapy.layers.l2.Ether'>, <class 'scapy.layers.inet.IP'>, <class 'scapy.layers.inet.UDP'>, <class 'scapy.layers.dns.DNS'>]
Ether / IP / UDP / DNS Qry "b'ytimg.l.google.com.'" 
[<class 'scapy.layers.l2.Ether'>, <class 'scapy.layers.inet.IP'>, <class 'scapy.layers.inet.UDP'>, <class 'scapy.layers.dns.DNS'>]
Ether / IP / UDP / DNS Qry "b'r4---sn-gxo5uxg-jqbe.googlevideo.c

### Identifying the Service Type

Use the DNS traffic to filter the packet trace for Netflix traffic.

In [None]:
test = str(netflix_packets[1])
#if netflix_packets[1].contains('DNS Qry'):
if 'DNS' in test:
    print("yes")

print(netflix_packets[8].summary())
#print(netflix_packets[8].summary().split("'"))
print(netflix_packets[8].summary().split('"'))
print((netflix_packets[8].summary()).split('"')[1])

yes
Ether / IP / UDP / DNS Ans "216.58.213.162" 
['Ether / IP / UDP / DNS Ans ', '216.58.213.162', ' ']
216.58.213.162
Looking for DNS responses...
Total DNS responses found: 62
  - Ether / IP / UDP / DNS Ans "216.58.213.162" 
  - Ether / IP / UDP / DNS Ans "2a00:1450:4007:805::2003" 
  - Ether / IP / UDP / DNS Ans "172.217.18.195" 

Checking for Netflix DNS responses...


In [2]:
netflix_domains = ['netflix.com', 'netflix.net', 'nflxso.net', 'nflxvideo.net', 'nflxext.com', 'nflximg.com', 'nflximg.net']
#source: https://www.netify.ai/resources/applications/netflix

netflix_traffic = []
#do domains in a bit

netflix_traffic_ips = []

for packet in netflix_packets:
    #packet_string_form = str(packet)
    packet_summary = packet.summary()

    #if 'DNS Qry' in packet_string_form:
    #if 'DNS' in packet_string_form:
    if 'DNS' in packet_summary:
        #packet_summary = packet.summary()
        #netflix_traffic.append(packet_summary)

        for domain in netflix_domains:
            #if ('netflix.com' in packet_summary) | ('netflix.net' in packet_summary) | ('nflxvideo.net' in packet_summary):
            if domain in packet_summary:
                netflix_traffic.append(packet_summary)

                if 'DNS Ans' in packet_summary:
                    #ip_addr = packet_summary.split("'")[1]
                    parts = packet_summary.split('"')

                    if len(parts) > 1:
                        ip_addr = parts[1]

                        if ip_addr not in netflix_traffic_ips:
                            netflix_traffic_ips.append(ip_addr)

#print(netflix_traffic)
print(netflix_traffic_ips) #um??

[]


### Generate Statistics

Generate statistics and features for the Netflix traffic flows. Use the `netml` library or any other technique that you choose to generate a set of features that you think would be good features for your model. 

In [3]:
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data

print("getting pcap...\n")
netflix_pcap = PCAP('netflix.pcap')

print("pcap2pandas...\n")
netflix_pcap.pcap2pandas()

print("pcap2flows...\n")
netflix_pcap.pcap2flows()

print("pcap features...\n")
netflix_pcap.flow2features('IAT', fft = False, header = False)

getting pcap...

pcap2pandas...

pcap2flows...

pcap features...



In [4]:
#um features?
print("getting features\n")
netflix_features = netflix_pcap.features
print(netflix_features.shape)
print(len(netflix_pcap.flows))

getting features

(184, 359)
184


In [5]:
#stats
print("feature stats:\n")
print(f"min: {np.min(netflix_features, axis = 0)}\n")
print(f"max: {np.max(netflix_features, axis = 0)}\n")
print(f"mean: {np.mean(netflix_features, axis = 0)}\n")
print(f"std: {np.std(netflix_features, axis = 0)}\n")

feature stats:

min: [0.01274514 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0

**Write a brief justification for the features that you have chosen.**

The features I chose are able to track timing patterns, since video streaming sends packets in bursts, but if the video quality is bad the timing becomes messy and very irregular, so these features observe and measure the timing patterns to spot any problems or issues. Good video streams have steady data flow, while poor quality streams have uneven flow, so my features can detect these differences in how data moves. The features are also able to see different video issues, like if a video freezes (unusual gaps between packets), a difference in quality (there would be different patterns in the data flow), and buffering (abnormal timing).

### Inferring Segment downloads

In addition to the features that you could generate using the `netml` library or similar, add to your feature vector a "segment downloads rate" feature, which indicates the number of video segments downloaded for a given time window.

Note: If you are using the `netml` library, generating features with `SAMP` style options may be useful, as this option gives you time windows, and you can then simply add the segment download rate to that existing dataframe.

In [None]:
"""
Segment definition:
 For the “segment downloads rate” feature, you can define a segment as a burst of packets separated by at least a 1-second gap.
 The number of such segments within a given time window can then be used as your “segment downloads rate” feature, e.g., x segments/second

 um 10 second widows??
"""

sorted_packets = sorted(netflix_packets, key = lambda x: x.time)
start = sorted_packets[0].time

segments = []
seg_in_curr = 0
in_burst = False

for i, packet in enumerate(sorted_packets):
    idx = int((packet.time - start) / 10)

    if idx >= len(segments):
        if len(segments) > 0:
            segments.append(seg_in_curr)
        else:
            segments.append(0)

        seg_in_curr = 0

    if i > 0:
        gap = packet.time - sorted_packets[i - 1].time

        if gap > 1:
            if in_burst:
                seg_in_curr += 1
                in_burst = False
            else:
                in_burst = True
        else:
            in_burst = True

if in_burst:
    seg_in_curr += 1

if (seg_in_curr > 0) or (len(segments) == 0):
    segments.append(seg_in_curr)

print(segments)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 2, 0, 0, 2, 2, 3, 3, 3, 3, 2, 2, 4, 4, 3, 3, 3, 3, 2, 2, 3, 3, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 3, 2]


In [12]:
#um time windows?
start = sorted_packets[0].time
size = 10
better_features = []
#segments / secod

for i, flow in enumerate(netflix_pcap.flows):
    if not flow:
        continue

    """
    flow_start = flow[0].time
    idx = int((flow_start - start) / size)
    """
    first = flow[0]
    flow_start = first.time
    idx = int((flow_start - start) / size)

    if idx < len(segments):
        rate = segments[idx]
    else:
        rate = 0

    original = netflix_pcap.features[i]
    temp = np.append(original, rate)
    better_features.append(temp)

print(better_features)

AttributeError: 'tuple' object has no attribute 'time'

## Part 2: Video Quality Inference

You will now load the complete video dataset from a previous study to train and test models based on these features to automatically infer the quality of a streaming video flow.

For this part of the assignment, you will need two pickle files, which we provide for you by running the code below:

```

!gdown 'https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS' -O netflix_session.pkl
!gdown 'https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI' -O video_dataset.pkl

```

### Load the File

Load the video dataset pickle file.

### Clean the File

1. The dataset contains video resolutions that are not valid. Remove entries in the dataset that do not contain a valid video resolution. Valid resolutions are 240, 360, 480, 720, 1080.

2. The file also contains columns that are unnecessary (in fact, unhelpful!) for performing predictions. Identify those columns, and remove them.

**Briefly explain why you removed those columns.**

### Prepare Your Data

Prepare your data matrix, determine your features and labels, and perform a train-test split on your data.

### Train and Tune Your Model

1. Select a model of your choice.
2. Train the model using your training data.

### Tune Your Model

Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

Evaluate your model accuracy according to the following metrics:

1. Accuracy
2. F1 Score
3. Confusion Matrix
4. ROC/AUC

## Part 3: Predict the Ongoing Resolution of a Real Netflix Session

Now that you have your model, it's time to put it in practice!

Use a preprocessed Netflix video session to infer **and plot** the resolution at 10-second time intervals.