# Assignment: Video Quality Inference

To this point in the class, you have learned various techniques for leading and analyzing packet captures of various types, generating features from those packet captures, and training and evaluating models using those features.

In this assignment, you will put all of this together, using a network traffic trace to train a model to automatically infer video quality of experience from a labeled traffic trace.

## Part 1: Warmup

The first part of this assignment builds directly on the hands-on activities but extends them slightly.

### Extract Features from the Network Traffic

Load the `netflix.pcap` file, which is a packet trace that includes network traffic. 

Click [here](https://github.com/noise-lab/ml-systems/blob/main/docs/notebooks/data/netflix.pcap) to download `netflix.pcap`.


In [11]:
import scapy.all as scapy
# from netml.pparser.process_pcap import from_pcap
from netml.pparser.parser import PCAP
import pandas as pd
import os

pcap_file = 'netflix.pcap'
try:
    packets = scapy.rdpcap(pcap_file)
    print(f"Successfully loaded {len(packets)} packets from {pcap_file}")
except FileNotFoundError:
    print(f"Error: {pcap_file} not found")
except Exception as e:
    print(f"An error occurred loading the pcap: {e}")

Successfully loaded 141471 packets from netflix.pcap


In [None]:
# Seeing all domain names
import scapy.all as scapy

pcap_file = "netflix.pcap"
packets = scapy.rdpcap(pcap_file)

domains = set()
for pkt in packets:
    if pkt.haslayer(scapy.DNS) and pkt[scapy.DNS].qr == 0:  # DNS query
        try:
            domains.add(pkt[scapy.DNS].qd.qname.decode('utf-8').strip('.'))
        except:
            pass

print(f"Total unique domains: {len(domains)}")
for d in sorted(domains):
    print(d)


Total unique domains: 39
543458527._teamviewer._tcp.local
Diegos-MacBook-Air.local
_privet._tcp.local
accounts.google.com
adservice.google.fr
apis.google.com
assets.nflxext.com
codex.nflxext.com
customerevents.netflix.com
db._dns-sd._udp.0.43.168.192.in-addr.arpa
dr._dns-sd._udp.0.43.168.192.in-addr.arpa
fonts.gstatic.com
freegeoip.net
google.com
googleads.g.doubleclick.net
ipv4-c001-cdg001-ix.1.oca.nflxvideo.net
ipv4-c005-cdg001-ix.1.oca.nflxvideo.net
ipv4-c024-cdg001-ix.1.oca.nflxvideo.net
ipv4-c063-cdg001-ix.1.oca.nflxvideo.net
ipv4-c069-cdg001-ix.1.oca.nflxvideo.net
ipv4-c071-cdg001-ix.1.oca.nflxvideo.net
ipv4-c072-cdg001-ix.1.oca.nflxvideo.net
mpittoni-macbook._ftp._tcp.local
mpittoni-macbook._sftp-ssh._tcp.local
occ-0-56-55.1.nflxso.net
push.prod.netflix.com
r._dns-sd._udp.0.43.168.192.in-addr.arpa
r4---sn-gxo5uxg-jqbe.googlevideo.com
ssl.gstatic.com
tp-s.nflximg.net
update.googleapis.com
www.google.com
www.google.fr
www.googleadservices.com
www.gstatic.com
www.netflix.com
www.yo

### Identifying the Service Type

Use the DNS traffic to filter the packet trace for Netflix traffic.

In [None]:
# Based on the domains observed, we can filter for Netflix-related traffic
netflix_domains = [
    "nflxvideo.net",
    "nflxext.com",
    "nflxso.net",
    "nflximg.net",
    "netflix.com"
]

netflix_ips = set()

# We only want to keep responses: DNS response (qr=1) with answers (ancount > 0)
for pkt in packets:
    if pkt.haslayer(scapy.DNS) and pkt[scapy.DNS].qr == 1 and pkt[scapy.DNS].ancount > 0:
        for i in range(pkt[scapy.DNS].ancount):
            answer = pkt[scapy.DNS].an[i]
            
            # I had to use Gemini to do this chunk.
            # Prompt: "In Scapy, how do I check if a DNS answer is an 'A' (IPv4) 
            # record and extract the domain name and IP address?"
            if answer.type == 1: # 'A' (IPv4) record
                domain = answer.rrname.decode('utf-8')
                if any(nd in domain for nd in netflix_domains):
                    netflix_ips.add(answer.rdata)

print(f"\nIdentified {len(netflix_ips)} Netflix-related IP addresses.")
print(netflix_ips)

# Get all packets related to Netflix IPs
netflix_packets = []
for pkt in packets:
    if pkt.haslayer(scapy.IP):
        # Keep packets where either source or destination IP is in netflix_ips
        if pkt[scapy.IP].src in netflix_ips or pkt[scapy.IP].dst in netflix_ips:
            netflix_packets.append(pkt)

print(f"\nFiltered down to {len(netflix_packets)} Netflix-related packets.")


filtered_pcap_file = 'netflix_only.pcap'
scapy.wrpcap(filtered_pcap_file, netflix_packets)
print(f"Saved filtered Netflix traffic to {filtered_pcap_file}")


Identified 15 Netflix-related IP addresses.
{'52.210.19.176', '34.252.77.54', '198.38.120.167', '52.208.128.101', '198.38.120.130', '198.38.120.166', '52.48.148.78', '198.38.120.164', '198.38.120.137', '198.38.120.134', '52.210.133.255', '198.38.120.153', '52.19.39.146', '198.38.120.162', '52.48.8.150'}

Filtered down to 138633 Netflix-related packets.
Saved filtered Netflix traffic to netflix_only.pcap


### Generate Statistics

Generate statistics and features for the Netflix traffic flows. Use the `netml` library or any other technique that you choose to generate a set of features that you think would be good features for your model. 

In [17]:
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data

pcap = PCAP('netflix_only.pcap', flow_ptks_thres=2)

pcap.pcap2flows()

# Extract inter-arrival time features
pcap.flow2features('IAT', fft=False, header=False)

iat_features = pcap.features

In [20]:
iat_features_df = pd.DataFrame(iat_features)
print(iat_features_df.head())

        0         1         2         3         4         5         6    \
0  0.714240  0.000158  1.207696  0.017828  0.001739  2.609759  0.001923   
1  0.715549  0.000191  1.286184  0.017959  0.002137  2.543728  0.002128   
2  0.713235  0.000234  1.128762  0.017737  0.001898  2.618819  0.002096   
3  0.768140  0.000545  2.326351  0.015718  0.001761  1.712796  0.981150   
4  0.771010  0.000206  2.449168  0.017225  0.002280  1.603454  0.987820   

        7             8         9    ...  925  926  927  928  929  930  931  \
0  1.369284  7.958562e+00  0.557696  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
1  9.313323  5.603709e-01  0.001295  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
2  1.425096  7.973116e+00  0.555151  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
3  0.021840  9.536743e-07  0.541786  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
4  3.662901  9.536743e-07  0.582064  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0   

   932  933  934  
0  0.0  0.0  0.0  
1  0.0  0.0  0.0  
2  0.0  0.0  0.0 

In [21]:
# Extract statistical features
pcap.flow2features('STATS', fft=False, header=False)
stats_features = pcap.features

# Extract sampled-bytes features
pcap.flow2features('SAMP_SIZE', fft=False, header=False)
samp_size_features = pcap.features


In [22]:
stat_features_df = pd.DataFrame(stats_features)
samp_size_features_df = pd.DataFrame(samp_size_features)
print(stat_features_df.head())

           0         1           2           3          4     5     6      7   \
0   14.440177  0.831015   73.683308   88.666667  48.140997  66.0  66.0   69.0   
1   14.442865  0.761622   69.099863   90.727273  49.772374  66.0  66.0   72.0   
2   14.437420  0.831173   73.697377   88.666667  48.140997  66.0  66.0   69.0   
3  138.854511  1.497971  184.560082  123.206731  66.443951  66.0  66.0  200.0   
4   74.817607  0.360878   45.203263  125.259259  66.350692  66.0  66.0  200.0   

     8      9      10       11  
0  66.0  200.0   12.0   1064.0  
1  66.0  200.0   11.0    998.0  
2  66.0  200.0   12.0   1064.0  
3  54.0  200.0  208.0  25627.0  
4  54.0  200.0   27.0   3382.0  


**Write a brief justification for the features that you have chosen.**

I chose the **STATS** represenation after I used the netml library to extract different types of network features from the Netflix traffic flows. The STATS representation is chosen because it provides useful summary statistics for each flow (packet rate, byte rate, packet size statistics, total packets, and total bytes). These are easy to feed into a model.
I also experimented with IAT (inter-arrival times) and SAMP_SIZE (per-interval byte counts) to capture traffic burstiness, which is often correlated with video segment downloads and playback quality.

### Inferring Segment downloads

In addition to the features that you could generate using the `netml` library or similar, add to your feature vector a "segment downloads rate" feature, which indicates the number of video segments downloaded for a given time window.

Note: If you are using the `netml` library, generating features with `SAMP` style options may be useful, as this option gives you time windows, and you can then simply add the segment download rate to that existing dataframe.

In [25]:
pcap.flow2features('SAMP_SIZE', fft=False, header=False)
samp_features = pcap.features
df_samp = pd.DataFrame(samp_features)

In [None]:
# Choosing a threshold
import numpy as np
values = df_samp.to_numpy().flatten()
values = values[values > 0]  
print(np.percentile(values, [50, 75, 90, 95, 99]))


[   1761.     12901.5   106884.2   286426.45 1093697.6 ]


In [29]:
threshold = 50_000  # Big bursts likely correspond to video segment downloads
segment_counts = (df_samp > threshold).sum(axis=1)


[   1761.     12901.5   106884.2   286426.45 1093697.6 ]


## Part 2: Video Quality Inference

You will now load the complete video dataset from a previous study to train and test models based on these features to automatically infer the quality of a streaming video flow.

For this part of the assignment, you will need two pickle files, which we provide for you by running the code below:

```

!gdown 'https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS' -O netflix_session.pkl
!gdown 'https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI' -O video_dataset.pkl

```

### Load the File

Load the video dataset pickle file.

### Clean the File

1. The dataset contains video resolutions that are not valid. Remove entries in the dataset that do not contain a valid video resolution. Valid resolutions are 240, 360, 480, 720, 1080.

2. The file also contains columns that are unnecessary (in fact, unhelpful!) for performing predictions. Identify those columns, and remove them.

**Briefly explain why you removed those columns.**

### Prepare Your Data

Prepare your data matrix, determine your features and labels, and perform a train-test split on your data.

### Train and Tune Your Model

1. Select a model of your choice.
2. Train the model using your training data.

### Tune Your Model

Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

Evaluate your model accuracy according to the following metrics:

1. Accuracy
2. F1 Score
3. Confusion Matrix
4. ROC/AUC

## Part 3: Predict the Ongoing Resolution of a Real Netflix Session

Now that you have your model, it's time to put it in practice!

Use a preprocessed Netflix video session to infer **and plot** the resolution at 10-second time intervals.