# Assignment: Video Quality Inference

To this point in the class, you have learned various techniques for leading and analyzing packet captures of various types, generating features from those packet captures, and training and evaluating models using those features.

In this assignment, you will put all of this together, using a network traffic trace to train a model to automatically infer video quality of experience from a labeled traffic trace.

## Part 1: Warmup

The first part of this assignment builds directly on the hands-on activities but extends them slightly.

### Extract Features from the Network Traffic

Load the `netflix.pcap` file, which is a packet trace that includes network traffic. 

Click [here](https://github.com/noise-lab/ml-systems/blob/main/docs/notebooks/data/netflix.pcap) to download `netflix.pcap`.


In [146]:
# Import pcap file into a Pandas DataFrame

from scapy.all import rdpcap, IP, TCP, UDP, DNS
import pandas as pd

pkts = rdpcap("netflix.pcap")  # may use lots of RAM for big files
rows = []
for p in pkts:
    is_dns = DNS in p
    proto = "DNS" if is_dns else ("TCP" if TCP in p else ("UDP" if UDP in p else p.name))
    txid = p[DNS].id if is_dns else None
    rows.append({
        "timestamp": float(getattr(p, "time", None)),
        "length": len(p),
        "src_ip": p[IP].src if IP in p else None,
        "dst_ip": p[IP].dst if IP in p else None,
        "protocol": proto,
        "src_port": p[TCP].sport if TCP in p else (p[UDP].sport if UDP in p else None),
        "dst_port": p[TCP].dport if TCP in p else (p[UDP].dport if UDP in p else None),
        "txid": txid,
        "info": str(p.summary())
    })

pcap_import = pd.DataFrame(rows)

In [147]:
pcap_import.head()

Unnamed: 0,timestamp,length,src_ip,dst_ip,protocol,src_port,dst_port,txid,info
0,1518358200.53468,77,192.168.43.72,128.93.77.234,DNS,55697.0,53.0,60684.0,"Ether / IP / UDP / DNS Qry ""b'fonts.gstatic.co..."
1,1518358200.53483,77,192.168.43.72,128.93.77.234,DNS,59884.0,53.0,12314.0,"Ether / IP / UDP / DNS Qry ""b'fonts.gstatic.co..."
2,1518358200.53941,87,192.168.43.72,128.93.77.234,DNS,61223.0,53.0,4563.0,"Ether / IP / UDP / DNS Qry ""b'googleads.g.doub..."
3,1518358200.5412,87,192.168.43.72,128.93.77.234,DNS,58785.0,53.0,4740.0,"Ether / IP / UDP / DNS Qry ""b'googleads.g.doub..."
4,1518358200.54578,78,192.168.43.72,128.93.77.234,DNS,51938.0,53.0,13362.0,"Ether / IP / UDP / DNS Qry ""b'ytimg.l.google.c..."


### Identifying the Service Type

Use the DNS traffic to filter the packet trace for Netflix traffic.

In [148]:
# Get transaction IDs from Netflix DNS queries

nflx_queries = pcap_import[pcap_import["info"].str.contains(r"netflix|nflx", na=False) &
                           pcap_import["info"].str.contains(r"Qry", na=False)]

nflx_txids = [int(x) for x in nflx_queries["txid"].unique().tolist()]

In [149]:
print(nflx_txids)

[45466, 12361, 24416, 7939, 30582, 14313, 37909, 26231, 48727, 21826, 9887, 33001, 49332, 14721, 33500, 34542, 60735, 43441]


In [150]:
# Get Netflix DNS answers by matching txids

nflx_answers = pcap_import[pcap_import["txid"].isin(nflx_txids) & 
                           pcap_import["protocol"].eq("DNS") & 
                           pcap_import["info"].str.contains(r"Ans", na=False)]

nflx_answers.head(10)

Unnamed: 0,timestamp,length,src_ip,dst_ip,protocol,src_port,dst_port,txid,info
101,1518358202.90277,109,128.93.77.234,192.168.43.72,DNS,53.0,48058.0,30582.0,"Ether / IP / UDP / DNS Ans ""198.38.120.130"""
102,1518358202.90278,102,128.93.77.234,192.168.43.72,DNS,53.0,43209.0,45466.0,"Ether / IP / UDP / DNS Ans ""52.19.39.146"""
103,1518358202.90281,113,128.93.77.234,192.168.43.72,DNS,53.0,4046.0,7939.0,"Ether / IP / UDP / DNS Ans ""52.210.19.176"""
104,1518358202.90283,109,128.93.77.234,192.168.43.72,DNS,53.0,50901.0,14313.0,"Ether / IP / UDP / DNS Ans ""198.38.120.153"""
105,1518358202.90286,96,128.93.77.234,192.168.43.72,DNS,53.0,28162.0,12361.0,"Ether / IP / UDP / DNS Ans ""23.57.80.120"""
108,1518358202.90331,96,128.93.77.234,192.168.43.72,DNS,53.0,48245.0,24416.0,"Ether / IP / UDP / DNS Ans ""23.57.80.120"""
219,1518358203.64291,94,128.93.77.234,192.168.43.72,DNS,53.0,57216.0,26231.0,"Ether / IP / UDP / DNS Ans ""198.38.120.137"""
224,1518358203.65937,96,128.93.77.234,192.168.43.72,DNS,53.0,55348.0,48727.0,"Ether / IP / UDP / DNS Ans ""23.57.80.120"""
228,1518358203.67025,109,128.93.77.234,192.168.43.72,DNS,53.0,15562.0,37909.0,"Ether / IP / UDP / DNS Ans ""198.38.120.167"""
1019,1518358212.32345,108,128.93.77.234,192.168.43.72,DNS,53.0,36897.0,21826.0,"Ether / IP / UDP / DNS Ans ""34.252.77.54"""


In [158]:
# Get Netflix IPs

nflx_ips = set()
for info in nflx_answers["info"]:
  parts = info.split("\"")
  for part in parts:
    if part.count(".") == 3:  # crude check for IPv4 address
      nflx_ips.add(part)

print(nflx_ips)

{'198.38.120.167', '198.38.120.166', '52.210.133.255', '198.38.120.134', '52.48.148.78', '52.48.8.150', '198.38.120.153', '23.57.80.120', '52.19.39.146', '198.38.120.162', '52.208.128.101', '198.38.120.130', '52.210.19.176', '198.38.120.137', '34.252.77.54', '198.38.120.164'}


In [152]:
# Get all Netflix packet by IP address

nflx_pkts = pcap_import[(pcap_import["src_ip"].isin(nflx_ips)) | (pcap_import["dst_ip"].isin(nflx_ips))].copy()

nflx_pkts.head(10)

Unnamed: 0,timestamp,length,src_ip,dst_ip,protocol,src_port,dst_port,txid,info
107,1518358202.90327,78,192.168.43.72,198.38.120.130,TCP,58451.0,443.0,,Ether / IP / TCP 192.168.43.72:58451 > 198.38....
109,1518358202.90332,78,192.168.43.72,198.38.120.130,TCP,58452.0,443.0,,Ether / IP / TCP 192.168.43.72:58452 > 198.38....
110,1518358202.90341,78,192.168.43.72,198.38.120.130,TCP,58453.0,443.0,,Ether / IP / TCP 192.168.43.72:58453 > 198.38....
112,1518358202.90363,78,192.168.43.72,52.19.39.146,TCP,58454.0,443.0,,Ether / IP / TCP 192.168.43.72:58454 > 52.19.3...
113,1518358202.90369,78,192.168.43.72,52.19.39.146,TCP,58455.0,443.0,,Ether / IP / TCP 192.168.43.72:58455 > 52.19.3...
114,1518358202.90377,78,192.168.43.72,52.19.39.146,TCP,58456.0,443.0,,Ether / IP / TCP 192.168.43.72:58456 > 52.19.3...
115,1518358202.90381,78,192.168.43.72,52.19.39.146,TCP,58457.0,443.0,,Ether / IP / TCP 192.168.43.72:58457 > 52.19.3...
116,1518358202.90394,78,192.168.43.72,52.19.39.146,TCP,58458.0,443.0,,Ether / IP / TCP 192.168.43.72:58458 > 52.19.3...
117,1518358202.90397,78,192.168.43.72,52.19.39.146,TCP,58459.0,443.0,,Ether / IP / TCP 192.168.43.72:58459 > 52.19.3...
119,1518358202.90421,78,192.168.43.72,52.210.19.176,TCP,58460.0,443.0,,Ether / IP / TCP 192.168.43.72:58460 > 52.210....


### Generate Statistics

Generate statistics and features for the Netflix traffic flows. Use the `netml` library or any other technique that you choose to generate a set of features that you think would be good features for your model. 

In [160]:
import numpy as np
# filter for tcp, udp traffic only
# group by 10 second windows - 
# num packets, total bytes, mean packet size, IAT, 
# max min packet size, num unique dst ips

df = nflx_pkts.copy()
df = df[df["protocol"].fillna("").str.lower().isin(["tcp", "udp"])].reset_index(drop=True)

# ensure timestamp and length numeric
df["timestamp"] = df["timestamp"].astype(float)
df["length"] = df["length"].astype(int)

# compute 10-second time window (align to multiples of 10)
df["time_window"] = (np.floor(df["timestamp"] / 10) * 10).astype(int)

# compute inter-arrival time within each time window
df["iat_within_window"] = df.groupby("time_window")["timestamp"].diff()

# aggregate per time window
nflx_windows = df.groupby("time_window").agg(
    packet_count=("length", "size"),
    total_bytes=("length", "sum"),
    mean_pkt_size=("length", "mean"),
    max_pkt_size=("length", "max"),
    min_pkt_size=("length", "min"),
    mean_iat=("iat_within_window", "mean"),
    unique_dst_ips=("dst_ip", "nunique"),
).reset_index()

# merge aggregated features back into nflx_pkts
nflx_pkts = df.merge(nflx_windows, on="time_window", how="left")

# show result
print("Per-window summary rows:", len(nflx_windows))
display(nflx_windows.head())

Per-window summary rows: 50


Unnamed: 0,time_window,packet_count,total_bytes,mean_pkt_size,max_pkt_size,min_pkt_size,mean_iat,unique_dst_ips
0,1518358200,679,80019,117.84831,200,66,0.01045,8
1,1518358210,494,63930,129.41296,200,54,0.0201,8
2,1518358220,614,80536,131.16612,200,54,0.0162,10
3,1518358230,509,72511,142.45776,200,66,0.019,4
4,1518358240,388,56428,145.43299,200,66,0.0248,4


**Write a brief justification for the features that you have chosen.**

### Inferring Segment downloads

In addition to the features that you could generate using the `netml` library or similar, add to your feature vector a "segment downloads rate" feature, which indicates the number of video segments downloaded for a given time window.

Note: If you are using the `netml` library, generating features with `SAMP` style options may be useful, as this option gives you time windows, and you can then simply add the segment download rate to that existing dataframe.

In [162]:

segment_download_rate = nflx_pkts.groupby("time_window").size().reset_index(name="segment_download_rate")
nflx_windows = nflx_windows.merge(segment_download_rate, on="time_window", how="left")
nflx_windows["segment_download_rate"] = nflx_windows["segment_download_rate"].fillna(0).astype(int)

# show result
print("Per-window summary rows:", len(nflx_windows))
display(nflx_windows.head())

Per-window summary rows: 50


Unnamed: 0,time_window,packet_count,total_bytes,mean_pkt_size,max_pkt_size,min_pkt_size,mean_iat,unique_dst_ips,segment_download_rate
0,1518358200,679,80019,117.84831,200,66,0.01045,8,679
1,1518358210,494,63930,129.41296,200,54,0.0201,8,494
2,1518358220,614,80536,131.16612,200,54,0.0162,10,614
3,1518358230,509,72511,142.45776,200,66,0.019,4,509
4,1518358240,388,56428,145.43299,200,66,0.0248,4,388


## Part 2: Video Quality Inference

You will now load the complete video dataset from a previous study to train and test models based on these features to automatically infer the quality of a streaming video flow.

For this part of the assignment, you will need two pickle files, which we provide for you by running the code below:

```

!gdown 'https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS' -O netflix_session.pkl
!gdown 'https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI' -O video_dataset.pkl

```

### Load the File

Load the video dataset pickle file.

In [174]:
netflix_video = pd.read_pickle("/Users/hudsoncarpenter/Downloads/netflix_dataset.pkl")

### Clean the File

1. The dataset contains video resolutions that are not valid. Remove entries in the dataset that do not contain a valid video resolution. Valid resolutions are 240, 360, 480, 720, 1080.

In [175]:
valid_res = {240, 360, 480, 720, 1080}
netflix_video = netflix_video[netflix_video["resolution"].isin(valid_res)]

2. The file also contains columns that are unnecessary (in fact, unhelpful!) for performing predictions. Identify those columns, and remove them.

In [176]:
# remove columns with same values
# remove columns with NAs
# find columns that are repeats of other columns (ending in R)
# only keep numeric columns
# remove nested structures
# 251 - 39 total cols = 213

from pandas.api import types as ptypes

df = netflix_video.copy()

# 1 Drop columns that contain any NaN
df = df.dropna(axis=1)

# 2 Remove columns with nested/unhashable values
def is_scalar_series(s):
    # Try to detect columns that contain lists, dicts, arrays, or tuples
    try:
        sample = s.head(100)
        for x in sample:
            if isinstance(x, (list, dict, set, tuple, np.ndarray)):
                return False
        return True
    except Exception:
        return False

scalar_cols = [c for c in df.columns if is_scalar_series(df[c])]
df = df[scalar_cols]

# 3 Remove columns with constant value
nunique = df.nunique(dropna=False)
df = df.loc[:, nunique > 1]

# 4 Remove duplicate columns
df = df.loc[:, ~df.T.duplicated()]

netflix_video_clean = df.copy()

# show resulting dataframe shape and head
print("Resulting shape:", netflix_video_clean.shape)
display(netflix_video_clean.head())

Resulting shape: (49748, 216)


Unnamed: 0,10_avg_chunksize,10_chunksizes_50,10_chunksizes_75,10_chunksizes_85,10_chunksizes_85R,10_chunksizes_90,10_chunksizes_90R,10_max_chunksize,10_min_chunksize,10_std_chunksize,...,userStrBytesInFlight,userSynFlags,userTwoRetransmit,userXRetransmit,userZeroRetransmit,startup3.3,startup6.6,startup5,startup10,startup_mc
208,148947.1,43473.0,185098.0,361832.3,361832.3,461040.2,461040.2,539882,4380,185126.0422,...,0,4,0.0,0.10873,0.55794,False,False,False,False,12.0
209,91984.2,101882.0,122323.0,159843.4,159843.4,181011.8,181011.8,196778,24498,62449.10736,...,0,1,0.0,0.11687,0.5498,False,False,False,False,12.0
210,147725.0,111373.0,224157.0,286280.0,286280.0,325845.8,325845.8,396800,24498,134268.55972,...,0,0,0.0,0.12685,0.53981,False,False,False,False,12.0
211,246420.6,297522.0,330737.5,361030.7,361030.7,378404.0,378404.0,396800,24498,127394.37245,...,0,1,0.0,0.18611,0.48056,False,False,False,False,12.0
212,336681.0,354461.0,399719.5,409866.3,409866.3,414060.0,414060.0,438000,198238,79844.368,...,0,0,0.0,0.19444,0.47222,False,False,False,False,12.0


**Briefly explain why you removed those columns.**

### Prepare Your Data

Prepare your data matrix, determine your features and labels, and perform a train-test split on your data.

### Train and Tune Your Model

1. Select a model of your choice.
2. Train the model using your training data.

### Tune Your Model

Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

Evaluate your model accuracy according to the following metrics:

1. Accuracy
2. F1 Score
3. Confusion Matrix
4. ROC/AUC

## Part 3: Predict the Ongoing Resolution of a Real Netflix Session

Now that you have your model, it's time to put it in practice!

Use a preprocessed Netflix video session to infer **and plot** the resolution at 10-second time intervals.