# Assignment: Video Quality Inference

To this point in the class, you have learned various techniques for leading and analyzing packet captures of various types, generating features from those packet captures, and training and evaluating models using those features.

In this assignment, you will put all of this together, using a network traffic trace to train a model to automatically infer video quality of experience from a labeled traffic trace.

## Part 1: Warmup

The first part of this assignment builds directly on the hands-on activities but extends them slightly.

### Extract Features from the Network Traffic

Load the `netflix.pcap` file, which is a packet trace that includes network traffic. 

Click [here](https://github.com/noise-lab/ml-systems/blob/main/docs/notebooks/data/netflix.pcap) to download `netflix.pcap`.


In [1]:
import pandas as pd

In [2]:
# I opened wireshark and downloaded the pcap as a csv.
# That is why the filename below is different from what you may expect

packets = pd.read_csv("ml-systems-netflix-packets.csv")

In [3]:
packets.head(10)

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info,dest (r),dest(u),src (r),src(u),New Column
0,1,0.0,192.168.43.72,128.93.77.234,DNS,77,Standard query 0xed0c A fonts.gstatic.com,53.0,53.0,55697.0,55697.0,1
1,2,0.00015,192.168.43.72,128.93.77.234,DNS,77,Standard query 0x301a AAAA fonts.gstatic.com,53.0,53.0,59884.0,59884.0,2
2,3,0.004726,192.168.43.72,128.93.77.234,DNS,87,Standard query 0x11d3 A googleads.g.doubleclic...,53.0,53.0,61223.0,61223.0,3
3,4,0.006522,192.168.43.72,128.93.77.234,DNS,87,Standard query 0x1284 AAAA googleads.g.doublec...,53.0,53.0,58785.0,58785.0,4
4,5,0.011103,192.168.43.72,128.93.77.234,DNS,78,Standard query 0x3432 AAAA ytimg.l.google.com,53.0,53.0,51938.0,51938.0,5
5,6,0.012354,192.168.43.72,128.93.77.234,DNS,96,Standard query 0xb756 A r4---sn-gxo5uxg-jqbe.g...,53.0,53.0,20949.0,20949.0,6
6,7,0.012474,192.168.43.72,128.93.77.234,DNS,75,Standard query 0x62ab A ssl.gstatic.com,53.0,53.0,58025.0,58025.0,7
7,8,0.012567,192.168.43.72,128.93.77.234,DNS,74,Standard query 0x42fb A www.google.com,53.0,53.0,15895.0,15895.0,8
8,9,0.319268,128.93.77.234,192.168.43.72,DNS,386,Standard query response 0x11d3 A 216.58.213.162,61223.0,61223.0,53.0,53.0,9
9,10,0.319288,192.168.43.72,128.93.77.234,DNS,75,Standard query 0x8756 A www.gstatic.com,53.0,53.0,18154.0,18154.0,10


### Cleaning the data frame

Before gettingstarted on the next stemp, I wanted to get a little more familiar with the data. First I wanted to answer:

1. What is the difference between "dest (r)" and "dest(u)"?
2. What is the difference between "src (r)" and "src(u)"?
3. What is going on with the "New Column" column?

I could of course look it up or ask Claude, but where is the fun in that? I wanted to first investigate what is going on myself.

In [4]:
# this cell shows there are 80 instances where "dest (r)" is not equal to "dest(u)"
len(packets) - sum(packets["dest (r)"] == packets["dest(u)"]) 

80

In [5]:
# we investigate what is happening in the cases where these aren't equal
packets[packets["dest (r)"] != packets["dest(u)"]].head(5)

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info,dest (r),dest(u),src (r),src(u),New Column
877,878,8.391182,CeLink_0d:2b:a7,Apple_01:4c:54,ARP,60,Who has 192.168.43.72? Tell 192.168.43.1,,,,,878
878,879,8.391235,Apple_01:4c:54,CeLink_0d:2b:a7,ARP,42,192.168.43.72 is at e4:ce:8f:01:4c:54,,,,,879
2324,2325,33.085201,fe80::a021:b7ff:febb:19ec,ff02::1,ICMPv6,86,Multicast Listener Query,,,,,2325
2388,2389,34.215829,fe80::e6ce:8fff:fe01:4c54,ff02::1:ff01:4c54,ICMPv6,86,Multicast Listener Report,,,,,2389
2459,2460,35.215949,fe80::e6ce:8fff:fe01:4c54,ff02::1:ff77:3595,ICMPv6,86,Multicast Listener Report,,,,,2460


In [6]:
# we can see that these 80 values are accounted for by NaN values
print(packets[~packets["dest (r)"].isna() & (packets["dest (r)"] != packets["dest(u)"])])

# I used Claude on this cell to debug an issue that I was having with indexing in pandas:
# https://claude.ai/share/d68d6049-2976-400d-9395-e8a6c1cf7f19

Empty DataFrame
Columns: [No., Time, Source, Destination, Protocol, Length, Info, dest (r), dest(u), src (r), src(u), New Column]
Index: []


In [7]:
# we can now show the same thing with "src (r)" and "src(u)"
# we can see that these 80 values are accounted for by NaN values
print(packets[~packets["src (r)"].isna() & (packets["src (r)"] != packets["src(u)"])])

Empty DataFrame
Columns: [No., Time, Source, Destination, Protocol, Length, Info, dest (r), dest(u), src (r), src(u), New Column]
Index: []


In [8]:
# now we investigate the "New Column"
new_col_list = packets["New Column"].to_list()

In [9]:
# now we show that it is just the row number indexed at one
for i in range(len(new_col_list)):
    assert i +1 == new_col_list[i]

So to answer my original questions from above:

1. The values for "dest (r)" and "dest(u)" are always identical
2. The values for "src (r)" and "src(u)" are always identical
3. The values in "New Column" contain no meaningful information.

Therefore we can colapse "dest (r)" and "dest(u)" into a single "dest" column. Likewise, we can colapse "src (r)" and "src(u)" into a single "src" column. We can drop "New Column" entirely.

In [25]:
packets = packets.drop(['src (r)', 'dest (r)', 'New Column'], axis=1)
packets = packets.rename(columns={"dest(u)": "dest", "src(u)": "src"})

# I referenced the link below for how to drop a column in pandas
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

# I referenced the link below to know how to rename columns in pandas
# https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas

In [31]:
packets.head(10)

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info,dest,src
0,1,0.0,192.168.43.72,128.93.77.234,DNS,77,Standard query 0xed0c A fonts.gstatic.com,53.0,55697.0
1,2,0.00015,192.168.43.72,128.93.77.234,DNS,77,Standard query 0x301a AAAA fonts.gstatic.com,53.0,59884.0
2,3,0.004726,192.168.43.72,128.93.77.234,DNS,87,Standard query 0x11d3 A googleads.g.doubleclic...,53.0,61223.0
3,4,0.006522,192.168.43.72,128.93.77.234,DNS,87,Standard query 0x1284 AAAA googleads.g.doublec...,53.0,58785.0
4,5,0.011103,192.168.43.72,128.93.77.234,DNS,78,Standard query 0x3432 AAAA ytimg.l.google.com,53.0,51938.0
5,6,0.012354,192.168.43.72,128.93.77.234,DNS,96,Standard query 0xb756 A r4---sn-gxo5uxg-jqbe.g...,53.0,20949.0
6,7,0.012474,192.168.43.72,128.93.77.234,DNS,75,Standard query 0x62ab A ssl.gstatic.com,53.0,58025.0
7,8,0.012567,192.168.43.72,128.93.77.234,DNS,74,Standard query 0x42fb A www.google.com,53.0,15895.0
8,9,0.319268,128.93.77.234,192.168.43.72,DNS,386,Standard query response 0x11d3 A 216.58.213.162,61223.0,53.0
9,10,0.319288,192.168.43.72,128.93.77.234,DNS,75,Standard query 0x8756 A www.gstatic.com,53.0,18154.0


### Identifying the Service Type

Use the DNS traffic to filter the packet trace for Netflix traffic.

In [27]:
# The variable below is from a previous hands on: https://github.com/noise-lab/ml-systems/blob/main/docs/notebooks/03-Performance-Service-Clean.ipynb
NF_DOMAINS = (["nflxvideo", 
              "netflix", 
              "nflxso", 
              "nflxext"])

In [28]:
set(packets['Protocol'])

{'ARP',
 'BOOTP',
 'DNS',
 'EAPOL',
 'HTTP',
 'ICMP',
 'ICMPv6',
 'IGMPv2',
 'MDNS',
 'NBNS',
 'SSLv2',
 'TCP',
 'TLSv1',
 'TLSv1.2'}

In [29]:
# get only DNS packets

dns_packets = packets[packets["Protocol"] == "DNS"]

In [47]:
# I used this chat with claude to learn the syntax for how to iterate over rows
# in a pandas dataframe:
# https://claude.ai/share/323a5fde-d8a8-4104-81cb-f89683cd0d3c

our_ip = "192.168.43.72"

netflix_transaction_ids = set()
netflix_ips = set()

for index, row in dns_packets.iterrows():

    # if the packet is a DNS response, see if the response matches any
    # netflix query
    third_word_in_info = row['Info'].split()[2]
    if third_word_in_info == "response":
            transaction_id = row['Info'].split()[3]
            if transaction_id not in netflix_transaction_ids:
                continue
            else:
                netflix_transaction_ids.remove(transaction_id)
            
            netflix_ip = row['Info'].split()[-1]
            
            assert '.' in netflix_ip
            for num in netflix_ip.split('.'):
                int(num) # make sure this doesn't throw an error
            
            netflix_ips.add(netflix_ip)

    # if the packet is a DNS request for netflix, extract and track the
    # transaction id

    for domain in NF_DOMAINS:
        if domain in row['Info']:
            
            transaction_id = row['Info'].split()[2]
        
            assert '0x' in transaction_id
            netflix_transaction_ids.add(transaction_id)
            break

In [48]:
netflix_ips

{'198.38.120.130',
 '198.38.120.134',
 '198.38.120.137',
 '198.38.120.153',
 '198.38.120.162',
 '198.38.120.164',
 '198.38.120.166',
 '198.38.120.167',
 '23.57.80.120',
 '34.252.77.54',
 '52.19.39.146',
 '52.208.128.101',
 '52.210.133.255',
 '52.210.19.176',
 '52.48.148.78',
 '52.48.8.150'}

We have extacted our IP addresses for netflix servers using the DNS traffic. As a sanity check,
lets now inspect the DNS packets that have these IP addresses.

In [50]:
netflix_transaction_ids = []
for index, row in dns_packets.iterrows():
    for ip in netflix_ips:
        if ip in row["Info"]:
            print(row["Info"])
            netflix_transaction_ids.append(row["Info"].split()[3])

Standard query response 0x7776 A 198.38.120.130
Standard query response 0xb19a A 52.19.39.146
Standard query response 0x1f03 A 52.210.19.176
Standard query response 0x37e9 A 198.38.120.153
Standard query response 0x3049 A 23.57.80.120
Standard query response 0x5f60 A 23.57.80.120
Standard query response 0x6677 A 198.38.120.137
Standard query response 0xbe57 A 23.57.80.120
Standard query response 0x9415 A 198.38.120.167
Standard query response 0x5542 A 34.252.77.54
Standard query response 0x269f A 198.38.120.134
Standard query response 0x80e9 A 198.38.120.164
Standard query response 0xc0b4 A 198.38.120.166
Standard query response 0x3981 A 198.38.120.162
Standard query response 0x82dc A 52.48.148.78
Standard query response 0x86ee A 52.48.8.150
Standard query response 0xed3f A 52.208.128.101
Standard query response 0xa9b1 A 52.210.133.255


In [54]:
for index, row in dns_packets.iterrows():
    for id in netflix_transaction_ids:
        if id in row["Info"]:
            print(row["Info"])

Standard query 0xb19a A www.netflix.com
Standard query 0x3049 A assets.nflxext.com
Standard query 0x5f60 A codex.nflxext.com
Standard query 0x1f03 A customerevents.netflix.com
Standard query 0x7776 A ipv4-c001-cdg001-ix.1.oca.nflxvideo.net
Standard query 0x37e9 A ipv4-c024-cdg001-ix.1.oca.nflxvideo.net
Standard query response 0x7776 A 198.38.120.130
Standard query response 0xb19a A 52.19.39.146
Standard query response 0x1f03 A 52.210.19.176
Standard query response 0x37e9 A 198.38.120.153
Standard query response 0x3049 A 23.57.80.120
Standard query response 0x5f60 A 23.57.80.120
Standard query 0x9415 A ipv4-c072-cdg001-ix.1.oca.nflxvideo.net
Standard query 0x6677 A occ-0-56-55.1.nflxso.net
Standard query 0xbe57 A tp-s.nflximg.net
Standard query response 0x6677 A 198.38.120.137
Standard query response 0xbe57 A 23.57.80.120
Standard query response 0x9415 A 198.38.120.167
Standard query 0x5542 A push.prod.netflix.com
Standard query response 0x5542 A 34.252.77.54
Standard query 0x269f A ipv

### Generate Statistics

Generate statistics and features for the Netflix traffic flows. Use the `netml` library or any other technique that you choose to generate a set of features that you think would be good features for your model. 

**Write a brief justification for the features that you have chosen.**

### Inferring Segment downloads

In addition to the features that you could generate using the `netml` library or similar, add to your feature vector a "segment downloads rate" feature, which indicates the number of video segments downloaded for a given time window.

Note: If you are using the `netml` library, generating features with `SAMP` style options may be useful, as this option gives you time windows, and you can then simply add the segment download rate to that existing dataframe.

## Part 2: Video Quality Inference

You will now load the complete video dataset from a previous study to train and test models based on these features to automatically infer the quality of a streaming video flow.

For this part of the assignment, you will need two pickle files, which we provide for you by running the code below:

```

!gdown 'https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS' -O netflix_session.pkl
!gdown 'https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI' -O video_dataset.pkl

```

### Load the File

Load the video dataset pickle file.

### Clean the File

1. The dataset contains video resolutions that are not valid. Remove entries in the dataset that do not contain a valid video resolution. Valid resolutions are 240, 360, 480, 720, 1080.

2. The file also contains columns that are unnecessary (in fact, unhelpful!) for performing predictions. Identify those columns, and remove them.

**Briefly explain why you removed those columns.**

### Prepare Your Data

Prepare your data matrix, determine your features and labels, and perform a train-test split on your data.

### Train and Tune Your Model

1. Select a model of your choice.
2. Train the model using your training data.

### Tune Your Model

Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

Evaluate your model accuracy according to the following metrics:

1. Accuracy
2. F1 Score
3. Confusion Matrix
4. ROC/AUC

## Part 3: Predict the Ongoing Resolution of a Real Netflix Session

Now that you have your model, it's time to put it in practice!

Use a preprocessed Netflix video session to infer **and plot** the resolution at 10-second time intervals.