### Extract Features from the Network Traffic

Load the `netflix.pcap` file, which is a packet trace that includes network traffic. 

Click [here](https://github.com/noise-lab/ml-systems/blob/main/docs/notebooks/data/netflix.pcap) to download `netflix.pcap`.


In [None]:
from scapy.all import rdpcap, IP, TCP, UDP, DNS
import pandas as pd

# Read pcap and build rows with a robust 'length' computation
pkts = rdpcap("../../myData/netflix.pcap")
rows = []
for pkt in pkts:
    is_DNS = DNS in pkt
    proto = "DNS" if is_DNS else ("TCP" if TCP in pkt else ("UDP" if UDP in pkt else "Other"))
    # compute packet length robustly; len(pkt) normally works for scapy packets
    try:
        pkt_len = len(pkt)
    except Exception:
        pkt_len = getattr(pkt, 'len', None)
    rows.append({
        "timestamp": float(getattr(pkt, "time", 0)),
        "length": pkt_len,
        "src_ip": pkt[IP].src if IP in pkt else None,
        "dst_ip": pkt[IP].dst if IP in pkt else None,
        "txn_id": pkt[DNS].id if is_DNS else None,
        "protocol": proto,
        "src_port": pkt.sport if (TCP in pkt or UDP in pkt) else None,
        "dst_port": pkt.dport if (TCP in pkt or UDP in pkt) else None,
        "info": str(pkt.summary())
    })
pcap__import = pd.DataFrame(rows)

# Normalise: ensure there is a numeric 'length' column for downstream processing
if 'length' not in pcap__import.columns and 'len' in pcap__import.columns:
    pcap__import['length'] = pcap__import['len']
# coerce to numeric and use pandas nullable integer dtype where possible
if 'length' in pcap__import.columns:
    pcap__import['length'] = pd.to_numeric(pcap__import['length'], errors='coerce').astype('Int64')



In [None]:
netflix_dns = pcap__import[
    pcap__import["info"].str.contains(r"netflix|nflx", case=False, na=False)
]
print(netflix_dns.head(15))

### Identifying the Service Type

Use the DNS traffic to filter the packet trace for Netflix traffic.

In [None]:
# only netflix queries (not answers)
netflix_qry = pcap__import[
    pcap__import["info"].str.contains(r"(netflix|nflx)", case=False, na=False)
    & pcap__import["info"].str.contains("Qry", case=False, na=False)
]

# show the txn ids and a few context columns
print(netflix_qry[["timestamp","src_ip","dst_ip","txn_id","info"]].head(30))
#drop duplicates
netflix_qry = netflix_qry.drop_duplicates(subset=["txn_id"])
print(netflix_qry[["timestamp","src_ip","dst_ip","txn_id","info"]].head(30))
#create a set of txn ids
netflix_txn_ids = set(netflix_qry["txn_id"])
print(netflix_txn_ids)


In [None]:
#now we want to loop through netflix_dns and find all the answers that match the txn ids
nflx_answers = pcap__import[pcap__import["txn_id"].isin(netflix_txn_ids) & pcap__import["info"].str.contains("Ans", case=False, na=False) & pcap__import["protocol"].eq("DNS")] 
nflx_answers.head()
print(len(nflx_answers))

In [None]:
#now let us extract the ips of nflx answers
nflx_ips = set()
for info in nflx_answers["info"]:
    parts = info.split("\"")
    for part in parts:
        if part.count(".") == 3:  # crude check for an IP address
            nflx_ips.add(part)  
print(nflx_ips)

### Generate Statistics

Generate statistics and features for the Netflix traffic flows. Use the `netml` library or any other technique that you choose to generate a set of features that you think would be good features for your model. 

In [None]:
nflx_pkts = pcap__import[
    (pcap__import["src_ip"].isin(nflx_ips)) | (pcap__import["dst_ip"].isin(nflx_ips))
]

nflx_pkts.head()

#I want to get some statistics that I can use about the nflx packets and traffic flows
print(len(nflx_pkts))
print(nflx_pkts["length"].describe())

#now lets get some more statistics to help us truly understand the traffic flows and provide features for our MOdel
nflx_pkts.describe()

#I also want to see how to the packets are distributed over time
nflx_pkts["timestamp"].plot(kind='hist', bins=50, title='Distribution of Packet Timestamps', xlabel='Timestamp', ylabel='Frequency')
#other features that might be useful are inter-arrival times, packet sizes, burstiness, protocol distribution, flow durations, and packet counts per flow.
#inter arrival time: 
print("inter-arrival time statistics:")
nflx_pkts = nflx_pkts.sort_values(by="timestamp")
nflx_pkts["inter_arrival_time"] = nflx_pkts["timestamp"].diff().fillna(0)
print(nflx_pkts["inter_arrival_time"].describe())

#burstiness: standard deviation of inter-arrival times
burstiness = nflx_pkts["inter_arrival_time"].std()
print(f"Burstiness (std of inter-arrival times): {burstiness}")

#protocol distribution which daescribes 
protocol_counts = nflx_pkts["protocol"].value_counts(normalize=True)
print("Protocol Distribution:")
print(protocol_counts)

#flow durations and packet counts per flow
nflx_pkts["flow_id"] = nflx_pkts.apply(lambda row: f"{row['src_ip']}-{row['dst_ip']}-{row['src_port']}-{row['dst_port']}-{row['protocol']}", axis=1)
flow_stats = nflx_pkts.groupby("flow_id").agg(
    flow_duration=pd.NamedAgg(column="timestamp", aggfunc=lambda x: x.max() - x.min()),
    packet_count=pd.NamedAgg(column="timestamp", aggfunc="count")
).reset_index()
print("Flow Statistics:")
print(flow_stats.describe())


**Write a brief justification for the features that you have chosen.**

### Inferring Segment downloads

In addition to the features that you could generate using the `netml` library or similar, add to your feature vector a "segment downloads rate" feature, which indicates the number of video segments downloaded for a given time window.

Note: If you are using the `netml` library, generating features with `SAMP` style options may be useful, as this option gives you time windows, and you can then simply add the segment download rate to that existing dataframe.

In [None]:
#add to your feature vector a "segment downloads rate" feature, which indicates the number of video segments downloaded for a given time window.
nflx_pkts["time_window"] = (nflx_pkts["timestamp"] // 10) * 10  # 10-second windows
segment_download_rate = nflx_pkts.groupby("time_window").size().reset_index(name="segment_download_rate")
nflx_pkts = nflx_pkts.merge(segment_download_rate, on="time_window", how="left")
print(nflx_pkts[["timestamp", "time_window", "segment_download_rate"]])

## Part 2: Video Quality Inference

You will now load the complete video dataset from a previous study to train and test models based on these features to automatically infer the quality of a streaming video flow.

For this part of the assignment, you will need two pickle files, which we provide for you by running the code below:

```

!gdown 'https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS' -O netflix_session.pkl
!gdown 'https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI' -O video_dataset.pkl

```

### Load the File

Load the video dataset pickle file.

In [None]:
!gdown 'https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS' -O netflix_session.pkl
!gdown 'https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI' -O video_dataset.pkl

### Clean the File

1. The dataset contains video resolutions that are not valid. Remove entries in the dataset that do not contain a valid video resolution. Valid resolutions are 240, 360, 480, 720, 1080.

In [None]:
import pickle
import pandas as pd

with open('../../myData/netflix_dataset.pkl', 'rb') as f:
    session_data = pickle.load(f)

print(session_data.columns.tolist())
print(session_data.head())


2. The file also contains columns that are unnecessary (in fact, unhelpful!) for performing predictions. Identify those columns, and remove them.

**Briefly explain why you removed those columns.**

### Prepare Your Data

Prepare your data matrix, determine your features and labels, and perform a train-test split on your data.

### Train and Tune Your Model

1. Select a model of your choice.
2. Train the model using your training data.

### Tune Your Model

Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

Evaluate your model accuracy according to the following metrics:

1. Accuracy
2. F1 Score
3. Confusion Matrix
4. ROC/AUC

## Part 3: Predict the Ongoing Resolution of a Real Netflix Session

Now that you have your model, it's time to put it in practice!

Use a preprocessed Netflix video session to infer **and plot** the resolution at 10-second time intervals.