# README

This section gives a comprehensive overview of our project process, how to run the notebooks, key findings, and future directions. 

## Overview

Data preprocessing is a majority bulk-work during machine learning methods. You need to make sure that the data is processed correctly, encapsulating, and one that is reflectory of the environment you are looking to predict. Often many will recommend that incomplete data is worse than non-existing data, and thus NaN values located within different features, when unable to be extracted, should be dropped. We look to challenge this method with a new question. What if you are able to predict these values instead of dropping them? Could this improve model accuracy when predicting video QoE? This is what we look to test. We do this through a series of steps outlined below:

1. We collected a pcap file during a Netflix streaming session and induce NaN values within a feature of our choosing. We look to only choose one feature for now to eliminate confounding variables. In this case we are choosing window size. 

2. We collected resolution labels during the same Netflix session by running a JS script in the Google DevTools console (outlined in Label_Generation_Script.txt). This script extracted "videoWidth" from the Netflix video element to represent video resolution each second.

3. After we have an NaN induced pcap file and labels, we look to test the control variable, in which we drop NaN values and rows corresponding to them. We then train our model on by doing an 80/20 train test split to see how our model predicts QoE. 

4. Now, we have a baseline accuracy for predicitng QoE. Let us move onto the next model. We first need to create a model that can predict window size values (to fill in NaN values). We cannot use the original file/pcap, so we take another file to do this (netflix_traffic_unlabeled.pcap). We train our model on this (also using an 80/20 train test split). Once we have a model that can predict window size, we go back to our original pcap file and we fill in NaN values with these predicted window sizes. We then run the SAME model that was used in our baseline mode, and compare accuracies. 

## How to Run

### Step 1: Induce NaN Values:

You will want to start in the Inducing_NaN_Values.ipynb. This is so that we can take the original pcap file (netflix_resolution) and induce NaN values into the Window Size column. After running this notebook, you will have a netflix_resolution_nan_window.pcap. 

### Step 2: Run a Baseline Netflix QoE Model

You will then want to run the "1:Baseline_Netfli_QoE_Inference.ipynb" notebook to get a baseline accuracy on our model.

### Step 3: Run a Netflix QoE Model with Induced Variables

Run "2:Predicted_Values_Netflix_QoE.ipynb". This notebook looks to take the NaN-induced pcap, generate a model to predict these values by training on netflix_traffic_unlabeled.pcap, and use that mdoel (window_size_model.pkl) to predict the values in our NaN-induced pcap and get another accuracy score. 

## Findings

Our findings was that simply dropping rows that has NaN Values was more accurate than trying to predict the values. This might be because of two reasons:

1. Quality of Training Data: The model used to predict missing window sizes was trained on a different pcap file (netflix_traffic_unlabeled.pcap) which may have captured different network conditions, traffic patterns, or connection characteristics compared to the target file (netflix_resolution.pcap). This domain shift between training and inference data could have introduced systematic errors in the predictions, making them less accurate than the actual values.

2. Window Size is Highly Context-Dependent: TCP window sizes are dynamic values that depend on many factors including network congestion, receiver buffer availability, and flow control mechanisms. Our feature set (source/destination ports, sequence numbers, ACK numbers, flags, TTL, packet length) may not have captured enough contextual information to accurately predict these values. Unlike more static features, window size requires understanding the full conversation state and network conditions at that specific moment in time

Our findings supported our hypothesis, as the contrary finding would hint at decades long of mispractice in the ML industry. 

## Future Directions

One thing that we might look to do is hyperparamter tune our model. If we are able to get a high accuracy on our window size prediciton, there doesn't seem to be a drawback into using the model compared to now, which might exacerbate inaccurate predicitons. 

We also look to predict other features to answer the question: could there be a feature matrix that supports our theory of using machine learning to predict values instead of dropping them. 