# Term Project: Checkpoint #2

**Group 8:** Palvi Sabherwal, Emily Thai, Hannah Shakouri

## PROJECT UPDATE

1. Provide an update on the status of your data collection. Have you been able to successfully collect and label a packet trace collected through netunicorn for your project? If yes, how many traces have you collected? Do you plan to scale up your data collection any further?

**Currently, we have collected the data of five videos from each platform: YouTube, Vimeo, and Twitch. For YouTube and Vimeo, each video has a different duration (1 min, 2 min, 5 min, 10 min, and 15 min). For Twitch, each video is around 1 minute long. Our pipeline is setup to collect data from 15 videos (5 from each platform). In total, our pipeline has collected 15 packet traces (.pcap files). We are currently in the process of labeling and extracting the features of the packet traces we have collected so far. We are considering expanding our data collection (by 1-2 videos) for each platform. In the code below, we are using netUnicorn to collect the packet traces and preproccessing them to train our model.**

2. Provide a list of features that you will extract from the packet traces for your model.

**For our model, we plan on extracting these features from the packet traces:**
- **Size:** of video chunk in bytes
- **Delivery Rate:** rate (in bytes per second) at which data is delivered
- **RRT (Round-Trip Time):** latency (in seconds) between client and server
- **Transmission Time:**  total time taken to download the video chunk

3. Please provide a high-level explanation of the model type that you plan to train / evaluate. Provide a link to the python sci-kit implementation that you plan to use.

**We plan on using the *RandomForestRegressor* to train and evaluate our data collection. The *RandomForestRegressor* is suitable for our project because it can predict numerical outcomes, such as the download time for video chunks or QoE metrics. This can be useful in understanding the significance of different features in classifying network traffic. This scikit-learn model is more complex and useful for capturing non-linear relationships between features and outcomes. In the code below, we have provided the implementation of the model we plan on using for our project.**

https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html

# From Checkpoint #1

## EXPERIMENT

*Provide a high-level explanation of the experiment(s) that you want to run through netUnicorn/netReplica. What type of data do you need to collect?*

Our project’s goal is to predict download times of video chunks on video streaming platforms to minimize the interruptions caused by fluctuations in network performance. The input of our model will be historic QoS metrics, including throughput, latency, and packet loss. As for the training data we plan on running through netUnicorn/netReplica, we will collect time-series data on historical QoS metrics for video streaming sessions. 

## DATA COLLECTION PIPELINES
*Provide (pseudo)code for the pipeline(s) that you will run for your data collection.*

### Import Statements
*For each task in your pipeline, provide a reference to the implementation of this task that you will use for your data collection.*

These imports statements are our reference to the implementations used in our pipeline tasks. 

In [1]:
import os
import time
import pandas as pd

from netunicorn.client.remote import RemoteClient, RemoteClientException
from netunicorn.base import Experiment, ExperimentStatus, Pipeline
from netunicorn.library.tasks.capture.tcpdump import StartCapture, StopNamedCapture
from netunicorn.library.tasks.upload.fileio import UploadToFileIO
from netunicorn.library.tasks.upload.webdav import UploadToWebDav
from netunicorn.library.tasks.basic import SleepTask
from netunicorn.library.tasks.measurements.ookla_speedtest import SpeedTest
from netunicorn.library.tasks.video_watchers.youtube_watcher import WatchYouTubeVideo
from netunicorn.library.tasks.video_watchers.vimeo_watcher import WatchVimeoVideo
from netunicorn.library.tasks.video_watchers.twitch_watcher import WatchTwitchStream

### Set Up netUnicorn
Choosing a device for our project. Using our group's netUnicorn API credentials.

In [2]:
NETUNICORN_ENDPOINT = os.environ.get('NETUNICORN_ENDPOINT', 'https://pinot.cs.ucsb.edu/netunicorn')
NETUNICORN_LOGIN = os.environ.get('NETUNICORN_LOGIN', 'cs190n8')       
NETUNICORN_PASSWORD = os.environ.get('NETUNICORN_PASSWORD', 'kfazTdrx')

In [3]:
client = RemoteClient(endpoint=NETUNICORN_ENDPOINT, login=NETUNICORN_LOGIN, password=NETUNICORN_PASSWORD)
print("Health Check: {}".format(client.healthcheck()))
nodes = client.get_nodes()
print(nodes)

Health Check: True
[<Uncountable node pool with next node template: [aws-fargate-A-cs190n8-, aws-fargate-B-cs190n8-, aws-fargate-ARM64-cs190n8-]>]


In [4]:
working_node = 'aws-fargate-A-cs190n8-'

### Collecting Network Data for Video Streaming
In our collecting network data pipeline we will be collecting packet captures while streaming video for YouTube.

In [5]:
from netunicorn.executor import get_local_executor
executor = get_local_executor(pipeline)

NameError: name 'pipeline' is not defined

In [6]:
pipeline = Pipeline()

# Flag to enable early stopping -- so if any task fails pipeline would go on working
# pipeline.early_stopping = False

# Generate Data for YouTube
pipeline.then(StartCapture(filepath="/tmp/youtube_1min.pcap", name="1min_captureYT"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://www.youtube.com/watch?v=0g1Q4fBDp2U&pp=ygUMMSBtaW4gdmlkZW9z", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="1min_captureYT"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/youtube_2min.pcap", name="2min_captureYT"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://www.youtube.com/watch?v=0CmtDk-joT4&ab_channel=Koi", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="2min_captureYT"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/youtube_5min.pcap", name="5min_captureYT"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://www.youtube.com/watch?v=6-2ra25RVRs&pp=ygUMNSBtaW4gdmlkZW9z", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="5min_captureYT"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/youtube_10min.pcap", name="10min_captureYT"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://www.youtube.com/watch?v=vCkhJeom7zU&pp=ygUNMTAgbWluIHZpZGVvcw%3D%3D", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="10min_captureYT"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/youtube_15min.pcap", name="15min_captureYT"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://www.youtube.com/watch?v=co47u19cbds&pp=ygUNMTUgbWluIHZpZGVvcw%3D%3D", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="15min_captureYT"))

# Generate Data for Twitch
pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/twitch_capture1.pcap", name="1_captureTW"))
for _ in range(5):
    pipeline.then(WatchTwitchStream("https://www.twitch.tv/emilycc/clip/BoringGloriousGoblinOMGScoots-HzN323BdbgWMC8z2", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="1_captureTW"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/twitch_capture2.pcap", name="2_captureTW"))
for _ in range(5):
    pipeline.then(WatchTwitchStream("https://www.twitch.tv/chess24/clip/EvilHedonisticLadiesDBstyle-Qvx2Iv5rD18TBkMc", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="2_captureTW"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/twitch_capture3.pcap", name="3_captureTW"))
for _ in range(5):
    pipeline.then(WatchTwitchStream("https://www.twitch.tv/dinabelenkaya/clip/BlueArtisticLarkDendiFace-6GsaEv3Rt8vpRoTL", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="3_captureTW"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/twitch_capture4.pcap", name="4_captureTW"))
for _ in range(5):
    pipeline.then(WatchTwitchStream("https://www.twitch.tv/chess24/clip/PlayfulEndearingOpossumPeoplesChamp-d1sejwuAMspNlEz7", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="4_captureTW"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/twitch_capture5.pcap", name="5_captureTW"))
for _ in range(5):
    pipeline.then(WatchTwitchStream("", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="5_captureTW"))

pipeline.then(UploadToWebDav(filepaths={"/tmp/youtube_1min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/youtube_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/youtube_2min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/youtube_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/youtube_5min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/youtube_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/youtube_10min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/youtube_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/youtube_15min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/youtube_capture", username="uploader", password="uploader"))

pipeline.then(UploadToWebDav(filepaths={"/tmp/twitch_capture1.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/twitch_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/twitch_capture2.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/twitch_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/twitch_capture3.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/twitch_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/twitch_capture4.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/twitch_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/twitch_capture5.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/twitch_capture", username="uploader", password="uploader"))

Pipeline(426a5aff-b1f7-4afb-bc8a-a427446bdf46): {'root': [<netunicorn.library.tasks.capture.tcpdump.StartCapture object at 0x7f50cd6bc820>], 1: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f50cd6bc970>], 2: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f50cd6bcb80>], 3: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f50cd6bcbe0>], 4: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f50cd6bcc10>], 5: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f50cd6bcc70>], 6: [<netunicorn.library.tasks.capture.tcpdump.StopNamedCapture object at 0x7f50cd6bcb50>], 7: [<netunicorn.library.tasks.basic.SleepTask with name 01d1484f-a582-44b0-a190-16899b926e7a>], 8: [<netunicorn.library.tasks.capture.tcpdump.StartCapture object at 0x7f50cd6bcac0>], 9: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoV

In [32]:
executor()

ERROR:executor_local:asyncio.run() cannot be called from a running event loop
Traceback (most recent call last):
  File "/srv/netunicorn/netunicorn-executor/src/netunicorn/executor/executor.py", line 134, in __call__
    asyncio.run(self.execute())
  File "/usr/lib/python3.10/asyncio/runners.py", line 33, in run
    raise RuntimeError(
RuntimeError: asyncio.run() cannot be called from a running event loop
CRITICAL:executor_local:Failed to execute the graph. Shutting down.
  break
INFO:executor_local:Skipping reporting results due to execution graph setting.


In [6]:
pipeline = Pipeline()
# Flag to enable early stopping -- so if any task fails pipeline would go on working
# pipeline.early_stopping = False

# Generate Data for Vimeo
pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/vimeo_1min.pcap", name="1min_captureVM"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://vimeo.com/820625227?autoplay=1", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="1min_captureVM"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/vimeo_2min.pcap", name="2min_captureVM"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://vimeo.com/867196026", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="2min_captureVM"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/vimeo_5min.pcap", name="5min_captureVM"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://vimeo.com/391703912", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="5min_captureVM"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/vimeo_10min.pcap", name="10min_captureVM"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://vimeo.com/675873896", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="10min_captureVM"))

pipeline.then(SleepTask(2))

pipeline.then(StartCapture(filepath="/tmp/vimeo_15min.pcap", name="15min_captureVM"))
for _ in range(5):
    pipeline.then(WatchVimeoVideo("https://vimeo.com/1031379349", 10))
pipeline.then(StopNamedCapture(start_capture_task_name="15min_captureVM"))

pipeline.then(UploadToWebDav(filepaths={"/tmp/vimeo_1min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/vimeo_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/vimeo_2min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/vimeo_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/vimeo_5min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/vimeo_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/vimeo_10min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/vimeo_capture", username="uploader", password="uploader"))
pipeline.then(UploadToWebDav(filepaths={"/tmp/vimeo_15min.pcap"}, endpoint="http://snl-server-5.cs.ucsb.edu/cs190n/cs190n8/vimeo_capture", username="uploader", password="uploader"))

Pipeline(94e758bd-c3e4-415d-9ae0-3898c9c0a863): {'root': [<netunicorn.library.tasks.basic.SleepTask with name 886ff3c3-8d68-4152-afcf-3cfe33228a45>], 1: [<netunicorn.library.tasks.capture.tcpdump.StartCapture object at 0x7f38c9e8f700>], 2: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f38c9e8f6d0>], 3: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f38c9e8c880>], 4: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f38c9e8c7c0>], 5: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f38c9e8c820>], 6: [<netunicorn.library.tasks.video_watchers.vimeo_watcher.WatchVimeoVideo object at 0x7f38c9e8c7f0>], 7: [<netunicorn.library.tasks.capture.tcpdump.StopNamedCapture object at 0x7f38c9e8c700>], 8: [<netunicorn.library.tasks.basic.SleepTask with name 2bd9699a-e514-46a9-9728-db5da4df4ac2>], 9: [<netunicorn.library.tasks.capture.tcpdump.StartCapture ob

In [7]:
client = RemoteClient(endpoint=NETUNICORN_ENDPOINT, login=NETUNICORN_LOGIN, password=NETUNICORN_PASSWORD)
print("Health Check: {}".format(client.healthcheck()))
nodes = client.get_nodes()
print(nodes)

Health Check: True
[<Uncountable node pool with next node template: [aws-fargate-A-cs190n8-, aws-fargate-B-cs190n8-, aws-fargate-ARM64-cs190n8-]>]


In [8]:
working_nodes = nodes.filter(lambda node: node.name.startswith(working_node)).take(1)

# Creating the experiment
experiment = Experiment().map(pipeline, working_nodes)
print(experiment)

 - Deployment: Node=aws-fargate-A-cs190n8-1, executor_id=, prepared=False, error=None


### Preparing the Experiment
We will use a predefined DockerImage.

In [9]:
from netunicorn.base import DockerImage
for deployment in experiment:
    # you can explore the image on the DockerHub
    deployment.environment_definition = DockerImage(image='satyandraguthula/netunicorn_images')

In [10]:
experiment_label = "da1aco113c1i0ns"

Now we can prepare the experiment, check for any errors and execute.

In [11]:
try:
    client.delete_experiment(experiment_label)
except RemoteClientException:
    pass

client.prepare_experiment(experiment, experiment_label)

while True:
    info = client.get_experiment_status(experiment_label)
    print(info.status)
    if info.status == ExperimentStatus.READY:
        break
    time.sleep(20)

ExperimentStatus.PREPARING
ExperimentStatus.READY


In [12]:
for deployment in client.get_experiment_status(experiment_label).experiment:
    print(f"Prepared: {deployment.prepared}, error: {deployment.error}")

Prepared: True, error: None


In [13]:
client.start_execution(experiment_label)

while True:
    info = client.get_experiment_status(experiment_label)
    print(info.status)
    if info.status != ExperimentStatus.RUNNING:
        break
    time.sleep(20)

ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.RUNNING
ExperimentStatus.FINISHED


In [14]:
from returns.pipeline import is_successful

for report in info.execution_result:
    print(f"Node name: {report.node.name}")
    print(f"Error: {report.error}")

    result, log = report.result  # report stores results of execution and corresponding log
    
    # result is a returns.result.Result object, could be Success of Failure
    print(f"Result is: {type(result)}")
    data = result.unwrap() if is_successful(result) else result.failure()
    for key, value in data.items():
        print(f"{key}: {value}")

    # we also can explore logs
    for line in log:
        print(line.strip())
    print()

Node name: aws-fargate-A-cs190n8-1
Error: None
Result is: <class 'returns.result.Failure'>
1min_captureYT: [<Success: 9>]
3b6eff90-078b-45e1-9c71-57619891588f: [<Success: Video finished by timeout: 10 seconds>]
65cb8823-f035-4133-b040-66721a8b6637: [<Success: Video finished by timeout: 10 seconds>]
43e4e13b-36a3-47d6-92b6-0ce438ddf593: [<Success: Video finished by timeout: 10 seconds>]
61e0c191-a261-4473-8f54-53acddac555c: [<Failure: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/netunicorn/executor/utils.py", line 32, in decorator
    result = function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/netunicorn/library/tasks/video_watchers/vimeo_watcher.py", line 134, in run
  File "/usr/local/lib/python3.10/dist-packages/netunicorn/library/tasks/video_watchers/vimeo_watcher.py", line 74, in watch
  File "/usr/local/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 495, in close
    self.execute(Command.CLOSE

# Checkpoint #2

### Convert Packet Trace to CSV

In [16]:
df_youtube1 = pd.read_csv("/mnt/md0/cs190n/cs190n8/youtube_1min_capture_ISCX.csv")
df_youtube2 = pd.read_csv("/mnt/md0/cs190n/cs190n8/youtube_2min_capture_ISCX.csv")
df_youtube5 = pd.read_csv("/mnt/md0/cs190n/cs190n8/youtube_5min_capture_ISCX.csv")
df_youtube10 = pd.read_csv("/mnt/md0/cs190n/cs190n8/youtube_10min_capture_ISCX.csv")
df_youtube15 = pd.read_csv("/mnt/md0/cs190n/cs190n8/youtube_15min_capture_ISCX.csv")

df_vimeo1 = pd.read_csv("/mnt/md0/cs190n/cs190n8/vimeo_1min_capture_ISCX.csv")
df_vimeo2 = pd.read_csv("/mnt/md0/cs190n/cs190n8/vimeo_2min_capture_ISCX.csv")
df_vimeo5 = pd.read_csv("/mnt/md0/cs190n/cs190n8/vimeo_5min_capture_ISCX.csv")
df_vimeo10 = pd.read_csv("/mnt/md0/cs190n/cs190n8/vimeo_10min_capture_ISCX.csv")
df_vimeo15 = pd.read_csv("/mnt/md0/cs190n/cs190n8/vimeo_15min_capture_ISCX.csv")

df_twitch1 = pd.read_csv("/mnt/md0/cs190n/cs190n8/twitch_capture1_ISCX.csv")
df_twitch2 = pd.read_csv("/mnt/md0/cs190n/cs190n8/twitch_capture2_ISCX.csv")
df_twitch3 = pd.read_csv("/mnt/md0/cs190n/cs190n8/twitch_capture3_ISCX.csv")
df_twitch4 = pd.read_csv("/mnt/md0/cs190n/cs190n8/twitch_capture4_ISCX.csv")
df_twitch5 = pd.read_csv("/mnt/md0/cs190n/cs190n8/twitch_capture5_ISCX.csv")

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/md0/cs190n-test/youtube_video_capture_ISCX.csv'

### Separate CSVs to video_sent & video_acked

In [1]:
def separate_video_packets(df):
    # Add label to classify packets as 'sent' or 'acked'
    df['Label'] = 'other'
    
    # Define criteria for video sent and video acked
    df.loc[df['Total Fwd Packet'] > 0, 'Label'] = 'sent'  # Forward packets (video sent)
    df.loc[df['Total Bwd packets'] > 0, 'Label'] = 'acked'  # Backward packets (video acked)
    
    # Filter out irrelevant rows (optional, depending on your needs)
    df = df.drop(df[(df['Protocol'] == 17) & (df['Label'] != 'sent')].index)
    
    # Separate into sent and acked dataframes
    df_sent = df[df['Label'] == 'sent']
    df_acked = df[df['Label'] == 'acked']
    
    return df_sent, df_acked

df_youtube1_sent, df_youtube1_acked = separate_video_packets(df_youtube1)
df_youtube2_sent, df_youtube2_acked = separate_video_packets(df_youtube2)
df_youtube5_sent, df_youtube5_acked = separate_video_packets(df_youtube5)
df_youtube10_sent, df_youtube10_acked = separate_video_packets(df_youtube10)
df_youtube15_sent, df_youtube15_acked = separate_video_packets(df_youtube15)

df_vimeo1_sent, df_vimeo1_acked = separate_video_packets(df_vimeo1)
df_vimeo2_sent, df_vimeo2_acked = separate_video_packets(df_vimeo2)
df_vimeo5_sent, df_vimeo5_acked = separate_video_packets(df_vimeo5)
df_vimeo10_sent, df_vimeo10_acked = separate_video_packets(df_vimeo10)
df_vimeo15_sent, df_vimeo15_acked = separate_video_packets(df_vimeo15)

df_twitch1_sent, df_twitch1_acked = separate_video_packets(df_twitch1)
df_twitch2_sent, df_twitch2_acked = separate_video_packets(df_twitch2)
df_twitch3_sent, df_twitch3_acked = separate_video_packets(df_twitch3)
df_twitch4_sent, df_twitch4_acked = separate_video_packets(df_twitch4)
df_twitch5_sent, df_twitch5_acked = separate_video_packets(df_twitch5)

NameError: name 'df_twitch' is not defined

In [None]:
import pandas as pd
import numpy as np
import pickle
import argparse

In [None]:
VIDEO_DURATION = 180180
PKT_BYTES = 1500
MILLION = 1000000
PAST_CHUNKS = 8
FUTURE_CHUNKS = 5

### Prepare CSV Data

In [None]:
def prepare_raw_data(video_sent_path, video_acked_path, time_start=None, time_end=None):
    """
    Load data from files and calculate chunk transmission times.
    """
    video_sent_df = pd.read_csv(video_sent_path)
    video_acked_df = pd.read_csv(video_acked_path)

    # Rename "time (ns GMT)" to "time" for convenience
    video_sent_df.rename(columns={'time (ns GMT)': 'time'}, inplace=True)
    video_acked_df.rename(columns={'time (ns GMT)': 'time'}, inplace=True)

    # Convert nanosecond timestamps to datetime
    video_sent_df['time'] = pd.to_datetime(video_sent_df['time'], unit='ns')
    video_acked_df['time'] = pd.to_datetime(video_acked_df['time'], unit='ns')

    # Filter by time range
    if time_start:
        time_start = pd.to_datetime(time_start)
        video_sent_df = video_sent_df[video_sent_df['time'] >= time_start]
        video_acked_df = video_acked_df[video_acked_df['time'] >= time_start]
    if time_end:
        time_end = pd.to_datetime(time_end)
        video_sent_df = video_sent_df[video_sent_df['time'] <= time_end]
        video_acked_df = video_acked_df[video_acked_df['time'] <= time_end]

    # Process the data
    return calculate_trans_times(video_sent_df, video_acked_df)

### Calculate Transmission Times & Divide into Chunks

In [None]:
def calculate_trans_times(video_sent_df, video_acked_df):
    """
    Calculate transmission times from video_sent and video_acked datasets using session_id.
    """
    d = {}
    last_video_ts = {}

    for _, row in video_sent_df.iterrows():
        session = row['session_id']  # Use only session_id to track sessions
        if session not in d:
            d[session] = {}
            last_video_ts[session] = None

        video_ts = int(row['video_ts'])
        if last_video_ts[session] is not None:
            if video_ts != last_video_ts[session] + VIDEO_DURATION:
                continue

        last_video_ts[session] = video_ts
        d[session][video_ts] = {
            'sent_ts': pd.Timestamp(row['time']),
            'size': float(row['size']) / PKT_BYTES,
            'delivery_rate': float(row['delivery_rate']) / PKT_BYTES,
            'rtt': float(row['rtt']) / MILLION,
        }

    for _, row in video_acked_df.iterrows():
        session = row['session_id']  # Use only session_id
        if session not in d:
            continue

        video_ts = int(row['video_ts'])
        if video_ts not in d[session]:
            continue

        dsv = d[session][video_ts]
        sent_ts = dsv['sent_ts']
        acked_ts = pd.Timestamp(row['time'])
        dsv['acked_ts'] = acked_ts
        dsv['trans_time'] = (acked_ts - sent_ts).total_seconds()

    return d

In [None]:
prepare_raw_data("/mnt/md0/cs190n/video_sent.csv", "/mnt/md0/cs190n/video_acked.csv")

In [None]:
def append_past_chunks(ds, next_ts, row):
    i = 1
    past_chunks = []
    while i <= PAST_CHUNKS:
        ts = next_ts - i * VIDEO_DURATION
        if ts in ds and 'trans_time' in ds[ts]:
            past_chunks = [ds[ts]['delivery_rate'],
                           ds[ts]['rtt'],
                           ds[ts]['size'], 
                           ds[ts]['trans_time']] + past_chunks
        else:
            nts = ts + VIDEO_DURATION  # padding with the nearest ts
            padding = [ds[nts]['delivery_rate'],
                       ds[nts]['rtt']]
            if nts == next_ts:
                padding += [0, 0]  # next_ts is the first chunk to send
            else:
                padding += [ds[nts]['size'], ds[nts]['trans_time']]
            break
        i += 1
    if i != PAST_CHUNKS + 1:  # break in the middle; padding must exist
        while i <= PAST_CHUNKS:
            past_chunks = padding + past_chunks
            i += 1
    row += past_chunks

In [None]:
def prepare_input_output(d):
    ret = [{'in': [], 'out': []} for _ in range(5)]  # FUTURE_CHUNKS = 5

    for session in d:
        ds = d[session]

        for next_ts in ds:
            if 'trans_time' not in ds[next_ts]:
                continue

            row = []

            # Append past chunks
            append_past_chunks(ds, next_ts, row)

            # Append the TCP info of the next chunk
            row += [ds[next_ts]['delivery_rate'],
                    ds[next_ts]['rtt']]

            # Generate FUTURE_CHUNKS rows
            for i in range(5):  # FUTURE_CHUNKS = 5
                row_i = row.copy()

                ts = next_ts + i * VIDEO_DURATION
                if ts in ds and 'trans_time' in ds[ts]:
                    row_i += [ds[ts]['size']]

                    ret[i]['in'].append(row_i)
                    ret[i]['out'].append(ds[ts]['trans_time'])

    return ret

In [None]:
def save_processed_data(output_file, processed_data):
    """
    Save processed data to a file.
    """
    with open(output_file, 'wb') as f:
        pickle.dump(processed_data, f)
    print(f"Processed data saved to {output_file}")

In [None]:
if __name__ == '__main__':
    DEFAULT_VIDEO_SENT_PATH = ''
    DEFAULT_VIDEO_ACKED_PATH = ''
    DEFAULT_OUTPUT_FILE = ''
    
    #Latest datasets can be found at https://puffer.stanford.edu/results/
    
    parser = argparse.ArgumentParser(description="Process video streaming datasets.")
    parser.add_argument('--video_sent_path', type=str, help='Path to the video_sent dataset CSV file.')
    parser.add_argument('--video_acked_path', type=str, help='Path to the video_acked dataset CSV file.')
    parser.add_argument('--output_file', type=str, help='Path to save the processed data.')
    parser.add_argument('--time_start', type=str, default=None, help='Start time for filtering data (RFC3339 format).')
    parser.add_argument('--time_end', type=str, default=None, help='End time for filtering data (RFC3339 format).')
    #args = parser.parse_args()
    #processed_data = prepare_input_output(prepare_raw_data(args.video_sent_path, args.video_acked_path,
    #    time_start=args.time_start, time_end=args.time_end))
    # save_processed_data(args.output_file, processed_data)
    processed_data = prepare_input_output(prepare_raw_data(DEFAULT_VIDEO_SENT_PATH, DEFAULT_VIDEO_ACKED_PATH,
        time_start=None, time_end=None))
    save_processed_data(DEFAULT_OUTPUT_FILE, processed_data)

## Training the Model

In [4]:
%pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
# required imports
import sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import metrics
from sklearn.tree import plot_tree

In [None]:
# Select the features and target variable
X = df[['trans_time', 'size', 'delivery_rate', 'rtt']]  
y = df['video_chunk_download_time']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# R-squared score (good for understanding how well the model fits the data)
r2 = model.score(X_test, y_test)
print(f'R-squared: {r2}')

# Feature Importance: This shows the relative importance of each feature
print("Feature Importances:")
print(model.feature_importances_)

### Plotting Results

In [None]:
import matplotlib.pyplot as plt

# Plot actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Actual vs Predicted (Random Forest)')
plt.show()