# Case Study: Evaluating Congestion Control for Satellite Networks

### Scenario:
Your lab is working on a project simulating congestion control protocols on satellite networks. They have run some simulations and produced data stored in Github. They have written some scripts to help with creating plots, but there are some issues they want you to take a look at because they heard you are taking ***Programming With Data*** as part of ***PCT***. Follow the tasks below and help your lab make good decisions.

You donâ€™t need any prior knowledge about satellite networks or congestion control protocols to complete any of the examples.



## Task 1: Download the Data

In [None]:
# download and extract tar file
!wget https://raw.githubusercontent.com/mitdbg/practical-programming-with-data/main/2026/day1_materials/sim_logs.tar.gz
!tar -xzf sim_logs.tar.gz

In [None]:
# examine the extracted files
!ls sim_logs

## Task 2:
Somebody suggested to your labmate that they use the `parquet` file format to store the large logs produced by the simulations. Find out why they may have made this suggestion by looking at some differences in the performance of `parquet` vs `csv`.

In [None]:
import pandas as pd

# read parquet file
df = pd.read_parquet("sim_logs/LeoSharedPath-1ms-TcpBbr-throughput.parquet")

# save data as csv
df.to_csv("sim_logs/LeoSharedPath-1ms-TcpBbr-throughput.csv", index=False)

In [None]:
# first, let's look at the difference in the files' raw size(s)
!ls -lth sim_logs/ | grep LeoSharedPath-1ms-TcpBbr-throughput

In [None]:
# next, let's time how long it takes for us to read each file from disk
import time

start_time = time.time()
df = pd.read_parquet("sim_logs/LeoSharedPath-1ms-TcpBbr-throughput.parquet")
parquet_read_time = time.time() - start_time

start_time = time.time()
df = pd.read_csv("sim_logs/LeoSharedPath-1ms-TcpBbr-throughput.csv")
csv_read_time = time.time() - start_time

print(f"parquet read time: {parquet_read_time}")
print(f"csv read time: {csv_read_time}")
print("---")
print(f"speedup: {csv_read_time / parquet_read_time:.2f}x")

In [None]:
# now let's examine the first few rows of the dataframe
df.head()

In [None]:
# parquet enables you to only read / load certain columns from the raw data;
# as a final experiment, let's see how long it takes us to load the flowId and bytes
# columns with parquet, relative to loading the entire CSV and post-filtering it
# next, let's time how long it takes for us to read each file from disk
import time

start_time = time.time()
df = pd.read_parquet("sim_logs/LeoSharedPath-1ms-TcpBbr-throughput.parquet", columns=["flowId", "bytes"])
parquet_read_time = time.time() - start_time

start_time = time.time()
df = pd.read_csv("sim_logs/LeoSharedPath-1ms-TcpBbr-throughput.csv")
df = df.loc[:, ["flowId", "bytes"]]
csv_read_time = time.time() - start_time

print(f"parquet read time: {parquet_read_time}")
print(f"csv read time: {csv_read_time}")
print("---")
print(f"speedup: {csv_read_time / parquet_read_time:.2f}")

## Task 3:
Let's examine the queueing delay associated with the BBR algorithm (`TcpBbr`) in the `GsToGs` setting.

In [None]:
delay_df = pd.read_parquet("sim_logs/LeoGsToGs-1ms-TcpBbr-delay.parquet")

In [None]:
delay_df.head()

In [None]:
# Q: How many unique flows are in this data sample?
delay_df['flowId'].nunique()

In [None]:
# Data Cleaning: There's a typo in the column name "queue_dealy(us)" rename this column to "queue_delay(us)"
delay_df.rename(columns={'queue_dealy(us)': 'queue_delay(us)'}, inplace=True)

In [None]:
# Plotting: plot the queueing delay in **milliseconds** vs. time in seconds
import matplotlib.pyplot as plt

delay_df['delay(ms)'] = delay_df['queue_delay(us)'] / 1000.0

plt.figure(figsize=(12, 7))
plt.scatter(delay_df['time(s)'], delay_df['delay(ms)'], s=15, alpha=0.7, color='b')

plt.title('Queuing Delay Over Time')
plt.xlabel('Time (s)')
plt.ylabel('Queuing Delay (ms)')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()

plt.show()

In [None]:
# Plotting: to better examine the queueing delay behavior, generate another
# version of the plot which does not contain samples with >1.0 ms of delay
import matplotlib.pyplot as plt

delay_df['delay(ms)'] = delay_df['queue_delay(us)'] / 1000.0
delay_df = delay_df[delay_df['delay(ms)'] < 1.0]

plt.figure(figsize=(12, 7))
plt.scatter(delay_df['time(s)'], delay_df['delay(ms)'], s=15, alpha=0.7, color='b')

plt.title('Queuing Delay Over Time')
plt.xlabel('Time (s)')
plt.ylabel('Queuing Delay (ms)')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()

plt.show()

## Task 4:

Let's now examine the congestion window of various congestion control algorithms for the `GsToGs` setting.

In [None]:
# Task: load the congestion window (cwnd) data for `TcpBbr`, `TcpCubic`, `TcpNewReno`, and `TcpHybla` in the `GsToGs` setting
import pandas as pd

bbr_df = pd.read_parquet("sim_logs/LeoGsToGs-1ms-TcpBbr-cwnd.parquet")
cubic_df = pd.read_parquet("sim_logs/LeoGsToGs-1ms-TcpCubic-cwnd.parquet")
reno_df = pd.read_parquet("sim_logs/LeoGsToGs-1ms-TcpNewReno-cwnd.parquet")
hybla_df = pd.read_parquet("sim_logs/LeoGsToGs-1ms-TcpHybla-cwnd.parquet")

In [None]:
# let's examine the head of each dataframe to ensure their schemas match
print(bbr_df.head())
print("---")
print(cubic_df.head())
print("---")
print(reno_df.head())
print("---")
print(hybla_df.head())

In [None]:
# Task: coalesce all the data into a single dataframe, adding a new column 'cc_algo',
# which stores the name of the congestion control algorithm associated with that data;
bbr_df['cc_algo'] = 'TcpBbr'
cubic_df['cc_algo'] = 'TcpCubic'
reno_df['cc_algo'] = 'TcpNewReno'
hybla_df['cc_algo'] = 'TcpHybla'

df = pd.concat([bbr_df, cubic_df, reno_df, hybla_df])

In [None]:
# Task: Remove outliers by filtering for congestion windows < 3e5
df = df.loc[df['cwnd'] < 3e5, :]

In [None]:
# Plotting: plot the congestion window as a function of time for each algorithm;
# plot each algorithm as a separate line on the same plot w/colors and a legend to distinguish them
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 7))

colors = ['b', 'r', 'g', 'k']
for idx, algo in enumerate(df['cc_algo'].unique()):
  algo_df = df.loc[df.cc_algo == algo, :]
  plt.plot(algo_df['time(s)'], algo_df['cwnd'], label=algo, alpha=0.7, color=colors[idx])

plt.title('Cwnd Over Time')
plt.xlabel('Time (s)')
plt.ylabel('Congestion Window (pkts)')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.legend(loc="lower right")
plt.tight_layout()

plt.show()

In [None]:
# Task: bucket the data into 10s time intervals, write a function which computes time(s) // 10
# and apply that function to the time(s) column, saving the output in a new "interval" column
def bucket(time):
  return time // 10

df.loc[:, 'interval'] = df['time(s)'].apply(lambda t: bucket(t))

In [None]:
# Task: compute and print the average congestion window for each unique (cc_algo, interval) (Hint: use .groupby())
for (algo, interval), algo_df in df.groupby(['cc_algo', 'interval']):
  print(f"({algo}, {interval}) -> {algo_df['cwnd'].mean():.1f}")

In [None]:
# Task: in how many intervals does TcpCubic have the largest mean congestion window?
#
# Hint:
# - Initialize two empty lists, one to contain data for TcpCubic and another to contain data for the other algorithms
# - Use the same groupby from before to compute the mean congestion window for each algorithm in each interval
# - Put each interval and congestion window into a dictionary:
#   - for cubic data: {"interval": interval, "mean_cwnd_cubic": group_df['cwnd'].mean()}
#   - for the rest: {"interval": interval, "mean_cwnd": group_df['cwnd'].mean()}
#   - append the dictionary to its respective list (cubic data --> cubic list, other data --> other list)
# - Construct a dataframe from each list
# - Join the dataframes on the "interval" column (use .merge()) and store the output in a final dataframe
# - Group the final by interval, and compute the fraction of intervals with mean_cwnd_cubic > mean_cwnd for all rows
cubic_data, other_data = [], []
for (algo, interval), algo_df in df.groupby(['cc_algo', 'interval']):
  if algo == 'TcpCubic':
    cubic_data.append({"interval": interval, "mean_cwnd_cubic": algo_df['cwnd'].mean()})
  else:
    other_data.append({"interval": interval, "mean_cwnd": algo_df['cwnd'].mean()})

cubic_df = pd.DataFrame(cubic_data)
other_df = pd.DataFrame(other_data)

merged_df = cubic_df.merge(other_df, on='interval')

cubic_wins = 0
for interval, interval_df in merged_df.groupby("interval"):
  interval_df['cubic_wins'] = interval_df['mean_cwnd_cubic'] > interval_df['mean_cwnd']
  if interval_df['cubic_wins'].all():
    cubic_wins += 1

print(f"Cubic wins {cubic_wins} intervals")
