# Predicting Wi-Fi Throughput and Delay Using a Linear Regression Model


From 10,000 simulations of a single-cell Wi-Fi network, we have a dataset which maps network configurations (number of stations, offered load, contention window values, etc.) to network performance metrics (throughput, delay). We will use this dataset to train a linear regression model and then use this model to predict network performance.

Steps:

1. Load the dataset
2. Observe correlation between input variables
3. Train (fit) the model
4. Predict and evaluate the results

Initial configuration and library import:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
import pandas as pd
from sklearn import preprocessing
from google.colab import files

## 1. Loading the dataset

The dataset is available as a CSV file with the following columns:

- Inputs
  - `station_no` - number of stations in the network,
  - `x` and `y` - the sides of the area of the network [m],
  - `area` - the area of the network: $x \times y$ [m^2],
  - `load` - per-station load [Mb/s],
  - `cw` - the minimum contention window value [slots] (the maximum is always 1023),
  - `channel` - the channel width [MHz],
  - `pkt_size` - packet size [b],
- Outputs  
  - `max_snr`, `avg_snr`, `min_snr` - maximum, average, and minimum observed SNR [dBm],
  - `p_fail` - fraction of failed transmissions (i.e., not correctly received), either because a transmission error or a collision (transmission error probability is fixed at 0.1),
  - `throughput` - aggregate network throughput [b/s],
  - `avg_del` - average time required to transmit a (head of line) packet [s],
  - `total_airtime` - sum of transmission time required for all transmissions,
  - `proportional_airtime` - airtime used for successful transmissions (does not account for time spent in collisions or retransmissions).

Load, parse, and print dataset:

In [None]:
url='https://drive.google.com/file/d/1ZwXZwbhpN1Z0Llzr6sq5WN87TgiNJQP-/view?usp=sharing'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
data_set_all = pd.read_csv(dwn_url, sep=',', names=['station_no','load','x','y','area','cw','channel',
                                                    'pkt_size','max_snr','avg_snr','min_snr','p_fail',
                                                    'throughput','avg_del','total_airtime','proportional_airtime'] )

# Drop any rows with null values
data_set_all.dropna(axis=0, how='any', inplace=True)

print('Data set (first five rows):')
print(data_set_all.head())

Data set (first five rows):
   station_no       load     x     y   area     cw  channel  pkt_size  \
0        21.0  5250000.0  22.0  39.0  858.0   63.0     80.0    4000.0   
1         7.0  1750000.0  23.0  25.0  575.0  511.0    160.0    6000.0   
2        13.0  3250000.0   4.0  24.0   96.0    7.0     40.0   12000.0   
3         9.0  2250000.0  25.0  18.0  450.0    7.0    160.0    8000.0   
4        11.0  2750000.0  31.0  31.0  961.0   63.0     20.0   12000.0   

     max_snr    avg_snr    min_snr    p_fail   throughput   avg_del  \
0 -42.928330 -63.360578 -74.965338  0.176645  5249995.023  0.002946   
1 -58.124635 -60.061144 -61.387044  0.103061  1749999.978  0.003899   
2 -42.650904 -51.205925 -63.162513  0.130817  3249999.721  0.000726   
3 -54.870343 -58.795338 -65.139086  0.127697  2249999.841  0.000648   
4 -52.336210 -66.339800 -76.562695  0.107643  2749999.950  0.001515   

   total_airtime  proportional_airtime  
0       0.731502              0.638437  
1       0.154115        

## 2. Data correlation

Which input parameters impact output metrics such as throughput and delay? We check the correlation using a heatmap.

In [None]:
column_names = ['station_no','load','x','y','area','cw','channel','pkt_size',
                'max_snr','avg_snr','min_snr', 'throughput', 'avg_del']
dataplot = sb.heatmap(data_set_all[column_names].corr(), cmap="Reds", annot=False)
plt.show()

## 3. Model fitting

We choose which columns to use and then fit a linear regression model to our data.

In [None]:
# Uncomment one of the following lines depending on required analysis
# For including all features:

#column_names = ['station_no','load','x','y','area','cw','channel','pkt_size',
#                'max_snr','avg_snr','min_snr']

# For analysing throughput:
column_names = ['station_no','load','max_snr']

# For analysing delay:
#column_names = ['station_no','load','cw']

X = data_set_all[column_names]

# Select throughput or delay:
y = data_set_all['throughput']
#y = data_set_all['avg_del']

# Load function for data splitting
from sklearn.model_selection import train_test_split

# Split the data set
trainX, testX, trainY, testY = train_test_split(X,y, test_size = 0.3)

# Regression analysis starts here
from sklearn.linear_model import LinearRegression

# Simple regression model:
#  y = c + b*x,
# where
#   y is the outcome,
#   x is the predictor,
#   b is the slope of the line (regression coefficient),
#   c is the intercept.

model = LinearRegression()
model.fit(trainX, trainY)

# Check the model parameters

print(f"b = {model.coef_}, c = {model.intercept_}")


## 4. Data prediction

In [None]:
# Generate predictions from the trained model
pred=model.predict(testX)

# Calculate MSE, RMSE
from sklearn.metrics import mean_squared_error
print('MSE =', mean_squared_error(testY,pred))
print('RMSE =', np.sqrt(mean_squared_error(testY,pred)))

Plot true and predicted values (for the first 100 values)


In [None]:
results = pd.DataFrame()
results['testY']=testY
results['pred']=pred

results['testY'][1:100].plot(alpha=0.5, color = 'red', marker='o', linestyle='None', label='Actual');
results['pred'][1:100].plot(alpha=0.5, color = 'blue', marker='x', linestyle='None', label='Predicted');
plt.legend()
plt.xlabel('Configuration ID')
plt.ylabel('Throughput [b/s]')
#plt.ylabel('Delay [s]')
plt.show()