To what extent can network performance indicators—such as signal strength, network type, reading from Signal Hound, SDR hardware reading and data throughput—be used to predict latency in cellular networks using regression analysis?

In [21]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import statsmodels.api as sm

In [22]:
signal_metric = pd.read_csv("https://raw.githubusercontent.com/izaan-khudadad/Data-Mining/refs/heads/main/signal_metrics.csv", na_values=['?'])
signal_metric.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16829 entries, 0 to 16828
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Timestamp                     16829 non-null  object 
 1   Locality                      16829 non-null  object 
 2   Latitude                      16829 non-null  float64
 3   Longitude                     16829 non-null  float64
 4   Signal Strength (dBm)         16829 non-null  float64
 5   Signal Quality (%)            16829 non-null  float64
 6   Data Throughput (Mbps)        16829 non-null  float64
 7   Latency (ms)                  16829 non-null  float64
 8   Network Type                  16829 non-null  object 
 9   BB60C Measurement (dBm)       16829 non-null  float64
 10  srsRAN Measurement (dBm)      16829 non-null  float64
 11  BladeRFxA9 Measurement (dBm)  16829 non-null  float64
dtypes: float64(9), object(3)
memory usage: 1.5+ MB


In [23]:
df = signal_metric

# Clean and standardize column names
df.columns = (
    df.columns
    .str.strip()                       # remove leading/trailing spaces
    .str.lower()                       # convert to lowercase
    .str.replace(r'[^\w\s]', '', regex=True)  # remove punctuation/symbols
    .str.replace(r'\s+', '_', regex=True)     # replace spaces with underscores
)

# Display new column names
print(df.columns.tolist())

['timestamp', 'locality', 'latitude', 'longitude', 'signal_strength_dbm', 'signal_quality_', 'data_throughput_mbps', 'latency_ms', 'network_type', 'bb60c_measurement_dbm', 'srsran_measurement_dbm', 'bladerfxa9_measurement_dbm']


In [42]:
df.describe()

Unnamed: 0,latitude,longitude,signal_strength_dbm,signal_quality_,data_throughput_mbps,latency_ms,bb60c_measurement_dbm,srsran_measurement_dbm,bladerfxa9_measurement_dbm
count,16829.0,16829.0,16829.0,16829.0,16829.0,16829.0,16829.0,16829.0,16829.0
mean,25.594796,85.137314,-90.072484,0.0,16.182856,101.313624,-68.82015,-74.439562,-68.81993
std,0.089881,0.090095,5.399368,0.0,25.702734,56.010418,40.046739,43.215204,39.996934
min,25.414575,84.957936,-116.942267,0.0,1.000423,10.019527,-115.667514,-124.652054,-119.207545
25%,25.522858,85.064124,-93.615962,0.0,2.001749,50.320775,-94.021959,-101.249987,-93.749032
50%,25.595383,85.138149,-89.665566,0.0,2.997175,100.264318,-89.126942,-96.838442,-89.282746
75%,25.66762,85.209504,-86.145491,0.0,9.956314,149.951112,0.0,0.0,0.0
max,25.773648,85.316994,-74.644848,0.0,99.985831,199.991081,0.0,0.0,0.0


In [24]:
df.isnull().sum()

timestamp                     0
locality                      0
latitude                      0
longitude                     0
signal_strength_dbm           0
signal_quality_               0
data_throughput_mbps          0
latency_ms                    0
network_type                  0
bb60c_measurement_dbm         0
srsran_measurement_dbm        0
bladerfxa9_measurement_dbm    0
dtype: int64

In [None]:
X = df[['signal_strength_dbm', 'data_throughput_mbps',
        'network_type', 'bb60c_measurement_dbm', 'srsran_measurement_dbm',
        'bladerfxa9_measurement_dbm']]

y = df['latency_ms']

X = pd.get_dummies(X, drop_first=True)

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [46]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [47]:
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

In [52]:
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

In [51]:
print("Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mae:.2f} ms")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f} ms")
print(f"R² Score: {r2:.3f}")

Model Evaluation:
Mean Absolute Error (MAE): 18.14 ms
Root Mean Squared Error (RMSE): 22.42 ms
R² Score: 0.842


In [50]:
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)

print("\nFeature Coefficients:")
print(coefficients)


Feature Coefficients:
                      Feature  Coefficient
7            network_type_LTE     9.968396
4  bladerfxa9_measurement_dbm     0.096919
2       bb60c_measurement_dbm     0.014040
1        data_throughput_mbps     0.003966
3      srsran_measurement_dbm    -0.007922
0         signal_strength_dbm    -0.088481
5             network_type_4G   -65.726637
6             network_type_5G  -110.754302


1. Network Type is the most powerful predictor of latency — 5G dramatically reduces delay compared to 4G and LTE.
2. Signal Strength (dBm) also matters — better signal = lower latency.
3. Device-level readings (the measurement variables) have much smaller effects.
4. Removing signal_quality simplified the model without losing predictive power — a good data-driven call.