
# Dataset Preprocessing for AI-based Resource Allocation in Open RAN

This notebook prepares a **final, working dataset** for:
- Centralized AI model
- Federated (local) models

Datasets used:
- `android_traffic.csv`
- `Midterm_53_group.csv`

The preprocessing is **fully based on the uploaded datasets**.


In [1]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


## Load datasets

In [2]:

android_df = pd.read_csv("android_traffic.csv", sep=';')
network_df = pd.read_csv("Midterm_53_group.csv")

print(android_df.shape)
print(network_df.shape)


(7845, 17)
(394136, 7)


## Aggregate Android traffic (user demand features)

In [3]:

android_agg = android_df.groupby(
    np.arange(len(android_df)) // 10
).agg({
    'vulume_bytes': 'sum',
    'external_ips': 'nunique'
}).reset_index(drop=True)

android_agg.rename(columns={
    'vulume_bytes': 'traffic_load',
    'external_ips': 'num_users'
}, inplace=True)

print(android_agg.head())


   traffic_load  num_users
0         77450          4
1        126410          6
2        194753          4
3        104932          7
4       1131664          6


## Aggregate Network traffic (packet behavior features)

In [4]:

network_agg = network_df.groupby(
    np.arange(len(network_df)) // 10
).agg({
    'Length': ['count', 'mean']
}).reset_index(drop=True)

network_agg.columns = ['packet_count', 'avg_packet_size']

print(network_agg.head())


   packet_count  avg_packet_size
0            10             66.4
1            10             60.0
2            10             80.9
3            10             69.6
4            10            109.6


## Feature-level fusion (NO timestamp merge)

In [5]:

min_len = min(len(android_agg), len(network_agg))

final_df = pd.concat([
    android_agg.iloc[:min_len],
    network_agg.iloc[:min_len]
], axis=1)

print(final_df.shape)
print(final_df.head())


(785, 4)
   traffic_load  num_users  packet_count  avg_packet_size
0         77450          4            10             66.4
1        126410          6            10             60.0
2        194753          4            10             80.9
3        104932          7            10             69.6
4       1131664          6            10            109.6


## Create synthetic resource allocation label

In [6]:

final_df['allocated_bandwidth'] = (
    0.7 * final_df['traffic_load'] +
    0.3 * final_df['packet_count']
)


## Normalize features

In [7]:

features = ['traffic_load', 'num_users', 'packet_count', 'avg_packet_size']

scaler = MinMaxScaler()
final_df[features] = scaler.fit_transform(final_df[features])

print(final_df.describe())


       traffic_load   num_users  packet_count  avg_packet_size  \
count    785.000000  785.000000         785.0       785.000000   
mean       0.037735    0.450159           0.0         0.434240   
std        0.063268    0.205270           0.0         0.304573   
min        0.000000    0.000000           0.0         0.000000   
25%        0.009460    0.250000           0.0         0.126685   
50%        0.024890    0.500000           0.0         0.481431   
75%        0.042585    0.625000           0.0         0.675653   
max        1.000000    1.000000           0.0         1.000000   

       allocated_bandwidth  
count         7.850000e+02  
mean          1.157355e+05  
std           1.940428e+05  
min           3.000000e+00  
25%           2.901590e+04  
50%           7.634010e+04  
75%           1.306118e+05  
max           3.067005e+06  


## Save final dataset

In [8]:

final_df.to_csv("C:\\Users\\YASH\\Documents\\Project\\data\\final_ran_dataset.csv", index=False)
print("Saved final_ran_dataset.csv")


Saved final_ran_dataset.csv


## Split dataset into local edge-node datasets (Federated Learning)

In [9]:

num_nodes = 3
local_datasets = np.array_split(final_df, num_nodes)

for i, df in enumerate(local_datasets):
    df.to_csv(f"C:\\Users\\YASH\\Documents\\Project\\data\\edge_node_{i+1}.csv", index=False)

print("Edge-node datasets created")


Edge-node datasets created


  return bound(*args, **kwds)
