1. Dataset Overview

Dataset Name: Delhi Traffic Density Dataset
City: Delhi
Data Type: Time-series traffic density data
Granularity: Per-second measurements
Duration: ~40 days (September to December)

Each CSV file represents one day of traffic data collected using traffic cameras at a Delhi intersection.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
import pickle


In [2]:
import os

data_path = Path("D:\daProjects\Traffic_flow_prediction\data\extracted\DelhiTrafficDensityDataset")

files = os.listdir(data_path)

print("Number of CSV files:", len(files))
print("Sample files:", files[:5])


Number of CSV files: 40
Sample files: ['Dec15.csv', 'Dec16.csv', 'Dec17.csv', 'Dec18.csv', 'Dec19.csv']


In [3]:
sample_file = os.path.join(data_path, files[0])
df = pd.read_csv(sample_file)

df.head(), df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54001 entries, 0 to 54000
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   EpochTime      54001 non-null  int64  
 1   QueueDensity1  54001 non-null  float64
 2   StopDensity1   54001 non-null  float64
 3   QueueDensity2  54001 non-null  float64
 4   StopDensity2   54001 non-null  float64
 5   QueueDensity3  54001 non-null  float64
 6   StopDensity3   54001 non-null  float64
 7   QueueDensity4  54001 non-null  float64
 8   StopDensity4   54001 non-null  float64
 9   QueueDensity5  54001 non-null  float64
 10  StopDensity5   54001 non-null  float64
 11  QueueDensity6  54001 non-null  float64
 12  StopDensity6   54001 non-null  float64
dtypes: float64(12), int64(1)
memory usage: 5.4 MB


(    EpochTime  QueueDensity1  StopDensity1  QueueDensity2  StopDensity2  \
 0  1607995800       0.081189      0.037368       0.055712      0.053962   
 1  1607995801       0.073563      0.054372       0.054795      0.050808   
 2  1607995802       0.081688      0.056802       0.054722      0.038284   
 3  1607995803       0.133200      0.084858       0.160540      0.123520   
 4  1607995804       0.210780      0.055671       0.078904      0.054119   
 
    QueueDensity3  StopDensity3  QueueDensity4  StopDensity4  QueueDensity5  \
 0       0.078610      0.076280       0.107167      0.081873       0.055056   
 1       0.078111      0.075976       0.109480      0.082113       0.046894   
 2       0.077518      0.075904       0.109600      0.092484       0.043800   
 3       0.077409      0.071430       0.108866      0.086182       0.043163   
 4       0.078076      0.076397       0.109781      0.087974       0.046313   
 
    StopDensity5  QueueDensity6  StopDensity6  
 0      0.011932  

In [4]:
df[['EpochTime']].head()


Unnamed: 0,EpochTime
0,1607995800
1,1607995801
2,1607995802
3,1607995803
4,1607995804


2: Timestamp Conversion and Dataset Merging
In this step, all daily CSV files are loaded programmatically.
The Unix epoch time is converted to a human-readable datetime format, and the files are merged into a single continuous time-series dataset.
This merged dataset will serve as the base for further aggregation and modeling.

In [5]:
data_path = "D:\daProjects\Traffic_flow_prediction\data\extracted\DelhiTrafficDensityDataset"

csv_files = sorted([f for f in os.listdir(data_path) if f.endswith(".csv")])

print("Total CSV files found:", len(csv_files))
csv_files[:5]


Total CSV files found: 40


['Dec15.csv', 'Dec16.csv', 'Dec17.csv', 'Dec18.csv', 'Dec19.csv']

In [6]:
#Load Files, Convert EpochTime, and Merge
df_list = []

for file in csv_files:
    file_path = os.path.join(data_path, file)
    temp_df = pd.read_csv(file_path)

    # Convert Unix epoch time to datetime
    temp_df['timestamp'] = pd.to_datetime(temp_df['EpochTime'], unit='s')

    df_list.append(temp_df)

traffic_df = pd.concat(df_list, ignore_index=True)


In [7]:
# Sort Dataset by Time (CRITICAL)
traffic_df = traffic_df.sort_values('timestamp').reset_index(drop=True)

traffic_df[['timestamp']].head(), traffic_df[['timestamp']].tail()


(            timestamp
 0 2020-09-12 01:30:00
 1 2020-09-12 01:30:01
 2 2020-09-12 01:30:02
 3 2020-09-12 01:30:03
 4 2020-09-12 01:30:04,
                   timestamp
 2160035 2020-12-19 16:29:56
 2160036 2020-12-19 16:29:57
 2160037 2020-12-19 16:29:58
 2160038 2020-12-19 16:29:59
 2160039 2020-12-19 16:30:00)

In [8]:
# Validate the Merged Dataset
print("Total rows:", traffic_df.shape[0])
print("Start timestamp:", traffic_df['timestamp'].min())
print("End timestamp:", traffic_df['timestamp'].max())

print("\nMissing values per column:")
print(traffic_df.isnull().sum())


Total rows: 2160040
Start timestamp: 2020-09-12 01:30:00
End timestamp: 2020-12-19 16:30:00

Missing values per column:
EpochTime        0
QueueDensity1    0
StopDensity1     0
QueueDensity2    0
StopDensity2     0
QueueDensity3    0
StopDensity3     0
QueueDensity4    0
StopDensity4     0
QueueDensity5    0
StopDensity5     0
QueueDensity6    0
StopDensity6     0
timestamp        0
dtype: int64


Outcome of STEP 2:
All 40 daily CSV files were merged into a single dataset after converting Unix epoch timestamps to datetime format.
The resulting dataset preserves temporal order and forms a continuous time-series suitable for further preprocessing and deep learning–based traffic flow prediction.

STEP 3: Lane-wise Aggregation and Hourly Resampling
In this step, multiple lane-level queue density measurements are aggregated into a single traffic density signal.
The per-second data is then resampled to hourly averages to reduce noise and create a stable time-series suitable for deep learning models such as LSTM.

In [9]:
queue_density_cols = [
    'QueueDensity1', 'QueueDensity2', 'QueueDensity3',
    'QueueDensity4', 'QueueDensity5', 'QueueDensity6'
]

traffic_df['avg_queue_density'] = traffic_df[queue_density_cols].mean(axis=1)


In [10]:
traffic_df = traffic_df[['timestamp', 'avg_queue_density']]
traffic_df.head()


Unnamed: 0,timestamp,avg_queue_density
0,2020-09-12 01:30:00,0.183016
1,2020-09-12 01:30:01,0.164574
2,2020-09-12 01:30:02,0.170759
3,2020-09-12 01:30:03,0.165333
4,2020-09-12 01:30:04,0.170892


In [11]:
traffic_df.set_index('timestamp', inplace=True)
traffic_df.head()


Unnamed: 0_level_0,avg_queue_density
timestamp,Unnamed: 1_level_1
2020-09-12 01:30:00,0.183016
2020-09-12 01:30:01,0.164574
2020-09-12 01:30:02,0.170759
2020-09-12 01:30:03,0.165333
2020-09-12 01:30:04,0.170892


In [12]:
# Resample Per-Second Data to Hourly Average
hourly_df = traffic_df.resample('H').mean()

hourly_df.head(), hourly_df.shape


  hourly_df = traffic_df.resample('H').mean()


(                     avg_queue_density
 timestamp                             
 2020-09-12 01:00:00           0.264208
 2020-09-12 02:00:00           0.329075
 2020-09-12 03:00:00           0.352610
 2020-09-12 04:00:00           0.393319
 2020-09-12 05:00:00           0.401703,
 (2368, 1))

In [13]:
hourly_df.isnull().sum()


avg_queue_density    1728
dtype: int64

In [14]:
hourly_df = hourly_df[['avg_queue_density']]


In [15]:
hourly_df = hourly_df.dropna()


In [16]:
print("Total hourly records:", hourly_df.shape[0])
print("Hourly data range:")
print("Start:", hourly_df.index.min())
print("End:", hourly_df.index.max())

hourly_df.head()


Total hourly records: 640
Hourly data range:
Start: 2020-09-12 01:00:00
End: 2020-12-19 16:00:00


Unnamed: 0_level_0,avg_queue_density
timestamp,Unnamed: 1_level_1
2020-09-12 01:00:00,0.264208
2020-09-12 02:00:00,0.329075
2020-09-12 03:00:00,0.35261
2020-09-12 04:00:00,0.393319
2020-09-12 05:00:00,0.401703


Outcome of STEP 3:
Lane-wise queue density values were averaged to form a single traffic density metric.
The per-second data was resampled to hourly averages, significantly reducing noise and creating a stable time-series suitable for deep learning–based traffic flow prediction.

STEP 4: Feature Engineering and Sequence Creation
In this step, time-based features are extracted from the hourly traffic data, and sliding window sequences are created to prepare the dataset for LSTM-based traffic flow prediction.

In [17]:
hourly_df = hourly_df.reset_index()
hourly_df.head()


Unnamed: 0,timestamp,avg_queue_density
0,2020-09-12 01:00:00,0.264208
1,2020-09-12 02:00:00,0.329075
2,2020-09-12 03:00:00,0.35261
3,2020-09-12 04:00:00,0.393319
4,2020-09-12 05:00:00,0.401703


In [18]:
# ONLY the signal
data = hourly_df[['avg_queue_density']].values


In [19]:
def create_sequences(data, seq_length=48):
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i + seq_length])
        y.append(data[i + seq_length][0])
    return np.array(X), np.array(y)

SEQ_LENGTH = 48
X_raw, y_raw = create_sequences(data, SEQ_LENGTH)

print("X_raw shape:", X_raw.shape)   # (samples, 48, 1)
print("y_raw shape:", y_raw.shape)

X_raw shape: (592, 48, 1)
y_raw shape: (592,)


In [20]:
split_index = int(0.8 * len(X_raw))

X_train_raw, X_test_raw = X_raw[:split_index], X_raw[split_index:]
y_train_raw, y_test_raw = y_raw[:split_index], y_raw[split_index:]

In [21]:
scaler = MinMaxScaler()

# Fit scaler on training X
X_train_scaled = scaler.fit_transform(
    X_train_raw.reshape(-1, 1)
).reshape(X_train_raw.shape)

# Transform test X
X_test_scaled = scaler.transform(
    X_test_raw.reshape(-1, 1)
).reshape(X_test_raw.shape)

# Scale y using SAME scaler
y_train_scaled = scaler.transform(y_train_raw.reshape(-1, 1)).ravel()
y_test_scaled  = scaler.transform(y_test_raw.reshape(-1, 1)).ravel()

# ==============================
# 5. Final outputs for training
# ==============================
X_train = X_train_scaled
X_test  = X_test_scaled
y_train = y_train_scaled
y_test  = y_test_scaled

print("Final X_train shape:", X_train.shape)
print("Final X_test shape:", X_test.shape)

Final X_train shape: (473, 48, 1)
Final X_test shape: (119, 48, 1)


In [22]:


# Save sequences
np.save("X_train.npy", X_train)
np.save("y_train.npy", y_train)
np.save("X_test.npy", X_test)
np.save("y_test.npy", y_test)

# Save scaler
with open("scaler.pkl", "wb") as f:
    pickle.dump(scaler, f)


Outcome of STEP 4:
Time-based features were added, including hour of day, day of week, and weekend indicator.
Data was scaled and converted into sliding window sequences for LSTM training.
Training and test sets were created with an 80–20 split, preserving temporal order.

In [23]:
hourly_df.to_csv("data/hourly_processed.csv", index=False)
