# Urban Traffic Prediction
This project preprocesses traffic data from four junctions in Isfahan to predict hourly car counts using machine learning.

## Dataset Overview
- **Rows**: 48,120
- **Columns**: 3 (initially)
- **Features**: DateTime, Junction, Car
- **Target**: `Car` (number of cars passing per hour)
- **Time Span**: May 2020 to December 2021

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from khayyam import JalaliDatetime
from pathlib import Path
import lightgbm as lgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split



In [3]:
# Load dataset
data_path = Path('traffic.csv')
df = pd.read_csv(data_path, parse_dates=['DateTime'])
df.head()

Unnamed: 0,DateTime,Junction,Car
0,2020-05-02 00:00:00,1,25
1,2020-05-02 01:00:00,1,23
2,2020-05-02 02:00:00,1,20
3,2020-05-02 03:00:00,1,12
4,2020-05-02 04:00:00,1,19


## Feature Engineering
### 1. Convert to Jalali DateTime
- Convert Gregorian `DateTime` to Persian calendar `JalaliDateTime`.

In [4]:
df['JalaliDateTime'] = df['DateTime'].apply(JalaliDatetime)

### 2. Discretize Hour
- Categorize hours into 6 time blocks based on traffic patterns.

In [5]:
def discretize_hour(hour):
    if hour < 6: return 0
    elif hour < 12: return 1
    elif hour < 15: return 2
    elif hour < 18: return 3
    elif hour < 22: return 4
    return 5

df['hour'] = df['DateTime'].dt.hour.apply(discretize_hour)

### 3. Holiday Indicator
- Mark Fridays as holidays (1) and other days as non-holidays (0).

In [6]:
df['IsHoliday'] = df['DateTime'].dt.day_name().apply(lambda x: 1 if x == 'Friday' else 0)

### 4. Seasonal Indicator
- Label cold months (Mehr to Esfand, months 7-12) as 1 and warm months (Farvardin to Shahrivar, months 1-6) as 0.

In [7]:
df['IsCold'] = df['JalaliDateTime'].apply(lambda x: 1 if x.month > 6 else 0)

### 5. One-Hot Encode Junction
- Convert `Junction` into four binary columns.

In [8]:
junction_encoded = pd.get_dummies(df['Junction'], prefix='Junc')
df = df.join(junction_encoded)

### 6. Additional Features
- **Day of Week**: Extracted from `DateTime` (0=Mon, 6=Sun).
- **Month**: Jalali month for seasonal patterns.
- **Lag Feature**: Number of cars in the previous hour per junction.
- **Rolling Mean**: 24-hour rolling average of car counts per junction.

In [9]:
# Day of week
df['day_of_week'] = df['DateTime'].dt.dayofweek

# Jalali month
df['month'] = df['JalaliDateTime'].apply(lambda x: x.month)

# Lag feature (previous hour's car count per junction)
df['lag_1'] = df.groupby('Junction')['Car'].shift(1).fillna(method='bfill')

# Rolling mean (24-hour window per junction)
df['rolling_mean_24'] = df.groupby('Junction')['Car'].transform(lambda x: x.rolling(24, min_periods=1).mean())

  df['lag_1'] = df.groupby('Junction')['Car'].shift(1).fillna(method='bfill')


In [11]:
df.head()

Unnamed: 0,DateTime,Junction,Car,JalaliDateTime,hour,IsHoliday,IsCold,Junc_1,Junc_2,Junc_3,Junc_4,day_of_week,month,lag_1,rolling_mean_24
0,2020-05-02 00:00:00,1,25,1399-02-13 00:00:00.000000,0,0,0,True,False,False,False,5,2,25.0,25.0
1,2020-05-02 01:00:00,1,23,1399-02-13 01:00:00.000000,0,0,0,True,False,False,False,5,2,25.0,24.0
2,2020-05-02 02:00:00,1,20,1399-02-13 02:00:00.000000,0,0,0,True,False,False,False,5,2,23.0,22.666667
3,2020-05-02 03:00:00,1,12,1399-02-13 03:00:00.000000,0,0,0,True,False,False,False,5,2,20.0,20.0
4,2020-05-02 04:00:00,1,19,1399-02-13 04:00:00.000000,0,0,0,True,False,False,False,5,2,12.0,19.8


## Model Training and Evaluation
- Use LightGBM for regression.
- Split data: 70% train, 30% test.
- Evaluate with R² score.

In [10]:
# Prepare features and target
features = ['hour', 'IsHoliday', 'IsCold', 'Junc_1', 'Junc_2', 'Junc_3', 'Junc_4', 
            'day_of_week', 'month', 'lag_1', 'rolling_mean_24']
target = 'Car'

X = df[features]
y = np.log1p(df[target])  # Log-transform target for better distribution

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train LightGBM model
lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.05, random_state=42)
lgb_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = np.expm1(lgb_model.predict(X_test))  # Reverse log-transform
y_test_raw = np.expm1(y_test)
r2 = r2_score(y_test_raw, y_pred) * 100
print(f'Model R² Score: {r2:.2f}%')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000889 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435
[LightGBM] [Info] Number of data points in the train set: 33684, number of used features: 11
[LightGBM] [Info] Start training from score 3.420151
Model R² Score: 96.11%


## Save Outputs
- Export processed data and submission file.

In [None]:
# Submission columns
submission_cols = ['JalaliDateTime', 'hour', 'IsHoliday', 'IsCold', 'Junc_1', 'Junc_2', 'Junc_3', 'Junc_4']
submission_df = df[submission_cols]

# Save files
output_dir = Path('outputs')
output_dir.mkdir(exist_ok=True)

submission_df.to_csv(output_dir / 'df.csv', index=False)
df.to_csv(output_dir / 'processed_df.csv', index=False)

# Compress outputs
import zipfile
files = ['df.csv', 'processed_df.csv']
with zipfile.ZipFile(output_dir / 'results.zip', 'w', compression=zipfile.ZIP_DEFLATED) as zf:
    for file in files:
        zf.write(output_dir / file, file)