## Phase 1A â€” Project Setup & Data (Kaggle â†’ /content)

In this step, we:
1. Configure the Kaggle API in Google Colab using `kaggle.json`.
2. Download the I-94 traffic dataset **once** using the Kaggle API.
3. Extract the ZIP file into `/content/` (not Google Drive) for faster reads.
4. Load the CSV directly from `/content/...csv` and display basic info.

**Output:** Dataset successfully loaded into a pandas DataFrame (`df`) from `/content/`.


In [None]:
# =========================================
# PHASE 1A: PROJECT SETUP & DATA (KAGGLE)
# =========================================

# 1) Install Kaggle (usually already installed in Colab, but safe)
!pip -q install kaggle

# 2) Upload your kaggle.json when prompted (from Kaggle > Account > Create New API Token)
from google.colab import files
files.upload()  # upload kaggle.json

# 3) Put kaggle.json in the right place + permissions
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

# 4) Download the dataset (set to YOUR Kaggle dataset slug)
# Example slug for I-94 dataset is commonly: "fedesoriano/traffic-prediction-dataset"
DATASET_SLUG = "fedesoriano/traffic-prediction-dataset"

!kaggle datasets download -d {DATASET_SLUG} -p /content --force

# 5) Unzip into /content
import zipfile, os, glob

zip_path = glob.glob("/content/*.zip")[0]
with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall("/content")

print("Extracted files:", os.listdir("/content"))

# 6) Locate the CSV and load it from /content (FAST)
import pandas as pd

csv_files = glob.glob("/content/*.csv")
print("CSV files found:", csv_files)

# pick the first CSV (weâ€™ll confirm name once it prints)
csv_path = csv_files[0]

df = pd.read_csv(csv_path)
print("Loaded:", csv_path)
print("Shape:", df.shape)
df.head()


## Phase 1B â€” Data Cleaning & Baseline Understanding

In this phase, we prepared the I-94 traffic dataset for time-series modeling.

### Key Steps
- Converted `DateTime` into a proper datetime format.
- Sorted data chronologically (mandatory for LSTM models).
- Checked for missing values and duplicates.
- Extracted basic temporal features (hour, day, month, weekday).
- Performed baseline statistical analysis on traffic volume.
- Visualized traffic trends over time and hourly patterns.

### Observations
- Dataset contains **48,120 records** with no missing values.
- Traffic volume shows strong temporal patterns.
- Clear hourly variation confirms suitability for time-series forecasting.
- The `Vehicles` column is selected as the prediction target.

**Output:** Cleaned, time-ordered dataset ready for LSTM sequence generation.


In [None]:
# =========================================
# PHASE 1B: DATA CLEANING & BASELINE EDA
# =========================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1) Re-load dataset explicitly (safe practice)
df = pd.read_csv("/content/traffic.csv")

# 2) Convert DateTime to proper datetime format
df['DateTime'] = pd.to_datetime(df['DateTime'])

# 3) Sort by time (CRITICAL for time-series & LSTM)
df = df.sort_values('DateTime').reset_index(drop=True)

# 4) Basic structure check
print("Dataset Info:")
df.info()

# 5) Missing values check
print("\nMissing values per column:")
print(df.isnull().sum())

# 6) Duplicate check
print("\nDuplicate rows:", df.duplicated().sum())

# 7) Basic statistics for target variable
print("\nVehicles statistics:")
print(df['Vehicles'].describe())

# 8) Time-based feature extraction (for later use)
df['hour'] = df['DateTime'].dt.hour
df['day'] = df['DateTime'].dt.day
df['month'] = df['DateTime'].dt.month
df['dayofweek'] = df['DateTime'].dt.dayofweek

df.head()


In [None]:
# =========================================
# BASELINE TRAFFIC VOLUME TREND
# =========================================

plt.figure(figsize=(14,5))
plt.plot(df['DateTime'], df['Vehicles'])
plt.title("Traffic Volume Over Time (Baseline)")
plt.xlabel("Time")
plt.ylabel("Number of Vehicles")
plt.show()


In [None]:
# =========================================
# HOURLY TRAFFIC PATTERN
# =========================================

hourly_avg = df.groupby('hour')['Vehicles'].mean()

plt.figure(figsize=(10,5))
plt.plot(hourly_avg.index, hourly_avg.values)
plt.title("Average Traffic Volume by Hour of Day")
plt.xlabel("Hour")
plt.ylabel("Average Vehicles")
plt.xticks(range(0,24))
plt.show()


## Phase 2A â€” Data Preparation for LSTM

In this phase, the cleaned dataset is transformed into a format suitable for LSTM modeling.

### Steps Performed
- Selected a single junction to reduce complexity.
- Isolated traffic volume as the prediction target.
- Normalized values using Min-Max scaling.
- Generated time-series sequences using a 24-hour sliding window.
- Split data into training and testing sets while preserving temporal order.

### Output
- `X_train`, `y_train` for model training
- `X_test`, `y_test` for evaluation

These sequences capture temporal dependencies required for LSTM learning.


In [None]:
# =========================================
# PHASE 2A: LSTM DATA PREPARATION
# =========================================

import numpy as np
from sklearn.preprocessing import MinMaxScaler

# 1) Select ONE junction (recommended for first model)
junction_id = 1
df_junction = df[df['Junction'] == junction_id].copy()

print("Selected Junction:", junction_id)
print("Shape:", df_junction.shape)

# 2) Keep only DateTime and target variable
data = df_junction[['DateTime', 'Vehicles']].set_index('DateTime')

# 3) Normalize traffic volume (LSTM requirement)
scaler = MinMaxScaler(feature_range=(0,1))
scaled_values = scaler.fit_transform(data[['Vehicles']])

# 4) Create sequences
def create_sequences(data, seq_length):
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length])
        y.append(data[i+seq_length])
    return np.array(X), np.array(y)

SEQ_LENGTH = 24  # 24 hours window

X, y = create_sequences(scaled_values, SEQ_LENGTH)

print("X shape:", X.shape)
print("y shape:", y.shape)

# 5) Train-test split (time-aware, no shuffling)
split_ratio = 0.8
split_index = int(len(X) * split_ratio)

X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])


## Phase 2B â€” LSTM Model Development and Training

In this phase, an LSTM neural network was designed and trained to predict hourly traffic volume.

### Model Architecture
- Two stacked LSTM layers with 50 units each
- Fully connected output layer
- Adam optimizer with Mean Squared Error loss

### Training Strategy
- 30 epochs with early stopping
- Learning monitored using validation loss
- Overfitting prevented through early stopping

### Output
- Trained LSTM model
- Training and validation loss history


In [None]:
# =========================================
# PHASE 2B: LSTM MODEL DEVELOPMENT
# =========================================

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping

# 1) Build LSTM model
model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])),
    LSTM(50),
    Dense(1)
])

# 2) Compile model
model.compile(
    optimizer='adam',
    loss='mse'
)

# 3) Model summary (for documentation)
model.summary()

# 4) Train model
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

history = model.fit(
    X_train, y_train,
    epochs=30,
    batch_size=32,
    validation_split=0.1,
    callbacks=[early_stop],
    verbose=1
)


## Phase 2C â€” Model Evaluation (RMSE, MAE) and Prediction Visualization

In this phase, the trained LSTM model is evaluated on unseen test data.

### Steps Performed
- Generated traffic volume predictions on the test set.
- Converted predictions back to the original scale using inverse Min-Max scaling.
- Evaluated the model using:
  - **RMSE (Root Mean Squared Error)**
  - **MAE (Mean Absolute Error)**
- Visualized model performance by plotting **Actual vs Predicted** traffic volume.

### Output
- RMSE and MAE values for model performance
- Prediction plot showing how closely the model follows real traffic trends


In [None]:
# =========================================
# PHASE 2C: EVALUATION (RMSE, MAE + PLOTS)
# =========================================

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error

# 1) Predict on test set
y_pred_scaled = model.predict(X_test)

# 2) Inverse transform back to original "Vehicles" scale
y_test_inv = scaler.inverse_transform(y_test)
y_pred_inv = scaler.inverse_transform(y_pred_scaled)

# 3) Compute metrics
rmse = np.sqrt(mean_squared_error(y_test_inv, y_pred_inv))
mae = mean_absolute_error(y_test_inv, y_pred_inv)

print(f"RMSE: {rmse:.4f}")
print(f"MAE:  {mae:.4f}")

# 4) Plot Actual vs Predicted (first 300 points for readability)
N = 300

plt.figure(figsize=(14,5))
plt.plot(y_test_inv[:N], label="Actual")
plt.plot(y_pred_inv[:N], label="Predicted")
plt.title("Actual vs Predicted Traffic Volume (Test Set)")
plt.xlabel("Time Step (Hourly)")
plt.ylabel("Vehicles")
plt.legend()
plt.show()


## Saving The Model

In [None]:
# Save trained LSTM model
model.save("/content/lstm_traffic_model.keras")

print("Model saved successfully.")


## Model Persistence

The trained LSTM model and data scaler were saved to Google Drive to ensure persistence and reproducibility. This allows the model to be reused for deployment without retraining and protects the project artifacts from runtime disconnections.


In [None]:
# =========================================
# SAVE PROJECT TO GOOGLE DRIVE
# =========================================

from google.colab import drive
import os

# 1) Mount Google Drive
drive.mount('/content/drive')

# 2) Create project directory (only once)
project_dir = "/content/drive/MyDrive/LSTM_Traffic_Prediction_Project"
os.makedirs(project_dir, exist_ok=True)

# 3) Save trained LSTM model
model_path = os.path.join(project_dir, "lstm_traffic_model.keras")
model.save(model_path)

print("Model saved at:", model_path)


In [None]:
# =========================================
# SAVE SCALER
# =========================================

import joblib

scaler_path = os.path.join(project_dir, "minmax_scaler.save")
joblib.dump(scaler, scaler_path)

print("Scaler saved at:", scaler_path)


## Phase 3 â€” Step 1: Environment Reset and Artifact Verification

Before deployment, we reset any running processes from previous attempts and verify that the trained artifacts are safely stored in Google Drive.

### Actions
- Stopped any previously running Streamlit/tunneling processes.
- Mounted Google Drive.
- Verified presence of:
  - `lstm_traffic_model.keras` (trained LSTM model)
  - `minmax_scaler.save` (MinMaxScaler used during training)
- Listed project folder contents to confirm correct file paths.

### Output
Confirmed model and scaler files exist and are ready for deployment.


In [None]:
# =========================================
# PHASE 3 (STEP 1): CLEAN RESTART + VERIFY FILES
# =========================================

# 1) Stop anything left running (Streamlit/ngrok/etc.)
!pkill -f streamlit || true
!pkill -f ngrok || true
!pkill -f cloudflared || true

# 2) Mount Drive
from google.colab import drive
drive.mount('/content/drive')

# 3) Verify artifacts exist
import os

project_dir = "/content/drive/MyDrive/LSTM_Traffic_Prediction_Project"
model_path = os.path.join(project_dir, "lstm_traffic_model.keras")
scaler_path = os.path.join(project_dir, "minmax_scaler.save")

print("Project dir:", project_dir)
print("Model exists:", os.path.exists(model_path), "|", model_path)
print("Scaler exists:", os.path.exists(scaler_path), "|", scaler_path)

# 4) List folder contents
print("\nFolder contents:")
print(os.listdir(project_dir))


## Phase 3 â€” Step 2: Web Application Development

A Streamlit-based web application was created to deploy the trained LSTM traffic prediction model.

### Features
- Loads the trained LSTM model and scaler from Google Drive
- Accepts traffic volume values for the previous 24 hours
- Predicts traffic volume for the next hour
- Displays results in a user-friendly interface

### Output
A complete `app.py` file ready for deployment.


In [None]:
# =========================================
# PHASE 3 (STEP 2): CREATE STREAMLIT APP
# =========================================

app_code = r'''
import numpy as np
import streamlit as st
import tensorflow as tf
import joblib
import os

# ------------------------------
# Page Configuration
# ------------------------------
st.set_page_config(
    page_title="Traffic Volume Prediction (LSTM)",
    page_icon="ðŸš¦",
    layout="centered"
)

st.title("ðŸš¦ Traffic Volume Prediction System")
st.write(
    "This system predicts **next-hour traffic volume** using a trained "
    "**LSTM model** based on the I-94 Traffic Dataset."
)

# ------------------------------
# Load Model & Scaler
# ------------------------------
PROJECT_DIR = "/content/drive/MyDrive/LSTM_Traffic_Prediction_Project"
MODEL_PATH = os.path.join(PROJECT_DIR, "lstm_traffic_model.keras")
SCALER_PATH = os.path.join(PROJECT_DIR, "minmax_scaler.save")

@st.cache_resource
def load_artifacts():
    model = tf.keras.models.load_model(MODEL_PATH)
    scaler = joblib.load(SCALER_PATH)
    return model, scaler

model, scaler = load_artifacts()

# ------------------------------
# Input Section
# ------------------------------
st.subheader("Input: Last 24 Hourly Traffic Volumes")

st.info("Enter traffic volume values for the previous 24 hours.")

inputs = []
cols = st.columns(4)

for i in range(24):
    with cols[i % 4]:
        value = st.number_input(
            f"Hour {i+1}",
            min_value=0.0,
            value=50.0,
            step=1.0
        )
        inputs.append(value)

# ------------------------------
# Prediction
# ------------------------------
if st.button("Predict Next Hour Traffic"):
    data = np.array(inputs).reshape(-1, 1)

    data_scaled = scaler.transform(data)
    X_input = data_scaled.reshape(1, 24, 1)

    prediction_scaled = model.predict(X_input)
    prediction = scaler.inverse_transform(prediction_scaled)

    st.success(
        f"âœ… **Predicted Traffic Volume (Next Hour): "
        f"{prediction[0][0]:.2f} vehicles**"
    )

# ------------------------------
# Footer
# ------------------------------
st.caption(
    "Model: LSTM | Window Size: 24 Hours | "
    "Dataset: I-94 Traffic Volume"
)
'''

with open("app.py", "w") as f:
    f.write(app_code)

print("âœ… app.py created successfully.")


## Phase 3 â€” Step 3: Launch Streamlit Server

The Streamlit application was launched inside the Colab runtime on port 8501.

### Actions
- Installed deployment dependencies (Streamlit + joblib).
- Started the Streamlit server in the background.
- Verified successful startup by checking the server logs.

### Output
A running Streamlit server listening on port `8501`.


In [None]:
# =========================================
# PHASE 3 (STEP 3): RUN STREAMLIT LOCALLY
# =========================================

!pip -q install streamlit joblib

# Stop any previous Streamlit process (safe)
!pkill -f streamlit || true

# Run Streamlit in background and log output to a file
!streamlit run app.py --server.port 8501 --server.headless true > streamlit.log 2>&1 &

# Quick check: show last lines of log
!tail -n 20 streamlit.log


## Phase 3 â€” Step 4A: Install Cloudflare Tunnel (cloudflared)

Cloudflare Tunnel was installed to expose the Streamlit app publicly from the Colab runtime.
All shell commands are executed using `!` in Colab.


In [None]:
# =========================================
# PHASE 3 (STEP 4A): INSTALL CLOUDFLARED (FIXED)
# =========================================

!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O cloudflared
!chmod +x cloudflared
!mv cloudflared /usr/local/bin/cloudflared

!cloudflared --version


In [None]:
!pkill -f streamlit || true
!pkill -f cloudflared || true
!lsof -i :8501 || true


In [None]:
# Start Streamlit in background and log output
!streamlit run app.py --server.port 8501 --server.headless true --server.address 0.0.0.0 > streamlit.log 2>&1 &

# Show logs (look for "Running on..." and no errors)
!tail -n 40 streamlit.log

# Confirm something is listening on port 8501
!lsof -i :8501 | head -n 5


## Phase 3 â€” Step 4B: Generate Public URL

A Cloudflare Tunnel is started to expose the Streamlit application running locally on port 8501.
The output provides a `trycloudflare.com` URL which can be used to access the system in a browser for live demo.


In [None]:
# =========================================
# PHASE 3 (STEP 4B): START PUBLIC TUNNEL
# =========================================

!cloudflared tunnel --url http://localhost:8501


In [None]:
from google.colab import files
files.download("app.py")
