Skip to content

hzhao65/SmartEdgeCDN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmartEdgeCDN

AI-assisted edge cache optimization for content delivery networks.

SmartEdgeCDN is a simulation framework that demonstrates how machine learning can optimize cache selection at CDN edge points-of-presence (POPs) to reduce latency, improve hit rates, and lower egress costs.

What & Why

The Problem

CDNs distribute content globally through edge POPs, but cache capacity is limited. Traditional heuristics (LRU, popularity-based) don't anticipate future request patterns. Poor cache decisions lead to:

  • Higher latency: Cache misses force origin fetches (4x slower)
  • Increased costs: Origin egress is 10x more expensive than edge serving
  • Lower hit rates: Suboptimal asset selection wastes capacity

The Solution

SmartEdgeCDN uses a lightweight Logistic Regression classifier to predict which assets are likely to be requested soon, then selects cache contents to maximize expected hits within capacity constraints.

Key Innovation: Score assets by predicted_hit_probability / size_kb instead of just popularity / size_kb.

Results (Seed=42)

With deterministic simulation on 4,000 requests across 4 regions:

Policy   Avg(ms)  p95(ms)  HitRate  Cost($)
--------------------------------------------------
Base     142.1    220.0    0.41     0.153
AI       126.0    200.5    0.47     0.131
Δ(%)     -11.3%   -8.8%    +14.6%   -14.4%

Summary:

  • 11.3% faster average response time
  • 14.6% higher cache hit rate
  • 14.4% lower egress costs

Architecture

┌─────────────────────────────────────────────────────────┐
│                    SmartEdgeCDN Pipeline                 │
└─────────────────────────────────────────────────────────┘

1. DATA GENERATION
   ┌──────────────┐
   │ 600 Assets   │ → size, type, popularity
   └──────────────┘
         ↓
   ┌──────────────┐
   │ 4000 Requests│ → timestamp, region, asset_id
   └──────────────┘    + recent_hits feature
         ↓
   ┌──────────────┐
   │ Train (60%)  │
   │ Test  (40%)  │
   └──────────────┘

2. MODEL TRAINING
   ┌────────────────────────────────────┐
   │ LogisticRegression                 │
   │ Features: size, popularity,        │
   │           recent_hits, region      │
   │ Label: will_hit (next 50 requests) │
   └────────────────────────────────────┘

3. CACHE SELECTION (per region)
   
   Baseline:              AI Policy:
   ┌──────────────┐      ┌──────────────┐
   │ popularity   │      │ pred_prob    │
   │ ──────────   │      │ ──────────   │
   │   size_kb    │      │   size_kb    │
   └──────────────┘      └──────────────┘
          ↓                     ↓
   ┌──────────────────────────────┐
   │ Greedy fill until 10MB       │
   └──────────────────────────────┘

4. SIMULATION
   ┌─────────────────────────────────┐
   │ For each test request:          │
   │  • Check if asset in cache      │
   │  • Hit: latency/4, low cost     │
   │  • Miss: full latency, high cost│
   └─────────────────────────────────┘

5. METRICS & VISUALIZATION
   ┌──────────────────────────────────┐
   │ • Average & P95 latency          │
   │ • Cache hit rate                 │
   │ • Total egress cost              │
   │ • Comparison plots (PNG)         │
   └──────────────────────────────────┘

Quickstart

Installation

git clone https://github.com/yourusername/smartedgecdn.git
cd smartedgecdn
make install

Or with pip:

pip install -e .

Run Full Comparison

smartedgecdn compare

This runs the complete pipeline:

  1. Generates synthetic data
  2. Trains ML model
  3. Simulates both policies
  4. Outputs metrics and plots

Outputs

After running, check the outputs/ directory:

outputs/
├── requests.csv           # Generated request stream
├── metrics_baseline.csv   # Baseline policy metrics
├── metrics_ai.csv         # AI policy metrics
├── plot_latency.png       # Latency comparison
├── plot_cache_hit.png     # Hit rate comparison
└── plot_cost.png          # Cost comparison

CLI Commands

# Generate data only
smartedgecdn make-data

# Train model only
smartedgecdn train

# Run single policy
smartedgecdn simulate --policy baseline
smartedgecdn simulate --policy ai

# Full comparison (recommended)
smartedgecdn compare

Testing

Run deterministic tests that verify AI improvements:

make test

Or with pytest directly:

pytest -v

Key test: test_end_to_end.py::test_ai_improves_over_baseline verifies:

  • AI reduces avg latency by ≥5%
  • AI improves hit rate by ≥5%
  • AI reduces egress cost by ≥5%

Repository Structure

smartedgecdn/
├── smartedgecdn/
│   ├── __init__.py
│   ├── config.py          # Settings & configuration
│   ├── data.py            # Synthetic data generation
│   ├── model.py           # ML model training
│   ├── policies.py        # Cache selection policies
│   ├── simulate.py        # Simulation engine
│   ├── metrics.py         # Performance metrics
│   ├── plot.py            # Visualization
│   └── cli.py             # Command-line interface
├── tests/
│   ├── test_data.py
│   ├── test_policies.py
│   ├── test_metrics.py
│   └── test_end_to_end.py
├── pyproject.toml
├── Makefile
├── README.md
└── LICENSE

How It Works

1. Data Generation

Assets: 600 synthetic assets with realistic size distributions:

  • Videos: 1-5 MB (20%)
  • Images: 100-1000 KB (40%)
  • HTML/JS/CSS: 50-500 KB (40%)

Requests: 4,000 requests with:

  • Regional distribution (US-heavy: 60% US, 40% EU/Asia)
  • Popularity-based sampling (Beta distribution)
  • Temporal features (recent_hits rolling window)

2. Feature Engineering

Numeric features:

  • size_kb: Asset size
  • popularity_score: Base popularity (0-1)
  • recent_hits: Count of requests in last 10 timestamps

Categorical features:

  • region: One-hot encoded (US-West, US-East, Europe, Asia)

Label creation:

  • will_hit = 1 if asset requested again within next 50 requests in same region
  • will_hit = 0 otherwise

3. Model Training

LogisticRegression(
    max_iter=1000,
    class_weight="balanced",  # Handle imbalance
    random_state=42           # Reproducibility
)

StandardScaler for numeric features + OneHotEncoder for regions.

4. Cache Selection

Baseline Policy:

score = popularity_score / size_kb
# Select highest-scoring assets until capacity

AI Policy:

score = predicted_hit_probability / size_kb
# Select highest-scoring assets until capacity

Both use greedy knapsack approach with 10 MB capacity per region.

5. Simulation

For each request:

  • Cache hit: latency = base_latency / 4, cost = $0.00002/MB
  • Cache miss: latency = base_latency, cost = $0.0002/MB

Base latencies by region:

  • US-West: 80ms → 20ms (hit)
  • US-East: 100ms → 25ms (hit)
  • Europe: 180ms → 45ms (hit)
  • Asia: 220ms → 55ms (hit)

Ideas for Enhancement

  1. LRU Baseline: Implement time-based eviction policy
  2. Cooperative Caching: Share popularity signals across regions
  3. Prefetching: Proactively cache assets before requests
  4. Reinforcement Learning: Use contextual bandits for online learning
  5. Multi-objective: Balance latency, cost, and capacity jointly
  6. Real Data: Apply to CDN access logs (Cloudflare, Akamai)
  7. Neural Networks: Try LSTM for temporal patterns
  8. A/B Testing: Simulate gradual rollout with traffic splitting

Configuration

Edit smartedgecdn/config.py to adjust:

  • Number of assets/requests
  • Region distributions
  • Cache capacities
  • Cost model parameters
  • Model hyperparameters

About

AI-assisted edge cache optimization for CDN simulation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors