This project implements an intelligent worm that learns to navigate its environment using Deep Reinforcement Learning (DRL). The worm uses a Deep Q-Network (DQN) to develop sophisticated movement patterns and exploration strategies.
- Create and activate virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/macOS:
source venv/bin/activate
# On Windows:
.\venv\Scripts\activate- Install dependencies:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # For CUDA support
pip install pygame numpy matplotlib pandas plotly seaborn- Create required directories:
mkdir -p models/saved analytics/reports- Deep Q-Learning Implementation: Uses PyTorch for neural network training
- GPU Acceleration: Optimized for NVIDIA GPUs with CUDA support
- Real-time Visualization: Pygame-based display of the worm's behavior
- Analytics Dashboard: Tracks and visualizes learning progress
- Fast Training Mode: Dedicated training script for rapid learning
- Save/Load System: Preserves learned behaviors across sessions
- Adaptive Difficulty: Game difficulty scales with worm length
- Dynamic Plant System: Plants grow, mature, and die naturally
- Emotional Expression System: Visual feedback of worm's internal state
- Smooth Movement: Enhanced rewards for fluid motion
- Learning Rate Scheduling: Automatic learning rate adjustment
- CUDA Optimization: Improved GPU utilization for faster training
- Headless Mode: Support for training without visualization
The worm's brain is implemented using PyTorch with the following architecture:
DQN Architecture:
- Input Layer (14 neurons): State information (position, velocity, angle, plants, walls, hunger)
- Hidden Layer 1 (256 neurons, ReLU): Pattern recognition with Kaiming initialization
- Hidden Layer 2 (256 neurons, ReLU): Complex feature processing
- Hidden Layer 3 (128 neurons, ReLU): Decision refinement
- Output Layer (9 neurons): Movement actions (8 directions + no movement)- Weight Initialization: Kaiming normal initialization for better gradient flow
- Optimizer: Adam optimizer with learning rate 0.0005
- Learning Rate: Step LR scheduler (gamma=0.9, step_size=100)
- Reward Scaling: Dynamic reward normalization with 20-step window
- Minimum Std: 5.0 (increased from 1.0 for more variation)
- Memory Buffer: 100,000 experiences
- Batch Size: 64 samples (512 in fast training mode)
- Epsilon: Starts at 1.0, decays to 0.1, decay rate 0.9998
The worm learns through a sophisticated reward system:
-
Food and Hunger:
- Base food reward: +100.0
- Hunger multiplier: 2.0x
- Starvation penalty: -1.5 base rate
- Shrinking penalty: -25.0
-
Movement and Exploration:
- Smooth movement: +2.0
- Exploration bonus: +5.0
- Sharp turns: -0.1
- Direction changes: -0.05
-
Growth and Progress:
- Growth reward: +50.0
- Plant spawn mechanics:
- Base spawn chance: 0.02
- Min plants: 2, Max plants: 8
- Spawn cooldown: 60 frames
-
Safety and Survival:
- Wall collision: -50.0
- Wall proximity: -2.0
- Extended wall contact: -20.0 base, scaling by 1.2x
- Danger zone starts at 1.8x head size
- Wall stay recovery: 0.4
- Hunger system:
- Max hunger: 1000
- Base hunger rate: 0.1
- Hunger gain from plant: 300
- Shrink threshold: 50% hunger
- Shrink cooldown: 60 frames
The worm's behavior and learning is guided by a reward system based on Maslow's Hierarchy of Needs, creating more realistic and human-like behavior patterns:
- Food & Hunger: Primary survival drive
- Exponential rewards (10-30) for eating based on hunger level
- Severe starvation penalties:
- -0.1 at 50% hunger (mild discomfort)
- -0.4 at 25% hunger (increasing distress)
- -2.25 at 10% hunger (severe penalty)
- -25.0 at 1% hunger (critical survival state)
- Emotional expressions show distress when starving
- Collision Avoidance: -2.0 penalty for wall collisions
- Movement Safety:
- Penalties up to -1.0 for sharp turns
- Small rewards (0.1) for smooth, stable movement
- Expressions reflect distress during unsafe behavior
- Growth Rewards: +8.0 for growing longer
- Only available when basic needs are met (>50% hunger)
- Enhanced happiness expressions when growing while healthy
- Small rewards (0.05) for movement and exploration
- Only activated when well-fed (>80% hunger)
- Encourages curiosity once basic needs are satisfied
This hierarchical reward structure ensures the worm:
- Prioritizes survival above all else
- Develops safe movement patterns
- Pursues growth only when healthy
- Explores its environment when comfortable
The worm's facial expressions provide visual feedback of its current state in the hierarchy, from distress when starving to contentment during healthy growth.
The worm features a sophisticated facial animation system that brings personality to the AI agent:
- Dynamic pupil movement with smooth interpolation (speed: 0.15)
- Natural blinking system with random intervals (3-6 seconds)
- Smooth blink animations (0.3s duration)
- Minimum 2.0s between blinks
- Eye convergence system (max 40% inward convergence)
- Detailed eye anatomy:
- White sclera
- Black pupils
- Dynamic pupil positioning
- Size scales with head size (25% of head size)
- Three expression states: smile (1), neutral (0), frown (-1)
- Smooth expression transitions
- Expression magnitude control
- Configurable hold times for expressions
- Base expression change speed: 2.0 (adjustable by magnitude)
- Smooth movement interpolation
- Dynamic head rotation
- Fluid body segment following
- Responsive wall collision reactions
- Hunger state visual feedback
The training uses several advanced DRL techniques:
- Experience Replay:
- 50,000 memory buffer size with normalized rewards
- Reward clipping to [-1.0, 1.0] range
- Batch Learning:
- Regular mode: 64 experiences per batch
- Fast training mode: 512 experiences per batch
- Multiple batches processed for stable learning
- Target Network: Updated every 1000 steps for stable learning
- Epsilon-Greedy Strategy:
- Starts at 1.0 (full exploration)
- Regular mode:
- Decays to 0.1 (increased minimum exploration)
- Slower decay rate of 0.9998
- Fast training mode:
- Decays to 0.01
- Decay rate of 0.9995
- Training Mode:
python train.pyThis starts the training process. The worm will learn to:
- Navigate efficiently
- Avoid walls
- Explore the environment
- Develop smooth movement patterns
- Demo Mode:
python app.py --demoShows the worm using its best learned behavior.
- Key Controls:
- ESC: Exit
- SPACE: Pause/Resume
- R: Reset position
The system generates detailed analytics every 50 episodes:
- Learning progress (rewards)
- Movement smoothness
- Exploration coverage
- Wall collision frequency
- Training metrics (epsilon, loss)
.
├── app.py # Main visualization application
├── train.py # Fast training script
├── models/
│ └── dqn.py # DQN implementation
├── analytics/
│ └── metrics.py # Analytics and reporting
└── README.md
- Python 3.12+
- PyTorch 2.0+ with CUDA support
- Pygame 2.5+
- NumPy
- Pandas (for analytics)
- Plotly (for visualization)
- Seaborn (for additional visualizations)
- Multi-agent training
- Curriculum learning
- Environment complexity scaling
- Competitive scenarios
- Enhanced sensory inputs