# Training World Models in VizDoom from Scratch

This notebook documents the process of setting up and training a World Models implementation for VizDoom with specific dependencies:

- Ubuntu 16.04
- Python 3.5.4
- TensorFlow 1.8.0
- NumPy 1.13.3
- VizDoom Gym Levels (Latest commit 60ff576 on Mar 18, 2017)
- OpenAI Gym 0.9.4
- cma 2.2.0
- mpi4py 2
- Jupyter Notebook

The World Models approach consists of three components:
1. **VAE (Vision)**: Compresses the high-dimensional visual input into a latent representation
2. **MDN-RNN (Memory)**: Predicts future latent states based on current states and actions
3. **Controller**: Maps latent states to actions using a simple neural network

Let's start by setting up our Docker environment to ensure we have the correct dependencies.

## 1. Setup Environment and Dependencies

First, we'll create a Dockerfile that sets up the correct environment for our experiment. This Dockerfile will:
1. Use Ubuntu 16.04 as the base image
2. Install Python 3.5.4
3. Install TensorFlow 1.8.0, NumPy 1.13.3, and other dependencies
4. Install VizDoom and the required gym environments

Let's create this Dockerfile and build our Docker image.

In [None]:
%%writefile Dockerfile.vizdoom

FROM ubuntu:16.04

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    libboost-all-dev \
    libgtk2.0-dev \
    libsdl2-dev \
    python3-dev \
    python3-pip \
    python3-numpy \
    wget \
    zlib1g-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Python 3.5.4
RUN apt-get update && apt-get install -y python3.5-dev python3.5-tk

# Upgrade pip
RUN pip3 install --upgrade pip

# Install Python dependencies
RUN pip3 install \
    tensorflow==1.8.0 \
    numpy==1.13.3 \
    gym==0.9.4 \
    cma==2.2.0 \
    mpi4py==2.0.0 \
    matplotlib==2.2.3 \
    jupyter \
    pillow \
    scipy==1.0.0

# Install VizDoom
RUN apt-get update && apt-get install -y \
    libbz2-dev \
    libffi-dev \
    libfreetype6-dev \
    libjpeg-dev \
    liblzma-dev \
    libncurses5-dev \
    libncursesw5-dev \
    libpng-dev \
    libreadline-dev \
    libssl-dev \
    libsqlite3-dev \
    libx11-dev \
    libgl1-mesa-dev \
    tk-dev

# Clone and build VizDoom
RUN git clone https://github.com/mwydmuch/ViZDoom.git \
    && cd ViZDoom \
    && python3 setup.py build \
    && python3 setup.py install

# Clone VizDoom gym environments
RUN git clone https://github.com/ppaquette/gym-doom.git \
    && cd gym-doom \
    && git reset --hard 60ff576 \
    && pip3 install -e .

# Create working directory
WORKDIR /app

# Copy the code
COPY doomrnn/ /app/doomrnn/

# Expose port for Jupyter Notebook
EXPOSE 8888

# Default command
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

Now, let's build the Docker image. This may take some time as it's installing all the dependencies.

```bash
docker build -t worldmodels-vizdoom -f Dockerfile.vizdoom .
```

Once the image is built, we can run a container with the following command:

```bash
docker run -p 8888:8888 -v $(pwd):/app worldmodels-vizdoom
```

This will start a Jupyter Notebook server that we can access through our browser.

Now, let's proceed with understanding and implementing the World Models architecture for VizDoom.

## 2. Understanding the VizDoom Environment

VizDoom provides a 3D environment based on the Doom game, allowing an agent to learn to navigate and act in a 3D world. The observations are RGB images from the agent's perspective, and the actions are discrete (move forward, turn left, turn right, shoot, etc.).

The World Models paper uses the "Take Cover" scenario, where the agent needs to avoid fireballs thrown by enemies. Let's first explore this environment.

In [None]:
# This code would run in the Docker container
# Here we're showing what we would execute

import gym
import ppaquette_gym_doom
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
%matplotlib inline

# Create the environment
env = gym.make('ppaquette/DoomTakeCover-v0')

# Reset the environment
observation = env.reset()

# Render the environment
plt.figure(figsize=(8, 6))
imshow(observation)
plt.title('VizDoom Take Cover - Initial Observation')
plt.axis('off')
plt.show()

# Get the observation space and action space
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

## 3. Overview of World Models Architecture

The World Models architecture for VizDoom consists of three main components:

1. **VAE (Vision)**: A variational autoencoder that compresses the high-dimensional visual input (RGB images) into a low-dimensional latent representation.
2. **MDN-RNN (Memory)**: A recurrent neural network with a mixture density network output layer that predicts future latent states based on current states and actions.
3. **Controller**: A simple neural network that maps latent states to actions.

Let's implement each of these components.

## 4. Implementing the VAE (Vision)

The VAE takes the high-dimensional visual input (RGB images) and compresses it into a low-dimensional latent representation. This makes it easier for the agent to learn.

The VAE architecture from the World Models paper consists of:
- Encoder: Several convolutional layers that reduce the image to a low-dimensional latent vector
- Latent space: A probabilistic representation of the input
- Decoder: Several deconvolutional layers that reconstruct the image from the latent vector

Let's implement the VAE for VizDoom:

In [None]:
# This is the VAE implementation for VizDoom
# In the Docker container, this would be in doomrnn/vae.py

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops

class ConvVAE(object):
    def __init__(self, z_size=64, batch_size=100, learning_rate=0.0001, kl_tolerance=0.5, is_training=True, reuse=False, gpu_mode=True):
        self.z_size = z_size
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.is_training = is_training
        self.kl_tolerance = kl_tolerance
        self.reuse = reuse
        self.gpu_mode = gpu_mode
        with tf.variable_scope('conv_vae', reuse=self.reuse):
            if not gpu_mode:
                with tf.device('/cpu:0'):
                    tf.logging.info('Model using cpu.')
                    self._build_graph()
            else:
                tf.logging.info('Model using gpu.')
                self._build_graph()
            self._init_session()
            self._init_saver()
            
    def _build_graph(self):
        # Placeholders for input and output
        self.x = tf.placeholder(tf.float32, shape=[None, 64, 64, 3])
        
        # Encoder
        h = tf.layers.conv2d(self.x, 32, 4, strides=2, activation=tf.nn.relu, name="enc_conv1")
        h = tf.layers.conv2d(h, 64, 4, strides=2, activation=tf.nn.relu, name="enc_conv2")
        h = tf.layers.conv2d(h, 128, 4, strides=2, activation=tf.nn.relu, name="enc_conv3")
        h = tf.layers.conv2d(h, 256, 4, strides=2, activation=tf.nn.relu, name="enc_conv4")
        h = tf.reshape(h, [-1, 2*2*256])
        
        # VAE latent layers
        self.mu = tf.layers.dense(h, self.z_size, name="enc_fc_mu")
        self.logvar = tf.layers.dense(h, self.z_size, name="enc_fc_log_var")
        self.sigma = tf.exp(self.logvar / 2.0)
        self.epsilon = tf.random_normal([self.batch_size, self.z_size])
        self.z = self.mu + self.sigma * self.epsilon
        
        # Decoder
        h = tf.layers.dense(self.z, 4*256, name="dec_fc")
        h = tf.reshape(h, [-1, 1, 1, 4*256])
        h = tf.layers.conv2d_transpose(h, 128, 5, strides=2, activation=tf.nn.relu, name="dec_deconv1")
        h = tf.layers.conv2d_transpose(h, 64, 5, strides=2, activation=tf.nn.relu, name="dec_deconv2")
        h = tf.layers.conv2d_transpose(h, 32, 6, strides=2, activation=tf.nn.relu, name="dec_deconv3")
        self.y = tf.layers.conv2d_transpose(h, 3, 6, strides=2, activation=tf.nn.sigmoid, name="dec_deconv4")
        
        # Loss
        # Reconstruction loss (binary cross entropy)
        self.reconstruction_loss = tf.reduce_sum(
            tf.square(self.x - self.y),
            reduction_indices=[1, 2, 3]
        )
        self.reconstruction_loss = tf.reduce_mean(self.reconstruction_loss)
        
        # KL divergence loss
        self.kl_loss = -0.5 * tf.reduce_sum(
            (1 + self.logvar - tf.square(self.mu) - tf.exp(self.logvar)),
            reduction_indices=1
        )
        
        # Apply KL tolerance
        self.kl_loss = tf.maximum(self.kl_loss, self.kl_tolerance * self.z_size)
        self.kl_loss = tf.reduce_mean(self.kl_loss)
        
        # Total loss
        self.loss = self.reconstruction_loss + self.kl_loss
        
        # Optimizer
        self.optimizer = tf.train.AdamOptimizer(self.learning_rate)
        self.train_op = self.optimizer.minimize(self.loss)
        
    def _init_session(self):
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())
        
    def _init_saver(self):
        self.saver = tf.train.Saver()
        
    def close_sess(self):
        self.sess.close()
        
    def encode(self, x):
        return self.sess.run(self.z, feed_dict={self.x: x})
    
    def decode(self, z):
        return self.sess.run(self.y, feed_dict={self.z: z})
    
    def get_model_params(self):
        # Get trainable variables
        params = self.sess.run(tf.trainable_variables())
        return params
    
    def set_model_params(self, params):
        # Set trainable variables
        var_list = tf.trainable_variables()
        ops = []
        for i, var in enumerate(var_list):
            ops.append(tf.assign(var, params[i]))
        self.sess.run(ops)
    
    def save_json(self, json_path):
        # Save model parameters as JSON
        params = self.get_model_params()
        with open(json_path, 'w') as f:
            json.dump(params, f)
    
    def load_json(self, json_path):
        # Load model parameters from JSON
        with open(json_path, 'r') as f:
            params = json.load(f)
        self.set_model_params(params)
    
    def train(self, x):
        # Train the VAE
        _, loss, r_loss, kl_loss = self.sess.run(
            [self.train_op, self.loss, self.reconstruction_loss, self.kl_loss],
            feed_dict={self.x: x}
        )
        return loss, r_loss, kl_loss

## 5. Implementing the MDN-RNN (Memory)

The MDN-RNN takes the latent representation from the VAE and predicts future latent states based on the current state and the action. It uses a mixture density network output layer to model the uncertainty in the predictions.

The architecture consists of:
- LSTM layers: To capture temporal dependencies
- Mixture Density Network: To model the distribution of the next latent state

Let's implement the MDN-RNN for VizDoom:

In [None]:
# This is the MDN-RNN implementation for VizDoom
# In the Docker container, this would be in doomrnn/rnn.py

import numpy as np
import tensorflow as tf
import json

class MDNRNN(object):
    def __init__(self, z_size=64, action_size=2, hidden_units=256, 
                 n_mixtures=5, batch_size=100, learning_rate=0.001, grad_clip=1.0,
                 is_training=True, reuse=False):
        self.z_size = z_size
        self.action_size = action_size
        self.hidden_units = hidden_units
        self.n_mixtures = n_mixtures
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.grad_clip = grad_clip
        self.is_training = is_training
        self.reuse = reuse
        
        with tf.variable_scope('mdn_rnn', reuse=self.reuse):
            self._build_graph()
            self._init_session()
            self._init_saver()
            
    def _build_graph(self):
        # Placeholders
        self.x = tf.placeholder(tf.float32, shape=[None, None, self.z_size + self.action_size])
        self.y = tf.placeholder(tf.float32, shape=[None, None, self.z_size])
        self.seq_lengths = tf.placeholder(tf.int32, shape=[None])
        self.initial_state = None
        
        # LSTM layers
        lstm_cell = tf.nn.rnn_cell.LSTMCell(self.hidden_units)
        self.initial_state = lstm_cell.zero_state(batch_size=self.batch_size, dtype=tf.float32)
        
        # RNN
        outputs, final_state = tf.nn.dynamic_rnn(
            lstm_cell, 
            self.x,
            initial_state=self.initial_state,
            dtype=tf.float32,
            sequence_length=self.seq_lengths
        )
        
        # Reshape outputs for dense layers
        outputs_flat = tf.reshape(outputs, [-1, self.hidden_units])
        
        # MDN outputs
        # For each mixture, we need:
        # - pi: the mixture weight
        # - mu: the mean
        # - sigma: the standard deviation
        n_outputs = self.n_mixtures * (2 * self.z_size + 1)
        
        # Dense layer
        mdn_outputs = tf.layers.dense(outputs_flat, n_outputs)
        
        # Split MDN outputs
        mdn_out_pi, mdn_out_mu, mdn_out_sigma = self._split_mdn_outputs(mdn_outputs)
        
        # Apply softmax to pi
        mdn_out_pi = tf.nn.softmax(mdn_out_pi, axis=-1)
        
        # Apply exp to sigma to ensure it's positive
        mdn_out_sigma = tf.exp(mdn_out_sigma)
        
        # Loss function
        self.loss = self._mdn_loss(mdn_out_pi, mdn_out_mu, mdn_out_sigma, self.y)
        
        # Optimizer with gradient clipping
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        grads, vars = zip(*optimizer.compute_gradients(self.loss))
        grads, _ = tf.clip_by_global_norm(grads, self.grad_clip)
        self.train_op = optimizer.apply_gradients(zip(grads, vars))
        
    def _split_mdn_outputs(self, mdn_outputs):
        # Split the MDN outputs into pi, mu, and sigma
        mdn_output_shape = tf.shape(mdn_outputs)
        batch_size = mdn_output_shape[0]
        
        # Split along the last dimension
        mdn_out_pi = mdn_outputs[:, :self.n_mixtures]
        mdn_out_mu = mdn_outputs[:, self.n_mixtures:self.n_mixtures*(1+self.z_size)]
        mdn_out_sigma = mdn_outputs[:, self.n_mixtures*(1+self.z_size):]
        
        # Reshape mu and sigma
        mdn_out_mu = tf.reshape(mdn_out_mu, [batch_size, self.n_mixtures, self.z_size])
        mdn_out_sigma = tf.reshape(mdn_out_sigma, [batch_size, self.n_mixtures, self.z_size])
        
        return mdn_out_pi, mdn_out_mu, mdn_out_sigma
    
    def _mdn_loss(self, pi, mu, sigma, y):
        # Calculate the MDN loss
        y = tf.reshape(y, [-1, 1, self.z_size])
        
        # Calculate the negative log likelihood
        dist = tf.contrib.distributions.Normal(loc=mu, scale=sigma)
        log_prob = dist.log_prob(y)
        log_prob = tf.reduce_sum(log_prob, axis=-1)
        
        # Weight by the mixture probabilities
        weighted_log_prob = log_prob + tf.log(pi + 1e-8)
        
        # Use logsumexp for numerical stability
        max_weighted_log_prob = tf.reduce_max(weighted_log_prob, axis=-1, keepdims=True)
        log_likelihood = max_weighted_log_prob + tf.log(tf.reduce_sum(
            tf.exp(weighted_log_prob - max_weighted_log_prob), axis=-1))
        
        # Negative log likelihood
        loss = -tf.reduce_mean(log_likelihood)
        
        return loss
    
    def _init_session(self):
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())
        
    def _init_saver(self):
        self.saver = tf.train.Saver()
        
    def close_sess(self):
        self.sess.close()
        
    def get_model_params(self):
        # Get trainable variables
        params = self.sess.run(tf.trainable_variables())
        return params
    
    def set_model_params(self, params):
        # Set trainable variables
        var_list = tf.trainable_variables()
        ops = []
        for i, var in enumerate(var_list):
            ops.append(tf.assign(var, params[i]))
        self.sess.run(ops)
    
    def save_json(self, json_path):
        # Save model parameters as JSON
        params = self.get_model_params()
        with open(json_path, 'w') as f:
            json.dump(params, f)
    
    def load_json(self, json_path):
        # Load model parameters from JSON
        with open(json_path, 'r') as f:
            params = json.load(f)
        self.set_model_params(params)
    
    def train(self, x, y, seq_lengths):
        # Train the MDN-RNN
        _, loss = self.sess.run(
            [self.train_op, self.loss],
            feed_dict={self.x: x, self.y: y, self.seq_lengths: seq_lengths}
        )
        return loss
    
    def predict(self, x, state=None):
        # Predict the next latent state
        # This is not implemented in this simplified version
        pass

## 6. Implementing the Controller

The Controller is a simple neural network that maps latent states to actions. It's trained using evolutionary strategies (specifically CMA-ES) to maximize the reward.

The architecture is simple:
- Input: Latent state from the VAE
- Hidden layers: A few fully connected layers
- Output: Action probabilities

Let's implement the Controller:

In [None]:
# This is the Controller implementation for VizDoom
# In the Docker container, this would be in doomrnn/controller.py

import numpy as np
import tensorflow as tf
import json

class Controller(object):
    def __init__(self, z_size=64, action_size=2, hidden_units=40, batch_size=100, learning_rate=0.001):
        self.z_size = z_size
        self.action_size = action_size
        self.hidden_units = hidden_units
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        
        self._build_graph()
        self._init_session()
        self._init_saver()
        
    def _build_graph(self):
        # Placeholders
        self.z = tf.placeholder(tf.float32, shape=[None, self.z_size])
        
        # Hidden layers
        h = tf.layers.dense(self.z, self.hidden_units, activation=tf.nn.tanh)
        h = tf.layers.dense(h, self.hidden_units, activation=tf.nn.tanh)
        
        # Output layer
        self.action_logits = tf.layers.dense(h, self.action_size)
        self.action_probs = tf.nn.softmax(self.action_logits)
        
    def _init_session(self):
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())
        
    def _init_saver(self):
        self.saver = tf.train.Saver()
        
    def close_sess(self):
        self.sess.close()
        
    def get_action(self, z):
        # Get the action probabilities
        action_probs = self.sess.run(self.action_probs, feed_dict={self.z: z})
        return action_probs
    
    def get_model_params(self):
        # Get trainable variables
        params = self.sess.run(tf.trainable_variables())
        return params
    
    def set_model_params(self, params):
        # Set trainable variables
        var_list = tf.trainable_variables()
        ops = []
        for i, var in enumerate(var_list):
            ops.append(tf.assign(var, params[i]))
        self.sess.run(ops)
    
    def save_json(self, json_path):
        # Save model parameters as JSON
        params = self.get_model_params()
        with open(json_path, 'w') as f:
            json.dump(params, f)
    
    def load_json(self, json_path):
        # Load model parameters from JSON
        with open(json_path, 'r') as f:
            params = json.load(f)
        self.set_model_params(params)

## 7. Data Collection and Preprocessing

Before we can train our models, we need to collect data from the VizDoom environment. We'll use random actions to explore the environment and collect observations.

For the VAE, we need images from the environment.
For the MDN-RNN, we need sequences of latent states and actions.

In [None]:
# Data collection for VizDoom
# In the Docker container, this would be executed directly

import gym
import ppaquette_gym_doom
import numpy as np
import random
import os
from PIL import Image

def collect_data(num_episodes=100, max_steps=1000, data_dir="data"):
    """Collect data from VizDoom environment using random actions"""
    # Create the environment
    env = gym.make('ppaquette/DoomTakeCover-v0')
    
    # Create data directory
    os.makedirs(data_dir, exist_ok=True)
    
    # Collect data
    total_frames = 0
    for episode in range(num_episodes):
        # Reset the environment
        observation = env.reset()
        
        # Downsample and normalize the observation
        obs = process_frame(observation)
        
        for step in range(max_steps):
            # Take a random action
            action = env.action_space.sample()
            
            # Step the environment
            next_observation, reward, done, info = env.step(action)
            
            # Downsample and normalize the next observation
            next_obs = process_frame(next_observation)
            
            # Save the observation and action
            save_frame(obs, os.path.join(data_dir, f"frame_{total_frames}.png"))
            
            # Update counters
            total_frames += 1
            obs = next_obs
            
            if done:
                break
                
        print(f"Episode {episode+1}/{num_episodes} completed, total frames: {total_frames}")
    
    env.close()
    print(f"Data collection completed. Total frames: {total_frames}")
    
def process_frame(frame, target_size=(64, 64)):
    """Preprocess a frame: resize and normalize"""
    # Convert to PIL Image
    img = Image.fromarray(frame)
    
    # Resize
    img = img.resize(target_size)
    
    # Convert back to numpy array and normalize
    img_array = np.array(img) / 255.0
    
    return img_array

def save_frame(frame, filepath):
    """Save a frame as an image file"""
    # Convert to PIL Image
    img = Image.fromarray((frame * 255).astype(np.uint8))
    
    # Save the image
    img.save(filepath)

## 8. Training the VAE

Now that we have collected data, we can train the VAE to compress the observations into a latent representation.

In [None]:
# Training the VAE
# In the Docker container, this would be in doomrnn/vae_train.py

import os
import numpy as np
import tensorflow as tf
from PIL import Image
import glob
import json
from vae import ConvVAE

def load_frames(data_dir="data", batch_size=100):
    """Load frames from the data directory"""
    # Get all frame files
    frame_files = glob.glob(os.path.join(data_dir, "frame_*.png"))
    
    # Shuffle the files
    np.random.shuffle(frame_files)
    
    # Load frames in batches
    n_files = len(frame_files)
    n_batches = n_files // batch_size
    
    for i in range(n_batches):
        batch_files = frame_files[i*batch_size:(i+1)*batch_size]
        batch_frames = []
        
        for file in batch_files:
            # Load the image
            img = Image.open(file)
            img_array = np.array(img) / 255.0
            batch_frames.append(img_array)
            
        yield np.array(batch_frames)

def train_vae(data_dir="data", model_dir="vae", batch_size=100, num_epochs=10, z_size=64):
    """Train the VAE on the collected data"""
    # Create model directory
    os.makedirs(model_dir, exist_ok=True)
    
    # Create the VAE
    vae = ConvVAE(z_size=z_size, batch_size=batch_size)
    
    # Training loop
    for epoch in range(num_epochs):
        # Epoch stats
        epoch_loss = 0
        epoch_r_loss = 0
        epoch_kl_loss = 0
        n_batches = 0
        
        # Load data batches
        for batch in load_frames(data_dir, batch_size):
            # Train on batch
            loss, r_loss, kl_loss = vae.train(batch)
            
            # Update stats
            epoch_loss += loss
            epoch_r_loss += r_loss
            epoch_kl_loss += kl_loss
            n_batches += 1
            
        # Print epoch stats
        epoch_loss /= n_batches
        epoch_r_loss /= n_batches
        epoch_kl_loss /= n_batches
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}, Reconstruction Loss: {epoch_r_loss:.4f}, KL Loss: {epoch_kl_loss:.4f}")
        
        # Save model
        vae.save_json(os.path.join(model_dir, "vae.json"))
        
    print("VAE training completed.")
    vae.close_sess()

## 9. Training the MDN-RNN

After training the VAE, we can use it to encode observations into latent states and train the MDN-RNN to predict future latent states.

In [None]:
# Training the MDN-RNN
# In the Docker container, this would be in doomrnn/rnn_train.py

import os
import numpy as np
import tensorflow as tf
import json
import gym
import ppaquette_gym_doom
from PIL import Image
from vae import ConvVAE
from rnn import MDNRNN

def generate_rnn_data(vae, num_episodes=100, max_steps=1000, data_dir="rnn_data"):
    """Generate data for RNN training using the trained VAE"""
    # Create the environment
    env = gym.make('ppaquette/DoomTakeCover-v0')
    
    # Create data directory
    os.makedirs(data_dir, exist_ok=True)
    
    # Collect data
    z_series = []
    action_series = []
    
    for episode in range(num_episodes):
        # Reset the environment
        observation = env.reset()
        
        # Preprocess the observation
        obs = process_frame(observation)
        
        # Encode the observation
        z = vae.encode(np.array([obs]))[0]
        
        # Initialize episode data
        episode_z = [z]
        episode_actions = []
        
        for step in range(max_steps):
            # Take a random action
            action_idx = np.random.randint(0, env.action_space.n)
            action = np.zeros(env.action_space.n)
            action[action_idx] = 1
            
            # Step the environment
            next_observation, reward, done, info = env.step(action_idx)
            
            # Preprocess the next observation
            next_obs = process_frame(next_observation)
            
            # Encode the next observation
            next_z = vae.encode(np.array([next_obs]))[0]
            
            # Store the action and next latent state
            episode_actions.append(action)
            episode_z.append(next_z)
            
            # Update observation
            obs = next_obs
            z = next_z
            
            if done:
                break
                
        # Store the episode data
        z_series.append(np.array(episode_z))
        action_series.append(np.array(episode_actions))
        
        print(f"Episode {episode+1}/{num_episodes} completed, steps: {len(episode_z)}")
    
    env.close()
    
    # Save the data
    np.save(os.path.join(data_dir, "z_series.npy"), z_series)
    np.save(os.path.join(data_dir, "action_series.npy"), action_series)
    
    print(f"RNN data generation completed. Episodes: {len(z_series)}")
    
    return z_series, action_series

def train_rnn(z_series, action_series, model_dir="rnn", batch_size=100, num_epochs=10, z_size=64, action_size=8):
    """Train the MDN-RNN on the generated data"""
    # Create model directory
    os.makedirs(model_dir, exist_ok=True)
    
    # Create the RNN
    rnn = MDNRNN(z_size=z_size, action_size=action_size, batch_size=batch_size)
    
    # Training loop
    for epoch in range(num_epochs):
        # Epoch stats
        epoch_loss = 0
        n_batches = 0
        
        # Shuffle the episodes
        indices = np.random.permutation(len(z_series))
        
        # Train on each episode
        for i in range(0, len(indices), batch_size):
            # Get batch indices
            batch_indices = indices[i:i+batch_size]
            actual_batch_size = len(batch_indices)
            
            # Skip if batch is too small
            if actual_batch_size < batch_size:
                continue
                
            # Get batch episodes
            batch_z = [z_series[idx] for idx in batch_indices]
            batch_actions = [action_series[idx] for idx in batch_indices]
            
            # Create input sequences (z and action) and output sequences (next z)
            max_seq_len = max(len(z) for z in batch_z)
            seq_lengths = [len(z)-1 for z in batch_z]  # -1 because we need the next z
            
            # Skip if sequences are too short
            if max(seq_lengths) <= 0:
                continue
                
            # Create padded sequences
            x_seq = np.zeros((batch_size, max_seq_len-1, z_size + action_size))
            y_seq = np.zeros((batch_size, max_seq_len-1, z_size))
            
            for b in range(batch_size):
                z = batch_z[b]
                actions = batch_actions[b]
                seq_len = min(len(z)-1, max_seq_len-1)
                
                for t in range(seq_len):
                    # Input: concatenate z and action
                    x_seq[b, t, :z_size] = z[t]
                    x_seq[b, t, z_size:] = actions[t]
                    
                    # Output: next z
                    y_seq[b, t] = z[t+1]
            
            # Train on batch
            loss = rnn.train(x_seq, y_seq, seq_lengths)
            
            # Update stats
            epoch_loss += loss
            n_batches += 1
            
        # Print epoch stats
        epoch_loss /= max(1, n_batches)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}")
        
        # Save model
        rnn.save_json(os.path.join(model_dir, "rnn.json"))
        
    print("RNN training completed.")
    rnn.close_sess()

## 10. Training the Controller with CMA-ES

Finally, we can train the Controller using evolutionary strategies (CMA-ES) to maximize the reward in the environment.

In [None]:
# Training the Controller with CMA-ES
# In the Docker container, this would be in doomrnn/train.py

import os
import numpy as np
import tensorflow as tf
import json
import gym
import ppaquette_gym_doom
import cma
from PIL import Image
from vae import ConvVAE
from rnn import MDNRNN
from controller import Controller

def process_frame(frame, target_size=(64, 64)):
    """Preprocess a frame: resize and normalize"""
    # Convert to PIL Image
    img = Image.fromarray(frame)
    
    # Resize
    img = img.resize(target_size)
    
    # Convert back to numpy array and normalize
    img_array = np.array(img) / 255.0
    
    return img_array

def evaluate_controller(vae, controller, n_trials=3, max_steps=1000):
    """Evaluate the controller in the environment"""
    # Create the environment
    env = gym.make('ppaquette/DoomTakeCover-v0')
    
    # Run multiple trials and average the rewards
    rewards = []
    
    for trial in range(n_trials):
        # Reset the environment
        observation = env.reset()
        
        # Preprocess the observation
        obs = process_frame(observation)
        
        # Encode the observation
        z = vae.encode(np.array([obs]))
        
        # Initialize episode reward
        total_reward = 0
        
        for step in range(max_steps):
            # Get action probabilities
            action_probs = controller.get_action(z)[0]
            
            # Sample action
            action_idx = np.random.choice(len(action_probs), p=action_probs)
            
            # Step the environment
            next_observation, reward, done, info = env.step(action_idx)
            
            # Update total reward
            total_reward += reward
            
            if done:
                break
                
            # Preprocess the next observation
            next_obs = process_frame(next_observation)
            
            # Encode the next observation
            z = vae.encode(np.array([next_obs]))
            
        rewards.append(total_reward)
        
    env.close()
    
    # Return the average reward
    return np.mean(rewards)

def train_controller_cma(vae_path="vae/vae.json", model_dir="controller", z_size=64, action_size=8, 
                         hidden_units=40, sigma_init=0.5, popsize=64, num_generations=100):
    """Train the controller using CMA-ES"""
    # Create model directory
    os.makedirs(model_dir, exist_ok=True)
    
    # Load the VAE
    vae = ConvVAE(z_size=z_size, batch_size=1, is_training=False)
    vae.load_json(vae_path)
    
    # Create the controller
    controller = Controller(z_size=z_size, action_size=action_size, hidden_units=hidden_units)
    
    # Get the initial parameters
    init_params = controller.get_model_params()
    param_count = sum(p.size for p in init_params)
    
    # Flatten the parameters
    init_params_flat = np.concatenate([p.flatten() for p in init_params])
    
    # Initialize CMA-ES
    es = cma.CMAEvolutionStrategy(init_params_flat, sigma_init, 
                                   {'popsize': popsize, 'maxiter': num_generations})
    
    # Training loop
    best_reward = -np.inf
    best_params = None
    
    for generation in range(num_generations):
        # Sample solutions
        solutions = es.ask()
        
        # Evaluate solutions
        rewards = []
        
        for i, solution in enumerate(solutions):
            # Reshape the solution
            solution_params = []
            start_idx = 0
            
            for p in init_params:
                param_size = p.size
                solution_params.append(solution[start_idx:start_idx+param_size].reshape(p.shape))
                start_idx += param_size
                
            # Set the controller parameters
            controller.set_model_params(solution_params)
            
            # Evaluate the controller
            reward = evaluate_controller(vae, controller)
            rewards.append(-reward)  # CMA-ES minimizes, so we negate the reward
            
            print(f"Generation {generation+1}/{num_generations}, Solution {i+1}/{popsize}, Reward: {-rewards[-1]:.2f}")
            
        # Update CMA-ES
        es.tell(solutions, rewards)
        
        # Check for new best
        best_idx = np.argmin(rewards)
        if -rewards[best_idx] > best_reward:
            best_reward = -rewards[best_idx]
            best_solution = solutions[best_idx]
            
            # Reshape the best solution
            best_params = []
            start_idx = 0
            
            for p in init_params:
                param_size = p.size
                best_params.append(best_solution[start_idx:start_idx+param_size].reshape(p.shape))
                start_idx += param_size
                
            # Save the best parameters
            controller.set_model_params(best_params)
            controller.save_json(os.path.join(model_dir, "controller.json"))
            
            print(f"New best reward: {best_reward:.2f}")
            
        # Print generation stats
        print(f"Generation {generation+1}/{num_generations}, Mean Reward: {-np.mean(rewards):.2f}, Best Reward: {best_reward:.2f}")
        
        # Save generation stats
        with open(os.path.join(model_dir, f"gen_{generation+1}.json"), 'w') as f:
            json.dump({
                'generation': generation+1,
                'mean_reward': -np.mean(rewards),
                'best_reward': best_reward
            }, f)
            
    print("Controller training completed.")
    vae.close_sess()
    controller.close_sess()

## 11. Running the Full Training Pipeline

Now that we have implemented all components, let's run the full training pipeline to train our World Models for VizDoom.

In [None]:
# Full training pipeline
# In the Docker container, this would be executed directly

import os
import numpy as np
import json
import matplotlib.pyplot as plt

# Parameters
data_dir = "data"
vae_dir = "vae"
rnn_dir = "rnn"
controller_dir = "controller"
z_size = 64
action_size = 8
hidden_units = 40
batch_size = 100
num_vae_epochs = 10
num_rnn_epochs = 10
num_controller_generations = 100

# 1. Collect data for VAE training
print("Step 1: Collecting data for VAE training...")
collect_data(num_episodes=100, max_steps=1000, data_dir=data_dir)

# 2. Train VAE
print("\nStep 2: Training VAE...")
train_vae(data_dir=data_dir, model_dir=vae_dir, batch_size=batch_size, num_epochs=num_vae_epochs, z_size=z_size)

# 3. Generate data for RNN training
print("\nStep 3: Generating data for RNN training...")
vae = ConvVAE(z_size=z_size, batch_size=1, is_training=False)
vae.load_json(os.path.join(vae_dir, "vae.json"))
z_series, action_series = generate_rnn_data(vae, num_episodes=100, max_steps=1000, data_dir="rnn_data")
vae.close_sess()

# 4. Train RNN
print("\nStep 4: Training RNN...")
train_rnn(z_series, action_series, model_dir=rnn_dir, batch_size=batch_size, num_epochs=num_rnn_epochs, 
          z_size=z_size, action_size=action_size)

# 5. Train Controller with CMA-ES
print("\nStep 5: Training Controller with CMA-ES...")
train_controller_cma(vae_path=os.path.join(vae_dir, "vae.json"), model_dir=controller_dir, z_size=z_size, 
                     action_size=action_size, hidden_units=hidden_units, num_generations=num_controller_generations)

print("\nTraining pipeline completed!")

## 12. Visualizing Training Progress

Let's visualize the training progress of the CMA-ES algorithm for the Controller.

In [None]:
# Visualizing training progress
# In the Docker container, this would be executed directly

import os
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def plot_training_progress(controller_dir="controller", num_generations=100):
    """Plot the training progress of the Controller"""
    # Collect generation stats
    generations = []
    mean_rewards = []
    best_rewards = []
    
    for gen in range(1, num_generations+1):
        gen_file = os.path.join(controller_dir, f"gen_{gen}.json")
        
        if os.path.exists(gen_file):
            with open(gen_file, 'r') as f:
                stats = json.load(f)
                
            generations.append(stats['generation'])
            mean_rewards.append(stats['mean_reward'])
            best_rewards.append(stats['best_reward'])
    
    # Plot the progress
    plt.figure(figsize=(12, 6))
    plt.plot(generations, mean_rewards, 'b-', label='Mean Reward')
    plt.plot(generations, best_rewards, 'r-', label='Best Reward')
    plt.xlabel('Generation')
    plt.ylabel('Reward')
    plt.title('CMA-ES Training Progress')
    plt.legend()
    plt.grid(True)
    plt.savefig(os.path.join(controller_dir, 'training_progress.png'))
    plt.show()

# Plot the training progress
plot_training_progress(controller_dir=controller_dir, num_generations=num_controller_generations)

## 13. Testing the Trained Model

Finally, let's test our trained World Models in the VizDoom environment.

In [None]:
# Testing the trained model
# In the Docker container, this would be executed directly

import os
import numpy as np
import tensorflow as tf
import json
import gym
import ppaquette_gym_doom
import matplotlib.pyplot as plt
from PIL import Image
%matplotlib inline

def test_model(vae_path="vae/vae.json", controller_path="controller/controller.json", 
               z_size=64, action_size=8, hidden_units=40, num_episodes=5, max_steps=1000, 
               render=True):
    """Test the trained model in the environment"""
    # Load the VAE
    vae = ConvVAE(z_size=z_size, batch_size=1, is_training=False)
    vae.load_json(vae_path)
    
    # Load the Controller
    controller = Controller(z_size=z_size, action_size=action_size, hidden_units=hidden_units)
    controller.load_json(controller_path)
    
    # Create the environment
    env = gym.make('ppaquette/DoomTakeCover-v0')
    
    # Run episodes
    episode_rewards = []
    
    for episode in range(num_episodes):
        # Reset the environment
        observation = env.reset()
        
        # Initialize episode data
        total_reward = 0
        frames = []
        
        for step in range(max_steps):
            # Preprocess the observation
            obs = process_frame(observation)
            
            # Save the frame if rendering is enabled
            if render:
                frames.append(observation)
                
            # Encode the observation
            z = vae.encode(np.array([obs]))
            
            # Get action probabilities
            action_probs = controller.get_action(z)[0]
            
            # Sample action
            action_idx = np.random.choice(len(action_probs), p=action_probs)
            
            # Step the environment
            observation, reward, done, info = env.step(action_idx)
            
            # Update total reward
            total_reward += reward
            
            if done:
                break
                
        episode_rewards.append(total_reward)
        print(f"Episode {episode+1}/{num_episodes}, Reward: {total_reward:.2f}")
        
        # Display a few frames if rendering is enabled
        if render and len(frames) > 0:
            # Display a subset of frames
            n_frames = min(5, len(frames))
            fig, axes = plt.subplots(1, n_frames, figsize=(20, 4))
            
            for i, frame_idx in enumerate(np.linspace(0, len(frames)-1, n_frames).astype(int)):
                axes[i].imshow(frames[frame_idx])
                axes[i].axis('off')
                axes[i].set_title(f"Frame {frame_idx}")
                
            plt.suptitle(f"Episode {episode+1}, Reward: {total_reward:.2f}")
            plt.show()
    
    env.close()
    vae.close_sess()
    controller.close_sess()
    
    # Print overall stats
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)
    print(f"\nTest completed, Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
    
    return episode_rewards

# Test the trained model
test_model(vae_path=os.path.join(vae_dir, "vae.json"), 
           controller_path=os.path.join(controller_dir, "controller.json"), 
           z_size=z_size, action_size=action_size, hidden_units=hidden_units)

## 14. Conclusion and Next Steps

We have successfully implemented and trained a World Models architecture for the VizDoom environment. The key components are:

1. **VAE (Vision)**: Compresses high-dimensional visual input into a low-dimensional latent representation.
2. **MDN-RNN (Memory)**: Predicts future latent states based on current states and actions.
3. **Controller**: Maps latent states to actions using a simple neural network trained with CMA-ES.

This implementation follows the architecture described in the World Models paper, but with some simplifications for clarity.

### Next Steps

There are several ways to extend and improve this implementation:

1. **Temperature Parameter**: Add a temperature parameter to the MDN-RNN to control the stochasticity of the predictions.
2. **Dream Training**: Train the controller purely in the dream environment (using only the RNN for state predictions).
3. **Longer Training**: Train the models for more epochs/generations to improve performance.
4. **Hyperparameter Tuning**: Experiment with different hyperparameters for each component.
5. **Different Environments**: Apply the same architecture to other VizDoom scenarios or different environments.

### References

- Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122.
- [World Models Website](https://worldmodels.github.io/)

# Copying Trained Models to the tf_models Directory

Now that we have the VAE, RNN, and initial_z models trained, we need to copy these files to the `tf_models` subdirectory in the `doomrnn` folder. This will allow us to use these trained models for the CMA-ES evolutionary training process.

In [None]:
import os
import shutil

# Create the target directory if it doesn't exist
os.makedirs('doomrnn/tf_models', exist_ok=True)

# Copy the model files
source_files = [
    'vae/vae.json',
    'initial_z/initial_z.json',
    'rnn/rnn.json'
]

for source_file in source_files:
    source_path = os.path.join('doomrnn', source_file)
    target_path = os.path.join('doomrnn/tf_models', os.path.basename(source_file))
    
    if os.path.exists(source_path):
        shutil.copy2(source_path, target_path)
        print(f"Copied {source_path} to {target_path}")
    else:
        print(f"Warning: Source file {source_path} does not exist")

# Verify the files were copied
print("\nFiles in tf_models directory:")
for file in os.listdir('doomrnn/tf_models'):
    if file.endswith('.json'):
        print(f" - {file}")

# Updating Git Repository

Now let's update our git repository with the new model files. We'll add the model files to the repository and commit the changes.

In [None]:
# Using os.system to run git commands
import os

# Check git status
print("Current git status:")
os.system("git status")

# Add the model files to git
print("\nAdding model files to git:")
os.system("git add doomrnn/tf_models/*.json")

# Commit the changes
print("\nCommitting changes:")
os.system('git commit -m "Add trained VAE, RNN, and initial_z models for doom"')

# Show git status after commit
print("\nGit status after commit:")
os.system("git status")

# Optional: Push to your fork
# Note: Uncomment the line below if you want to push to your fork
# os.system("git push origin master")

# CMA-ES Training Process

After copying the model files and updating the git repository, we need to run the CMA-ES training process on a 64-core CPU instance. Here are the steps:

1. **Start a 64-core CPU instance** (if you haven't already)
2. **Log into the machine**
3. **Navigate to the doomrnn directory**
4. **Run the training script**: `python train.py`

The training process will continue until you stop it with Ctrl+C. It's recommended to run for about 200 generations (4-5 hours), which should be sufficient to get good results.

While training is running, you can monitor the progress using the `plot_training_progress.ipynb` notebook, which loads and visualizes the log files being generated.

In [None]:
# When on the 64-core CPU instance, run this command in the terminal:
# cd doomrnn
# python train.py

# For reference, here's what the command would look like if executed from Python:
import os

# Don't run this cell directly unless you're on the 64-core CPU instance
# This is just for reference
def run_cma_es_training():
    os.chdir('doomrnn')
    os.system('python train.py')
    
# Note: The training will run until you manually stop it with Ctrl+C
# It's recommended to run for ~200 generations (4-5 hours)

# Monitoring Training Progress

While the CMA-ES training is running, you can monitor its progress using the `plot_training_progress.ipynb` notebook. This notebook will load the log files being generated and create visualizations of the training progress.

Let's prepare a cell that you can use to open this notebook:

In [None]:
# To open the training progress notebook, run:
import os
import subprocess
import sys

notebook_path = os.path.join('doomrnn', 'plot_training_progress.ipynb')

if os.path.exists(notebook_path):
    print(f"Opening notebook at: {notebook_path}")
    if sys.platform == 'win32':
        os.startfile(notebook_path)  # Windows-specific
    elif sys.platform == 'darwin':  # macOS
        subprocess.run(['open', notebook_path])
    else:  # Linux
        subprocess.run(['xdg-open', notebook_path])
else:
    print(f"Warning: Notebook not found at {notebook_path}")

# Saving Training Results to Git

After you've completed the training (recommended to run for about 200 generations or 4-5 hours), you'll need to add the log files to your git repository.

These log files are stored in the `doomrnn/log` directory with `.json` extensions. They contain the training history and the best models found during training.

Here's how to add them to your git repository:

In [None]:
# Using os.system to run git commands for saving log files
import os

# After training is complete, run these commands:

# Check git status
print("Current git status:")
os.system("git status")

# Add the log files to git
print("\nAdding log files to git:")
os.system("git add doomrnn/log/*.json")

# Commit the changes
print("\nCommitting changes:")
os.system('git commit -m "Add CMA-ES training logs for doom"')

# Show git status after commit
print("\nGit status after commit:")
os.system("git status")

# Optional: Push to your fork
# Note: Uncomment the line below if you want to push to your fork
# os.system("git push origin master")

# Shutdown the instance after you're done
print("\nRemember to shutdown the instance after completing these steps!")

# Summary of Steps

Here's a summary of the steps we've covered:

1. **Copy trained models**:
   - Copy VAE, RNN, and initial_z models to the tf_models directory
   - Update git repository with these files

2. **Run CMA-ES training**:
   - Start a 64-core CPU instance
   - Navigate to doomrnn directory
   - Run `python train.py`
   - Monitor progress with the plotting notebook
   - Stop after ~200 generations (4-5 hours)

3. **Save training results**:
   - Add log files to git repository
   - Commit changes
   - Push to your fork (optional)
   - Shutdown the instance

These steps complete the training pipeline for the World Models approach in the VizDoom environment.