# Conway's Game of Life controlled with Reinforcement Learning
*Lucas Wilson, Dec 1, 2018*

# Introduction

This project was to see how easily Conway's Game of Life (CGOL) could be controlled. This experimentation could have value in a couple of different areas. CGOL is a type of cellular automation and has emergent properties.
* By "controlling" the system, it demonstrates an understanding of how underlying systems (individual cell manipulation) has an effect on large, emergent behaviors (cell population and structures). Emergent behavior is largely uncorrelated with underlying states, so such a control is not expected.
* If CGOl is representative of processes found naturally in nature, then controlling the behavior of CGOl implies a potential to control processes in nature. With deep learning, this may be possible.

To analyze this, I created the following:
* A headless model of CGOL (used by the GUI) which can quickly simulate CGOL independent of the GUI
* A Conway's Game of Life GUI for generating, viewing, and simulating Conway's Game of Life simulations.
* A restricted feed forward neural network package made with PyTorch.
* A Problem class definition which describes a problem for AI agents to interact with.
* A general reinforcement agent to control these Problem objects

I represented the CGOL problem with my Problem class. My reinforcement agent interacts with problems that follow the specified API, so it was then able to interact with my CGOL model. I defined several aspects of what available actions could be taken from specific states, how the initial state was created, what constituted as a reward (usually maximizing cell population), and how the size of the problem affected how the RL agent performed. I used making random actions as a control.

### Findings
* The RL agent favored not exploring the system to more passive moves: The RL agent's strategy was to create still life structures and then stop interacting with the system. If forced to be more interactive (reward incentives or limiting passive actions), it performed better.
* If the initial state was always the same, it was very good at memorizing paths to maximize reward. However, if it was random, it performed worse. With random states, it would try to converge to a still life, and then stop interacting with the system. If it failed to converge, it performed as well as the control. This implies a lack of understanding of the system but instead a memorization of paths to still life.
* For smaller cell grids, RL is extremely effective in maximizing the population of cells. That is because it is capable of memorizing the state space as opposed to learning it. For large state spaces, random action (control) outperformed the RL agent in population, but the RL agent was able to keep cells alive for longer (generations vs boom and bust populations).
* The RL agent is capable of outperforming a random agent, but it lacks a true understanding of the system.

# Methods

I created a python package called `cgolai` to assist with this experimentation. The full source code for this is on [my github](https://github.com/larkwt96/c440), but all files needed are included in the submitted tar file.

###  A headless model of CGOL (used by the GUI) which can quickly simulate CGOL independent of the GUI

I built a headless model of CGOL so that I could run CGOl simulations quickly without the requirement of a GUI. I wrote the code myself. It's very efficient and designed specifically to work well with other components (the GUI and CgolProblem class). It records past cell states as well as what cells were modified before stepping to the next state. It's also capable of saving these records to file so that I can load them back up later.

###  A Conway's Game of Life GUI for generating, viewing, and simulating Conway's Game of Life simulations.

This component has little to do with AI, so you can skip it unless you're interested. I had fun with this since I haven't built many GUIs that weren't web pages. I struggled a lot to learn how pygame graphics work and how to design the view/controller and model dynamic. The only reason I built this was to be able to easily view and interact with CGOL simulations (which the model above generates).

The GUI was built with pygame and generally follows the MVC pattern. Design and implementation were done completely by me with guidance from the pygame website docs and things I learned from CS 414 (OO software design). MVC was a logical decision since I already had the requirement of the CGOL model being independent, but the view and controller are very closely integrated (but still separate) since it's a small program.

The software has lots of functionalities that let you save, load and draw your own CGOL initial conditions. You can let them play or you can move step by step (with rewinding also as a possibility). It's also easy to run:
`python3 -m cgolai.cgol -f saved_simulation.dat`

(I'm most proud of the game loop structure. Most guides have a loop which checks for events, but this uses CPU time to constantly check. I designed it to be completely event driven so that it will wait until an event occurs and act on it. However, when events occur very quickly, the thread swapping degraded performance. To solve this, I had it process all events when a single event was identified. Then, if lots of events were generated at once (a mouse drag or high simulation rate for example), it would process all of them while only blocking on one. While my simulation is paused, the CPU is completely idol.)

This is the initial state. You can see dots in the middle. This symbolize flipped cells.

<img src="img/demo1.png" style="display: inline-block;" width="400px" height="600px"/>

Here is the state a few iterations later

<img src="img/demo2.png" style="display: inline-block;" width="200px" height="300px"/>
<img src="img/demo3.png" style="display: inline-block;" width="200px" height="300px"/>
<img src="img/demo4.png" style="display: inline-block;" width="200px" height="300px"/>

And about 100 iterations later:

<img src="img/demo5.png" style="display: inline-block;" width="200px" height="300px"/>

This is a state filled with still life and oscillators. The cells will stay alive forever. My RL agent would converge to these and then avoid changing cells near any of the structures.

###  (Failed) Personal implementation of Neural Network

Originally, I planned to implement a Neural Network manually, but it didn't work. I did the math to derive the gradient of the weights for stochastic gradient descent, and the math is correct, but implementation doesn't work as well. 

**Problems**
* NN doesn't fit well compared to given NN code from class and compared to the pytorch NN I built.
* Weights and gradients explode

**Why?**

The math was for the weights gradient of a single point. I perform the optimization step per point for a number of iterations. Fitting subsequent points overwrites the learning of the previous point.

**Solution**

Batching which uses matrix multiplication, but I haven't done this since I decided to just use pytorch. 

###  A restricted feed forward neural network package made with PyTorch.

I modeled the design using the documentation from [their website](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html). Instead of building my own `torch.nn.Module`, I used `torch.nn.Sequential` since it was exactly what I needed (a restricted feed forward neural network). I created a wrapper class called `cgolai.ai.NNTorch` for this module which builds the network as well as trains it.

###  A Problem class definition which describes a problem for AI agents to interact with.

I created a problem API:
* A goal: maximize or minimize reward.
* An initial state. `reset()` will initialize the state.
* A set of available actions. `actions()` will return a list of actions
* A means of performing actions. `do(action)` will perform the chosen action and return the state-action pair and reward
* A definition of the state-action pair for SARSA. `key(action)` will return a state-action pair.
* A means of checking for terminal states. `is_terminal()` will check for terminal states.

The RL agent interacts with these objects. I separated this functionality so that I could insert whatever problem definition I want. This allows me to define other problems (such as Hanoi for testing) as well as variable problem definitions where I can limit the type of actions the RL agent can make.

###  A general reinforcement agent to control these Problem objects

The neural network Q-function uses my `pytorch` network, so back propagation and parameter optimization is handled automatically. I use ReLU as the activation functions and Adam as optimization technique.

The Reinforcement Learning Agent was designed to model the code given in lecture 21 (RL with Neural Network as Q-function). The SARSA tuple comes from the problem object used by the reinforcement agent. The RL agent runs through the problem several times (ether reaching a maximum allowed steps or reaching a goal state). It collects Q-values and rewards. Once several runs have been made, the Q function is updated in batches.

$Q(s_i, a_i) \leftarrow r + Q(s_{i+1}, a_{i+1})$

Here is the general process of a single batch (some lines have been removed for readability).
```python
# reset the problem
steps = 0
self._problem.reset()

# get first state-action key and reward
action, q_val = self.choose_best_action(explore=True)
key, reward = self._problem.do(action)

# while there are still more state-action keys, explore them
while not self._problem.is_terminal():

    # step
    steps += 1

    # get expected Q-value for next action
    action, q_val = self.choose_best_action(explore=True)

    # collect sample: Q(s_i, a_i) = r + Q(s_i+1, a_i+1)
    samples.append([*key, reward, q_val])

    # get next key and reward
    key, reward = self._problem.do(action)
    replay_samples.append(key)
```

### Experiment Results

More details for experiments are in the [Experimentation](#Experimentation) section.

#### Experiment 1
* The optimal solution is to find a means of generating still life and then make no actions on the system.

#### Experiment 2
* However, it performed very similarly since it often found a way to choose cells which don't affect the generated still life structures.

#### Experiment 3
* Removing the ability to idle actually increases the performance of the RL agent (possibly due to better forced exploration).
* Randomness hurts the performance as expected, but the RL agent is still capable of guiding the cells into still life structures when it recognizes the state.

#### Experiment 4
* I suspect that this factor doesn't have much of an affect on RL since the state deviates from the initial state so quickly.

#### Experiment 5
* RL is memorizing the small state space and doesn't perform well in large state spaces.

# Pictures

Here are sequences from a 10x10 trial.

Here is the beginning of the trial:

<img src="img/test_start6.png" style="display: inline;"/>
<img src="img/test_start7.png" style="display: inline;"/>
<img src="img/test_start8.png" style="display: inline;"/>

These domonstrate how the agent chooses open space. It behaved like this for about 7 iterations.

<img src="img/open_space1.png" style="display: inline;"/>
<img src="img/open_space2.png" style="display: inline;"/>
<img src="img/open_space3.png" style="display: inline;"/>
<img src="img/open_space4.png" style="display: inline;"/>
<img src="img/open_space5.png" style="display: inline;"/>

Here is a solution it converges to after the system becomes an oscillator or still life. In this case, it was an oscillator.

<img src="img/osc_soln1.png" style="display: inline;"/>
<img src="img/osc_soln2.png" style="display: inline;"/>
<img src="img/osc_soln3.png" style="display: inline;"/>
<img src="img/osc_soln4.png" style="display: inline;"/>

Here is a few more example solutions. These are only still-life and no convergent behavior.

<img src="img/soln1.png" style="display: inline;"/>
<img src="img/soln2.png" style="display: inline;"/>



# Results

# Problem Definition
First, I define what a problem is so that the RL agent can use it:
    
```python
class Problem(ABC):
    def __init__(self, maximize=True):
        """ problem should be initialized to the point of where actions will return actions """
        self.maximize = maximize

    @abstractmethod
    def is_terminal(self):
        """ Returns True iff the state is a terminal state """
        pass

    @abstractmethod
    def actions(self):
        """ Returns list of possible actions """
        pass

    @abstractmethod
    def key(self, action):
        """ Return the state-action key from the current state given the action """
        pass

    @abstractmethod
    def do(self, action):
        """ Perform the specified action on current state, and returns (state-action key, reward) """
        pass

    @abstractmethod
    def reset(self):
        """ Initialize the state to the initial position """
        pass
```

Problems have:
* An initial state. `reset()` will initialize the state.
* A set of available actions. `actions()` will return a list of actions
* A means of performing actions. `do(action)` will perform the chosen action and return the state-action pair and reward
* A definition of the state-action pair for SARSA. `key(action)` will return a state-action pair.
* A means of checking for terminal states. `is_terminal()` will check for terminal states.

## Problem Definition: CGOL

I experiment with a couple different problem definitions in terms of CGOL.

CGOL is a grid of "cells" defined by a set of rules:
* Cells die when over or under populated (0, 1, 4+ neighbors)
* Cells stay alive otherwise (2, 3 neighbors)
* Cells can be born when there is a specific number of cells (3 neighbors)

Terminology:
* still-life - a shape which doesn't change through iterations
* oscillator - a shape which changes shape but periodically returns to a previous iteration

Project Specific Terminology:
* maxout - This is my term: the terminal state is never reached since there is a still life and the actions being chosen are perpetuating the still life

This is a type of cellular automation used to potentially simulate some forms of cellular automation found in nature (Caballero). The original problem definition is that the grid is infinite, but that kind of problem space is too large and complex, so I chose a fixed size for the grid. The edges are connected to the opposite edge making it a toroid shape.

### State-Action Definition
* State: More specifically, the RL is controlling CGOL on a toroid with the state as a grid of cells are alive or dead (1 or 0, respectively).
* Actions: In the original problem, the only "action" is setting the initial state. The action varies from experiment but is generally a selection of cells which to revive or eliminate.

# Reinforcement Learning Algorithm

In order to measure the effectiveness of the experiment, I have an agent that makes random actions to provide as a control.

The problem with this kind of measurement, however, is that random actions are very likely to destroy still life's that exist in the grid. RL agents taking greedy actions will make the same move repeatedly if the state doesn't change. Therefore, (and my experimentation shows this) random agents can't avoid disrupting a still life while RL agents can. If reward is defined by how many iterations the agent can keep cells alive (while maximizing population).

Experiments are a series of trials where the agent controls the problem, reward is collected, and if a maximum number of steps is reached, the problem is stopped (under the assumption that it can go on forever).

The neural net for the RL agent has 3 layers. The number of nodes in each layer is the number of cells on the grid but capped at 300. Example, 4x4 is 16x16x16, but 20x20 is 300x300x300 not 400x400x400.

# Experimentation

## Experiment 1

I define the available actions to be reviving or eliminating a single cell or doing nothing. The initial state was the same every time, and grid was 4x4 (very small). Reward was how many cells were still alive in the next generation.

### Results:

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: same random flip of size 4x4 and density 0.3
    cutoff: 500 steps
    
    base
    base trial: 0; steps: 11; reward: 57; maxout: False
    base trial: 1; steps: 14; reward: 63; maxout: False
    base trial: 2; steps: 4; reward: 14; maxout: False
    base trial: 3; steps: 4; reward: 16; maxout: False
    base trial: 4; steps: 38; reward: 222; maxout: False
    total steps: 71
    total rewards: 372
    total maxouts: 0

    test
    test trial: 0; steps: 500; reward: 4000; maxout: True
    test trial: 1; steps: 500; reward: 4000; maxout: True
    test trial: 2; steps: 500; reward: 4000; maxout: True
    test trial: 3; steps: 500; reward: 4000; maxout: True
    test trial: 4; steps: 500; reward: 4000; maxout: True
    total steps: 2500
    total rewards: 20000
    total maxouts: 5
   
Because the initial state was the same every time, the RL agent did a simple search for optimal reinforcement. Without fail, the trails would all be maxed out during the testing phase. Through trial and error, it learned how to manipulate the initial state into a still life. Once a still life was generated, it would make actions that don't affect the still life (either not moving or selecting isolated cells far from the structure). 

**Conclusion: The optimal solution is to find a means of generating still life and then make no actions on the system.**

## Experiment 2

Variable change: Remove idle action.

After removing the idle action, it forced the RL to interact with the system instead of idling to avoid destroying still life structures. 

### Results:

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: same random flip of size 4x4 and density 0.3 no idle
    cutoff: 500 steps
    
    base
    base trial: 0; steps: 3; reward: 15; maxout: False
    base trial: 1; steps: 1; reward: 0; maxout: False
    base trial: 2; steps: 19; reward: 108; maxout: False
    base trial: 3; steps: 8; reward: 37; maxout: False
    base trial: 4; steps: 1; reward: 0; maxout: False
    total steps: 32
    total rewards: 160
    total maxouts: 0

    test
    test trial: 0; steps: 500; reward: 4000; maxout: True
    test trial: 1; steps: 500; reward: 4000; maxout: True
    test trial: 2; steps: 500; reward: 4000; maxout: True
    test trial: 3; steps: 500; reward: 4000; maxout: True
    test trial: 4; steps: 500; reward: 4000; maxout: True
    total steps: 2500
    total rewards: 20000
    total maxouts: 5
    
Random seemed to have performed worse, but the chances of choosing the idle were 1/17. I don't think the changed affected the control very much, so this difference is likely due to chance.

**Conclusion: However, it performed very similarly since it often found a way to choose cells which don't affect the generated still life structures.**

## Experiment 3

Variable change: Initial state is random

Now, the initial state is randomly generated instead of the same state over and over. This forces the agent to have to learn to understand the board instead of finding a path to a still life. I also vary no idle again.

### Results

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: static idle
    base
    total steps: 100
    total rewards: 490
    total maxouts: 0
    test
    total steps: 75
    total rewards: 390
    total maxouts: 0

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: static no idle
    base
    total steps: 145
    total rewards: 710
    total maxouts: 0
    test
    total steps: 7500
    total rewards: 40110
    total maxouts: 15

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: random idle
    base
    total steps: 85
    total rewards: 404
    total maxouts: 0
    test
    total steps: 1051
    total rewards: 7692
    total maxouts: 2

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: random no idle
    base
    total steps: 155
    total rewards: 810
    total maxouts: 0
    test
    total steps: 2060
    total rewards: 13216
    total maxouts: 4

There are several things to notice here.

### Idle vs No Idle

* Disallowing no action moves seems to improve the agent's performance. With random states, it performed twice as well. 
* With non random initial states, the RL agent failed to find a means to maxout. This can be credited to a bad initial state that's very difficult (or impossible) to foster into emergent structures.
* I wonder if this is due to better exploration since the agent is forced to interact with the system and break still life structures.

**Conclusion: Removing the ability to idle actually increases the performance of the RL agent (possibly due to better forced exploration).**

### Random initial states vs Same initial states

* There is variability in the testing phase since the initial conditions are no longer static. This demonstrates that the RL agent is being forced to evaluate a path instead of using the one it memorized as good.
* Randomness had no affect on the performance. This makes sense since the initial state has no effect on how the random agent behaves. A new random state vs the same random state is effectively the same.
* The RL agent performed much worse with randomness. This makes sense since it's being forced to understand the system, but it's interesting that it still performs better than the control and is capable achieving maxout. I think that it randomly encounters a state that it recognizes from its training and it is able to control this state into a still life where it then chooses effectively idle actions.

**Conclusion: Randomness hurts the performance as expected, but the RL agent is still capable of guiding the cells into still life structures when it recognizes the state.**


#### Experiment 4

Variable change: Variable density (the chance of a cell initially being alive)

The density is how many cells are chosen to be alive.

##### Results
    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: new random flip of size 5x5 and density 0.3 no idle
    base
    total steps: 216
    total rewards: 1457
    total maxouts: 0
    test
    total steps: 3219
    total rewards: 21141
    total maxouts: 6

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: new random flip of size 5x5 and density 0.5 no idle
    base
    total steps: 323
    total rewards: 2490
    total maxouts: 0
    test
    total steps: 1878
    total rewards: 9012
    total maxouts: 3

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: new random flip of size 5x5 and density 0.7 no idle
    base
    total steps: 441
    total rewards: 3476
    total maxouts: 0
    test
    total steps: 2267
    total rewards: 13397
    total maxouts: 4

The performance of control increased with density, and the RL agent has too much variability to tell. It seems to have performed well for lower and higher density but not very well for half alive / half dead.

**Conclusion: I suspect that this factor doesn't have much of an affect on RL since the state deviates from the initial state so quickly.**

#### Experiment 5

Variable: Problem/grid size

##### Results

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: new random flip of size 4x4 and density 0.3
    base
    total steps: 35
    total rewards: 176
    total maxouts: 0

    test
    total steps: 1015
    total rewards: 6902
    total maxouts: 2

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: new random flip of size 10x10 and density 0.3
    base
    total steps: 450
    total rewards: 8937
    total maxouts: 0

    test
    total steps: 1688
    total rewards: 15531
    total maxouts: 3

    -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    Title: new random flip of size 20x20 and density 0.3
    base
    total steps: 2116
    total rewards: 87956
    total maxouts: 3

    test
    total steps: 2500
    total rewards: 62130
    total maxouts: 5

For smaller cell grids, RL is extremely effective in maximizing the population of cells. That is because it is capable of memorizing the state space as opposed to learning it. For large state spaces, random action (control) outperformed the RL agent in population, but the RL agent was able to keep cells alive for longer (generations vs boom and bust populations).

**Conclusion: RL is memorizing the small state space and doesn't perform well in large state spaces.**

# Conclusions

### What I Learned
#### About Conway's Game of Life and applied Reinforcement Learning
* The RL agent favored not exploring the system to more passive moves: The RL agent's strategy was to create still life structures and then stop interacting with the system. If forced to be more interactive (reward incentives or limiting passive actions), it performed better.
* If the initial state was always the same, it was very good at memorizing paths to maximize reward. However, if it was random, it performed worse. With random states, it would try to converge to a still life, and then stop interacting with the system. If it failed to converge, it performed as well as the control. This implies a lack of understanding of the system but instead a memorization of paths to still life.
* For smaller cell grids, RL is extremely effective in maximizing the population of cells. That is because it is capable of memorizing the state space as opposed to learning it. For large state spaces, random action (control) outperformed the RL agent in population, but the RL agent was able to keep cells alive for longer (generations vs boom and bust populations).
* The RL agent is capable of outperforming a random agent, but it lacks a true understanding of the system.

#### About Programming and AI

Here are some more general topics that I understand much more now:
* Back propagation and minimization problems
* The general SARSA Neural Network Q-function Reinforcement Learning Algorithm
* Applying MVC to create GUI application where Model is independent from Controller/View
* How to create installable python modules (my `cgolai` package can be installed like other packages, with `setup.py`).
* Unit testing with python's `unittest` to test my module
* Git for version control and distributed work (easy switching between GPU machine vs local)
* Compressed pickling for storing model states
* PEP 8 code style standards

### Experimentation and Difficulties

Problems with my experimentation methodologies:
* I didn't do much hyper-parameter tuning, so it's possible there are better parameters which train the model much better than my experiments found.
* Training larger networks took very long, so I usually didn't let the network train for very long. This means the RL agent wasn't performing as well as it could have, but I think letting it train longer has the same performance as the RL in a smaller state, i.e., it will memorize the larger state as well. However, since it can't really memorize the larger state, it would be interesting to know if it could draw accurate extrapolations from such an intense training.
* I'm pretty sure that my conversions from numpy.ndarray/list/torch.Tensor result in my neural network training very inefficiently (either not using GPU or making unnecessary copies). It was functional, so I didn't look further into it.
* I put tables in my report instead of graphing them, which I apologize for.

### Resources

See cs440_notes.docx for the reading component for the Honors Requirement.
* Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Chapter 6: Temporal-Difference Learning. Second edition, complete draft. 5 Nov 2017.
* Caballero, Lorena, Bob Hodge, and Sergio Hernandez. Conway’s “Game of Life” and the Epigenetic Principle.
* Gadaleta, Sabino, and Gerhard Dangelmayr. Reinforcement Learning Chaos Control Using Value Sensitive Vector-Quantization.

In [5]:
import io
from IPython.nbformat import current
import glob
nbfile = glob.glob('Final.ipynb')
if len(nbfile) > 1:
    print('More than one ipynb file. Using the first one.  nbfile=', nbfile)
with io.open(nbfile[0], 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print('Word count for file', nbfile[0], 'is', word_count)

Word count for file Final.ipynb is 4751
