## Vision

*Overview of project and purpose*

Hoping to encorporate some of the advancements in artificial intelligence into games made even by indie developers, Unity Technologies developed *Puppo, The Corgi* to demonstrate the capabilities of it's ML-Agents toolkit. And in a world where state-machines are state-of-the-art "AI", even simple tools to leverage machine learning and neural network techniques is most appreciated: opening up the doors not necessarily to smarter non-playable characters, but perhaps towards automated beta testing by autonomous agents or the use of generative adversarial networks to create artwork to populate the worlds, it feels like there are many useful applications of artificial intelligence technologies to aid game-developers in creating the worlds we love to explore. Hopefully then, the ML-Agents toolkit and perhaps even the *Puppo* example can serve as valuable starting points for developers to dip their toes into the world of AI.

This project seeks to explore and tweak the existing *Puppo, The Corgi* example project to either serve as a play-ground to get an idea of how Reinforcement Learning works within the ML-Agents framework.

## Background
*Describe technologies being used (and why you chose them).
Include citations of work on which you've based your system. Cite technologies used from class and new technologies you've experimented with.*

My project's starting point, *Puppo, The Corgi* is a sample project including the cute but helpless titular canine who starts off with neither animations nor any form of movement rather than the slight alterations to the various cylindrical limbs. But through training, a model is developed that allows him to move his little legs in ways that at first propel him towards his stick, and later to create a cute if perhaps comical custom animation for the puppy.

The example project that I'm working with is using Unity's ML-Agent's tookit which is a tool that interfaces with Tensorflow through an external python script to compute a model. The project is available for download here: https://blogs.unity3d.com/2018/10/02/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit/

The ML-Agent's toolkit is predominantly driven by Reinforcment Learning. This example in particular is using Proximal Policy Optimization which has become the default reinforcement learning algorithm at OpenAI thanks to it's simplicity, ease of use and performance (https://openai.com/blog/openai-baselines-ppo/).



## Implementation

*Summarize your implementation (how it extends referenced work)*

At present, I've been able to get the existing project up and running including all of the configuration work to enable training locally though I have not seriously begun tweaking the model or extending too far beyond the sample project.

The default hyper-parameters that I've been using are:

In [0]:

DogBrain: 
    normalize: true 
    num_epoch: 3 
    time_horizon: 1000 
    batch_size: 2048 
    buffer_size: 20480 
    gamma: 0.995 
    max_steps: 2e6 
    summary_freq: 3000 
    num_layers: 3 
    hidden_units: 512  

There are a variety of other hyper-parameters that could also be added into the model. It would definitely be interesting to see how some of the parameters involved with using RNNs to establish some memory or others related to what they've described as curiousity would affect the model as well. Some best practices for parameter tweaking for the PPO model can be found here (https://github.com/Unity-Technologies/ml-agents/blob/0.5.0/docs/Training-PPO.md).


The reward being used at the moment is 0.01 * velocity.dot(direction) - 0.001 - 0.001 * angular_force + reached_target. This provides an orientation bonus (the velocity/direction dot product) to move Puppo towards the stick, a time penality (the -0.001) to keep him from dilly-dallying, a rotation penalty (the angular force) to make him look more like a real dog who would get dizzy quickly, and most importantly whether or not Puppo reached the stick or not. The reached_target is 1 if the target has been reached and 0 otherwise, which compared to the units on the other aspects of the reward appears to be several magnitudes larger which makes sense given that is the ultimate goal of the little pup.


## Results

*Results of your system, compare to other work*

While watching the many Puppo's going through the learning process, there was a noticeable difference between the earlier stages of the training where they would reach the stick, but only after a bit of flailing. 

INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 3000. Mean Reward: 1.702. Std of Reward: 0.945. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 6000. Mean Reward: 1.752. Std of Reward: 0.909. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 9000. Mean Reward: 2.028. Std of Reward: 0.917. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 12000. Mean Reward: 2.278. Std of Reward: 0.917. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 15000. Mean Reward: 2.537. Std of Reward: 1.079. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 18000. Mean Reward: 2.654. Std of Reward: 1.100. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 21000. Mean Reward: 2.890. Std of Reward: 1.164. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 24000. Mean Reward: 2.939. Std of Reward: 1.157. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 27000. Mean Reward: 3.084. Std of Reward: 1.135. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 30000. Mean Reward: 3.111. Std of Reward: 1.279. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 33000. Mean Reward: 3.246. Std of Reward: 1.330. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 36000. Mean Reward: 3.236. Std of Reward: 1.373. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 39000. Mean Reward: 3.232. Std of Reward: 1.344. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 42000. Mean Reward: 3.254. Std of Reward: 1.400. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 45000. Mean Reward: 3.357. Std of Reward: 1.455. Training.
INFO:mlagents.trainers: puppo0-0: DogBrain: Step: 48000. Mean Reward: 3.480. Std of Reward: 1.467. Training.



By the time I stopped training (at 444,000 steps), the mean reward was around 3.8/3.9 which was more than double the inital couple of steps, but interestingly not that much more than that of the 48,000 mark where they were beginning to hit 3.48 mean reward.

## Implications



*Discuss social/ethical implications of using these technologies*

Advances in Reinforcement Learning seem to be driving many of the cool technologies at large at the moment. Like with any technological advancement, there's the threat of introducing new technologies that carry along extra baggage that disrupt the environments they're placed in in unexpected ways. One of the largest concerns is often the lost jobs as new algorithms learn how to do things cheaper and faster than humans. 

Particularly looking at the game's industry, these sorts of things feel like toy examples: mostly designed to spark developer interest in tinkering with AI. But just as truck drivers might be rightfully afraid for their jobs with advancements in self-driving vehicles, it might not be too long before a little bit more complex examples of Puppo where reinforcement learning could automate animator's jobs away. However, especially in the game's industry, I suspect that while AI techniques will hopefully become more prevalent they will be more intending to supplement and enhance developers abilities rather than outright replacing them. And perhaps it's foolish pride or foolish faith, but I find it hard to imagine any algorithm that can automate away the creativity and ingenuity that human developers rely on to create compelling worlds which tug perhaps even at the soul of the players.