Solution to the OpenAI Gym environment of the MountainCar through Deep Q-Learning
OpenAI offers a toolkit for practicing and implementing Deep Q-Learning algorithms. (http://gym.openai.com/) This is my implementation of the MountainCar-v0 environment. This environment has a small cart stuck in a trench. The cart needs to get to the flag on top of the crest to gain points and the faster it learns to do this, it gains more points. The cart can go left and right, with any variation of speed. Once the cart performs an action, the environment provides it a reward and tells it where the cart is at this point.
This model basically learns to randomly perform actions until it recognizes the actions that give it a higher score. You can watch it learn what to do by watching the score for each episode.
Results can be found in train_results.log and test_results.log for the train and test, respectively.
For the training, I set a threshold of -110 for an average score of the mountain car. The mountain car gets a score of -200 per episode if it doesn't reach the flag. It gets a small boost to its score if it reaches the flag. And it gets more and more points if it gets to the flag fast. I modeled the reward function (reward + gamma * np.max(next_Q_target))
to train the MountainCar to get to the flag as fast as possible.
Episode 743 Time Taken: 31.48 sec Score: -109.00 State: 0.505025593192 Average Q-Target: -14.1120 Epsilon: 0.001 Average Score: -110.03
Episode 744 Time Taken: 36.07 sec Score: -125.00 State: 0.501472864486 Average Q-Target: -60.9826 Epsilon: 0.001 Average Score: -110.14
Episode 745 Time Taken: 30.07 sec Score: -104.00 State: 0.521960339024 Average Q-Target: -56.2346 Epsilon: 0.001 Average Score: -110.31
Episode 746 Time Taken: 26.00 sec Score: -90.00 State: 0.510257725214 Average Q-Target: -36.6640 Epsilon: 0.001 Average Score: -110.15
Episode 747 Time Taken: 34.39 sec Score: -119.00 State: 0.536389897949 Average Q-Target: -11.7476 Epsilon: 0.001 Average Score: -109.70
Episode 748 Time Taken: 32.40 sec Score: -112.00 State: 0.520691140417 Average Q-Target: -44.0669 Epsilon: 0.001 Average Score: -109.76
Episode 749 Time Taken: 31.25 sec Score: -108.00 State: 0.501927383055 Average Q-Target: -57.2569 Epsilon: 0.001 Average Score: -109.75
Episode 750 Time Taken: 35.22 sec Score: -122.00 State: 0.507334402534 Average Q-Target: -46.2420 Epsilon: 0.001 Average Score: -109.35
Episode 751 Time Taken: 32.34 sec Score: -112.00 State: 0.519616429384 Average Q-Target: -59.1135 Epsilon: 0.001 Average Score: -109.60
Model training finished!
Average Score over last 100 episodes: -109.6 Number of Episodes: 751
Iteration: 92 Score: -105.0
Iteration: 93 Score: -105.0
Iteration: 94 Score: -105.0
Iteration: 95 Score: -105.0
Iteration: 96 Score: -104.0
Iteration: 97 Score: -105.0
Iteration: 98 Score: -105.0
Iteration: 99 Score: -105.0
Iteration: 100 Score: -103.0
Total Avg. Score over 100 consecutive iterations : -102.84
Agent finished test within expected reward boundary! Environment is solved.
This was my first project with Deep Q Learning after my Udacity course so this was very interesting to model and plan. I took some inspirations from github user "harshitandro" to start out and get my feet wet. I enjoyed seeing the training progress and tweaking with the input parameters to get this to work. The MountainCar showed me how a complex learning algorithm in a continuous space could be developed through Deep Q Learning instead of arduous man hours by developers.