### OpenAI Gym. LunarLander-v2 environment

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. 
Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points.
If lander moves away from landing pad it loses reward back. 
Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. 
Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. 
Landing outside landing pad is possible. 
Fuel is infinite, so an agent can learn to fly and then land on its first attempt. 
Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

**Problem:**  
To guide the lander to the center of the landing pad and zero speed, maintaining correct orientation (both legs pointing down)  

**State vector (s):**  
* s[0] - x coordinate  
* s[1] - y coordinate  
* s[2] - x speed  
* s[3] - y speed  
* s[4] - angle  
* s[5] - angular speed  
* s[6] - if first leg has contact  
* s[7] - if second leg has contact  


**Possible actions:**   
- 0 - Do nothing  
- 1 - Fire left engine  
- 2 - Fire main engine  
- 3 - Fire right engine  

**Rewards:**
*   +100..140 for landing on the pad
*   +100 for successful landing
*   +10 for each led ground contact
*   -100 for crashing into the ground
*   -0.3 for each firing of the main engine

---

### Keyboard control

In [None]:
! python lunar_lander_keyboard.py

---

### Heuristic solution (fuzzy controller)

In [None]:
!python lunar_lander_heuristic.py

---

### Q-table solution

$\epsilon$-greedy policy  

TD(0) update:  
![q-value equation](images/math.svg)

**Hyperparameters:**  
*   $\alpha$ (alpha) is the learning rate (0<$\Large \alpha$<=1) 
*   $\gamma$ (gamma) is the discount factor ($0 \leq \gamma \leq 1$) - determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.
*   $\epsilon$ (epsilon) is the probability of random action. In our case we start from full exploration (epsilon=1.0) and decrease the value to epsilon_min (=0.01) as episodes pass.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [15, 9]
from lunar_lander_qtable_td import QTable, create_env
env = create_env()

In [None]:
# Train the agent on N episodes
agent = QTable(env, alpha=0.1, gamma=.99, epsilon=1.0, epsilon_min=.01, epsilon_max=1.0, epsilon_decay=0.996)

In [None]:
loss = agent.train(20000)

### Final reward

In [None]:
reward = np.load(r'qtbl_td_reward_20000.npy')
window = 500
freq = 20
avg_reward = pd.DataFrame(reward).rolling(window).mean()
plt.plot(np.arange(1, len(reward)+1, freq), reward[::freq], label='final reward')
plt.plot(avg_reward, label='moving average on final reward')
plt.xlabel('episode')
plt.ylabel('reward')
plt.title('Final reward over trainig episodes')
plt.legend()
plt.show();

### Q-table visualization

In [None]:
# Load q-tables of agents with different performance
agent_1k = QTable(env)
# agent_1k.load_qtable(r'checkpoints\qtbl_td_e1000.npy')
agent_1k.load_qtable(r'checkpoints\qtbl_td_e1000.npy')
agent_20k = QTable(env)
# agent_20k.load_qtable(r'checkpoints\qtbl_td_e20000.npy')
agent_20k.load_qtable(r'checkpoints\qtbl_td_e20000.npy')
qtbl_1k = agent_1k.qtbl_2d
qtbl_20k = agent_20k.qtbl_2d

In [None]:
fig, ax = plt.subplots(1,2, figsize=(16,30)) 
actions = ['idle', 'left engine', 'main engine', 'right engine']
start_idx = 18000
n_samples = 300
ax[0].set_title('for 1000 episodes')
ax[1].set_title('for 20000 episodes')
x_1k = sns.heatmap(qtbl_1k[start_idx:start_idx+n_samples], xticklabels=actions, ax=ax[0])
x_20k = sns.heatmap(qtbl_20k[start_idx:start_idx+n_samples], xticklabels=actions, ax=ax[1])

### Epsilon decay function

In [None]:
total_episodes = 20000
episodes = np.arange(total_episodes)
# epsilon = np.clip(1.0 - np.log10((episodes + 1) / (total_episodes * 0.1)), 0.1, 1.0)
epsilon = agent.decay_function(episodes, total_episodes)
plt.rcParams['figure.figsize'] = [15, 9]
plt.plot(episodes, epsilon, label='epsilon decay')
plt.plot(episodes, (avg_reward - np.min(avg_reward)) / (np.max(avg_reward) - np.min(avg_reward)), label='reward')
plt.xlabel('episode')
plt.ylabel('epsilon')
plt.legend()
plt.title('Decaying epsilon and reward over training episodes')
plt.show();

---

### Agent performance at the beginning (1000 episodes)
<!-- ![before_training](images/training/before_training.gif) -->

In [None]:
!python evaluate_qtable.py "checkpoints\qtbl_td_e1000.npy" 1

### Agent performance after 5000 episodes
<!-- ![mid_training](images/training/mid_training.gif) -->

In [None]:
!python evaluate_qtable.py "checkpoints\qtbl_td_e5000.npy" 1

### Agent performance after 20000 episodes
<!-- ![after_training](images/training/after_training.gif) -->

In [None]:
!python evaluate_qtable.py "checkpoints\qtbl_td_e20000.npy" 5

---