## Вопросы для самопроверки:
* что такое обучени с подкреплением (reinforcement learning)?
* что такое среда?
* что такое агент?
* что такое награда, какая она может быть?

In [3]:
import gym

#create a single game instance
env = gym.make("Taxi-v2")

#start new game
env.reset();

In [9]:
print("initial observation code:", env.reset())
print('printing observation:')
env.render()
print("observations:", env.observation_space, 'n=', env.observation_space.n)
print("actions:", env.action_space, 'n=', env.action_space.n)

initial observation code: 364
printing observation:
+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : : : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |B: |
+---------+

observations: Discrete(500) n= 500
actions: Discrete(6) n= 6


In [34]:
import numpy as np
n_states = env.observation_space.n
n_actions = env.action_space.n
def get_random_policy():
    """
    Build a numpy array representing agent policy.
    This array must have one element per each of 16 environment states.
    Element must be an integer from 0 to 3, representing action
    to take from that state.
    """
    return np.random.choice(range(n_actions), size=n_states)

In [36]:
np.random.seed(1234)
policies = [get_random_policy() for i in range(10**4)]
assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should match n_actions-1'
action_probas = np.unique(policies, return_counts=True)[-1] /10**4. /n_states
print("Action frequencies over 10^4 samples:",action_probas)
assert np.allclose(action_probas, [1. / n_actions] * n_actions, atol=0.05), "The policies aren't uniformly random (maybe it's just an extremely bad luck)"
print("Seems fine!")

Action frequencies over 10^4 samples: [ 0.1668264  0.1667818  0.166578   0.166587   0.166508   0.1667188]
Seems fine!


### Let's evaluate!
* Implement a simple function that runs one game and returns the total reward

In [28]:
def sample_reward(env, policy, max_steps=100):
    """
    Interact with an environment, return sum of all rewards.
    If game doesn't end on t_max (e.g. agent walks into a wall), 
    force end the game and return whatever reward you got so far.
    Tip: see signature of env.step(...) method above.
    """
    
    n_steps = 0
    
    s = env.reset()
    state = 0
    total_reward = 0
    done = False
    
    while(not done and n_steps <= max_steps):
        state, reward, done, _ = env.step(policy[state])
        total_reward += reward
        n_steps += 1
    
    return total_reward

In [27]:
print("generating 10^3 sessions...")
rewards = [sample_reward(env, get_random_policy()) for _ in range(10**3)]
assert all([type(r) in (int, float) for r in rewards]), 'sample_reward must return a single number'
print("Looks good!")

generating 10^3 sessions...
Looks good!


In [29]:
def evaluate(policy, n_times=100):
    """Run several evaluations and average the score the policy gets."""
    rewards = [sample_reward(env, policy) for i in range(n_times)]
    return float(np.mean(rewards))
        

### Main loop

In [31]:
best_policy = None
best_score = -float('inf')

from tqdm import tqdm
for i in tqdm(range(10000)):
    policy = get_random_policy()
    score = evaluate(policy)
    if score > best_score:
        best_score = score
        best_policy = policy
        print("New best score:", score)

  0%|          | 2/10000 [00:00<27:50,  5.99it/s]

New best score: -691.22
New best score: -610.13


  0%|          | 5/10000 [00:00<28:21,  5.88it/s]

New best score: -574.76


  0%|          | 10/10000 [00:01<26:54,  6.19it/s]

New best score: -555.68


  0%|          | 14/10000 [00:02<26:58,  6.17it/s]

New best score: -538.58


  0%|          | 39/10000 [00:06<25:57,  6.39it/s]

New best score: -493.58


  0%|          | 45/10000 [00:07<25:58,  6.39it/s]

New best score: -387.11


 19%|█▉        | 1911/10000 [04:55<20:50,  6.47it/s]

New best score: -379.37


 83%|████████▎ | 8291/10000 [21:26<04:25,  6.44it/s]

New best score: -368.84


100%|██████████| 10000/10000 [25:50<00:00,  6.45it/s]


In [33]:
print(best_score)

-368.84


# Part II Genetic algorithm 

The next task is to devise some more effecient way to perform policy search.
We'll do that with a bare-bones evolutionary algorithm.
[unless you're feeling masochistic and wish to do something entirely different which is bonus points if it works]

In [43]:
def crossover(policy1, policy2, p=0.5):
    """
    for each state, with probability p take action from policy1, else policy2
    """
    q = 1 - p
    
    cross_policy = np.zeros(np.shape(policy1))
    policy_filter = np.random.choice([False, True], size=n_states, p=[q, p])
    inverted_policy_filter = policy_filter == 0
    cross_policy[policy_filter] = policy1[policy_filter]
    cross_policy[inverted_policy_filter] = policy2[inverted_policy_filter]
    return cross_policy

In [44]:
def mutation(policy, p=0.1):
    """
    for each state, with probability p replace action with random action
    Tip: mutation can be written as crossover with random policy
    """
    mutated_policy = crossover(get_random_policy(), policy, p)
    return mutated_policy
    

In [45]:
np.random.seed(1234)
policies = [crossover(get_random_policy(), get_random_policy()) 
            for i in range(10**4)]

assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should be n_actions-1'

assert any([np.mean(crossover(np.zeros(n_states), np.ones(n_states))) not in (0, 1)
               for _ in range(100)]), "Make sure your crossover changes each action independently"
print("Seems fine!")

Seems fine!


In [46]:

n_epochs = 100 #how many cycles to make
pool_size = 100 #how many policies to maintain
n_crossovers = 50 #how many crossovers to make on each step
n_mutations = 50 #how many mutations to make on each tick


In [47]:
print("initializing...")
pool = [get_random_policy() for i in range(pool_size)]
pool_scores = [evaluate(i) for i in pool]

initializing...


In [48]:
assert type(pool) == type(pool_scores) == list
assert len(pool) == len(pool_scores) == pool_size
assert all([type(score) in (float, int) for score in pool_scores])


In [50]:
#main loop
for epoch in tqdm(range(n_epochs)):
    print("Epoch %s:"%epoch)
    
    
    
    crossovered = [crossover(pool[i], pool[-i - 1]) for i in np.random.choice(range(pool_size), n_crossovers)]
    mutated = [mutation(pool[i]) for i in np.random.choice(range(pool_size), n_mutations)]
    
    assert type(crossovered) == type(mutated) == list
    
    #add new policies to the pool
    pool = pool + crossovered + mutated
    pool_scores = [evaluate(i) for i in pool]
    
    #select pool_size best policies
    selected_indices = np.argsort(pool_scores)[-pool_size:]
    pool = [pool[i] for i in selected_indices]
    pool_scores = [pool_scores[i] for i in selected_indices]

    #print the best policy so far (last in ascending score order)
    print("best score:", pool_scores[-1])

  0%|          | 0/100 [00:00<?, ?it/s]

Epoch 0:


  1%|          | 1/100 [00:27<45:14, 27.42s/it]

best score: -414.38
Epoch 1:


  2%|▏         | 2/100 [00:54<44:40, 27.35s/it]

best score: -378.38
Epoch 2:


  3%|▎         | 3/100 [01:22<44:23, 27.46s/it]

best score: -378.47
Epoch 3:


  4%|▍         | 4/100 [01:50<44:15, 27.66s/it]

best score: -405.47
Epoch 4:


  5%|▌         | 5/100 [02:22<45:16, 28.59s/it]

best score: -378.56
Epoch 5:


  6%|▌         | 6/100 [02:51<44:41, 28.52s/it]

best score: -379.19
Epoch 6:


  7%|▋         | 7/100 [03:21<44:32, 28.74s/it]

best score: -369.56
Epoch 7:


  8%|▊         | 8/100 [03:49<43:59, 28.69s/it]

best score: -388.01
Epoch 8:


  9%|▉         | 9/100 [04:17<43:25, 28.63s/it]

best score: -349.67
Epoch 9:


 10%|█         | 10/100 [04:44<42:44, 28.49s/it]

best score: -360.29
Epoch 10:


 11%|█         | 11/100 [05:12<42:07, 28.40s/it]

best score: -306.38
Epoch 11:


 12%|█▏        | 12/100 [05:39<41:30, 28.30s/it]

best score: -334.19
Epoch 12:


 13%|█▎        | 13/100 [06:06<40:54, 28.21s/it]

best score: -342.92
Epoch 13:


 14%|█▍        | 14/100 [06:33<40:19, 28.13s/it]

best score: -324.74
Epoch 14:


 15%|█▌        | 15/100 [07:00<39:45, 28.06s/it]

best score: -324.83
Epoch 15:


 16%|█▌        | 16/100 [07:28<39:12, 28.01s/it]

best score: -334.01
Epoch 16:


 17%|█▋        | 17/100 [07:55<38:41, 27.97s/it]

best score: -324.92
Epoch 17:


 18%|█▊        | 18/100 [08:22<38:08, 27.91s/it]

best score: -280.19
Epoch 18:


 19%|█▉        | 19/100 [08:49<37:37, 27.87s/it]

best score: -280.01
Epoch 19:


 20%|██        | 20/100 [09:16<37:06, 27.84s/it]

best score: -270.2
Epoch 20:


 21%|██        | 21/100 [09:43<36:35, 27.79s/it]

best score: -288.92
Epoch 21:


 22%|██▏       | 22/100 [10:11<36:07, 27.78s/it]

best score: -270.56
Epoch 22:


 23%|██▎       | 23/100 [10:38<35:37, 27.77s/it]

best score: -243.83
Epoch 23:


 24%|██▍       | 24/100 [11:05<35:07, 27.74s/it]

best score: -253.28
Epoch 24:


 25%|██▌       | 25/100 [11:32<34:38, 27.71s/it]

best score: -244.01
Epoch 25:


 26%|██▌       | 26/100 [11:59<34:08, 27.68s/it]

best score: -216.83
Epoch 26:


 27%|██▋       | 27/100 [12:26<33:38, 27.65s/it]

best score: -226.55
Epoch 27:


 28%|██▊       | 28/100 [12:53<33:09, 27.63s/it]

best score: -217.1
Epoch 28:


 29%|██▉       | 29/100 [13:21<32:41, 27.62s/it]

best score: -216.74
Epoch 29:


 30%|███       | 30/100 [13:48<32:12, 27.60s/it]

best score: -208.55
Epoch 30:


 31%|███       | 31/100 [14:15<31:43, 27.59s/it]

best score: -225.74
Epoch 31:


 32%|███▏      | 32/100 [14:42<31:14, 27.57s/it]

best score: -226.73
Epoch 32:


 33%|███▎      | 33/100 [15:09<30:45, 27.55s/it]

best score: -208.91
Epoch 33:


 34%|███▍      | 34/100 [15:36<30:17, 27.54s/it]

best score: -208.73
Epoch 34:


 35%|███▌      | 35/100 [16:03<29:49, 27.53s/it]

best score: -226.28
Epoch 35:


 36%|███▌      | 36/100 [16:30<29:20, 27.51s/it]

best score: -190.82
Epoch 36:


 37%|███▋      | 37/100 [16:57<28:52, 27.50s/it]

best score: -208.19
Epoch 37:


 38%|███▊      | 38/100 [17:24<28:24, 27.49s/it]

best score: -190.37
Epoch 38:


 39%|███▉      | 39/100 [17:51<27:56, 27.49s/it]

best score: -164.0
Epoch 39:


 40%|████      | 40/100 [18:19<27:28, 27.48s/it]

best score: -181.64
Epoch 40:


 41%|████      | 41/100 [18:46<27:00, 27.46s/it]

best score: -190.28
Epoch 41:


 42%|████▏     | 42/100 [19:14<26:34, 27.49s/it]

best score: -190.73
Epoch 42:


 43%|████▎     | 43/100 [19:43<26:08, 27.51s/it]

best score: -163.46
Epoch 43:


 44%|████▍     | 44/100 [20:10<25:40, 27.51s/it]

best score: -163.55
Epoch 44:


 45%|████▌     | 45/100 [20:37<25:12, 27.50s/it]

best score: -172.73
Epoch 45:


 46%|████▌     | 46/100 [21:04<24:44, 27.49s/it]

best score: -145.37
Epoch 46:


 47%|████▋     | 47/100 [21:32<24:17, 27.49s/it]

best score: -163.55
Epoch 47:


 48%|████▊     | 48/100 [21:59<23:49, 27.49s/it]

best score: -154.82
Epoch 48:


 49%|████▉     | 49/100 [22:26<23:21, 27.49s/it]

best score: -145.91
Epoch 49:


 50%|█████     | 50/100 [22:53<22:53, 27.48s/it]

best score: -163.28
Epoch 50:


 51%|█████     | 51/100 [23:20<22:26, 27.47s/it]

best score: -127.91
Epoch 51:


 52%|█████▏    | 52/100 [23:48<21:58, 27.46s/it]

best score: -127.91
Epoch 52:


 53%|█████▎    | 53/100 [24:15<21:30, 27.46s/it]

best score: -136.91
Epoch 53:


 54%|█████▍    | 54/100 [24:42<21:02, 27.45s/it]

best score: -136.73
Epoch 54:


 55%|█████▌    | 55/100 [25:09<20:35, 27.45s/it]

best score: -118.82
Epoch 55:


 56%|█████▌    | 56/100 [25:37<20:08, 27.46s/it]

best score: -163.1
Epoch 56:


 57%|█████▋    | 57/100 [26:06<19:41, 27.48s/it]

best score: -127.64
Epoch 57:


 58%|█████▊    | 58/100 [26:34<19:14, 27.48s/it]

best score: -127.91
Epoch 58:


 59%|█████▉    | 59/100 [27:01<18:46, 27.48s/it]

best score: -118.91
Epoch 59:


 60%|██████    | 60/100 [27:28<18:18, 27.47s/it]

best score: -127.64
Epoch 60:


 61%|██████    | 61/100 [27:55<17:51, 27.47s/it]

best score: -127.91
Epoch 61:


 62%|██████▏   | 62/100 [28:22<17:23, 27.47s/it]

best score: -128.0
Epoch 62:


 63%|██████▎   | 63/100 [28:49<16:56, 27.46s/it]

best score: -119.0
Epoch 63:


 64%|██████▍   | 64/100 [29:17<16:28, 27.46s/it]

best score: -110.0
Epoch 64:


 65%|██████▌   | 65/100 [29:44<16:01, 27.46s/it]

best score: -101.0
Epoch 65:


 66%|██████▌   | 66/100 [30:12<15:33, 27.47s/it]

best score: -118.73
Epoch 66:


 67%|██████▋   | 67/100 [30:40<15:06, 27.47s/it]

best score: -109.91
Epoch 67:


 68%|██████▊   | 68/100 [31:07<14:39, 27.47s/it]

best score: -110.0
Epoch 68:


 69%|██████▉   | 69/100 [31:35<14:11, 27.46s/it]

best score: -109.91
Epoch 69:


 70%|███████   | 70/100 [32:02<13:43, 27.46s/it]

best score: -118.82
Epoch 70:


 71%|███████   | 71/100 [32:29<13:16, 27.46s/it]

best score: -110.0
Epoch 71:


 72%|███████▏  | 72/100 [32:56<12:48, 27.46s/it]

best score: -109.91
Epoch 72:


 73%|███████▎  | 73/100 [33:23<12:21, 27.45s/it]

best score: -101.0
Epoch 73:


 74%|███████▍  | 74/100 [33:51<11:53, 27.45s/it]

best score: -101.0
Epoch 74:


 75%|███████▌  | 75/100 [34:17<11:25, 27.44s/it]

best score: -101.0
Epoch 75:


 76%|███████▌  | 76/100 [34:45<10:58, 27.44s/it]

best score: -101.0
Epoch 76:


 77%|███████▋  | 77/100 [35:12<10:30, 27.43s/it]

best score: -109.91
Epoch 77:


 78%|███████▊  | 78/100 [35:39<10:03, 27.43s/it]

best score: -101.0
Epoch 78:


 79%|███████▉  | 79/100 [36:06<09:36, 27.43s/it]

best score: -101.0
Epoch 79:


 80%|████████  | 80/100 [36:33<09:08, 27.42s/it]

best score: -101.0
Epoch 80:


 81%|████████  | 81/100 [37:01<08:40, 27.42s/it]

best score: -101.0
Epoch 81:


 82%|████████▏ | 82/100 [37:28<08:13, 27.42s/it]

best score: -101.0
Epoch 82:


 83%|████████▎ | 83/100 [37:55<07:46, 27.42s/it]

best score: -109.91
Epoch 83:


 84%|████████▍ | 84/100 [38:22<07:18, 27.42s/it]

best score: -101.0
Epoch 84:


 85%|████████▌ | 85/100 [38:49<06:51, 27.41s/it]

best score: -101.0
Epoch 85:


 86%|████████▌ | 86/100 [39:17<06:23, 27.41s/it]

best score: -101.0
Epoch 86:


 87%|████████▋ | 87/100 [39:44<05:56, 27.41s/it]

best score: -101.0
Epoch 87:


 88%|████████▊ | 88/100 [40:11<05:28, 27.40s/it]

best score: -101.0
Epoch 88:


 89%|████████▉ | 89/100 [40:38<05:01, 27.40s/it]

best score: -101.0
Epoch 89:


 90%|█████████ | 90/100 [41:05<04:33, 27.40s/it]

best score: -101.0
Epoch 90:


 91%|█████████ | 91/100 [41:32<04:06, 27.40s/it]

best score: -101.0
Epoch 91:


 92%|█████████▏| 92/100 [42:00<03:39, 27.40s/it]

best score: -101.0
Epoch 92:


 93%|█████████▎| 93/100 [42:27<03:11, 27.40s/it]

best score: -101.0
Epoch 93:


 94%|█████████▍| 94/100 [42:55<02:44, 27.39s/it]

best score: -101.0
Epoch 94:


 95%|█████████▌| 95/100 [43:22<02:16, 27.39s/it]

best score: -101.0
Epoch 95:


 96%|█████████▌| 96/100 [43:51<01:49, 27.41s/it]

best score: -101.0
Epoch 96:


 97%|█████████▋| 97/100 [44:19<01:22, 27.41s/it]

best score: -101.0
Epoch 97:


 98%|█████████▊| 98/100 [44:46<00:54, 27.41s/it]

best score: -101.0
Epoch 98:


 99%|█████████▉| 99/100 [45:13<00:27, 27.41s/it]

best score: -101.0
Epoch 99:


100%|██████████| 100/100 [45:41<00:00, 27.41s/it]

best score: -101.0





Ссылка на фидбек по семинару: [link](https://docs.google.com/forms/d/e/1FAIpQLSf-08wFrEke6zKlysETYiqAjH5CRXtOKut5Q77Tr5rdVId7zA/)