# MDP

Когда Инокетений Вальдемарович Носочков приходит на работу, то с вероятностью 50% он хочет пойти в бар, с 10% – покушац, с 20% – спать. Если он ничего не хочет, то он продолжает работать. Когда Кеша пьет в Баре, то в 10% посещений он возвращается на работу и с вероятностью 30% идет спать, но в остальное время продолжает пить. Когда он просыпается, то с вероятностью 40% идет покушац и с вероятностью 60% идет в бар пить дальше. Если вдруг г-н Носочков поел, то с вероятностью 80% он начинает работать, а если не срослось с работой, то он начинает хотеть спать.

Определите вероятности, что наш герой прямо сейчас работает, пьет в баре, спит или ест, при условии, что если Инокетений чего-то хочет, то делает это.

# Practice: gym interface and crossentropy method

_Reference:_ This notebook is based on Practical RL [week01](https://github.com/yandexdataschool/Practical_RL/tree/master/week01_intro)

In [None]:
import sys, os

if "google.colab" in sys.modules and not os.path.exists(".setup_complete"):
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash
    !touch .setup_complete

# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ["DISPLAY"] = ":1"
    %env DISPLAY = : 1

## OpenAI Gym

We're gonna spend several next weeks learning algorithms that solve decision processes. We are then in need of some interesting decision problems to test our algorithms.

That's where OpenAI Gym comes into play. It's a Python library that wraps many classical decision problems including robot control, videogames and board games.

So here's how it works:

In [None]:
import gym
gym.__version__

In [None]:
import matplotlib.pyplot as plt

env = gym.make("MountainCar-v0")
env.reset()

print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
plt.imshow(env.render("rgb_array"));

### Gym interface

The three main methods of an environment are
* `reset()`: reset environment to the initial state, and _return it_
* `render()`: show current environment state (a more colorful version)
* `step(a)`: commit action `a` and return `(new_state, reward, is_done, info)`
 * `new_state`: the new state right after committing the action `a`
 * `reward`: a number representing your reward for committing action `a`
 * `is_done`: True if the MDP has just finished, False if still in progress
 * `info`: some auxiliary stuff about what just happened. For now, ignore it.

In [None]:
state = env.reset()
print("initial state:", state)

In MountainCar, observation is just two numbers: car position and velocity.

Let's take action 2, which stands for "go right".

In [None]:
print("taking action 2 (right)")
new_state, reward, is_done, _ = env.step(2)

print("new state:", new_state)
print("reward:", reward)
print("is game over?:", is_done)

As you can see, the car has moved to the right slightly (around 0.0005).

### Play with it

Below is the code that drives the car to the right. However, if you simply use the default policy, the car will not reach the flag at the far right due to gravity.

__Your task__ is to fix it. Find a strategy that reaches the flag. 

You are not required to build any sophisticated algorithms for now, and you definitely don't need to know any reinforcement learning for this. Feel free to hard-code :)

In [None]:
actions = {"left": 0, "stop": 1, "right": 2}

def policy(state, time_step):
    # Write the code for your policy here. You can use the current state
    # (a tuple of position and velocity), the current time step, or both,
    # if you want.
    
    # Ваш код здесь

    return 

In [None]:
from IPython.display import clear_output, display

state = env.reset()
time_limit = 250
is_done = 0
for time_step in range(time_limit):
    # Choose action based on your policy.

    # Pass the action to the environment.
    
    # We don't do anything with reward here because MountainCar is a very
    # simple environment, and reward is a constant -1 (meaning that your
    # goal is to end the episode as quickly as possible).
    action = policy(state, time_step)
    state, reward, is_done, _ = env.step(action)

    # Draw game image on display.
    clear_output(wait=True)
    plt.imshow(env.render("rgb_array"))
    plt.show()

    if is_done:
          print("Well done!")
          break
if not is_done:
    print("Time limit exceeded. Try again.")

## Crossentropy method

Now that we know how does the `gym` work, let's try and solve a more complicated problem using the crossentropy method.

In [None]:
env = gym.make("Taxi-v3")
env.reset()
env.render()

As `Taxi-v3` is a much more sophisticated environment, it presents us with more possible states and actions at our disposal.

In [None]:
n_states, n_actions = env.observation_space.n, env.action_space.n
print(f"n_states={n_states}, n_actions={n_actions}")

That's definitely a lot. Way too much to hard-code as we did with previous problem. Let's use the crossentropy method on this one.

### Create stochastic policy

This time our policy should be a probability distribution.

```policy[s, a] = P(take action a | in state s)```

Since we still use integer state and action representations, you can use a 2-dimensional array to represent the policy.

Please initialize policy __uniformly__, that is, probabililities of all actions should be equal.

In [None]:
import numpy as np

def initialize_policy(n_states, n_actions):
    # Create an array to store action probabilities
    
    # Ваш код здесь
    
    return policy

In [None]:
policy = initialize_policy(n_states, n_actions)
policy

### Play the game

Let's play the game just like before, however this time we will also record states, actions and rewards to use them in training loop.

In [None]:
def generate_session(env, policy, time_limit=10**4):
    state = env.reset()
    states, actions = [], []
    total_reward = 0.
    for _ in range(time_limit):
        # Choose action based on policy and take it.
        # Record information we just got from the environment.
        # Ваш код здесь



        if is_done:
            break

    return states, actions, total_reward

In [None]:
states, actions, reward = generate_session(env, policy)

Let's see the initial reward distribution for our "naive" policy.

In [None]:
sample_rewards = [generate_session(env, policy, time_limit=1000)[2] for _ in range(200)]
plt.hist(sample_rewards, bins=20)
plt.vlines([np.percentile(sample_rewards, 50)], [0], [100], label="50'th percentile", color="green")
plt.vlines([np.percentile(sample_rewards, 90)], [0], [100], label="90'th percentile", color="red")
plt.legend();

In [None]:
np.percentile(sample_rewards, 90)

### Crossentropy method step

In [None]:
def select_elites(states_batch, actions_batch, rewards_batch, percentile):
    """
    Select states and actions from games that have rewards >= percentile.

    Compute minimum reward for session to be elite and choose elite states
    and actions based on this threshold.

    Note that states_batch and actions_batch are both 2d lists, i.e. lists
    containing lists of states and actions from each session in batch.
    """
    
    elite_states = []
    elite_actions = []
    # Ваш код здесь


    return elite_states, elite_actions

In [None]:
def get_new_policy(elite_states, elite_actions):
    """
    Given a list of elite states/actions from select_elites, return a new
    policy where each action probability is proportional to

        policy[s, a] ~ #[occurrences of s and a in elite states/actions]

    Don't forget to normalize the policy to get valid probabilities.
    For states that you never visited, use a uniform distribution.
    """

    new_policy = np.ones([n_states, n_actions])

    # Set probabilities for actions given elite states & actions.
    # Ваш код здесь

    
    return new_policy

### Training loop

Generate sessions, select N best and fit to those.

In [None]:
def show_progress(rewards_batch, log, percentile, reward_range=[-990, +10]):
    """
    A convenience function that displays training progress. 
    No cool math here, just charts.
    """

    mean_reward = np.mean(rewards_batch)
    threshold = np.percentile(rewards_batch, percentile)
    log.append([mean_reward, threshold])
    
    plt.figure(figsize=[8, 4])
    plt.subplot(1, 2, 1)

    mean_rewards = [mean_reward for mean_reward, threshold in log]
    reward_thresholds = [threshold for mean_reward, threshold in log]
    plt.plot(mean_rewards, label="Mean rewards")
    plt.plot(reward_thresholds, label="Reward thresholds")
    plt.legend()
    plt.grid()

    plt.subplot(1, 2, 2)
    plt.hist(rewards_batch, range=reward_range)
    plt.vlines(
        [np.percentile(rewards_batch, percentile)],
        ymin=[0],
        ymax=[100],
        label="percentile",
        color="red",
    )
    plt.legend()
    plt.grid()
    clear_output(wait=True)
    print(f"mean reward = {mean_reward:.3f}, threshold={threshold:.3f}")
    plt.show()

In [None]:
# reset policy
policy = initialize_policy(n_states, n_actions)

In [None]:
n_sessions = 250     # sample this many sessions
percentile = 70      # take this percent of session with highest rewards
learning_rate = 0.8  # how quickly the policy is updated, on a scale from 0 to 1

log = []

for i in range(40):
    # Generate a list of n_sessions new sessions, select elites and compute
    # new policy based on them. After that update the existing policy wrt
    # learning rate.
    sessions = [generate_session(env, policy)  for _ in range(n_sessions)]
    states_batch = [session_states for session_states, session_actions, session_reward in  sessions]
    actions_batch = [session_actions for session_states, session_actions, session_reward in  sessions]
    rewards_batch = [session_reward for session_states, session_actions, session_reward in  sessions]

    # Ваш код здесь
    

    # display results on chart
    show_progress(rewards_batch, log, percentile)

### Analysing the results

You may have noticed that the taxi problem quickly converges from very little values to a near-optimal score and then descends back. This is caused (at least in part) by the innate randomness of the environment. Namely, the starting points of passenger/driver change from episode to episode.

In such case if crossentropy policy failed to learn how to win from one distinct starting point, it will simply discard it because no sessions from that starting point will make it into the "elites".

To mitigate that problem, you can either reduce the threshold for elite sessions (duct tape way) or change the way you evaluate strategy (theoretically correct way). For each starting state, you can sample an action randomly, and then evaluate this action by running several games starting from it and averaging the total reward. Choosing elite sessions with this kind of sampling (where each session's reward is counted as the average of the rewards of all sessions with the same starting state and action) should improve the performance of your policy.

## Deeging deeper: approximate crossentropy with neural networks

In this section we'll extend your CEM implementation with neural networks! You will train a multi-layer neural network to solve simple continuous state space games.

![img](https://watanimg.elwatannews.com/old_news_images/large/249765_Large_20140709045740_11.jpg)

In [None]:
# .env is to remove auto-assigned time limit wrapper
env = gym.make("CartPole-v0").env

env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape[0]

print("state vector dim =", state_dim)
print("n_actions =", n_actions)
plt.imshow(env.render("rgb_array"));

Here, just like in a `MountainCar-v0`, we will be controlling a cart, which we can move right or left. However our goal here is different. In this environment we want to keep pole attached to the top of our cart from falling as long as possible.

### Neural Network Policy

For this assignment we'll utilize the simplified neural network implementation from [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). Here's what you'll need:
* `agent.partial_fit(states, actions)` - make a single training pass over the data to increase the probability of provided `actions` in provided `states`
* `agent.predict_proba(states)` - predict probabilities of all actions, a matrix of shape `[len(states), n_actions] `

In [None]:
from sklearn.neural_network import MLPClassifier

agent = MLPClassifier(
    hidden_layer_sizes=(20, 20),
    activation="tanh",
)

# initialize agent to the dimension of state space and number of actions
agent.partial_fit([env.reset()] * n_actions, range(n_actions), range(n_actions))

Despite the apparent differences, you will find the training procedure for such agent to be very similar to the one we used in the previous part. We won't even need to rewrite most of our helper functions at all! However, one thing that has changed is the way we get actions' probabilities. So let's adapt our `generate_session` function to this new agent-based policy.

In [None]:
def generate_session(env, agent, time_limit=10**4):
    state = env.reset()
    states, actions = [], []
    total_reward = 0.
    for _ in range(time_limit):
        # Use agent to predict a vector of action probabilities for current 
        # state and use the probabilities you predicted to pick an action.
        # Sample actions, don't just take the most likely one!

        action =  np.random.choice # Ваш код здесь
        
        states.append(state)
        actions.append(action)
        state, reward, is_done, _ = env.step(action)
        total_reward += reward

        # Record information we just got from the environment.
        
        if is_done:
            break

    return states, actions, total_reward

In [None]:
states, actions, reward = generate_session(env, agent, time_limit=100)
print("states:", np.stack(states))
print("actions:", actions)
print("reward:", reward)

### Training loop

In [None]:
n_sessions = 100
percentile = 70

log = []

for _ in range(100):
    # Generate new sessions, select elites and update our agent.

    sessions = [generate_session(env, agent)  for _ in range(n_sessions)]
    states_batch = [session_states for session_states, session_actions, session_reward in  sessions]
    actions_batch = [session_actions for session_states, session_actions, session_reward in  sessions]
    rewards_batch = [session_reward for session_states, session_actions, session_reward in  sessions]

    elite_states, elite_actions = select_elites(states_batch, actions_batch, rewards_batch, percentile)

    # Ваш код здесь

    show_progress(
        rewards_batch, 
        log, 
        percentile, 
        reward_range=[0, np.max(rewards_batch)]
    )

    if np.mean(rewards_batch) > 190:
        print("You Win! You may stop training now via KeyboardInterrupt.")
        break

### Analysing the results

In [None]:
from IPython.display import clear_output, display
total_reward = 0
total_steps = 0
# Ваш код здесь