
# 3 Custom Environments
In this task you are asked to create a environment for a reinforcement agent. It's common to create environments for agents by using the openai gym interface. It creates a good baseline for what is necessary to train a RL agent and makes it easy to try out different environments on the same algorithm.
If you need more information take a look at the documentation https://www.gymlibrary.ml/.
You can find the implementation of all official enviroments on GitHub: https://github.com/openai/gym/tree/master/gym/envs if you need some examples. 

In [1]:
pip install gym==0.21.0

You should consider upgrading via the '/home/merlin/.pyenv/versions/3.9.6/bin/python3.9 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import gym
import numpy as np
import time
from IPython.display import display, clear_output

## 3.1 The Environment
A openai gym environment consists of at least 3 methods. `__init__` the constructor which sets all the necassary values, a `step` function, which describes the behaviour of the environment and a `reset` function, which resets the starting state of the environment. In addition to that usually a `render` function is provided to visualize the behaviour of the environment.

#### \__init__:
The Constructor of the environment defines all the necassary variables. To set the bounds of our environment we have to define the action_space and the observartion space. The gym.spaces library contains the necassary functions to do in our case we use gym.spaces.Discrete because we wan to only handle discrete values.
The Discrete space works a bit like a range, with some extra methods. A Linear representation of the state is helpful for tabular learning, because it makes creating a Q-Table really easy. 

#### step:
The step method takes an action and returns a tuple of the shape(observation, reward, done). The observation is the result of taking the action. The reward is the reward handet for takin given action in the previous state. The done variable is boolean and indicates if a given scenario has come to an end. 

#### reset:
The reset method rests the start state of the environment. It returns the new state of the environment.

#### render:
The render method visualizes the state of the environment. There are many different ways to do so i.e. creating a visual representation by using vector graphics or printing to the terminal. 
We want to focus on the easiest way, by printing the state. Find a good and easily printable representation of the internal state (i.e. a numpy array) and print it. To print over the last output you can call the  function before you print the state.

### Encoding and decoding
This functions are not necessary for a gym environment. However it might be usefull do write some functions that encode and decode the linearized state to a 2D imensional Form and back.

#### decode_action
returns the action refering to the index of the action

#### decode_state
returns a 2D representation of the linear state

#### encode_state
returns a linear representation of the 2D state.


### Task 3.1.1
- Create a two dimensional, discrete environment of the size 10x10.
- Each episode the agent should start at a random position, while the target always stays at the same position.
- The agent should be able to move in all 4 directions, If the agent hits a wall it should do nothing.
- An episode ends if the agent reaches the target.
- Reaching the target results in a reward of 10, while every other action should give a small negative reward.

In [9]:
class CustomEnv(gym.Env):
    metadata = {'render.modes': ['human']}
    def __init__(self):
        super(CustomEnv, self).__init__()
        # Write a constructor for your enviroment
        # Define the action_space and observation_space
        # Position your agent and the target in the enviroment 
        self.action_space = gym.spaces.Discrete(4)
        self.observation_space = gym.spaces.Discrete(100)
        self._agent_location = 0, 0
        
    def step(self, action):
        ax, ay = self.decode_action(action)
        sx, sy = self._agent_location
        sx += ax
        sy += ay
        if sx < 0:
            sx = 0
        if sy < 0:
            sy = 0
        if sx >= 10:
            sx = 10
        if sy >= 10:
            sy = 10
        self._agent_location = sx, sy
        
        done = sx == 5 and sy == 5
        reward = 10 if done else -1
        return (self.encode_state((sx, sy)), reward, done, _)
     
    def reset(self):
        self._agent_location = np.random.randint(0, 10), np.random.randint(0, 10)
        return self.encode_state(self._agent_location)
        
    def render(self):
        sx, sy = self._agent_location
        for x in range(0, 10):
            for y in range(0, 10):
                if x == 5 and y == 5:
                    print("T", end='')
                elif x == sx and y == sy:
                    print("A", end='')
                else:
                    print(" ", end='')
            print("")
    
    def decode_action(self, action):
        return action // 2, action % 2

    def decode_state(self, state):
        return state // 10, state % 10
    
    def encode_state(self, state):
        a, b = state
        return a * 10 + b
    

## 3.2 Test with a random agent
The following cell allows you to test your enviroment with a random agent.

In [10]:
env = CustomEnv()
done = False
while done == False:
    a = env.action_space.sample()
    _,_, done,_ = env.step(a)
    clear_output(wait=True)
    env.render()
    time.sleep(0.1)

          
          
          
          
          
     T    
          
          
          
          


## Task 3.3 Test with a Q-Learning Agent
In the previous task we wrote an agent that used the SARSA algorithm. Now we want to use a similar algorithm, Q-Learning, to solve your own custom environment. And of course visualise your training progress (Cumulative rewards over time).

The main difference between SARSA and Q-Learning is the way the Q-Values are calculated. Therefore, you can recycle most of your code.

In [None]:
# your code here

### Task 3.3.1 Size concerns for Tabular RL:
The table for learning our simple enviroment has the size 100x4 for now. Since we have 100 possible States and 4 actions. How much bigger would the table get if we allowed the target to be placed anywhere?  

In [None]:
#Your Answer here