# Linear World

In this notebook we implement a very simple example of the agent-environment interface used in reinforcement learning, called "linear world".

The world consists of $n$ places in a row, labelled $0, 1, \dots, n-1$, and the state of the world consists of the position where the player is located. The initial state has the player in the middle of the world:
- The (empty) world for $n=5$: `"_ _ _ _ _"`
- With the player in position $2$: `"_ _ X _ _"`

The actions that the agent can take are `LEFT` and `RIGHT`, each moving the player one place in the indicated direction. In the two outer positions ($0$ and $n-1$), both actions result in a step towards the inside:
- After action `RIGHT`: `"_ _ _ X _"`
- After action `RIGHT`: `"_ _ _ _ X"`
- After action `LEFT` or `RIGHT`: `"_ _ _ X _"`

The reward is $+1$ for an action that leaves the player in one of the outer positions, and $0$, else. A possible sequence of events in this setting is:
- `"_ _ X _ _"` $S_0 = 3, A_0 = \text{"RIGHT"}$
- `"_ _ _ X _"` $R_1 = 0, S_1 = 3, A_1 = \text{"LEFT"}$
- `"_ _ X _ _"` $R_2 = 0, S_2 = 2, A_2 = \text{"LEFT"}$
- `"_ X _ _ _"` $R_3 = 0, S_3 = 1, A_3 = \text{"LEFT"}$
- `"X _ _ _ _"` $R_4 = 1, S_4 = 0, A_4 = \text{"RIGHT"}$
- `"_ X _ _ _"` $R_5 = 0, S_5 = 1, A_5 = \dots$




S={0,...n-1}  or 1 to n 

A={'left', 'right' ={-1,+1}

R={0,1}

prob of new state
P=p(s'|S,a)= 
1        a='R',s'=s+1
                
1       ,a='L',s'=s-1     
                
0       else
 edge cases
 1 s=0 , s'=1
 1 s=n-1, s'=n-1

can write a deterministic function
f(s,a) =s'


1  if s=0
n-2 s=n-1
s+1 a=R
s-1 a=L

in the deterministic case by choosing an action i choose a state
P(r|s,a)= 1 s=1 a=L
1 s=n-2,a=R
0 else

*<span style="color:red">Below, the parts indicated by `#??` need to be filled in!</span>*

In [1]:
# We use constants 1 and 2 to represent LEFT, RIGHT:
LEFT = 1
RIGHT = 2

In [None]:
class LinearWorld:
    def __init__(self, length):
        # Store length of world
        self._length= length # we need to set it as an attribute
        # state shoud be prvate so thta the user does not change the length by themselves
        
        # Initialize state of world in the middle
        self._state = length//2
    
    def step(self, action): #for a givenaction we get a certain rewards
        #want to rememeber the prev state
        #state0= self.state
        # Compute new state
        #?? (1. handle the outer two positions)
        if self.state ==0:
            self.state = 1#left position
            #random.randint(1, self.length-1) for non deterministic case
        elif self.state==self._length -1:
            self.state=self._length - 2
        elif action== LEFT:
            self.state -=1
        elif action == RIGHT:
            self.state+=1
        else raise Exception('wrong input!')
        #?? (2. change position to the left/right)

        # Compute reward
        #?? does not depend on previous state
        if self.state==0 or self.state=self.length -=1:
            reward=1
        else:
            reward= 0
        # Return state and reward
        return self.state, reward
    
    def reset(self):
        # Reset the position to the middle
        self.state=self._length//2
    
    def showWorld(self): 
        #Print a representation of the linear world
        ret=''
        
        # Start with an empty string
        for i in range (self._length):
           if i != self.state:
               ret+= '_' 
           else:
               ret+='X'
        
        # Add "_" for every empty spot, "X" for the player
        #??
        
        # Print the complete string
        print(ret)
    
    def __str__(self):
        # (!) Advanced concept:
        # Custom string-conversion (used e.g. by `print()`) - 
        
        
        # Start with an empty string
        ret=''
        for i in range (self._length):
           if i != self.state:
               ret+= '_' 
           else:
               ret+='X'
        
        return ret
        # Use the same logic as in .showWorld(), but return the string
        # (instead of printing it)
        #??


## Testing the linear world

First, we create an instance of `LinearWorld`, then we use the `.step()` method to perform actions ($A_t$) and observe the resulting state ($S_{t+1}$) and reward ($R_{t+1}$).

In [None]:
# Create a new instance of the LinearWorld class
lw = LinearWorld(7)

In [None]:
# Check the properties `length` and `pos` of the instance
print(lw.state)
print(lw.pos)

In [None]:
# Make a step and assign the outcome to variables (state + reward)
#??
state, reward= lw.step(RIGHT)

# Print the outcome
#?? (state)
#?? (reward)

We can use the method `.showWorld()` to visualize the events "graphically":

In [None]:
# Make a step
#??

# Show the new state of the world
#??

## Two simple policies

Next, we implement two policies and see how they perform over a timespan of $T = 100$ steps:
- The random policy randomly chooses and action
- The "right" policy always goes `RIGHT`

In [None]:
# We use numpy to choose a random action
import random 

In [None]:
# Number of steps
T = 100

In [None]:
# Run the random policy for T steps, update the total reward
lw = LinearWorld(7)
totalRandom = 0
for t in range(T):
    a=random.choice(LEFT ,RIGHT)
    state, reward = lw.step(a)
    totalRandom+= reward
    lw.showWorld()

In [None]:
# Check the total rewards we got
print(totalRandom)

In [None]:
# Run the "right" policy for T steps, update the total reward
lw = LinearWorld(7)
totalRight = 0
for t in range(T):
   _, lw.step(RIGHT) # do not care aout state
   totalRight+= reward
   lw.showWorld()
print (totalRight)

In [None]:
# Check the total rewards we got
print(totalRight)