# Basic Reinforcement Learning Implementation

**Capstone Project:** To Study the Methodologies of Reinforcement Learning and to Develop an Understanding of Problem Representation in Code<br>
**Author:** Pranav Panchal<br>
**Notebook:** 1 of 3<br>
**Next Notebook:** Exploration of OpenAI Gym<br>

### Table of contents
1. [What is Machine Learning?](#What-is-Machine-Learning?)
2. [What is Reinforcement Learning?](#What-is-Reinforcement-Learning?)
3. [Basic RL Implementation](#Basic-RL-Implementation)
4. [Project Statement](#Project-Statement)

### What is Machine Learning?

Machine learning is a field in computer science which tries to understand and build models which can learn a behavior from a dataset and can consequently perform some set of tasks. Machine learning models can be broadly divided into three categories Supervised learning, Unsupervised learning and Reinforcement learning.

<img src = "https://www.mathworks.com/discovery/reinforcement-learning/_jcr_content/mainParsys3/discoverysubsection/mainParsys/image.adapt.full.medium.png/1663235165820.png" width = 600>
<center><i>(Image Source: <a>https://www.mathworks.com/discovery/reinforcement-learning/_jcr_content/mainParsys3/discoverysubsection/mainParsys/image.adapt.full.medium.png/1663235165820.png</a>)</i></center>

This notebook gives the most basic implementation of reinforcement learning algorithim. 

### What is Reinforcement Learning?

In reinforcement learning an `agent` is introduced to an environment which has a set of rules and objectives. The agent is initially unaware of these rules. The agent is however able to experience the environment, i.e. it is able to observe the `state` of the environment. Based on the state the agent is required to take some `action` from a predetermined set of `actions`. These actions would in-turn change the `state` of the environment and the cycle continues.

Some `actions` would sometime lead to a positive outcome(target objective) and some other time it may lead to negative outcome. The environment is coded as such that when a positive outcome occurs a `reward` is handed to the agent and the agent is required to try and maximise this `reward`. Contrary a `penalty` is incurred when the outcome is undesired.

Eventually, the `agent` learns its environment and is able to perform task efficiently.

<img src = "https://hub.packtpub.com/wp-content/uploads/2019/12/reinforcement-learning-768x626.png" width = 600>
<center><i>(Image Source: <a>https://hub.packtpub.com/wp-content/uploads/2019/12/reinforcement-learning-768x626.png</a>)</i></center>

### Basic RL Implementation

Let's create a simple environment

In [1]:
# Importing dependencies
import random
import numpy as np
from typing import List

np.random.seed(42)

In [2]:
# Defining the sample environment
class SampleEnvironment:
    '''
    A class to represent a simple environment.
    '''
    def __init__(self):
        '''
        Initialises the environment.
        Initialises the variable steps_left with an integer value.

        The environment state is represented by a list of 3 integers from 0-9 and 
        it is designed to reward the agent when it take a step.

        The reward is 0 if the integer at index 2 of the state is even and 1 when it is odd.

        Example:
            State: [4 8 1]   Reward: 1
            State: [6 5 5]   Reward: 1
            State: [8 7 6]   Reward: 0
        '''
        self.steps_left = 10
        
    def get_observation(self):
        '''
        INPUT: none
        OUTPUT: returns the state of the environment

        Returns an array depicting the state of the environment.
        '''
        state = np.random.randint(0, 9, 3)
        return state
    
    def get_actions(self, state):
        '''
        INPUT: state
        OUTPUT: list of possible actions

        Returns a array of possible actions. This is generated as a range from 0 to state[1]

        Example:
            INPUT: [4 8 1]   OUTPUT: [0 1 2 3 4 5 6 7 8]
            INPUT: [6 5 5]   OUTPUT: [0 1 2 3 4 5]
            INPUT: [8 7 6]   OUTPUT: [0 1 2 3 4 5 6 7]
        '''
        list_of_actions = np.arange(0, state[1]+1, 1)
        return list_of_actions
    
    def is_done(self):
        '''
        INPUT: none
        OUTPUT: bool

        Checks whether the steps_left variable has value 0, returns TRUE if value is 0, else returns FALSE.
        '''
        # Checks whether the number of steps are over
        if self.steps_left == 0:
            print("Game is over")
            return True
        else:
            return False
    
    def action(self, state):
        '''
        INPUT: state
        OUTPUT: int, (0 or 1)

        Take an action step within an environment.

        Example:
        INPUT: [4, 8, 1]   OUTPUT: 1
        INPUT: [6, 5, 5]   OUTPUT: 1
        INPUT: [8, 7, 6]   OUTPUT: 0
        '''
        # Reduces the steps_left by 1
        self.steps_left -= 1
        
        # Checks whether the integer at index 2 of state is even or odd
        if state[2] % 2 == 0:
            return 0
        else:
            return 1

Next, we will create the agent which can interact with the above environment.

In [3]:
# Define the agent class
class Agent:
    def __init__(self):
        '''
            Initialises the agent and initialises the total reward at 0.
        '''
        self.total_reward = 0.0
        
    def step(self, env:SampleEnvironment):
        '''
            INPUT: sample environment
            OUTPUT: total_reward
            
            Defines a step taken by the agent in the environment.
        '''
        # Get the current state of the environment
        state = env.get_observation()
        
        # Get the list of possible actions for the given state
        actions = env.get_actions(state)
        
        # Get the reward for the given state
        reward = env.action(state)
        
        # Print the variables
        print(f"State: {state}\tReward: {reward}\tList of actions: {actions}")
        
        # Calculate the total reward
        self.total_reward += reward

Now, that we have written the code for both the environment and the agent, we can now initialise both and check the result.

In [4]:
# Initialise the environment
env = SampleEnvironment()

# Initialise the agent
agent = Agent()

# Take a step in the environment till the done condition is met
while not env.is_done():
    agent.step(env)

# Print the total reward
print(f"Total reward got: {agent.total_reward}")

State: [6 3 7]	Reward: 1	List of actions: [0 1 2 3]
State: [4 6 2]	Reward: 0	List of actions: [0 1 2 3 4 5 6]
State: [6 7 4]	Reward: 0	List of actions: [0 1 2 3 4 5 6 7]
State: [3 7 7]	Reward: 1	List of actions: [0 1 2 3 4 5 6 7]
State: [2 5 4]	Reward: 0	List of actions: [0 1 2 3 4 5]
State: [1 7 5]	Reward: 1	List of actions: [0 1 2 3 4 5 6 7]
State: [1 4 0]	Reward: 0	List of actions: [0 1 2 3 4]
State: [5 8 0]	Reward: 0	List of actions: [0 1 2 3 4 5 6 7 8]
State: [2 6 3]	Reward: 1	List of actions: [0 1 2 3 4 5 6]
State: [8 2 4]	Reward: 0	List of actions: [0 1 2]
Game is over
Total reward got: 4.0


Though this is not a very sophisticated `environment-agent` modeling, it satisfies the basic mechanism of reinforcement learning.

Our `agent` is able to observe the `state` of the environment and the number of `actions` available to the agent is dependent of the `state` of the environment (Though we receive the number of `actions`, we have not coded any separate interaction based on these list of `actions`, our agent is only able to step forward within the environment). The `reward` is a consequence of the `action` taken on the current `state`.

### Project Statement<br>
**<center>To study the methodologies of Reinforcement Learning and to develop an understanding of problem representation in code.</center>**

The aim of the project is to understand the methodology of implementing a reinforcement learning algorithm to problem. This project aims to help understand how one can create their own environments to represent a problem in python and train a learning agent to interact with it and achieve the objectives of the problem.

Everyone knows the snake game. The objective of the snake game is to eat the apple and let the snake grow in length. We will try to implement a reinforcement learning model to play a game of snake.

<img src = "https://media.istockphoto.com/vectors/cute-funny-snake-vector-cartoon-reptile-isolated-on-white-background-vector-id1128740486?k=20&m=1128740486&s=612x612&w=0&h=L5qjI89nOrA_5POMFOAzGF_yHOtMZTzSEZqRTaalNg4=" width = 400>
<center><i>(Image Source: <a>https://media.istockphoto.com/vectors/cute-funny-snake-vector-cartoon-reptile-isolated-on-white-background-vector-id1128740486?k=20&m=1128740486&s=612x612&w=0&h=L5qjI89nOrA_5POMFOAzGF_yHOtMZTzSEZqRTaalNg4=</a>)</i></center>

This problem is tackled and successfully completed by many; however, the idea is not only to implement the reinforcement learning model, but also to learn how to define and code the environment which represents the problem to be solved.

In reinforcement learning, the environment definition plays an important role. The representation of state and the different regards and penalty weights affect the behaviour of the learning agent.

In the next notebook, we will be looking at implementing this understanding of reinforcement learning using a the Q-learning on a pre-defined environment from the gym library.