## Introduction

Reinforcement learning algorithms try to find the best ways to earn the greatest reward. Rewards can be winning a game, earning more money or beating other opponents.

The reinforcement learning process can be modeled as an iterative loop that works as below:

    1. The RL Agent receives state S⁰ from the environment.
    
    2. Based on that state S⁰, the RL agent takes an action A⁰, say — our RL agent moves right. Initially, this is random.
    
    3. Now, the environment is in a new state S¹
    
    4. Environment gives some reward R¹ to the RL agent. 

<img src="RL_1.png" width="500" style="left:1px">

## Reward Maximization

**Reinforcement learning should have best possible action in order to maximize the reward**

<img src="RL_2.png" width="300" style="left:1px">

But this way of adding the rewards doesnt work well because of the uncertainity factor.

<img src="RL_3.png" width="500" style="left:1px">

# Tasks and their types

A task is a single instance of a reinforcement learning problem.

## Types
    - Continuous tasks
    - Episodic tasks
    
### Continuous Tasks
    These are the tasks that continue forever and the agent continuous to act until the process is stopped Manually
    Eg: RL Agent which does the automated trading
    
### Episodic Tasks
    These tasks have starting and Ending point. 
    Eg: Game Either we win or Opponent wins

## Exploration and exploitation trade off

    Exploration is the process of finding more information about the Environment.
    
    Exploitation is the process of exploiting the known information to maximize the rewards.
    
    Real Life Example: 
        Say you go to the same restaurant every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration

## Approaches to Reinforcement Learning

There are three approaches to solve the Reinforcement learning problem:
   1. **Policy based approach**
            We learn a policy function which helps us in mapping each state to the best action.
            We have a policy which we need to optimize
            We further divide policies into two types:
            
              Deterministic 
                  A policy at a given state(s) will always return the same action(a). 
                  It means, it is pre-mapped as **S=(s) ➡ A=(a)**.
              Stochastic 
                  It gives a distribution of probability over different actions. 
                  i.e Stochastic Policy ➡ **p( A = a | S = s )**
            
   2. **Value Based approach**
            We have to optimize the function which tells the maximum expected future reward at each state.
        
            The value of each state is the total amount of the reward an RL agent can expect to collect over the future, 
            from a particular state.
            
            
<img src="RL_4.png" width="500" style="left:1px">

             The agent will always take the state with the biggest value

## Markov Decision Process

The process of the agent observing the environment output consisting of a reward and the next state, and then acting upon that. This whole process is a **Markov Decision Process**

To understand the MDP, first we have to understand the Markov property.

**The Markov property**
    _“ The future is independent of the past given the present.”_
    
    P[St+1 | St] = P[St+1 | S1, ….. , St],
    
    For a Markov state S and successor state S′, the state transition probability function is defined by,
    
<img src="RL_5.png" width="200" style="left:1px">


**Markov Process**
    A Markov process is a memory-less random process, i.e. a sequence of random states S1, S2, ….. **with the Markov property**. A Markov process or Markov chain is a tuple (S, P) on state space S, and transition function P. The dynamics of the system can be defined by these two components S and P. 
    
    
**Markov Reward Process**
    A Markov Reward Process or an MRP is a Markov process with value judgment, saying how much reward accumulated through some particular sequence that we sampled.
    
    An MRP is a tuple (S, P, R, 𝛾) where S is a finite state space, P is the state transition probability function, R is a reward function where
    
<img src="RL_6.png" width="200" style="left:1px">    
    
**Bellman Equation**
    The agent tries to get the most expected sum of rewards from every state it lands in.
    
    We unroll the return Gt,
<img src="RL_7.png" width="200" style="left:1px">    

That gives us the Bellman equation for MRPs,

<img src="RL_8.png" width="300" style="left:1px">    
    
    
**The value of the state S is the reward we get upon leaving that state, plus a discounted average over next possible successor states, where the value of each possible successor state is multiplied by the probability that we land in it.**
    
<img src="RL_9.png" width="300" style="left:1px">    
    
    
**Markov Decision process**
An MDP is a Markov Reward Process with decisions, it’s an environment in which all states are Markov. This is what we want to solve.

An MDP is a tuple (S, A, P, R, 𝛾), where S is our state space, A is a finite set of actions, P is the state transition probability function,

<img src="RL_10.png" width="300" style="left:1px">    

R is the reward function

<img src="R_11.png" width="300" style="left:1px"> 

and 𝛾 is a discount factor 𝛾 ∈ [0, 1].


Remember that a policy 𝜋 is a distribution over actions given states. A policy fully defines the behavior of an agent,

<img src="RL_12.png" width="300" style="left:1px">

The state-value function V𝜋(s) of an MDP is the expected return starting from state S, and then following policy 𝜋.

### The value function tells us how good is it to be in state S if I’m following policy 𝜋, i.e. the expectations when we sample all actions according to policy 𝜋

<img src="RL_17.png" width="700" style="left:1px">


### "The action-value function q𝜋(s, a) is the expected return starting from state s, taking action a, and following policy 𝜋. “

<img src="RL_13.png" width="500" style="left:1px">    




# Bellman Expectation Equation

<img src="R_18.png" width="500" style="left:1px">    


From a particular state S, there are multiple actions, I’m gonna average over the actions that I might take. 

### V𝜋(s) is telling us how good is it to be in a particular state,

<img src="RL_16.png" width="300" style="left:1px">

### q𝜋(s, a) is telling us how good is it to take a particular action from a given state.

<img src="RL_14.png" width="300" style="left:1px">
            
***Stitching Bellman expectation equation for V***

<img src="RL_15.png" width="400" style="left:1px">    

***Stitching Bellman expectation equation for q***
<img src="RL_19.png" width="400" style="left:1px">    

  ***Finally  we get the immediate reward for our action, and then we average over possible states we might land in, i.e. the value of each state we might land in multiplied by a probability the environment will select and average over all those things together***.

<img src="R_20.png" width="600" style="left:1px">    

## Optimal Value Functions

**The optimal state-value function V*(s) is the maximum value function over all policies.**

<img src="RL_21.png" width="300" style="left:1px">    

**The optimal action-value function q*(s, a) is the maximum action value function over all policies.**

<img src="RL_22.png" width="300" style="left:1px">    
