# Introduction

Reinforcement learning lies *in between* supervised and unsupervised learning. While the former has a label and the latter does not, reinforcement learning has *sparse time delay labels* (rewards). Reinforcement learning is about creating a mathematical framework that encapsulates the idea of an AI interacting with an environment and *time* acting as a dimension and learning through **trial and error**.

Markov chain is a set of states moved successively from one state to another using what we call a **transition matrix** one step at a time. Future is conditionally independent on any **previous states** given the current state. The current state encapsulates all that is needed to decide the future state when an input action is received. eg - a chess configuration.

The representation of RL is **Markov decision process**. (Extenstion by addition of decision and reward elements)
Five components - set of states, transitions, set of actions, starting state, reward.

*State* - Set of tokens that represent every possible condition.

*Action* - Set of all possible decisions. (a->whole, a(S) -> in a particular state) (Stochastic (SxA->Prob(S)) and Deterministic (SxA->S') )

*Model* - Action's effect. In particular, T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S and S’ may be same). For stochastic actions (noisy, non-deterministic) we also define a probability P(S’|S,a) (transition probablity matrix) which represents the probability of reaching a state S’ if action ‘a’ is taken in state S. 
*Transition matrix is like a next-state function but every state is thought to be a possible consequence of an action in a state*

*Reward* - Real-valued response, denoted by R(S, a, S').

*Discount factor* ($\gamma$) - It represents the relative importance of the current rewards wrt the future rewards. It fugres out the choice between performing an action that yields immediate rewards but leads to a less rewarding state or go with the other one. This helps us in achieving the **infinite-horizon** optimal solution. (A k-horizon solution ==> Policy that results in maximal expected reward sum from time 0 to k). So discount implies ==> worth of reward decreases with time. So it guarantees the convergence of the algorithm.

*Policy* - It is the solution to our MDP. It is a set of actions that are taken to reach a goal. It indicates the action to be taken in a particular state. A policy is denoted as ‘Pi’ π(s) –> ∞. 
π\* is called the optimal policy, one that optimizes to maximise the reward expected to be received at the end time.
**The policy isa guide that tells you what action to take at a point. It is not a plan but uncovers the plan by returning actions to take for each state. So no matter where you are it suggests the action that's best there.**

Put the states in a grid ==> Capture the essence of the environment by dividing it into the five components ==> Take decisions ==> The solution is policy.

Goal - Choose an optimal action that maximises the long term expected reward

Summary - an AI learns how to optimally interact in a real-time environment using time delayed labels called rewards as a signal
The Markov decision process is a mathematical framework for defining the RL problem using states, action and reward
Through interaction, an AI will learn a policy that will return the action for a given state with the highest reward


# Bellman Equation

STATE: a numeric representation of what the agent is observing al a particular point of time in the environment, eg, current pixel values
ACTION: the input the agent provides to the environment, calculated by applying a policy to the current state, eg, a joystick press
REWARD: a feedback signal from the environment reflecting how well the agent is performing the goals of the game, eg, score or enemies killed

RL is based on Dynamic programming introduced by Dr. Richard Bellman. DP helps to solve complex problem by simpliying it into smaller subproblems recursively.

What does this equation solve ? 
Ans - It evaluates the *value* of a state. It helps us to find the *expected reward* of a state *relative* to the advantage or disadvantage of each state. *Note* that even on taking the best possible / optimum action at a point may not give the expected reward

**Explaination of basic bellman equation follows here

Goal -  Finding the optimal action which will maximize the value of the expected long-term discounted rewards

Use NN to estimate the value of a state ==> Guess which one maximises the expected reward

**Tips for gamma.**

Finetune gamma between 0.9 and 0.99
A lower value encourages short-term thinking while a higher value emphasizes future rewards.
'Sense of urgency in the real world'- that is what discount factor is.

# Value function

The value function or value of policy tells us "how good" a state is for an agent. It is equal to the **expected discounted reward** when starting from state 's' and following a policy 'π' for an action. This completely depends on the policy chosen.

Types :-
-> State-Value function: *How good is a state*
-> Action-Value function: *How good is a state-action pair* (Q-function)

Optimal Value function: Value of state that maximises the discounted reward. (V\*(s))