<a href="https://colab.research.google.com/github/koushikpr/Machine-Learning-Prerequisites/blob/Reinforcement-Learning/Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Reinforcement Learning: The method by which, the machine learns the process and comes up with a model of its own.

Some Terminologies To Go Ahead

1. Environment: Like the platform or the surrounding that our agent will explore
2. Agent: An entity that explores the environment.
3. State: Tells us the status of the agent in the environment
4. Action: Any interaction between the agent and the environment is considered an action.
5. Reward: The actions taken by the agent results in rewards(output). Our goal is to maximise the rewards.

## Q-Learning

Q-Learning is a method used learning matrix of action-reward. The matrix is in the form of nxm where, 

n = no of states

m = no of actions

This is an integration of Linear Algebra and Truth Tables.

# Learning the Q-Table

Lets look into the finding the values with the help of stages.

Stage-1: Assigning 0 as the initial condition for the Q-Table

Stage-2: There are 2 ways for learning the values for the Q-Table:

a. Randomly picking actions

b. Using the current Q-table to predict the next

As a beginner we will be looking into assigning random actions and later on we will use these tables to assign the next values. The agent will stop taking actions if it has reached the time limit/completed the goal/reached the end of the environment

Stage-3: The formula for updating the Q-Table after each action is as follows:
> $ Q[state, action] = Q[state, action] + \alpha * (reward + \gamma * max(Q[newState, :]) - Q[state, action]) $

- $\alpha$ stands for the **Learning Rate**

- $\gamma$ stands for the **Discount Factor**


Learning rate: $\alpha$ is a numerical value that depict how much change occurs between any different stages of the Q-Table. Higher value means there is a drastic change between stages in the Table.

Discount Factor: $\gamma$ depicts how much focus is put on the current and future rewards. Higher value means the o/p is considered heavily.




### Q-learning Example

In this example we will be using openAI which is platform for developers to build machine learning models.


Step 1: Importing Open AI Gym

In [36]:
import gym

Step 2: Loading an environment

In [37]:
env  = gym.make('FrozenLake-v0')

Step 3: Few basics interaction commands

In [38]:
print(env.observation_space.n)# no of states
print(env.action_space.n)# no of actions
env.reset()# resets the environment to default stage
action  = env.action_space.sample() # gets a random action
newstate,reward,done,info = env.step(action) # returns information abt the certain action
print(newstate,reward,done,info)
env.render() # returns a GUI of the environment

16
4
0 0.0 False {'prob': 0.3333333333333333}
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG


What is our Frozenlake Environment:

Our goal in the frozen lake environment is to navigate through the ice without breaking it to reach to the other side.There are 

16 stages for each square.

4 actions we have are (left,right,up,down)

4 Types of blocks(frozen,hole,start,goal)

Step 3: Building the Q-Table

In [44]:
import gym
import numpy as np
import time

env = gym.make('FrozenLake-v0')
states = env.observation_space.n
actions = env.action_space.n

Q=np.zeros((states,actions))#creating a base Q-Table with all zeros
Q

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Step 4: Assigning Constants

In [48]:
episodes = 20000 #epochs 
turns = 100 #max steps

learningrate = 0.81
gamma = 0.96

render = False #this is to see the render
e = 0.9


Step 5: Picking an Action

We know that we have 2 ways to pick an action random or learning. here we will define a constant that gives us the probability of selecting a random action or learning from the table

In [41]:
e = 0.9
#picking the action based on the random probability
if np.random.uniform(0,1)< e:#selecting the action based on uniform Random Variable
  action  = env.action_space.sample()#takes a random action from the envoirnment
else:
  action = np.argmax(Q[state,:])#selects the max argument from the current state(row)

Step 6: Extraction States and updating Q-Values

In [50]:
rewards = []

for episode in range(episodes):
  state = env.reset()#resets the environment
  for _ in range(turns):
    if render:#to check the render
      env.render()

    if np.random.uniform(0,1)<e:
      action = env.action_space.sample()
    else:
      action = np.argmax(Q[state,:])

    nextstate,reward,done,_ = env.step(action) #returns the values of the current action

    Q[state,action] = Q[state,action] + learningrate * (reward + gamma *np.max(Q[newstate,:])-Q[state,action])

    state = nextstate#goes to the next stage

    if done:#if reached goal
      rewards.append(reward)
      e -=0.001
      break 

print(Q)
avg  = sum(rewards)/len(rewards)
print(avg)

[[0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.00020944 0.183141   0.15495698]
 [0.         0.         0.         0.        ]]
0.0
