### Computer games is an example for Reinforcement Learning

- RL : Hit and Trial Method
- Just like learning to ride a cycle


#### Its parts are :
- ***AI Bot*** which intracts with the environment
- Environment with a state
- Agent interacts with environment with some action 
- That action will belong to a set of actions
- Now, environment will return a state fromm the set of states to the agent
- Environment will also giive some reward to agent along with state


### Teaching our AI bot:
- We will use Gym openAI
 

In [17]:
pip install gym

Note: you may need to restart the kernel to use updated packages.


In [1]:
import gym

In [2]:
### Loaded environment (game with name CartPole)
env = gym.make('CartPole-v0')

#### COMES WITH certain import methods :
- action_space
- observation_space
- reset() : Returns init state and also resets the env
- step()
- render()

In [3]:
env.reset() ## Take game to initial state

array([ 0.03158367,  0.00335383, -0.00151384,  0.02523392])

In [4]:
env.action_space ## Means we can move only right or left

Discrete(2)

In [5]:
env.action_space.n ## 2 actions we can take

2

In [6]:
env.observation_space

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

In [7]:
env.reset()

for t in range(1000):
    random_action = env.action_space.sample()
    env.step(random_action) ## randomly move lt or rt
    env.render()

env.close()



#### Playing CartPole with random statergy :
- Game episode : We are going to play multiple games with different time duration
- Step() function in more detail
- game over ?

In [8]:
for e in range(20): ## we will play 20 games 
    #e = episode
    observation = env.reset()
    
    for t in range(50):
         # 50 timestep is maximum time for which we will play one episode or game
            
        env.render()
        action = env.action_space.sample()
        observation,reward,done,other_info = env.step(action)
        
        if done:
            
            print("Game Episode :{}/{} High Score :{}".format(e,20,t))
            break
            
env.close()
print("All 20 episodes are over")
        

Game Episode :0/20 High Score :22
Game Episode :1/20 High Score :45
Game Episode :2/20 High Score :13
Game Episode :3/20 High Score :15
Game Episode :4/20 High Score :8
Game Episode :5/20 High Score :9
Game Episode :6/20 High Score :15
Game Episode :7/20 High Score :12
Game Episode :8/20 High Score :11
Game Episode :9/20 High Score :36
Game Episode :10/20 High Score :21
Game Episode :11/20 High Score :15
Game Episode :12/20 High Score :11
Game Episode :13/20 High Score :13
Game Episode :14/20 High Score :13
Game Episode :15/20 High Score :14
Game Episode :16/20 High Score :22
Game Episode :17/20 High Score :11
Game Episode :18/20 High Score :13
Game Episode :19/20 High Score :10
All 20 episodes are over


### Q - Learning :
- Building our own statergy
- Q function : Q(s,a)
- Q(s,a) where 's' is state and 'a' is action to take corrosponding to that state 

#### Bellman Equation:
'''https://ai.stackexchange.com/questions/11057/what-is-the-bellman-operator-in-reinforcement-learning'''

- Q(s,a) = r + (gamma) * MAX{Q(S_,a_)}
- gamma = Discount Factor
- r = 

#### Agent Design Exploration VS Exploitation Tradeoff
- Exploration is good in begening : It helps us to try various random things
- Exploitation is good at the end : It helps to sample good experience from the past(memory)

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import os
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import random

%matplotlib inline 

In [8]:
class Agent:
    
    def __init__(self,state_size,action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95 ## Discount factor
        self.epsilon = 1.0 ## 100% Random exploration in the begning
        self.epsilon_decay = 0.995 ## It will trust the knowledge 0.05% from past experience and learn by exploration rest 99.5%\
        self.epsilon_min = 0.01 ##{Work at last when we gained full knowledge by playing lot of games}
        ##Even if I have played 1000 games, I will take 1% random action and rest will be from past knowwledge
        
        self.learning_rate = 0.001
        self.model = self._createmodel()
        
    def _createmodel(self):   
        model = Sequential()
        model.add(Dense(24,input_dim=self.state_size,activation='relu'))

        model.add(Dense(24,activation='relu'))

        model.add(Dense(self.action_size,activation='linear'))

        model.compile(loss='mse',optimizer=Adam(lr = 0.001))

        #model.summary()
        return model
    
    
    def remember(self,state,action,reward,next_state,done):
        # Remember all past experiences
        self.memory.append((state,action,reward,next_state,done))
    
    def act(self,state):
        # Epsilon greedy method
        if np.random.rand() <= self.epsilon:
            ## In this case, take a random action 
            return random.randrange(self.action_size)
        
        ## In else case, we will ask NN to give a suitable action
        
        return np.argmax(model.predict(state)[0])
    
    
    def train(self,batch_size):
        # Train using Replay Buffer
        
        minibatch = random.sample(self.memory,batch_size)
        
        for experience in minibatch:
            state,action,reward,next_state,done = experience
            
            ## X : State
            ## Y : Expected Reward
            
            if not done:
                # Gameis not yet over, Bellman eqn to approx the target_value of reward
                
                target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
                
            else:

                target = reward

            target_f = self.model.predict(state)
            target_f[0][action] = target

            # X = state
            # Y = target_f

            self.model.fit(state,target_f,epochs=1,verbose=0)
                
            if self.epsilon > self.epsilon_min:
                
                self.epsilon *= self.epsilon_decay
                
    def load(self,name):
        self.model.load_weights(name)
        
    def save(self,name):
        self.model.save_weights(name)
        
                   

#### Why agent needs a memory ?
- You dont havve any training data
- We will use past experiences of agent in memory to play the game 
- We will use that memory to train neural network 

- We will use double endded queue where we can add or remove items from both ends
- We will remove old experience and add new experience to that memory


In [9]:
### Example of what model we are using
model = Sequential()
model.add(Dense(24,input_dim=4,activation='relu'))

model.add(Dense(24,activation='relu'))

model.add(Dense(2,activation='linear'))

model.compile(loss='mse',optimizer=Adam(lr = 0.001))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 24)                120       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


In [10]:
X = np.random.rand(1,4)
model.predict(X)

array([[-0.06790003,  0.08345148]], dtype=float32)

### Training the DQN Agent (Deep Q-Learner)

In [11]:
n_episodes = 1000
output_dir = "C:/Users/Asus/Desktop/carpole_output/"

In [12]:
agent = Agent(state_size=4,action_size=2)
done = False
state_size = 4
action_size = 2
batch_size = 32

In [None]:
for e in range(n_episodes):
    state = env.reset()
    state = np.reshape(state,[1,state_size])
    
    for time in range(5000):
        
        env.render()
        action = agent.act(state) ## Action is 0 or 1
        next_state,reward,done,other_info = env.step(action)
        reward = reward if not done else -10  ########## IMPORTANT #########
        next_state = np.reshape(next_state,[1,state_size])
        agent.remember(state,action,reward,next_state,done) ## Experience for the agent
        
        
        if done:
            
            print("Game Episode :{}/{} High Score :{} Exploration Rate {:.2}".format(e,20,time,agent.epsilon))
            break
            
    if len(agent.memory) > batch_size:
        agent.train(batch_size)
        
    if e%50 == 0:
        agent.save(output_dir+"weights_"+'{:04d}'.format(e)+".hdf5")

print("Deep Q-Learner Model trained")
env.close()
            

Game Episode :0/20 High Score :10 Exploration Rate 1.0
Game Episode :1/20 High Score :32 Exploration Rate 1.0
Game Episode :2/20 High Score :13 Exploration Rate 0.85
Game Episode :3/20 High Score :21 Exploration Rate 0.73
Game Episode :4/20 High Score :12 Exploration Rate 0.62
Game Episode :5/20 High Score :10 Exploration Rate 0.53
Game Episode :6/20 High Score :10 Exploration Rate 0.45
Game Episode :7/20 High Score :11 Exploration Rate 0.38
Game Episode :8/20 High Score :10 Exploration Rate 0.33
Game Episode :9/20 High Score :9 Exploration Rate 0.28
Game Episode :10/20 High Score :8 Exploration Rate 0.24
Game Episode :11/20 High Score :9 Exploration Rate 0.2
Game Episode :12/20 High Score :8 Exploration Rate 0.17
Game Episode :13/20 High Score :9 Exploration Rate 0.15
Game Episode :14/20 High Score :9 Exploration Rate 0.12
Game Episode :15/20 High Score :9 Exploration Rate 0.11
Game Episode :16/20 High Score :7 Exploration Rate 0.09
Game Episode :17/20 High Score :9 Exploration Rate 0

Game Episode :146/20 High Score :9 Exploration Rate 0.01
Game Episode :147/20 High Score :9 Exploration Rate 0.01
Game Episode :148/20 High Score :9 Exploration Rate 0.01
Game Episode :149/20 High Score :10 Exploration Rate 0.01
Game Episode :150/20 High Score :9 Exploration Rate 0.01
Game Episode :151/20 High Score :9 Exploration Rate 0.01
Game Episode :152/20 High Score :10 Exploration Rate 0.01
Game Episode :153/20 High Score :9 Exploration Rate 0.01
Game Episode :154/20 High Score :7 Exploration Rate 0.01
Game Episode :155/20 High Score :9 Exploration Rate 0.01
Game Episode :156/20 High Score :9 Exploration Rate 0.01
Game Episode :157/20 High Score :8 Exploration Rate 0.01
Game Episode :158/20 High Score :8 Exploration Rate 0.01
Game Episode :159/20 High Score :8 Exploration Rate 0.01
Game Episode :160/20 High Score :10 Exploration Rate 0.01
Game Episode :161/20 High Score :9 Exploration Rate 0.01
Game Episode :162/20 High Score :9 Exploration Rate 0.01
Game Episode :163/20 High Sc

Game Episode :290/20 High Score :9 Exploration Rate 0.01
Game Episode :291/20 High Score :8 Exploration Rate 0.01
Game Episode :292/20 High Score :8 Exploration Rate 0.01
Game Episode :293/20 High Score :8 Exploration Rate 0.01
Game Episode :294/20 High Score :9 Exploration Rate 0.01
Game Episode :295/20 High Score :8 Exploration Rate 0.01
Game Episode :296/20 High Score :9 Exploration Rate 0.01
Game Episode :297/20 High Score :10 Exploration Rate 0.01
Game Episode :298/20 High Score :9 Exploration Rate 0.01
Game Episode :299/20 High Score :9 Exploration Rate 0.01
Game Episode :300/20 High Score :8 Exploration Rate 0.01
Game Episode :301/20 High Score :9 Exploration Rate 0.01
Game Episode :302/20 High Score :9 Exploration Rate 0.01
Game Episode :303/20 High Score :8 Exploration Rate 0.01
Game Episode :304/20 High Score :7 Exploration Rate 0.01
Game Episode :305/20 High Score :7 Exploration Rate 0.01
Game Episode :306/20 High Score :8 Exploration Rate 0.01
Game Episode :307/20 High Scor

Game Episode :434/20 High Score :9 Exploration Rate 0.01
Game Episode :435/20 High Score :8 Exploration Rate 0.01
Game Episode :436/20 High Score :8 Exploration Rate 0.01
Game Episode :437/20 High Score :8 Exploration Rate 0.01
Game Episode :438/20 High Score :9 Exploration Rate 0.01
Game Episode :439/20 High Score :8 Exploration Rate 0.01
Game Episode :440/20 High Score :7 Exploration Rate 0.01
Game Episode :441/20 High Score :8 Exploration Rate 0.01
Game Episode :442/20 High Score :9 Exploration Rate 0.01
Game Episode :443/20 High Score :10 Exploration Rate 0.01
Game Episode :444/20 High Score :8 Exploration Rate 0.01
Game Episode :445/20 High Score :7 Exploration Rate 0.01
Game Episode :446/20 High Score :8 Exploration Rate 0.01
Game Episode :447/20 High Score :9 Exploration Rate 0.01
Game Episode :448/20 High Score :8 Exploration Rate 0.01
Game Episode :449/20 High Score :8 Exploration Rate 0.01
Game Episode :450/20 High Score :8 Exploration Rate 0.01
Game Episode :451/20 High Scor