<a href="https://colab.research.google.com/github/pegahahadian/university/blob/main/Q_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install gym



Reinforcement learning is learning how to map situations to actions so as to maximize a numerical reward signal. Gym is a toolkit for developing and comparing reinforcement learning algorithms.

In [None]:
import gym
import random
import numpy as np

Play one game of blackjack with random actions

In [None]:
env = gym.make("Blackjack-v0")
observation = env.reset()
memory = []
for _ in range(10):
  action = env.action_space.sample() 
  observation, reward, done, info = env.step(action)
  memory.append((observation,action,reward,done))
  if done:
    break

Calling env.step gives us an observation, reward, a boolean indicating whether the episode has finished

In [None]:
memory

[((14, 2, False), 1, 0.0, False), ((14, 2, False), 0, 1.0, True)]

**States**

 The observation is a 3-tuple of: 
the players current sum,
the dealer's one showing card (1-10 where 1 is ace),
and whether or not the player holds a usable ace (0 or 1).

**Actions**

In [None]:
env.action_space
# Stay = 0
# Hit = 1

Discrete(2)

**Rewards**

In [None]:
# Win = 1
# Loss = -1

def compute_avg_reward(memory):
  rewards = [r[2] for r in memory]
  return sum(rewards)/len(memory)

#compute_avg_reward(memory)

Now lets play 100 games of random blackjack. We'll keep track of our score.

In [None]:
env = gym.make("Blackjack-v0")
observation = env.reset()
memory = []
episodes = 100
for e in range(episodes):
  for _ in range(10):
    action = env.action_space.sample() 
    observation, reward, done, info = env.step(action)
    memory.append((observation,action,reward,done))
    if done:
      break

In [None]:
rewards = [r[2] for r in memory]
sum(rewards)

-100.0

Let's try building a simple agent.

In [None]:
class RuleBasedAgent():
  
  def __init__(self,hit_probability):
    self.hit_probability = hit_probability
  
  def act(self,state):
    if state[0]>=17:
      return 0
    else:
      return 1

In [None]:
env = gym.make("Blackjack-v0")

memory = []
agent = RuleBasedAgent(.8)
episodes = 100
for e in range(episodes):
  state = env.reset()
  for _ in range(10):
    action = agent.act(observation) 
    state, reward, done, info = env.step(action)
    memory.append((observation,action,reward,done))
    if done:
      break

In [None]:
compute_avg_reward(memory)

-0.26

**A reinforcement learning policy**

One of the things that sets reinforcement learning apart from machine learning is that the agent explores its environment to discover the optimal policy. A policy specifies what action to take at every state.

In [None]:
def epsilon_greedy(agent):
  if random.random()>.3:
    return env.action_space.sample()
  else:
    return agent.act()

The task of the agent is estimate the value of taking an action in a given state. This is referred to as the Q-value. The class of algorithms that estimate Q-values are called Q-learning. Q can be estimated a variety of ways including Monte Carlo methods and neural networks. We will show an example with a neural network

In [None]:
import keras
from keras.layers import Input, Dense
from keras.models import Model

class NeuralQAgent():
  
  def __init__(self,epsilon):
    self.epsilon = epsilon
    self.memory = []
    self.init_model()
    
  def init_model(self):
    inputs = Input(shape=(3,))
    x = Dense(5,activation='relu')(inputs)
    x1 = Dense(5,activation='relu')(x)
    output = Dense(2,activation='softmax')(x1)

    model = Model(inputs=inputs,outputs=output)
    model.compile(optimizer='adam',loss='mean_squared_error')
    self.model = model
    
  def act(self,state):
    state = np.array(state,ndmin=2)

    if random.random()>self.epsilon:
      return env.action_space.sample()
    else:
      self.update_Q()
      action = np.argmax(self.model.predict(state))
      return action

    
  def update_Q(self):
    if len(self.memory)<32:
      pass
    else:
      batch = random.sample(self.memory,32)
      states = np.array([np.array(s[0]) for s in batch])
      actions = np.array([a[1] for a in batch])
      rewards = np.array([a[2] for a in batch])
      targets = self.model.predict(states)
      for i in range(len(targets)):
        targets[i][actions[i]] = rewards[i]
      self.model.fit(states,targets,verbose=0)

In [None]:
env = gym.make("Blackjack-v0")


agent = NeuralQAgent(.1)

episodes = 100000
for e in range(episodes):
  state = env.reset()
  while True:
    action = agent.act(state) 
    state, reward, done, info = env.step(action)
    agent.memory.append((state,action,reward,done))
    if done:
      break

In [None]:
compute_avg_reward(agent.memory)

-0.27001510857881644