### **[강화 학습] Reinforcement Learning - Assignment #2**
# **Monte Carlo (MC) Control Policy Evaluation in Blackjack Environment**

**(마감일: 2020년 11월 16일 오전 12:00시) - (2019 55718 - MELIA PUTRI H)**

### **Gym Libraries**
> Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball.

In [1]:
import gym
from gym import spaces
from gym.utils import seeding

### **BlackJack Environment**

In [2]:
def cmp(a,b):
  return float(a>b)-float(a<b)

In [3]:
#1=Ace, 2-10=Number Cards, Jack/Queen/King=10
deck=[1,2,3,4,5,6,7,8,9,10,10,10,10]

In [4]:
def draw_card(np_random):
  return int(np_random.choice(deck))

In [5]:
def draw_hand(np_random):
  return [draw_card(np_random),draw_card(np_random)]

In [6]:
#Does this hand have a usable Ace card?
def usable_ace(hand):
  return 1 in hand and sum(hand)+10<=21

In [7]:
#Return the current hand total
def sum_hand(hand):
  if usable_ace(hand):
    return sum(hand)+10
  return sum(hand)

In [8]:
#Is this hand a bust?
def is_bust(hand):
  return sum_hand(hand)>21

In [9]:
#What is the score of this hand (0 if bust)?
def score(hand):
  return 0 if is_bust(hand) else sum_hand(hand)

In [10]:
#Is this hand a natural Black Jack? (21)
def is_natural(hand):
  return sorted(hand)==[1,10] #Cards are an Ace and a 10, Jack, Queen, or King

In [11]:
class BlackJackEnv(gym.Env):
  def __init__(self,natural=False):
    self.action_space=spaces.Discrete(2)
    self.observation_space=spaces.Tuple((spaces.Discrete(32),
                                         spaces.Discrete(11),
                                         spaces.Discrete(2)))
    self.seed()

    #Flag to payout 1.5 on a "natural" Black Jack Win, like casino rules
    self.natural=natural

    #Start the First Game
    self.reset()

  def seed(self,seed=None):
    self.np_random,seed=seeding.np_random(seed)
    return [seed]
  
  def step(self,action):
    assert self.action_space.contains(action)
    #HIT: Add a card to players hand and return
    if action:
      self.player.append(draw_card(self.np_random))
      if is_bust(self.player):
        done=True
        reward=-1.
      else:
        done=False
        reward=0.
    #STICK: Play out the dealers hand, and score
    else:
      done=True
      while sum_hand(self.dealer)<17:
        self.dealer.append(draw_card(self.np_random))
      reward=cmp(score(self.player),score(self.dealer))
      if self.natural and is_natural(self.player) and reward==1.:
        reward=1.5
    return self._get_obs(),reward,done,{}
  
  def _get_obs(self):
    return(sum_hand(self.player),self.dealer[0],usable_ace(self.player))
  
  def reset(self):
    self.dealer=draw_hand(self.np_random)
    self.player=draw_hand(self.np_random)
    return self._get_obs()

### **BlackJack with MC Control**

1. Consider $k$-th episode

  $(S_1,A_1,R_2,...,S_T)∼𝜋$
2. For each state $S_t$ and action $A_t$ in the episode

  $N(S_t,A_t)←N(S_t,A_t)+1$

  $Q(S_t,A_t)←Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t))$

3. Improve policy based on new action-value function
  $𝜖←\frac{1}{k}$

  $𝜋←𝜖 - greedy(Q)$

In [12]:
import numpy as np
import random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [13]:
env=BlackJackEnv()
Q=np.zeros([2,10,10,2])  #Policy
N=np.zeros([2,10,10,2])  #Number of Visits
gamma=1                  #Discount Factor

In [14]:
num_episodes=500000

In [15]:
def state_idx(state):
  return state[2]*1,state[0]-12,state[1]-1

In [16]:
def setrandom():
  action=random.randrange(0,2)
  return action

In [17]:
winning_counter=0
for epi in range(num_episodes):
  state=env.reset()
  done=False
  state_history=[]
  reward_history=[]
  state_action_history=[]

  while state[0]<12:
    state,_,done,_=env.step(1)

  e=1.0/((epi//500)+1)  #The epsilon
    
  while done==False:
    if np.random.rand(1)<e:
      action=setrandom()
    else:
      action=np.argmax(Q[state[2]*1,state[0]-12,state[1]-1,:])

    next_state,reward,done,_=env.step(action)
    state_history.append(state)
    reward_history.append(reward)
    state_action_history.append([state[2]*1,state[0]-12,state[1]-1,action])
    state=next_state

  winning_counter+=(reward>0)*1.0

  T=len(state_action_history)
  
  for t in range(T):
    G_t=np.sum(np.array(reward_history[t:T])*gamma**np.arange(T-t))
    index=state_idx(state_history[t-1])

    ace=state_action_history[t][0]
    p=state_action_history[t][1]
    d=state_action_history[t][2]
    a=state_action_history[t][3]
    N[ace,p,d,a]+=1
    Q[ace,p,d,a]=Q[ace,p,d,a]+(1/N[ace,p,d,a])*(G_t-Q[ace,p,d,a])
    
  if (epi+1)%10000==0:
    print("Episode: %6d, Winning rate: %.2f"%(epi+1, winning_counter/(epi+1)))

Episode:  10000, Winning rate: 0.39
Episode:  20000, Winning rate: 0.40
Episode:  30000, Winning rate: 0.41
Episode:  40000, Winning rate: 0.41
Episode:  50000, Winning rate: 0.41
Episode:  60000, Winning rate: 0.41
Episode:  70000, Winning rate: 0.42
Episode:  80000, Winning rate: 0.42
Episode:  90000, Winning rate: 0.42
Episode: 100000, Winning rate: 0.42
Episode: 110000, Winning rate: 0.42
Episode: 120000, Winning rate: 0.42
Episode: 130000, Winning rate: 0.42
Episode: 140000, Winning rate: 0.42
Episode: 150000, Winning rate: 0.42
Episode: 160000, Winning rate: 0.42
Episode: 170000, Winning rate: 0.42
Episode: 180000, Winning rate: 0.42
Episode: 190000, Winning rate: 0.42
Episode: 200000, Winning rate: 0.42
Episode: 210000, Winning rate: 0.42
Episode: 220000, Winning rate: 0.42
Episode: 230000, Winning rate: 0.42
Episode: 240000, Winning rate: 0.42
Episode: 250000, Winning rate: 0.42
Episode: 260000, Winning rate: 0.42
Episode: 270000, Winning rate: 0.42
Episode: 280000, Winning rat

In [18]:
print(Q)

[[[[-7.69688318e-01 -9.28571429e-01]
   [-2.72233202e-01 -3.53846154e-01]
   [-2.33446328e-01 -3.84615385e-01]
   [-2.18069250e-01 -2.58064516e-01]
   [-1.70765341e-01 -5.62500000e-01]
   [-1.61823766e-01 -4.44444444e-01]
   [-4.88595438e-01 -7.85714286e-01]
   [-5.18264840e-01 -5.51724138e-01]
   [-7.27272727e-01 -3.63979255e-01]
   [-5.73333333e-01 -4.25789474e-01]]

  [[-1.00000000e+00 -5.78160060e-01]
   [-2.87878788e-01 -5.88235294e-01]
   [-4.66666667e-01 -2.85470085e-01]
   [-6.00000000e-01 -2.99450549e-01]
   [-1.42159257e-01 -5.21739130e-01]
   [-2.57142857e-01 -2.17597208e-01]
   [-4.94055148e-01 -8.23529412e-01]
   [-7.33333333e-01 -3.61032197e-01]
   [-5.50520231e-01 -5.71428571e-01]
   [-6.45569620e-01 -4.65630115e-01]]

  [[-7.50000000e-01 -5.99106513e-01]
   [-2.75099867e-01 -3.90000000e-01]
   [-2.28253141e-01 -2.30769231e-01]
   [-2.05804111e-01 -4.32432432e-01]
   [-1.80475594e-01 -4.11764706e-01]
   [-1.57843848e-01 -2.50000000e-01]
   [-4.77654353e-01 -7.24137931e-0

####**Visualization**

In [19]:
X=np.arange(1,11)
Y=np.arange(12,22)
X,Y=np.meshgrid(X,Y)

In [20]:
Q.shape

(2, 10, 10, 2)

In [None]:
Q.reshape((2,10,10))

In [None]:
fig=plt.figure(figsize=(16,15))

ax0=fig.add_subplot(211,projection='3d')
ax0.plot_surface(X,Y,Q[0],rstride=1,cstride=1,cmap='coolwarm')
ax0.set_xlabel('Dealer')
ax0.set_ylabel('Player Sum')
ax0.set_zlabel('V Value')
ax0.set_title('No Usable Ace')

ax1=fig.add_subplot(212,projection='3d')
ax1.plot_surface(X,Y,Q[1],rstride=1,cstride=1,cmap='coolwarm')
ax1.set_xlabel('Dealer')
ax1.set_ylabel('Player Sum')
ax1.set_zlabel('V Value')
ax1.set_title('Usable Ace')

plt.show()