# 2人の意思決定（石取りゲームの学習）<br>
# Decison-Making by Two People (Learning of Stone Picking Game)<br>

・強化学習の1つであるQラーニングを用いる<br>
・This time, we use Q-learning, which is one of the reinforcement learning.<br>
<br>
・エージェントは2つ用いる<br>
・We use two agents.


In [1]:
import numpy as np

変数の設定 Settings of variables

In [2]:
BOTTLE_N = 11#石の数 the number of stones

#Q値の初期化 Q-value initialization
QV0=np.zeros((BOTTLE_N+1,3), dtype=np.float32)
QV1=np.zeros((BOTTLE_N+1,3), dtype=np.float32)
QVs = [QV0, QV1]

状態遷移：状態を変化させるための関数<br>
State transiton: The function used to change state

In [3]:
def step(action, state, turn):
    state = state + action + 1
    rewards = [0,0]
    done = False
    if (state>=BOTTLE_N):
        state = BOTTLE_N
        rewards[turn] = -1
        rewards[(turn+1)%2] = 1
        done = True
    return state, rewards, done

行動選択：Q値から次の行動を選択するための関数<br>
Select actions: The function to select next action from Q-value

In [4]:
def getAction(state, epsilon, qv):
    #徐々に最適行動のみをとる、ε-greedy法
    #Gradually take only the optimal action at that time, ε-greedy method
    if epsilon > np.random.uniform(0, 1):
        next_action = np.random.choice([0, 1, 2])
    else:
        a = np.where(qv[state]==qv[state].max())[0]
        #その時点での最適行動が複数ある場合はランダムに選択
        #If there are multiple optimal actions at that time, select randomly from them
        next_action = np.random.choice(a)
    return next_action

Q値の更新：状態，行動，報酬，次の状態を用いてQ値を更新するための関数<br>
Update Q-value: The function to update Q-value based on the current state, the behavior, the reward and the next state

In [5]:
def updateQValue(action, reward, state, state_old, qv, gamma, alpha):
    maxQ = np.max(qv[state])
    qv[state_old][action] = (1-alpha)*qv[state_old][action]+alpha*(reward + gamma*maxQ);

変数の設定 Settings of variables

In [6]:
num_episodes = 100#エピソード数 number of episodes
gamma = 0.9#割引率 discount rate
alpha = 0.5#学習率 learning rate

強化学習の実行 Do reinforcement learning

In [7]:
for episode in range(num_episodes):  #試行数分繰り返す
    state = 0
    state_old = [0,0]
    rewards = [0,0]
    actions = [0,0]
    epsilon = 0.5 * (1 / (episode + 1))
    while(1):
        actions[0] = getAction(state, epsilon, QVs[0])
        state_old[0] = state
        state, rewards, done = step(state, actions[0], 0)
        updateQValue(actions[1], rewards[1], state, state_old[1], QVs[1], gamma, alpha)
        if (done==True):
            updateQValue(actions[0], rewards[0], state, state_old[0], QVs[0], gamma, alpha)
            print('{} : 0 Lose, 1 Win!!'.format(episode))
            break
        actions[1] = getAction(state, epsilon, QVs[1])
        state_old[1] = state
        state, rewards, done = step(state, actions[1], 1)
        updateQValue(actions[0], rewards[0], state, state_old[0], QVs[0], gamma, alpha)
        if (done==True):
            updateQValue(actions[1], rewards[1], state, state_old[1], QVs[1], gamma, alpha)
            print('{} : 0 Win!!, 1 Lose'.format(episode))
            break

0 : 0 Win!!, 1 Lose
1 : 0 Win!!, 1 Lose
2 : 0 Lose, 1 Win!!
3 : 0 Lose, 1 Win!!
4 : 0 Win!!, 1 Lose
5 : 0 Win!!, 1 Lose
6 : 0 Lose, 1 Win!!
7 : 0 Win!!, 1 Lose
8 : 0 Lose, 1 Win!!
9 : 0 Lose, 1 Win!!
10 : 0 Lose, 1 Win!!
11 : 0 Lose, 1 Win!!
12 : 0 Lose, 1 Win!!
13 : 0 Win!!, 1 Lose
14 : 0 Lose, 1 Win!!
15 : 0 Lose, 1 Win!!
16 : 0 Lose, 1 Win!!
17 : 0 Lose, 1 Win!!
18 : 0 Win!!, 1 Lose
19 : 0 Win!!, 1 Lose
20 : 0 Lose, 1 Win!!
21 : 0 Lose, 1 Win!!
22 : 0 Lose, 1 Win!!
23 : 0 Win!!, 1 Lose
24 : 0 Win!!, 1 Lose
25 : 0 Win!!, 1 Lose
26 : 0 Win!!, 1 Lose
27 : 0 Win!!, 1 Lose
28 : 0 Win!!, 1 Lose
29 : 0 Win!!, 1 Lose
30 : 0 Win!!, 1 Lose
31 : 0 Win!!, 1 Lose
32 : 0 Win!!, 1 Lose
33 : 0 Win!!, 1 Lose
34 : 0 Win!!, 1 Lose
35 : 0 Win!!, 1 Lose
36 : 0 Win!!, 1 Lose
37 : 0 Win!!, 1 Lose
38 : 0 Win!!, 1 Lose
39 : 0 Lose, 1 Win!!
40 : 0 Win!!, 1 Lose
41 : 0 Win!!, 1 Lose
42 : 0 Win!!, 1 Lose
43 : 0 Win!!, 1 Lose
44 : 0 Win!!, 1 Lose
45 : 0 Win!!, 1 Lose
46 : 0 Win!!, 1 Lose
47 : 0 Win!!, 1 Lose
48

In [8]:
print("Agent 0")
print(QVs[0])
print("Agent 1")
print(QVs[1])
np.savetxt('QV0.txt', QVs[0])
np.savetxt('QV1.txt', QVs[1])

Agent 0
[[-0.03203613  0.8099999  -0.02645507]
 [ 0.          0.          0.        ]
 [-0.10124999 -0.24046874 -0.18140624]
 [ 0.         -0.25312498  0.8999864 ]
 [ 0.          0.899978    0.        ]
 [ 0.899999    0.         -0.4359375 ]
 [-0.703125   -0.646875   -0.6234375 ]
 [-0.26874998 -0.39374998  0.99999905]
 [-0.4359375   0.9999999   0.        ]
 [ 1.          0.         -0.5       ]
 [-0.984375   -0.96875    -0.984375  ]
 [ 0.          0.          0.        ]]
Agent 1
[[-0.7274703   0.          0.        ]
 [ 0.20309943  0.          0.        ]
 [-0.8098532  -0.80977774 -0.80984014]
 [ 0.          0.          0.5352539 ]
 [ 0.          0.590625    0.        ]
 [ 0.63984376 -0.01406249 -0.225     ]
 [-0.8999899  -0.89998716 -0.89998543]
 [ 0.          0.          0.96875   ]
 [ 0.          0.96875    -0.5       ]
 [ 0.9921875  -0.5        -0.5       ]
 [-1.         -1.         -1.        ]
 [ 0.          0.          0.        ]]


必勝法と同じ取り方を学習しているかの確認<br>
Check if you are learning the same winning method.

In [9]:
for j in range(2):
    print("Agent", j)
    for i in range(BOTTLE_N):
        a = np.where(QVs[j][i]==QVs[j][i].max())[0]
        print('残り本数',BOTTLE_N-i,'取る数',a+1,'必勝法',(BOTTLE_N-i-1)%4,'なんでもよい' if (BOTTLE_N-i-1)%4 == 0 else \
              ('不定' if a.size >1 else ('正解' if (BOTTLE_N-i-1)%4 == a+1 else '不正解')))

Agent 0
残り本数 11 取る数 [2] 必勝法 2 正解
残り本数 10 取る数 [1 2 3] 必勝法 1 不定
残り本数 9 取る数 [1] 必勝法 0 なんでもよい
残り本数 8 取る数 [3] 必勝法 3 正解
残り本数 7 取る数 [2] 必勝法 2 正解
残り本数 6 取る数 [1] 必勝法 1 正解
残り本数 5 取る数 [3] 必勝法 0 なんでもよい
残り本数 4 取る数 [3] 必勝法 3 正解
残り本数 3 取る数 [2] 必勝法 2 正解
残り本数 2 取る数 [1] 必勝法 1 正解
残り本数 1 取る数 [2] 必勝法 0 なんでもよい
Agent 1
残り本数 11 取る数 [2 3] 必勝法 2 不定
残り本数 10 取る数 [1] 必勝法 1 正解
残り本数 9 取る数 [2] 必勝法 0 なんでもよい
残り本数 8 取る数 [3] 必勝法 3 正解
残り本数 7 取る数 [2] 必勝法 2 正解
残り本数 6 取る数 [1] 必勝法 1 正解
残り本数 5 取る数 [3] 必勝法 0 なんでもよい
残り本数 4 取る数 [3] 必勝法 3 正解
残り本数 3 取る数 [2] 必勝法 2 正解
残り本数 2 取る数 [1] 必勝法 1 正解
残り本数 1 取る数 [1 2 3] 必勝法 0 なんでもよい
