# 1人の意思決定（ミントタブレット問題）<br>
# Decision-Making by One Person (Mint Tablet Problem)<br>
<br>

・強化学習の1つであるQラーニングを用いる<br>
・This time, we use O-learning, which is one of the reinforcement learning.<br>
<br>
・エージェントは1つだけ用いる<br>
・We use only one agent.


In [11]:
import numpy as np

状態遷移：状態を変化させるための関数<br>
State transition: The function used to change state

In [2]:
def step(state, action):
    reward = 0
    if state==0:#閉じている closed
        if action==0:#開ける open
            state = 1
    elif state==1:#開いていて，ミント菓子がある opened with mint
        if action==1:#閉じる close
            state = 0
        elif action==2:#傾ける tip
            state = 2
            reward = 1
    else:#開いていて，ミント菓子がない opend without mint
        if action==1:#閉じる close
            state = 0
    return state, reward

行動選択：Q値から次の行動を選択するための関数<br>
Select actions: The function to select next action from Q-value

In [13]:
def getAction(state, epsilon, qv):
    #徐々に最適行動のみをとる、ε-greedy法
    #gradually take the optimal behavior at that time, ε-greedy method
    if epsilon > np.random.uniform(0, 1):
        next_action = np.random.choice([0, 1])
    else:
        a = np.where(qv[state]==qv[state].max())[0]
        #その時点での最適行動が複数ある場合はランダムに選択
        #If there are multiple optimal actions at that time, select randomly from them
        next_action = np.random.choice(a)
    return next_action

Q値の更新：状態，行動，報酬，次の状態を用いてQ値を更新するための関数<br>
Update Q-value: The function to update Q-value based on the current state, the behavior, the reward and the next state

In [14]:
def updateQValue(qv, state, action, reward, next_state, gamma, alpha):
    next_maxQ=max(qv[next_state])
    qv[state, action] = (1 - alpha) * qv[state, action] + alpha * (reward + gamma * next_maxQ)
    return qv

変数の設定 Settings of variables

In [15]:
num_episodes = 5  #総試行回数 total number of trials
num_steps = 10  #1試行の中の行動数 number of actions in one trial
gamma = 0.9  #割引率 discount rate
alpha = 0.5  #学習係数 learning rate

強化学習の実行 Do reinforcement learning

In [16]:
QV = np.zeros((3, 3))
for episode in range(num_episodes):  #試行数分繰り返す loop for the number of trials
    state = 0#初期状態に戻す reset to initial state
    sum_reward = 0#累積報酬 cumulative reward
    epsilon = 0.5 * (1 / (episode + 1))
    for t in range(num_steps):  #1試行のループ loop for one trial
        action = getAction(state, epsilon, QV)    # a_{t+1} 
        next_state, reward = step(state, action)
        print("state:", state, " action:", action, " reward:", reward)
        sum_reward += reward  #報酬を追加 add reward
        QV = updateQValue(QV, state, action, reward, next_state, gamma, alpha)
        state = next_state
    print('episode : %d total reward %d' %(episode+1, sum_reward))
    print(QV)

state: 0  action: 0  reward: 0
state: 1  action: 1  reward: 0
state: 0  action: 0  reward: 0
state: 1  action: 0  reward: 0
state: 1  action: 2  reward: 1
state: 2  action: 1  reward: 0
state: 0  action: 0  reward: 0
state: 1  action: 0  reward: 0
state: 1  action: 2  reward: 1
state: 2  action: 1  reward: 0
episode : 1 total reward 2
[[0.225   0.      0.     ]
 [0.225   0.      0.75   ]
 [0.      0.10125 0.     ]]
state: 0  action: 0  reward: 0
state: 1  action: 1  reward: 0
state: 0  action: 0  reward: 0
state: 1  action: 2  reward: 1
state: 2  action: 0  reward: 0
state: 2  action: 1  reward: 0
state: 0  action: 0  reward: 0
state: 1  action: 2  reward: 1
state: 2  action: 1  reward: 0
state: 0  action: 0  reward: 0
episode : 2 total reward 2
[[0.8413875  0.         0.        ]
 [0.225      0.2025     1.09696875]
 [0.0455625  0.46485141 0.        ]]
state: 0  action: 0  reward: 0
state: 1  action: 2  reward: 1
state: 2  action: 1  reward: 0
state: 0  action: 0  reward: 0
state: 1  a