# 13wk-1: 강화학습 (1) – bandit

최규빈  
2024-05-29

<a href="https://colab.research.google.com/github/guebin/DL2024/blob/main/posts/13wk-2.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" style="text-align: left"></a>

# 1. 강의영상

In [1]:
#{{<video https://youtu.be/playlist?list=PLQqh36zP38-zoOHd7w3N5q9Jc5P34Ux8X&si=MdJTHM3a27MCAssp >}}

# 2. 환경셋팅 ($\star\star\star$)

`-` 설치 (코랩)

``` python
!pip install -q swig
!pip install gymnasium
!pip install gymnasium[box2d]
```

# 3. Imports

In [2]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

-   ref: <https://gymnasium.farama.org/index.html>

# 4. 강화학습 Intro

`-` 강화학습(대충설명): 어떠한 “(게임)환경”이 있을때 거기서 “뭘 할지”를
학습하는 과업

`-` 딥마인드: breakout $\to$ 알파고

-   <https://www.youtube.com/watch?v=TmPfTpjtdgg>

`-` 강화학습 미래? (이거 잘하면 먹고 살 수 있을까?)

# 5. Game1: `Bandit` 게임

## A. 게임설명 및 원시코드

`-` 문제설명: 두 개의 버튼이 있다. `버튼0`을 누르면 1의 보상을,
`버튼1`을 누르면 100의 보상을 준다고 가정

`-` 처음에 어떤 행동을 해야 하는가?

-   처음에는 아는게 없음
-   일단 “아무거나” 눌러보자.

`-` 버튼을 아무거나 누르는 코드를 작성해보자.

In [3]:
action_space = ['버튼0', '버튼1'] 
action = np.random.choice(action_space)
action

> `action_space` 와 `action` 이라는 용어를 기억할 것

`-` 버튼을 누른 행위에 따른 보상을 구현하자. 함수를 구현해보자.

In [4]:
if action == '버튼0': # button0을 눌렀다면 
    reward = 1 
else: # button1을 눌렀다면 
    reward = 100 

> `reward`라는 용어를 기억할 것

In [5]:
reward

`-` 아무버튼이나 10번정도 눌러보면서 데이터를 쌓아보자.

In [6]:
for _ in range(10):
    action = np.random.choice(action_space)
    if action == '버튼0': 
        reward = 1 
    else: 
        reward = 100     
    print(action,reward) 

버튼0 1
버튼1 100
버튼0 1
버튼0 1
버튼0 1
버튼0 1
버튼0 1
버튼1 100
버튼1 100
버튼1 100

`-` 깨달았음: `버튼0`을 누르면 1점을 받고, `버튼1`을 누르면 100점을 받는
“환경(environment)”이구나? $\to$ `버튼1`을 누르는 “동작(=action)”을
해야하는 상황이구나?

-   여기에서 $\to$의 과정을 체계화 시킨 학문이 강화학습

> `environment`라는 용어를 기억할 것

In [7]:
for _ in range(10):
    action = action_space[1]
    if action == '버튼0': 
        reward = 1 
    else: 
        reward = 100     
    print(action,reward) 

버튼1 100
버튼1 100
버튼1 100
버튼1 100
버튼1 100
버튼1 100
버튼1 100
버튼1 100
버튼1 100
버튼1 100

-   게임 클리어

`-` 강화학습: 환경(environment)을 이해 $\to$ 에이전트(agent)가
행동(action)을 결정

> `agent`라는 용어를 기억할 것

***위의 과정이 잘 되었다는 의미로 사용하는 문장들***

-   강화학습이 성공적으로 잘 되었다.
-   에이전트가 환경의 과제를 완료했다.
-   에이전트가 환경에서 성공적으로 학습했다.
-   에이전트가 올바른 행동을 학습했다.
-   게임 클리어 (비공식)

`-` 게임이 클리어 되었다는 것을 의미하는 지표를 정하고 싶다.

-   첫 생각: `버튼1`을 누르는 순간 게임클리어로 보면 되지 않나?
-   두번째 생각: 아니지? 우연히 누를수도 있잖아?
-   게임클리어조건: (1) 20번은 그냥 진행 (2) 최근 20번의 보상의 평균이
    95점 이상이면 게임이 클리어 되었다고 생각하자.[1]

`-` 원시코드1: 환경을 이해하지 못한 에이전트 – 게임을 클리어할 수 없다.

[1] `버튼1`을 눌러야 하는건 맞지만 몇번의 실수는 눈감아 주자는 의미

In [8]:
action_space = [0,1] 
rewards = [] 
for t in range(50): # 10000번을 해도 못깸  
    action = np.random.choice(action_space) # 무지한자의 행동 (찍어) 
    if action == 0: 
        reward = 1 
        rewards.append(reward)
    else: 
        reward = 100
        rewards.append(reward)
    #--# 
    print(
        f"n_try = {t+1}\t"
        f"action = {action}\t"
        f"reward = {reward}\t"
        f"mean(recent_rewards) = {np.mean(rewards[-20:])}"
    )
    #--#
    if t < 20:
        pass
    else:
        if np.mean(rewards[-20:]) >= 95:
            break 

n_try = 1   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 2   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 3   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 4   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 5   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 6   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 7   action = 1  reward = 100    mean(recent_rewards) = 15.142857142857142
n_try = 8   action = 0  reward = 1  mean(recent_rewards) = 13.375
n_try = 9   action = 0  reward = 1  mean(recent_rewards) = 12.0
n_try = 10  action = 0  reward = 1  mean(recent_rewards) = 10.9
n_try = 11  action = 1  reward = 100    mean(recent_rewards) = 19.0
n_try = 12  action = 1  reward = 100    mean(recent_rewards) = 25.75
n_try = 13  action = 1  reward = 100    mean(recent_rewards) = 31.46153846153846
n_try = 14  action = 1  reward = 100    mean(recent_rewards) = 36.357142857142854
n_try = 15  action = 1  reward = 100    mean(r

`-` 원시코드2: 환경을 깨달은 에이전트 – 게임클리어

In [9]:
action_space = [0,1]
rewards = [] 
for t in range(50): 
    #action = np.random.choice(action_space) # 무지한자의 행동 (찍어) 
    action = 1 # 환경을 이해한 에이전트의 행동
    if action == 0: 
        reward = 1 
        rewards.append(reward)
    else: 
        reward = 100
        rewards.append(reward)
    #--# 
    print(
        f"n_try = {t+1}\t"
        f"action = {action}\t"
        f"reward = {reward}\t"
        f"mean(recent_rewards) = {np.mean(rewards[-20:])}"
    )
    #--#
    if t < 20:
        pass
    else:
        if np.mean(rewards[-20:]) >= 95:
            break 

n_try = 1   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 2   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 3   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 4   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 5   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 6   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 7   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 8   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 9   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 10  action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 11  action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 12  action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 13  action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 14  action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 15  action = 1  reward = 1

## C. 수정1: `Env` 구현

`-` `Bandit` 클래스 선언 + `.step()` 구현

In [10]:
class Bandit: 
    def step(self, action):
        if action == 0:
            return 1 
        else: 
            return 100 

In [11]:
action_space = [0,1]
env = Bandit()
rewards = []
for t in range(50): 
    action = np.random.choice(action_space)
    #action = 1
    reward = env.step(action)
    rewards.append(reward)
    #--# 
    print(
        f"n_try = {t+1}\t"
        f"action = {action}\t"
        f"reward = {reward}\t"
        f"mean(recent_rewards) = {np.mean(rewards[-20:])}"
    )
    #--#
    if t < 20:
        pass
    else:
        if np.mean(rewards[-20:]) >= 95:
            break 

n_try = 1   action = 1  reward = 100    mean(recent_rewards) = 100.0
n_try = 2   action = 0  reward = 1  mean(recent_rewards) = 50.5
n_try = 3   action = 0  reward = 1  mean(recent_rewards) = 34.0
n_try = 4   action = 1  reward = 100    mean(recent_rewards) = 50.5
n_try = 5   action = 1  reward = 100    mean(recent_rewards) = 60.4
n_try = 6   action = 1  reward = 100    mean(recent_rewards) = 67.0
n_try = 7   action = 1  reward = 100    mean(recent_rewards) = 71.71428571428571
n_try = 8   action = 0  reward = 1  mean(recent_rewards) = 62.875
n_try = 9   action = 0  reward = 1  mean(recent_rewards) = 56.0
n_try = 10  action = 1  reward = 100    mean(recent_rewards) = 60.4
n_try = 11  action = 1  reward = 100    mean(recent_rewards) = 64.0
n_try = 12  action = 0  reward = 1  mean(recent_rewards) = 58.75
n_try = 13  action = 1  reward = 100    mean(recent_rewards) = 61.92307692307692
n_try = 14  action = 1  reward = 100    mean(recent_rewards) = 64.64285714285714
n_try = 15  action = 0  r

## D. 수정2: `Agent` 구현 (인간지능)

`-` Agent 클래스 설계

-   액션을 하고, 본인의 행동과 환경에서 받은 reward를 기억
-   `.act()`함수와 `.save_experience()`함수 구현

In [12]:
class Agent:
    def __init__(self):
        self.action_space = [0,1]
        self.action = None 
        self.reward = None 
        self.actions = [] 
        self.rewards = []
    def act(self):
        self.action = np.random.choice(self.action_space) # 무지한자 
        #self.action = 1 # 깨달은 자
    def save_experience(self):
        self.actions.append(self.action)
        self.rewards.append(self.reward)

— 대충 아래와 같은 느낌으로 코드가 돌아가요 —

**시점0**: init

In [13]:
env = Bandit()
agent = Agent() 

In [14]:
agent.action, agent.reward

**시점1**: agent \>\> env

In [15]:
agent.act()

In [16]:
agent.action, agent.reward

In [17]:
env.agent_action = agent.action

**시점2**: agent \<\< env

In [18]:
agent.reward = env.step(env.agent_action)

In [19]:
agent.action, agent.reward, env.agent_action

In [20]:
agent.actions,agent.rewards

In [21]:
agent.save_experience()

In [22]:
agent.actions,agent.rewards

– 전체코드 –

In [23]:
env = Bandit() 
agent = Agent()
for t in range(50): 
    # step1: agent >> env 
    agent.act() 
    env.agent_action = agent.action
    # step2: agent << env 
    agent.reward = env.step(env.agent_action)
    agent.save_experience() 
    #--# 
    print(
        f"n_try = {t+1}\t"
        f"action = {agent.action}\t"
        f"reward = {agent.reward}\t"
        f"mean(recent_rewards) = {np.mean(agent.rewards[-20:])}"
    )
    #--#
    if t < 20:
        pass
    else:
        if np.mean(rewards[-20:]) >= 95:
            break 

n_try = 1   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 2   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 3   action = 1  reward = 100    mean(recent_rewards) = 34.0
n_try = 4   action = 0  reward = 1  mean(recent_rewards) = 25.75
n_try = 5   action = 0  reward = 1  mean(recent_rewards) = 20.8
n_try = 6   action = 0  reward = 1  mean(recent_rewards) = 17.5
n_try = 7   action = 0  reward = 1  mean(recent_rewards) = 15.142857142857142
n_try = 8   action = 0  reward = 1  mean(recent_rewards) = 13.375
n_try = 9   action = 0  reward = 1  mean(recent_rewards) = 12.0
n_try = 10  action = 0  reward = 1  mean(recent_rewards) = 10.9
n_try = 11  action = 0  reward = 1  mean(recent_rewards) = 10.0
n_try = 12  action = 0  reward = 1  mean(recent_rewards) = 9.25
n_try = 13  action = 1  reward = 100    mean(recent_rewards) = 16.23076923076923
n_try = 14  action = 0  reward = 1  mean(recent_rewards) = 15.142857142857142
n_try = 15  action = 1  reward = 100    mean(recent_re

## E. 수정3: `Agent` 구현 (인공지능)

`-` 지금까지 풀이의 한계

-   사실 강화학습은 “환경을 이해 $\to$ 행동을 결정” 의 과정에서
    “$\to$”의 과정을 수식화 한 것이다.
-   그런데 지금까지 했던 코드는 환경(environment)를 이해하는 순간
    에이전트(agent)가 최적의 행동(action)[1]을 **“직관적으로”**
    결정하였으므로 기계가 스스로 학습을 했다고 볼 수 없다.

`-` 에이전트가 데이터를 보고 스스로 학습할 수 있도록 설계 – 부제:
`agent.learn()`을 설계하자.

1.  데이터를 모아서 `q_table` 를 만든다. `q_table`은 아래와 같은 내용을
    포함한다.

|      행동      |   보상평균   |
|:--------------:|:------------:|
| 버튼0 ($=a_0$) |  1 ($=q_0$)  |
| 버튼1 ($=a_1$) | 100 ($=q_1$) |

1.  `q_table`을 바탕으로 적절한 정책(=`policy`)을 설정한다.

-   이 예제에서는 버튼0과 버튼1을 각각
    $\big(\frac{q_0}{q_0+q_1},\frac{q_1}{q_0+q_1}\big)$ 의 확률로
    선택하는 “정책”을 이용하면 충분할 듯

> 여기에서 `q_table`, `policy`라는 용어를 기억하세요.

`-` `q_table`을 계산하는 코드 예시

[1] `버튼1`을 누른다

In [24]:
agent.actions = [0,1,1,0,1,0,0] 
agent.rewards = [1,101,102,1,99,1,1.2] 
actions = np.array(agent.actions)
rewards = np.array(agent.rewards)

In [25]:
q0 = rewards[actions == 0].mean()
q1 = rewards[actions == 1].mean()

In [26]:
agent.q = np.array([q0,q1]) 
agent.q

In [27]:
prob = agent.q / agent.q.sum()
prob 

In [28]:
action = np.random.choice([0,1], p= prob)
action

`-` 최종코드정리

In [29]:
class Bandit: 
    def step(self, action):
        if action == 0:
            return 1 
        else: 
            return 100 
class Agent:
    def __init__(self):
        self.action_space = [0,1]
        self.action = None 
        self.reward = None 
        self.actions = [] 
        self.rewards = []
        self.q_table = np.array([0,0]) 
        self.n_experience = 0 
    def act(self):
        if self.n_experience < 20: 
            self.action = np.random.choice(self.action_space)
        else: 
            prob = self.q_table / self.q_table.sum()
            self.action = np.random.choice(self.action_space, p= prob)
    def save_experience(self):
        self.actions.append(self.action)
        self.rewards.append(self.reward)
        self.n_experience += 1 
    def learn(self):
        if self.n_experience<20: 
            pass 
        else: 
            actions = np.array(self.actions)
            rewards = np.array(self.rewards)
            q0 = rewards[actions == 0].mean()
            q1 = rewards[actions == 1].mean()
            self.q_table = np.array([q0,q1]) 

In [30]:
env = Bandit() 
agent = Agent()
for t in range(50): 
    ## 1. main 코드 
    # step1: agent >> env 
    agent.act() 
    env.agent_action = agent.action
    # step2: agent << env 
    agent.reward = env.step(env.agent_action)
    agent.save_experience() 
    # step3: learn 
    agent.learn()
    #--# 
    print(
        f"n_try = {t+1}\t"
        f"action = {agent.action}\t"
        f"reward = {agent.reward}\t"
        f"mean(recent_rewards) = {np.mean(agent.rewards[-20:])}"
    )
    #--#
    if t < 20:
        pass
    else:
        if np.mean(rewards[-20:]) >= 95:
            break  

n_try = 1   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 2   action = 0  reward = 1  mean(recent_rewards) = 1.0
n_try = 3   action = 1  reward = 100    mean(recent_rewards) = 34.0
n_try = 4   action = 1  reward = 100    mean(recent_rewards) = 50.5
n_try = 5   action = 1  reward = 100    mean(recent_rewards) = 60.4
n_try = 6   action = 0  reward = 1  mean(recent_rewards) = 50.5
n_try = 7   action = 1  reward = 100    mean(recent_rewards) = 57.57142857142857
n_try = 8   action = 1  reward = 100    mean(recent_rewards) = 62.875
n_try = 9   action = 0  reward = 1  mean(recent_rewards) = 56.0
n_try = 10  action = 1  reward = 100    mean(recent_rewards) = 60.4
n_try = 11  action = 1  reward = 100    mean(recent_rewards) = 64.0
n_try = 12  action = 1  reward = 100    mean(recent_rewards) = 67.0
n_try = 13  action = 0  reward = 1  mean(recent_rewards) = 61.92307692307692
n_try = 14  action = 0  reward = 1  mean(recent_rewards) = 57.57142857142857
n_try = 15  action = 1  reward = 