# RL 강화학습 활용예제: OpenAI Gym 시리즈1. 막대 중심잡기

* 강화학습의 구현원리를 이해할 수 있는 OpenAI의 '막대 중심잡기' 예제
* 저자: RJBrooker https://github.com/RJBrooker/Q-learning-demo-Cartpole-V1
* 강연: 동준상 (naebon1@gmail.com) / 2021.1.14 / KIDET 한국국방기술학회 인공지능 세미나
* ! 이번 소스는 구글 코랩에서 실행될 때 몇 가지 문제가 발생하는 바, 현재는 아나콘다 환경에서 실행만 가능

### Cartpole 예제 개요

* A pole is attached by an un-actuated joint to a cart, which moves along
* a frictionless track. The pendulum starts upright, and the goal is to
* prevent it from falling over by increasing and reducing the cart's velocity.

* This environment corresponds to the version of the cart-pole problem 
* described by Barto, Sutton, and Anderson
* Reinforcement Learning: An Introduction - Stanford University
* https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

### 에피소드 종료 조건 / Episode Termination:

* Pole Angle is more than 12 degrees.
* Cart Position is more than 2.4 (center of the cart reaches the edge of the display).
* Episode length is greater than 200.
* Solved Requirements:
* Considered solved when the average return is greater than or equal to 195.0 over 100 consecutive trials.

## 라이브러리 설치 및 임포트

In [None]:
#!pip install --upgrade pip
!pip install gym

import gym

import numpy as np 
import time, math, random
from typing import Tuple

In [None]:
!pip install sklearn

# KBinsDiscretizer를 임포트하지 못하는 경우, 콘솔에서 conda update scikit-learn 실행
from sklearn.preprocessing import KBinsDiscretizer

## CartPole-v1

In [None]:
env = gym.make('CartPole-v1')

In [None]:
env

## 학습 전 에이전트의 동작 확인 및 실행환경 시각화 / Visualise Enviroment

* Visualise the eniroment/simulation

In [None]:
policy = lambda obs: 1

for _ in range(5):
    obs = env.reset()
    for _ in range(80):
        actions = policy(obs)
        obs, reward, done, info = env.step(actions) 
        env.render()
        time.sleep(0.05)

env.close()

In [None]:
# Look at the docstring.
?env.env

## 정책 입력 / Hard Coded Policy

In [None]:
#Simple policy function 
policy = lambda _,__,___, tip_velocity : int( tip_velocity > 0 )

## Q-learning

* Catpoles의 연속형 상태공간을 이산형 상태공간으로 변환
* Convert Catpoles continues state space into discrete one.

In [None]:
n_bins = ( 6 , 12 )
lower_bounds = [ env.observation_space.low[2], -math.radians(50) ]
upper_bounds = [ env.observation_space.high[2], math.radians(50) ]

def discretizer( _ , __ , angle, pole_velocity ) -> Tuple[int,...]:
    """Convert continues state intro a discrete state"""
    est = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')
    est.fit([lower_bounds, upper_bounds ])
    return tuple(map(int,est.transform([[angle, pole_velocity]])[0]))

## Q 값의 테이블을 영행렬로 초기화 / Initialise the Q value table with zeros.

In [None]:
Q_table = np.zeros(n_bins + (env.action_space.n,))
Q_table.shape

## 정책 함수 생성 / Create a policy function

* Q-table을 이용하여 최고의 Q 값을 선택하는 정책 policy( ) 함수 정의
* epsilon-greedy policy
* Uses the Q-table to and greedly selecting the highest Q value

In [None]:
def policy( state : tuple ):
    """Choosing action based on epsilon-greedy policy"""
    return np.argmax(Q_table[state])

## Q 값의 갱신 / Update function

In [None]:
def new_Q_value( reward : float ,  new_state : tuple , discount_factor=1 ) -> float:
    """Temperal diffrence for updating Q-value of state-action pair"""
    future_optimal_value = np.max(Q_table[new_state])
    learned_value = reward + discount_factor * future_optimal_value
    return learned_value

## 학습효율 체감수준을 반영한 적응형 학습 / Decaying learning rate

In [None]:
# Adaptive learning of Learning Rate
def learning_rate(n : int , min_rate=0.01 ) -> float  :
    """Decaying learning rate"""
    return max(min_rate, min(1.0, 1.0 - math.log10((n + 1) / 25)))

## 탐색효율 체감수준을 반영한 적응형 학습 / Decaying exploration rate

In [None]:
def exploration_rate(n : int, min_rate= 0.1 ) -> float :
    """Decaying exploration rate"""
    return max(min_rate, min(1, 1.0 - math.log10((n  + 1) / 25)))

## 학습 후 에이전트의 동작 확인 및 실행환경 시각화

* 에피소드 횟수: 10,000회
* 연속형 상태정보를 이산형화

In [None]:
n_episodes = 10000 
for e in range(n_episodes):
    
    # Siscretize state into buckets
    current_state, done = discretizer(*env.reset()), False
    
    while done==False:
        
        # policy action 
        action = policy(current_state) # exploit
        
        # insert random action
        if np.random.random() < exploration_rate(e) : 
            action = env.action_space.sample() # explore 
         
        # increment enviroment
        obs, reward, done, _ = env.step(action)
        new_state = discretizer(*obs)
        
        # Update Q-Table
        lr = learning_rate(e)
        learnt_value = new_Q_value(reward , new_state )
        old_value = Q_table[current_state][action]
        Q_table[current_state][action] = (1-lr)*old_value + lr*learnt_value
        
        current_state = new_state
        
        # Render the cartpole environment
        env.render()