## SmartGrid with DQN


### SmartGrid with DQN 학습 목표

**주제** : SmartGrid Envrionment 에 DQN 적용해 보기.

In [1]:
import grid2op
from grid2op.PlotGrid import PlotMatplot 
from grid2op.Reward import L2RPNReward
from tqdm import tqdm, notebook # for easy progress bar
import os
import sys
sys.path.append("..")
from grid_agent import GridAgent

import numpy as np
import collections
import random

SmartGrid 환경은 gym의 구조와 유사하지만, state(observation)와 action에 차이가 있습니다.
그 부분을 먼저 살펴 보겠습니다.

In [2]:
env = grid2op.make("rte_case5_example", test=True,  reward_class=L2RPNReward)


You are using a development environment. This environment is not intended for training agents. It might not be up to date and its primary use if for tests (hence the "test=True" you passed as argument). Use at your own risk.



Cartpole 때와 같은 code를 사용해 한 episode를 test해보겠습니다.

code가 정상적으로 동작하지만, action과 obs가 일반적인 gym환경과 다르다는 사실을 알 수 있습니다.

In [3]:
# reset은 매 episode가 시작할 때마다 호출해야 합니다.
obs = env.reset()

# random한 action을 뽑아 환경에 적용합니다..
action = env.action_space.sample()
print("Sampled action:\n {}\n".format(action))
obs, reward, done, info = env.step(action)

print("obs : {}\nreward : {}\ndone : {}\n".format(obs, reward, done))

# 한 episode 에 대한 testing
obs, done, ep_duration = env.reset(), False, 0

while True: 
    action = env.action_space.sample() # action 선택
    obs, reward, done, info = env.step(action)  # 환경에 action 적용
    ep_duration += 1
    if done:  # episode 종료 여부 체크
        break
        
env.close()  
print("episode duration : ", ep_duration) 

Sampled action:
 This action will:
	 - NOT change anything to the injections
	 - NOT perform any redispatching action
	 - NOT force any line status
	 - NOT switch any line status
	 - NOT switch anything in the topology
	 - Set the bus of the following element:
	 	 - assign bus 2 to line (extremity) 1 [on substation 2]
	 	 - assign bus 1 to line (extremity) 4 [on substation 2]
	 - disconnect line (origin) 5 [on substation 2]

obs : <grid2op.Space.GridObjects.CompleteObservation_rte_case5_example object at 0x7fb247d4f780>
reward : 3.427182197570801
done : False

episode duration :  8


### Converter

SmartGrid 환경은 독자적인 state와 action의 형식을 사용합니다. 위와 같이 일반적인 Gym과 같은 형식으로 test하는 것은 가능하지만, 이 전에 구현한 DQN을 통해 학습하는 것은 불가능하다는 사실을 알 수 있습니다. 

사전에 구현한 DQN을 이용해 Smart Grid 환경에서 학습을 진행하기 위해서는, state와 action을 network model에 맞춰 변환시켜 줘야 하며, Smart Grid 환경은 이를 위한 Converter 함수를 제공합니다.

사전에 구현된 Converter 함수를 이용해 observation과 action을 변환해 보겠습니다.

In [4]:
# Grid 환경에 맞춰 Conveter를 추가한 Agent 사용.
agent = GridAgent(env=env)

### Observation space

**agent.convert_obs(obs)**

convert_obs 함수는 grid 환경의 observation을 input으로 받아 학습에 적합한 gym 스타일의 observation (ndarray)로 변환시켜 줍니다.

In [5]:
### Observation space
grid_obs = env.reset()
print(grid_obs)

gym_obs = agent.convert_obs(grid_obs)
print(gym_obs.shape)

<grid2op.Space.GridObjects.CompleteObservation_rte_case5_example object at 0x7fb247b41ac8>
(39,)


### Action space

**agent.id_to_act(act(int))**

id_to_act 함수는 int형의 action을 input으로 받아, grid 환경에 사용 가능한 action으로 변환시켜 줍니다.

In [6]:
### Action space
print("number of action = ", agent.num_actions)

gym_act = np.random.randint(agent.num_actions)
print("int-like action = ", gym_act)

grid_act = agent.id_to_act(gym_act)
print("grid_ action \n", grid_act)

number of action =  67
int-like action =  33
grid_ action 
 This action will:
	 - NOT change anything to the injections
	 - NOT perform any redispatching action
	 - NOT force any line status
	 - NOT switch any line status
	 - NOT switch anything in the topology
	 - Set the bus of the following element:
	 	 - assign bus 2 to line (origin) 0 [on substation 0]
	 	 - assign bus 1 to line (origin) 1 [on substation 0]
	 	 - assign bus 1 to line (origin) 2 [on substation 0]
	 	 - assign bus 2 to line (origin) 3 [on substation 0]
	 	 - assign bus 1 to generator 0 [on substation 0]
	 	 - assign bus 1 to load 0 [on substation 0]


In [7]:
for i in range(agent.num_actions):
    print(i, agent.id_to_act(i))

0 This action will:
	 - NOT change anything to the injections
	 - NOT perform any redispatching action
	 - NOT force any line status
	 - NOT switch any line status
	 - NOT switch anything in the topology
	 - NOT force any particular bus configuration
1 This action will:
	 - NOT change anything to the injections
	 - NOT perform any redispatching action
	 - NOT force any line status
	 - switch status of 1 powerlines ([0])
	 - NOT switch anything in the topology
	 - NOT force any particular bus configuration
2 This action will:
	 - NOT change anything to the injections
	 - NOT perform any redispatching action
	 - NOT force any line status
	 - switch status of 1 powerlines ([1])
	 - NOT switch anything in the topology
	 - NOT force any particular bus configuration
3 This action will:
	 - NOT change anything to the injections
	 - NOT perform any redispatching action
	 - NOT force any line status
	 - switch status of 1 powerlines ([2])
	 - NOT switch anything in the topology
	 - NOT force an

In [8]:
import tensorflow as tf
import tensorflow.keras.layers as kl
import tensorflow.keras.optimizers as ko
from collections import deque
from tqdm import tqdm, notebook  # 학습 과정을 더 깔끔하게 보여주는 library 입니다.

In [9]:
buffer_size=5000
learning_rate=0.01
epsilon=1.0
epsilon_decay=0.995
min_epsilon=0.01
gamma=0.98
batch_size=16
target_update_iter=400
train_nums=10000
start_learning = 40
max_iter=200

# Neural Network Model 
class Model(tf.keras.Model):
    def __init__(self, num_actions, units=[32, 32]):
        super().__init__()
        self.fc1 = kl.Dense(units[0], activation='relu', kernel_initializer='he_uniform')
        self.fc2 = kl.Dense(units[1], activation='relu', kernel_initializer='he_uniform')
        self.logits = kl.Dense(num_actions, name='q_values')

    
    # forward propagation
    def call(self, inputs):
        x = self.fc1(inputs)
        x = self.fc2(x)
        x = self.logits(x)
        return x

    # return best action that maximize action-value (Q) from network
    # a* = argmax_a' Q(s, a')
    def action_value(self, obs):
        q_values = self.predict(obs)
        best_action = np.argmax(q_values, axis=-1)
        return best_action[0]

class ReplayBuffer:
    def __init__(self, buffer_size):
        self.buffer_size = buffer_size
        self.count = 0
        self.buffer = deque(maxlen=buffer_size) 

    # store transition of each step in replay buffer
    def store(self, s, a, r, next_s, d):
        experience = (s, a, r, d, next_s)
        self.buffer.append(experience)
        self.count += 1

    # Sample random minibatch of transtion
    def sample(self, batch_size):
        batch = []
        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
        else:
            batch = random.sample(self.buffer, batch_size)

        s_batch, a_batch, r_batch, d_batch, s2_batch = map(np.array, list(zip(*batch)))
        return s_batch, a_batch, r_batch, s2_batch, d_batch
    
    def clear(self):
        self.buffer.clear()
        self.count = 0

In [10]:
network = Model(agent.num_actions)
action_types = ["random", "do_nothing"]
max_steps = env.chronics_handler.max_timestep()

# test before train
for action_type in action_types:
    done = False
    obs, done, ep_reward = env.reset(), False, 0
    actions = []
    msg = "Smart grid [ {} ] agent".format(action_type)
    for t in notebook.tqdm(range(max_steps), desc=msg):
        if action_type == "random":  # select random action
            action = np.random.randint(agent.num_actions) # select do-nothing action ( action number 0)
        elif action_type == "do_nothing":
            action = 0
        actions.append(action)
        converted_act = agent.id_to_act(action)
        obs, reward, done, _ = env.step(converted_act)
        ep_reward += reward
        if done:
            break
    print(actions[:10])

HBox(children=(FloatProgress(value=0.0, description='Smart grid [ random ] agent', max=2016.0, style=ProgressS…


[2, 26, 52, 47, 39]


HBox(children=(FloatProgress(value=0.0, description='Smart grid [ do_nothing ] agent', max=2016.0, style=Progr…


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [13]:
replay_buffer = ReplayBuffer(buffer_size)

network = Model(agent.num_actions)
target_network = Model(agent.num_actions)
target_network.set_weights(network.get_weights()) # initialize target network weight 
opt = ko.Adam(learning_rate=.0015)
network.compile(optimizer=opt, loss='mse')

obs = env.reset()
epi_duration = 0
epi = 0 # number of episode taken
epsilon=1.0
avg_duration = deque(maxlen=5)

for t in notebook.tqdm(range(1, train_nums+1), desc='train with DQN'):
    # epsilon update
    if epsilon > min_epsilon:
        epsilon = max(epsilon * epsilon_decay, min_epsilon)

    #######################  step 1  ####################### 
    ####        Select action using episolon-greedy      ### 
    ########################################################   

    gym_obs = agent.convert_obs(obs)
    # select action that maximize Q value f
    best_action = network.action_value(np.atleast_2d(gym_obs))  # input the obs to the network model 
    
    # e-greedy
    if np.random.rand() < epsilon:
        #action = np.random.randint(agent.num_actions)
        gym_act = 0
    else:
        gym_act = best_action   # with prob. epsilon, select a random action
    
    #######################  step 2  ####################### 
    #### Take step and store transition to replay buffer ### 
    ########################################################
    
    grid_act = agent.id_to_act(gym_act)
    next_obs, reward, done, _ = env.step(grid_act)    # Excute action in the env to return s'(next state), r, done
    gym_next_obs = agent.convert_obs(next_obs)
    replay_buffer.store(gym_obs, gym_act, reward, gym_next_obs, done)
    
    epi_duration += 1
    
    #######################  step 3  ####################### 
    ####     Train network (perform gradient descent)    ### 
    ########################################################
    
    if t > start_learning and t % 2 == 0:
        # target value 계산
        # np.amax -> list 에서 가장 큰 값 반환
        s_batch, a_batch, r_batch, ns_batch, done_batch = replay_buffer.sample(batch_size)
        target_q = r_batch + gamma * np.amax(target_network.predict(ns_batch), axis=1) * (1- done_batch)  
        q_values = network.predict(s_batch) 
        for i, action in enumerate(a_batch):
            q_values[i][action] = target_q[i]

        network.train_on_batch(s_batch, q_values)
    
    #######################  step 3  ####################### 
    ####             Update target network               ### 
    ########################################################
      
    if t % target_update_iter == 0:
        target_network.set_weights(network.get_weights()) # assign the current network parameters to target network
 
    obs = next_obs  # s <- s'
    # if episode ends (done)
    if (epi_duration >= max_iter) or done:
        epi += 1 # num of episode 
        avg_duration.append(epi_duration)
        if epi % 5 == 0:
            print("[Episode {:>5}] avg duration: {:>6.2f}  --eps : {:>4.2f} --steps : {:>5}".format(epi, np.mean(avg_duration), epsilon, t))
        obs, done, epi_duration = env.reset(), False, 0  # Environmnet reset


HBox(children=(FloatProgress(value=0.0, description='train with DQN', max=10000.0, style=ProgressStyle(descrip…

[Episode     5] avg duration: 122.00  --eps : 0.05 --steps :   610
[Episode    10] avg duration: 122.00  --eps : 0.01 --steps :  1220
[Episode    15] avg duration: 122.00  --eps : 0.01 --steps :  1830
[Episode    20] avg duration: 122.00  --eps : 0.01 --steps :  2440
[Episode    25] avg duration: 200.00  --eps : 0.01 --steps :  3440
[Episode    30] avg duration: 200.00  --eps : 0.01 --steps :  4440
[Episode    35] avg duration: 200.00  --eps : 0.01 --steps :  5440
[Episode    40] avg duration: 161.00  --eps : 0.01 --steps :  6245
[Episode    45] avg duration: 176.60  --eps : 0.01 --steps :  7128
[Episode    50] avg duration: 200.00  --eps : 0.01 --steps :  8128
[Episode    55] avg duration: 161.00  --eps : 0.01 --steps :  8933
[Episode    60] avg duration: 122.00  --eps : 0.01 --steps :  9543



In [14]:
max_steps = env.chronics_handler.max_timestep()

# test before train
for i in range(5):
    obs, done, ep_reward = env.reset(), False, 0
    actions = []
    for t in notebook.tqdm(range(max_steps), desc=""):
        converted_obs = agent.convert_obs(obs)
        action = network.action_value(np.atleast_2d(converted_obs))
        actions.append(action)
        converted_act = agent.id_to_act(action)
        obs, reward, done, _ = env.step(converted_act)
        ep_reward += reward
        if done:
            break

HBox(children=(FloatProgress(value=0.0, max=2016.0), HTML(value='')))


0


HBox(children=(FloatProgress(value=0.0, max=2016.0), HTML(value='')))


0


HBox(children=(FloatProgress(value=0.0, max=2016.0), HTML(value='')))


0


HBox(children=(FloatProgress(value=0.0, max=2016.0), HTML(value='')))


0


HBox(children=(FloatProgress(value=0.0, max=2016.0), HTML(value='')))


0
