### 5. Smart Charging Using Reinforcement Learning:
**Original Exercise:** <br>
Consider an electric taxi driver who can charge her vehicle at home. To simplify the problem, we assume that the vehicle always arrives at home at 2 p.m. and leaves the garage at 4 p.m. each day. We want to design an intelligent charging system (an automated agent). Therefore, instead of a flat charging rate, the charging agent adjusts the charging power every 15 minutes, which is bounded between 0 kW and the highest rate (e.g., 22 kW). Also, the vehicle's battery has a capacity that cannot be exceeded. After leaving the garage, the taxi needs enough energy to complete its working day. The energy demand is a stochastic value following a normal distribution (you should choose the parameters, e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh) and must be generated exactly when the driver wants to leave. The agent’s goal is to avoid running out of energy (you should consider a very high penalty for running out of energy) and to minimize the recharging cost. The recharging cost follows an exponential function of the power (i.e., ![image.png](attachment:image.png)), where 𝛼𝑡 is the time coefficient and p is the charging rate.

The task is to create the environment (a very simple discrete event simulation) that receives the agent's decisions and returns the reward. In addition, you must define a Markov decision process, including states, actions, and reward function, and solve it using a reinforcement learning algorithm (e.g., deep q-network) to find optimal charging policies. To allow the use of discrete action methods, you can consider only limited charging options such as zero, low, medium, high.

In [1]:
### Imports
#Basic
import numpy as np
import random
import math
# Gym
from gym import Env
from gym.spaces import Box, Discrete
# Keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
# Keras RL
from rl.agents import DQNAgent
from rl.agents import SARSAAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

### Creating of Environment

In [25]:

class Environment(Env):
    def __init__(self, printB, randomInit):
        #Possible Actions for charging zero, low, medium to high
        self.action_space = Discrete(4)
        #Vehicle's battery: 69KWh; Timeframe 2p.m. to 4p.m.: 8x 15 minute intervals
        self.battery_limit = 69
        self.observation_space = np.array([Box(low=np.array([0]), high = np.array([self.battery_limit])), Box(low=np.array([0]), high = np.array([8]))])
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        self.time_delta =  15/60
        # Battery load at initialization
        self.battery = 25
        self.randomInit = randomInit
        if self.randomInit:
            self.battery = 25 + random.randint(-5,5)
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])
        self.printB = printB


    def step(self, action):
        # Setting loading interval +1/8 /--> +15/120 minutes
        self.time += 1
        self.state[1] = self.time

        # Seting new battery state
        #zero
        load = 0
        if action == 2:
            #low
            load += 7 * self.time_delta
            # loading until max capacity 
            if self.battery + load > self.battery_limit:
                diff = self.battery_limit - self.battery
                load = diff / self.time_delta
        if action == 3:
            #medium
            load += 14 * self.time_delta
            # loading until max capacity 
            if self.battery + load > self.battery_limit:
                diff = self.battery_limit - self.battery
                load = diff / self.time_delta
        if action == 4:
            #high
            load += 22 * self.time_delta
            # loading until max capacity 
            if self.battery + load > self.battery_limit:
                diff = self.battery_limit - self.battery
                load = diff / self.time_delta
        # load multiplied by 15 min to get the 
        self.battery += load
        self.state[0] = self.battery

        # Cost function
        reward = self.time * math.exp(load) * (-1)
        # Because e^0 = 1
        if action == 0:
            reward = 0

        if self.printB:
            print("time:" + str(self.time) + " |load:" + str(load) + " |reward:" + str(round(reward, 2)))

        #Checking if 2 Hours are done
        if self.time >= 8:
            #Demand is a random value following a normal distribution (e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh)
            kwh_needed = np.random.normal(loc=30, scale=5)
            if self.printB:
                print("|needed:" + str(round(kwh_needed, 2))+" |battery:" + str(round(self.battery, 2)))
            # The agent’s goal is to avoid running out of energy (with a very high penalty) 
            if kwh_needed > self.battery:
                #print("NO BATTERY")
                reward -= 10000
            done = True
        else:
            done = False

        info = {}

        # Returning the step information
        return self.state, reward, done, info
    
    def reset(self):
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        # Battery load at initialization
        self.battery = 25
        if self.randomInit:
            self.battery = 25 + random.randint(-5,5)
            if self.printB:
                print("Initialized with :" + str(self.battery) + " KWh")
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])
        return self.state

In [15]:
printB = False
randomInit = False
env = Environment(printB, randomInit)
printB = True
printEnv = Environment(printB, randomInit)


In [13]:
# Destilling important Information for our model
states = env.observation_space.shape
actions = env.action_space.n
print("Actions: " + str(actions) + " | States: " + str(states))

Actions: 4 | States: (2,)


In [14]:
# Defining our model
model = Sequential()    
model.add(Flatten(input_shape=(1,2)))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(actions, activation='linear'))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 2)                 0         
                                                                 
 dense (Dense)               (None, 8)                 24        
                                                                 
 dense_1 (Dense)             (None, 4)                 36        
                                                                 
 dense_2 (Dense)             (None, 4)                 20        
                                                                 
Total params: 80
Trainable params: 80
Non-trainable params: 0
_________________________________________________________________


In [16]:
# Defining and training of Deep Q-Network Agent
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=10000*2, window_length=1)
dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=10000*1, target_model_update=1e-2)
dqn.compile(Adam(learning_rate=0.05), metrics=['mae'])
dqn.fit(env, nb_steps=10000*4, visualize=False, verbose=1)

Training for 40000 steps ...
Interval 1 (0 steps performed)
  127/10000 [..............................] - ETA: 7s - reward: -329.3679

  updates=self.state_updates,


1250 episodes - episode_reward: -3713.918 [-10192.902, -36.755]

Interval 2 (10000 steps performed)
1250 episodes - episode_reward: -2838.008 [-10827.886, 0.000] - loss: 1985692.603 - mae: 213.255 - mean_q: -121.788

Interval 3 (20000 steps performed)
1250 episodes - episode_reward: -1332.180 [-10695.424, 0.000] - loss: 1321667.500 - mae: 384.488 - mean_q: -367.431

Interval 4 (30000 steps performed)
done, took 219.675 seconds


<keras.callbacks.History at 0x2154243fa60>

In [18]:
# Testing the agent on the simple Environment
results = dqn.test(printEnv, nb_episodes=20, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 20 episodes ...
time:1 |load:1.75 |reward:-5.75
time:2 |load:1.75 |reward:-11.51
time:3 |load:1.75 |reward:-17.26
time:4 |load:1.75 |reward:-23.02
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:20.2 |battery:39.0
Episode 1: reward: -207.166, steps: 8
time:1 |load:1.75 |reward:-5.75
time:2 |load:1.75 |reward:-11.51
time:3 |load:1.75 |reward:-17.26
time:4 |load:1.75 |reward:-23.02
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:28.75 |battery:39.0
Episode 2: reward: -207.166, steps: 8
time:1 |load:1.75 |reward:-5.75
time:2 |load:1.75 |reward:-11.51
time:3 |load:1.75 |reward:-17.26
time:4 |load:1.75 |reward:-23.02
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:36.61 |battery:39.0
Episode 3: reward: -207.

## Advanced Environment
With random Battery initialization between 20 and 30 to have a better look at how the agent takes the current battery state into account when acting.

In [27]:
printB = False
randomInit = True
env2 = Environment(printB, randomInit)
printB = True
printEnv2 = Environment(printB, randomInit)

  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")


In [22]:
# Destilling important Information for our model
states = env2.observation_space.shape
actions = env2.action_space.n
model = Sequential()    
model.add(Flatten(input_shape=(1,2)))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(actions, activation='linear'))

# Defining and training of Deep Q-Network Agent
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=10000*2, window_length=1)
dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=10000*1, target_model_update=1e-2)
dqn.compile(Adam(learning_rate=0.01), metrics=['mae'])
dqn.fit(env2, nb_steps=10000*10, visualize=False, verbose=1)

Training for 100000 steps ...
Interval 1 (0 steps performed)
1250 episodes - episode_reward: -8138.684 [-10263.581, 0.000]

Interval 2 (10000 steps performed)
1250 episodes - episode_reward: -1316.161 [-10875.511, -36.000] - loss: 983880.477 - mae: 1086.251 - mean_q: -1011.721

Interval 3 (20000 steps performed)
1250 episodes - episode_reward: -1331.858 [-10618.833, -20.509] - loss: 486143.375 - mae: 789.708 - mean_q: -775.298

Interval 4 (30000 steps performed)
1250 episodes - episode_reward: -1441.930 [-10704.179, -57.037] - loss: 348399.625 - mae: 497.468 - mean_q: -444.481

Interval 5 (40000 steps performed)
1250 episodes - episode_reward: -1818.577 [-10725.198, -39.773] - loss: 350977.625 - mae: 607.743 - mean_q: -566.582

Interval 6 (50000 steps performed)
1250 episodes - episode_reward: -1918.697 [-10738.952, -43.282] - loss: 369449.312 - mae: 742.853 - mean_q: -725.647

Interval 7 (60000 steps performed)
1250 episodes - episode_reward: -1789.513 [-10698.249, -33.264] - loss: 42

<keras.callbacks.History at 0x215466f89d0>

In [72]:
# Testing the agent on the simple Environment
results = dqn.test(printEnv2, nb_episodes=1, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 1 episodes ...
Initialized with :28 KWh
time:1 |load:3.5 |reward:-33.12
time:2 |load:3.5 |reward:-66.23
time:3 |load:3.5 |reward:-99.35
time:4 |load:3.5 |reward:-132.46
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:30.64 |battery:49.0
Episode 1: reward: -480.774, steps: 8
-480.77418916307215
