### 5. Smart Charging Using Reinforcement Learning:
**Original Exercise:** <br>
Consider an electric taxi driver who can charge her vehicle at home. To simplify the problem, we assume that the vehicle always arrives at home at 2 p.m. and leaves the garage at 4 p.m. each day. We want to design an intelligent charging system (an automated agent). Therefore, instead of a flat charging rate, the charging agent adjusts the charging power every 15 minutes, which is bounded between 0 kW and the highest rate (e.g., 22 kW). Also, the vehicle's battery has a capacity that cannot be exceeded. After leaving the garage, the taxi needs enough energy to complete its working day. The energy demand is a stochastic value following a normal distribution (you should choose the parameters, e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh) and must be generated exactly when the driver wants to leave. The agent’s goal is to avoid running out of energy (you should consider a very high penalty for running out of energy) and to minimize the recharging cost. The recharging cost follows an exponential function of the power (i.e., ![image.png](attachment:image.png)), where 𝛼𝑡 is the time coefficient and p is the charging rate.

The task is to create the environment (a very simple discrete event simulation) that receives the agent's decisions and returns the reward. In addition, you must define a Markov decision process, including states, actions, and reward function, and solve it using a reinforcement learning algorithm (e.g., deep q-network) to find optimal charging policies. To allow the use of discrete action methods, you can consider only limited charging options such as zero, low, medium, high.

In [14]:
### Imports
#Basic
import numpy as np
import random
import math
# Gym
from gym import Env
from gym.spaces import Box, Discrete
# Keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
# Keras RL
from rl.agents import DQNAgent
from rl.agents import SARSAAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

## Environment Class
The "randomInit" parameter gives us the option to initialize 2 different environments; One of them a little bit more advanced. 

In [40]:

class Environment(Env):
    def __init__(self, printB, randomInit):
        #Possible Actions for charging zero, low, medium to high
        self.action_space = Discrete(4)
        #Vehicle's battery: 69KWh; Timeframe 2p.m. to 4p.m.: 8x 15 minute intervals
        self.battery_limit = 69
        self.observation_space = np.array([Box(low=np.array([0]), high = np.array([self.battery_limit])), Box(low=np.array([0]), high = np.array([8]))])
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        self.time_delta =  15/60
        # Battery load at initialization
        self.battery = 25
        self.randomInit = randomInit
        if self.randomInit:
            self.battery += random.randint(-15,5)
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])
        self.printB = printB


    def step(self, action):
        # Setting loading interval +1/8 /--> +15/120 minutes
        self.time += 1
        self.state[1] = self.time

        # Seting new battery state
        #zero
        load = 0
        if action == 2:
            #low
            load += 7 * self.time_delta
            # loading until max capacity 
            if self.battery + load > self.battery_limit:
                diff = self.battery_limit - self.battery
                load = diff / self.time_delta
        if action == 3:
            #medium
            load += 14 * self.time_delta
            # loading until max capacity 
            if self.battery + load > self.battery_limit:
                diff = self.battery_limit - self.battery
                load = diff / self.time_delta
        if action == 4:
            #high
            load += 22 * self.time_delta
            # loading until max capacity 
            if self.battery + load > self.battery_limit:
                diff = self.battery_limit - self.battery
                load = diff / self.time_delta
        # load multiplied by 15 min to get the 
        self.battery += load
        self.state[0] = self.battery

        # Cost function
        reward = self.time * math.exp(load) * (-1)
        # Because e^0 = 1
        if action == 0:
            reward = 0

        if self.printB:
            print("time:" + str(self.time) + " |load:" + str(load) + " |reward:" + str(round(reward, 2)))

        #Checking if 2 Hours are done
        if self.time >= 8:
            #Demand is a random value following a normal distribution (e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh)
            kwh_needed = np.random.normal(loc=30, scale=5)
            if self.printB:
                print("|needed:" + str(round(kwh_needed, 2))+" |battery:" + str(round(self.battery, 2)))
            # The agent’s goal is to avoid running out of energy (with a very high penalty) 
            if kwh_needed > self.battery:
                #High Penalty
                reward -= 10000
            done = True
        else:
            done = False

        info = {}

        # Returning the step information
        return self.state, reward, done, info
    
    def reset(self):
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        # Battery load at initialization
        self.battery = 25
        if self.randomInit:
            self.battery +=random.randint(-15,5)
            if self.printB:
                print("Initialized with :" + str(self.battery) + " KWh")
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])
        return self.state

## First Environment
It has all the components that are required by the exercise.
As the initialization is not specified in the exercise we initialize the car battery with 25KWh.

In [16]:
printB = False
randomInit = False
env = Environment(printB, randomInit)
printB = True
printEnv = Environment(printB, randomInit)


In [17]:
# Destilling important Information for our model
states = env.observation_space.shape
actions = env.action_space.n
print("Actions: " + str(actions) + " | States: " + str(states))

Actions: 4 | States: (2,)


In [18]:
# Defining our model
model = Sequential()    
model.add(Flatten(input_shape=(1,2)))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(actions, activation='linear'))
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_2 (Flatten)         (None, 2)                 0         
                                                                 
 dense_7 (Dense)             (None, 8)                 24        
                                                                 
 dense_8 (Dense)             (None, 4)                 36        
                                                                 
 dense_9 (Dense)             (None, 4)                 20        
                                                                 
Total params: 80
Trainable params: 80
Non-trainable params: 0
_________________________________________________________________


In [19]:
# Defining and training of Deep Q-Network Agent
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=10000*2, window_length=1)
dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=10000*1, target_model_update=1e-2)
dqn.compile(Adam(learning_rate=0.1), metrics=['mae'])
dqn.fit(env, nb_steps=10000*10, visualize=False, verbose=1)

Training for 100000 steps ...
Interval 1 (0 steps performed)
1250 episodes - episode_reward: -1191.791 [-1192.156, -1110.074]

Interval 2 (10000 steps performed)
1250 episodes - episode_reward: -1250.003 [-10403.385, 0.000] - loss: 298046.923 - mae: 143.749 - mean_q: -131.207

Interval 3 (20000 steps performed)
1250 episodes - episode_reward: -535.166 [-10207.166, -207.166] - loss: 384229.906 - mae: 361.060 - mean_q: -374.914

Interval 4 (30000 steps performed)
1250 episodes - episode_reward: -1018.669 [-10711.934, -13.000] - loss: 246899.781 - mae: 438.659 - mean_q: -479.478

Interval 5 (40000 steps performed)
1250 episodes - episode_reward: -2221.886 [-10749.216, -16.000] - loss: 580289.438 - mae: 542.407 - mean_q: -563.796

Interval 6 (50000 steps performed)
1250 episodes - episode_reward: -2241.106 [-10674.476, -13.000] - loss: 961144.812 - mae: 712.879 - mean_q: -782.422

Interval 7 (60000 steps performed)
1250 episodes - episode_reward: -2262.816 [-10729.952, -18.000] - loss: 973

<keras.callbacks.History at 0x26a09469960>

In [23]:
# Testing the agent on the simple Environment
results = dqn.test(printEnv, nb_episodes=2, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 2 episodes ...
time:1 |load:1.75 |reward:-5.75
time:2 |load:1.75 |reward:-11.51
time:3 |load:1.75 |reward:-17.26
time:4 |load:1.75 |reward:-23.02
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:32.23 |battery:39.0
Episode 1: reward: -207.166, steps: 8
time:1 |load:1.75 |reward:-5.75
time:2 |load:1.75 |reward:-11.51
time:3 |load:1.75 |reward:-17.26
time:4 |load:1.75 |reward:-23.02
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:27.73 |battery:39.0
Episode 2: reward: -207.166, steps: 8
-207.1656963362063


We can see the learned policy.

The agent always loads 14 KWh in the two hours. The resulting battery state thus is 39KWh.
He does it in the most efficient way, which is by loading with 7KW at all times.

As the initialized battery state is always the same it does the same for all test episodes.
Of course this result is dependent on the choosen very high penalty of -10000 if the car runs out of energy.

## Advanced Environment
With random battery initialization between 15 and 30 to have a better look at how the agent takes the battery state into account when acting.

In [69]:
printB = False
randomInit = True
env2 = Environment(printB, randomInit)
printB = True
printEnv2 = Environment(printB, randomInit)

In [70]:
# Destilling important Information for our model
states = env2.observation_space.shape
actions = env2.action_space.n
model2 = Sequential()    
model2.add(Flatten(input_shape=(1,2)))
#model2.add(Dense(64, activation='relu'))
model2.add(Dense(32, activation='relu'))
model2.add(Dense(16, activation='relu'))
model2.add(Dense(8, activation='relu'))
#model2.add(Dense(4, activation='relu'))
model2.add(Dense(actions, activation='linear'))

# Defining and training of Deep Q-Network Agent
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=10000*2, window_length=1)
dqn2 = DQNAgent(model=model2, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=10000*1, target_model_update=1e-2)
dqn2.compile(Adam(learning_rate=0.005), metrics=['mae'])
dqn2.fit(env2, nb_steps=10000*5, visualize=False, verbose=1)

Training for 50000 steps ...
Interval 1 (0 steps performed)
1250 episodes - episode_reward: -7697.962 [-10517.732, -33.755]

Interval 2 (10000 steps performed)
1250 episodes - episode_reward: -5199.809 [-10967.348, 0.000] - loss: 1060532.947 - mae: 2848.122 - mean_q: -3171.436

Interval 3 (20000 steps performed)
1250 episodes - episode_reward: -3896.592 [-10860.659, -150.110] - loss: 627184.375 - mae: 2171.474 - mean_q: -2262.274

Interval 4 (30000 steps performed)
1250 episodes - episode_reward: -3745.487 [-10845.150, -113.759] - loss: 377627.250 - mae: 1738.747 - mean_q: -1797.645

Interval 5 (40000 steps performed)
done, took 452.884 seconds


<keras.callbacks.History at 0x26a2101e590>

Let's have a look at the policy this agent learned and compare it to the policy of the simpler environment.

We can see, that the agent reacts to the battery state. If the initial state is lower he loads more compared to a higher initial battery state.
He seems to try to get to a battery state of around 38 to 47 at the end of the two hours.

In [72]:
results = dqn2.test(printEnv2, nb_episodes=1, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 1 episodes ...
Initialized with :12 KWh
time:1 |load:3.5 |reward:-33.12
time:2 |load:3.5 |reward:-66.23
time:3 |load:3.5 |reward:-99.35
time:4 |load:3.5 |reward:-132.46
time:5 |load:3.5 |reward:-165.58
time:6 |load:3.5 |reward:-198.69
time:7 |load:3.5 |reward:-231.81
time:8 |load:3.5 |reward:-264.92
|needed:27.97 |battery:40.0
Episode 1: reward: -1192.156, steps: 8
-1192.1562705129234


Above is an example, where the initialized 12KWh battery state makes the agent load with 14kw at all 15 minute intervals.

In [80]:
results = dqn2.test(printEnv2, nb_episodes=1, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 1 episodes ...
Initialized with :30 KWh
time:1 |load:3.5 |reward:-33.12
time:2 |load:3.5 |reward:-66.23
time:3 |load:1.75 |reward:-17.26
time:4 |load:1.75 |reward:-23.02
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:24.37 |battery:47.5
Episode 1: reward: -289.248, steps: 8
-289.24824418426607


Above is an example, where a high initialized battery makes the agent not load as much as in the first example.

In [73]:
results = dqn2.test(printEnv2, nb_episodes=1, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 1 episodes ...
Initialized with :28 KWh
time:1 |load:3.5 |reward:-33.12
time:2 |load:3.5 |reward:-66.23
time:3 |load:3.5 |reward:-99.35
time:4 |load:1.75 |reward:-23.02
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:38.12 |battery:47.25
Episode 1: reward: -371.331, steps: 8
-371.3307920323258


And this example above is somewhere in between the examples before.

You can also have a look at 25 random exmaples following:

In [83]:
results = dqn2.test(printEnv2, nb_episodes=25, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 25 episodes ...
Initialized with :26 KWh
time:1 |load:3.5 |reward:-33.12
time:2 |load:3.5 |reward:-66.23
time:3 |load:3.5 |reward:-99.35
time:4 |load:3.5 |reward:-132.46
time:5 |load:1.75 |reward:-28.77
time:6 |load:1.75 |reward:-34.53
time:7 |load:1.75 |reward:-40.28
time:8 |load:1.75 |reward:-46.04
|needed:23.58 |battery:47.0
Episode 1: reward: -480.774, steps: 8
Initialized with :13 KWh
time:1 |load:3.5 |reward:-33.12
time:2 |load:3.5 |reward:-66.23
time:3 |load:3.5 |reward:-99.35
time:4 |load:3.5 |reward:-132.46
time:5 |load:3.5 |reward:-165.58
time:6 |load:3.5 |reward:-198.69
time:7 |load:3.5 |reward:-231.81
time:8 |load:3.5 |reward:-264.92
|needed:31.95 |battery:41.0
Episode 2: reward: -1192.156, steps: 8
Initialized with :13 KWh
time:1 |load:3.5 |reward:-33.12
time:2 |load:3.5 |reward:-66.23
time:3 |load:3.5 |reward:-99.35
time:4 |load:3.5 |reward:-132.46
time:5 |load:3.5 |reward:-165.58
time:6 |load:3.5 |reward:-198.69
time:7 |load:3.5 |reward:-231.81
time:8 |load:3