### 5. Smart Charging Using Reinforcement Learning:
**Original Exercise:** <br>
Consider an electric taxi driver who can charge her vehicle at home. To simplify the problem, we assume that the vehicle always arrives at home at 2 p.m. and leaves the garage at 4 p.m. each day. We want to design an intelligent charging system (an automated agent). Therefore, instead of a flat charging rate, the charging agent adjusts the charging power every 15 minutes, which is bounded between 0 kW and the highest rate (e.g., 22 kW). Also, the vehicle's battery has a capacity that cannot be exceeded. After leaving the garage, the taxi needs enough energy to complete its working day. The energy demand is a stochastic value following a normal distribution (you should choose the parameters, e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh) and must be generated exactly when the driver wants to leave. The agent’s goal is to avoid running out of energy (you should consider a very high penalty for running out of energy) and to minimize the recharging cost. The recharging cost follows an exponential function of the power (i.e., ![image.png](attachment:image.png)), where 𝛼𝑡 is the time coefficient and p is the charging rate.

The task is to create the environment (a very simple discrete event simulation) that receives the agent's decisions and returns the reward. In addition, you must define a Markov decision process, including states, actions, and reward function, and solve it using a reinforcement learning algorithm (e.g., deep q-network) to find optimal charging policies. To allow the use of discrete action methods, you can consider only limited charging options such as zero, low, medium, high.


**In Bulletpoints:**
- Problem description:
    - An electric taxi driver can charge her vehicle at home between 2 p.m. and 4 p.m. each day
    - The charging agent adjusts the charging power every 15 minutes within a range of 0 kW to 22 kW
    - The vehicle's battery has a limited capacity that cannot be exceeded
    - The taxi needs enough energy to complete its working day, which is a random value following a normal distribution (e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh)
    - The agent’s goal is to avoid running out of energy (with a very high penalty) and to minimize the recharging cost, which is an exponential function of the power (i.e., ![image.png](attachment:image.png)), where 𝛼𝑡 is the time coefficient and p is the charging rate
- Task description:
    - Create the environment that simulates the charging process and the energy demand, and returns the reward to the agent based on its actions
    - Define a Markov decision process, including states, actions, and reward function, that models the problem
    - Solve the Markov decision process using a reinforcement learning algorithm (e.g., deep q-network) to find optimal charging policies
    - Consider only discrete action methods, such as zero, low, medium, high, for the charging power

In [1]:
### Imports
#Basic
import numpy as np
import random
import math
# Gym
from gym import Env
from gym.spaces import Box, Discrete
# Keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
# Keras RL
from rl.agents import DQNAgent
from rl.agents import SARSAAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

## Simple Environment for Testing

In [107]:
### Simple environment without probability for testing
class SimpleEnvironment(Env):
    def __init__(self):
        #Possible Actions for charging zero, low, medium to high
        self.action_space = Discrete(4)
        #Vehicle's battery: 69KWh; Timeframe 2p.m. to 4p.m.: 8x 15 minute intervals
        self.observation_space = np.array([Box(low=np.array([0]), high = np.array([69])), Box(low=np.array([0]), high = np.array([8]))])
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        # 20 KWh loaded battery at initialization
        self.battery = 20
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])


    def step(self, action):
        # Setting loading interval +1/8 /--> +15/120 minutes
        self.time += 1
        self.state[1] = self.time

        # Seting new battery state
        #zero
        load = 0
        if action == 2:
            #low
            load += 7
        if action == 3:
            #medium
            load += 14
        if action == 4:
            #high
            load += 22
        self.battery += load
        self.state[0] = self.battery

        # Simple cost function
        reward = (1/self.time)*load*(-1)

        
        print("time:" + str(self.time) + " |load:" + str(load) + " |reward:" + str(reward))

        #Checking if 2 Hours are done
        if self.time >= 8:
            # Static Demand
            kwh_needed = 30
            print("|needed:" + str(round(kwh_needed, 2))+" |battery:" + str(round(self.battery, 2)))
            # The agent’s goal is to avoid running out of energy (with a very high penalty) 
            if kwh_needed > self.battery:
                reward -= 100
            done = True
        else:
            done = False

        info = {}

        # Returning the step information
        return self.state, reward, done, info
    
    def reset(self):
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        # 20 KWh loaded battery at initialization
        self.battery = 20
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])
        return self.state

In [108]:
simpleEnv = SimpleEnvironment()

In [109]:
# Some random examples
episodes = 2
for episode in range(1, episodes+1):
    print("__ Day " + str(episode) + " ___")
    state = simpleEnv.reset()
    done = False
    score = 0 
    while not done:
        action = simpleEnv.action_space.sample()
        n_state, reward, done, info = simpleEnv.step(action)
        score+=reward
    print('--> Score:{}'.format(round(score, 2)))

__ Day 1 ___
time:1 |load:7 |reward:-7.0
time:2 |load:0 |reward:-0.0
time:3 |load:0 |reward:-0.0
time:4 |load:7 |reward:-1.75
time:5 |load:0 |reward:-0.0
time:6 |load:0 |reward:-0.0
time:7 |load:0 |reward:-0.0
time:8 |load:0 |reward:-0.0
|needed:30 |battery:34
--> Score:-8.75
__ Day 2 ___
time:1 |load:0 |reward:-0.0
time:2 |load:0 |reward:-0.0
time:3 |load:7 |reward:-2.333333333333333
time:4 |load:0 |reward:-0.0
time:5 |load:0 |reward:-0.0
time:6 |load:14 |reward:-2.333333333333333
time:7 |load:14 |reward:-2.0
time:8 |load:0 |reward:-0.0
|needed:30 |battery:55
--> Score:-6.67


In [110]:
# Destilling Information from hour model
states = simpleEnv.observation_space.shape
actions = simpleEnv.action_space.n
print("Actions: " + str(actions) + " | States: " + str(states))

Actions: 4 | States: (2,)


In [111]:
# Defining our model
model = Sequential()    
model.add(Flatten(input_shape=(1,2)))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(actions, activation='linear'))
model.summary()

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_11 (Flatten)        (None, 2)                 0         
                                                                 
 dense_33 (Dense)            (None, 8)                 24        
                                                                 
 dense_34 (Dense)            (None, 4)                 36        
                                                                 
 dense_35 (Dense)            (None, 4)                 20        
                                                                 
Total params: 80
Trainable params: 80
Non-trainable params: 0
_________________________________________________________________


In [112]:
# Defining and training of Deep Q-Network Agent
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=100, window_length=1)
dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=8*100, target_model_update=1e-2)
dqn.compile(Adam(learning_rate=0.01), metrics=['mae'])
dqn.fit(simpleEnv, nb_steps=30000, visualize=False, verbose=1)

Training for 30000 steps ...
Interval 1 (0 steps performed)
time:1 |load:0 |reward:-0.0
    1/10000 [..............................] - ETA: 3:09:48 - reward: 0.0000e+00time:2 |load:0 |reward:-0.0
time:3 |load:7 |reward:-2.333333333333333
time:4 |load:7 |reward:-1.75
time:5 |load:7 |reward:-1.4000000000000001
time:6 |load:7 |reward:-1.1666666666666665
time:7 |load:0 |reward:-0.0
time:8 |load:7 |reward:-0.875
|needed:30 |battery:55
time:1 |load:7 |reward:-7.0
time:2 |load:7 |reward:-3.5
time:3 |load:7 |reward:-2.333333333333333
time:4 |load:7 |reward:-1.75
time:5 |load:7 |reward:-1.4000000000000001
time:6 |load:7 |reward:-1.1666666666666665
time:7 |load:7 |reward:-1.0
time:8 |load:7 |reward:-0.875
|needed:30 |battery:76
time:1 |load:7 |reward:-7.0
time:2 |load:7 |reward:-3.5
   18/10000 [..............................] - ETA: 29s - reward: -2.0583       time:3 |load:7 |reward:-2.333333333333333
time:4 |load:7 |reward:-1.75
time:5 |load:7 |reward:-1.4000000000000001
time:6 |load:7 |reward

<keras.callbacks.History at 0x288bd1d9fc0>

In [114]:
# Testing the agent on the simple Environment
results = dqn.test(simpleEnv, nb_episodes=2, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 2 episodes ...
time:1 |load:0 |reward:-0.0
time:2 |load:0 |reward:-0.0
time:3 |load:0 |reward:-0.0
time:4 |load:0 |reward:-0.0
time:5 |load:0 |reward:-0.0
time:6 |load:0 |reward:-0.0
time:7 |load:0 |reward:-0.0
time:8 |load:14 |reward:-1.75
|needed:30 |battery:34
Episode 1: reward: -1.750, steps: 8
time:1 |load:0 |reward:-0.0
time:2 |load:0 |reward:-0.0
time:3 |load:0 |reward:-0.0
time:4 |load:0 |reward:-0.0
time:5 |load:0 |reward:-0.0
time:6 |load:0 |reward:-0.0
time:7 |load:0 |reward:-0.0
time:8 |load:14 |reward:-1.75
|needed:30 |battery:34
Episode 2: reward: -1.750, steps: 8
-1.75


## Environment from the Exercise

In [28]:

class Environment(Env):
    def __init__(self):
        #Possible Actions for charging zero, low, medium to high
        self.action_space = Discrete(4)
        #Vehicle's battery: 69KWh; Timeframe 2p.m. to 4p.m.: 8x 15 minute intervals
        self.observation_space = np.array([Box(low=np.array([0]), high = np.array([69])), Box(low=np.array([0]), high = np.array([8]))])
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        # 0-20 KWh loaded battery at initialization
        self.battery = 10 + random.randint(-10,10)
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])


    def step(self, action):
        # Setting loading interval +1/8 /--> +15/120 minutes
        self.time += 1
        self.state[1] = self.time

        # Seting new battery state
        #zero
        load = 0
        if action == 2:
            #low
            load += 7
        if action == 3:
            #medium
            load += 14
        if action == 4:
            #high
            load += 22
        self.battery += load
        self.state[0] = self.battery

        # Cost function
        reward = self.time * math.exp(load) * (-1)
        # Because e^0 = 1
        if action == 0:
            reward = 0

        
        print("time:" + str(self.time) + " |load:" + str(load) + " |reward:" + str(reward))

        #Checking if 2 Hours are done
        if self.time >= 8:
            #Demand is a random value following a normal distribution (e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh)
            kwh_needed = np.random.normal(loc=30, scale=5)
            print("|needed:" + str(round(kwh_needed, 2))+" |battery:" + str(round(self.battery, 2)))
            # The agent’s goal is to avoid running out of energy (with a very high penalty) 
            if kwh_needed > self.battery:
                print("NO BATTERY")
                reward -= 500000
            done = True
        else:
            done = False

        info = {}

        # Returning the step information
        return self.state, reward, done, info
    
    def reset(self):
        # Starting at 2p.m.: 0 (going to 3:45p.m.: 7)
        self.time = 0
        # 0-20 KWh loaded battery at initialization
        self.battery = 10 + random.randint(-10,10)
        # our state consisting of battery status and time
        self.state = np.array([self.battery, self.time])
        return self.state

In [29]:
env = Environment()

  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")


In [30]:
# Destilling Information from hour model
states = env.observation_space.shape
actions = env.action_space.n
print("Actions: " + str(actions) + " | States: " + str(states))

Actions: 4 | States: (2,)


In [31]:
# Defining our model
model = Sequential()    
model.add(Flatten(input_shape=(1,2)))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(actions, activation='linear'))
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_5 (Flatten)         (None, 2)                 0         
                                                                 
 dense_23 (Dense)            (None, 64)                192       
                                                                 
 dense_24 (Dense)            (None, 32)                2080      
                                                                 
 dense_25 (Dense)            (None, 16)                528       
                                                                 
 dense_26 (Dense)            (None, 8)                 136       
                                                                 
 dense_27 (Dense)            (None, 4)                 36        
                                                                 
 dense_28 (Dense)            (None, 4)                

In [33]:
# Defining and training of Deep Q-Network Agent
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=500, window_length=1)
dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=8*1000, target_model_update=1e-2)
dqn.compile(Adam(learning_rate=0.01), metrics=['mae'])
dqn.fit(env, nb_steps=500000, visualize=False, verbose=1)

Training for 500000 steps ...
Interval 1 (0 steps performed)


  updates=self.state_updates,


time:1 |load:14 |reward:-1202604.2841647768
    1/10000 [..............................] - ETA: 36:56 - reward: -1202604.2842time:2 |load:7 |reward:-2193.266316856917
time:3 |load:7 |reward:-3289.8994752853755
time:4 |load:14 |reward:-4810417.136659107
time:5 |load:7 |reward:-5483.165792142292
time:6 |load:7 |reward:-6579.798950570751
time:7 |load:14 |reward:-8418229.989153437
time:8 |load:14 |reward:-9620834.273318214
|needed:33.51 |battery:100
time:1 |load:7 |reward:-1096.6331584284585
time:2 |load:7 |reward:-2193.266316856917
time:3 |load:14 |reward:-3607812.85249433
time:4 |load:7 |reward:-4386.532633713834
time:5 |load:14 |reward:-6013021.420823884
time:6 |load:14 |reward:-7215625.70498866
time:7 |load:7 |reward:-7676.43210899921
time:8 |load:7 |reward:-8773.065267427668
|needed:35.77 |battery:95
time:1 |load:7 |reward:-1096.6331584284585
time:2 |load:7 |reward:-2193.266316856917
time:3 |load:0 |reward:0
time:4 |load:7 |reward:-4386.532633713834
time:5 |load:14 |reward:-6013021.42

In [None]:
# Testing the agent on the simple Environment
results = dqn.test(env, nb_episodes=100, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 100 episodes ...
time:1 |load:0 |reward:0
time:2 |load:0 |reward:0
time:3 |load:0 |reward:0
time:4 |load:0 |reward:0
time:5 |load:0 |reward:0
time:6 |load:0 |reward:0
time:7 |load:0 |reward:0
time:8 |load:0 |reward:0
|needed:27.1 |battery:4
NO BATTERY
Episode 1: reward: -500000.000, steps: 8
time:1 |load:0 |reward:0
time:2 |load:0 |reward:0
time:3 |load:0 |reward:0
time:4 |load:0 |reward:0
time:5 |load:0 |reward:0
time:6 |load:0 |reward:0
time:7 |load:0 |reward:0
time:8 |load:0 |reward:0
|needed:32.29 |battery:4
NO BATTERY
Episode 2: reward: -500000.000, steps: 8
time:1 |load:0 |reward:0
time:2 |load:0 |reward:0
time:3 |load:0 |reward:0
time:4 |load:0 |reward:0
time:5 |load:0 |reward:0
time:6 |load:0 |reward:0
time:7 |load:0 |reward:0
time:8 |load:0 |reward:0
|needed:34.44 |battery:8
NO BATTERY
Episode 3: reward: -500000.000, steps: 8
time:1 |load:0 |reward:0
time:2 |load:0 |reward:0
time:3 |load:0 |reward:0
time:4 |load:0 |reward:0
time:5 |load:0 |reward:0
time:6 |load:0 |