In [None]:
# Lets try to do some Reinforcement Learning -.-

### 5. Smart Charging Using Reinforcement Learning:
**Original Exercise:** <br>
Consider an electric taxi driver who can charge her vehicle at home. To simplify the problem, we assume that the vehicle always arrives at home at 2 p.m. and leaves the garage at 4 p.m. each day. We want to design an intelligent charging system (an automated agent). Therefore, instead of a flat charging rate, the charging agent adjusts the charging power every 15 minutes, which is bounded between 0 kW and the highest rate (e.g., 22 kW). Also, the vehicle's battery has a capacity that cannot be exceeded. After leaving the garage, the taxi needs enough energy to complete its working day. The energy demand is a stochastic value following a normal distribution (you should choose the parameters, e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh) and must be generated exactly when the driver wants to leave. The agent’s goal is to avoid running out of energy (you should consider a very high penalty for running out of energy) and to minimize the recharging cost. The recharging cost follows an exponential function of the power (i.e., ![image.png](attachment:image.png)), where 𝛼𝑡 is the time coefficient and p is the charging rate.

The task is to create the environment (a very simple discrete event simulation) that receives the agent's decisions and returns the reward. In addition, you must define a Markov decision process, including states, actions, and reward function, and solve it using a reinforcement learning algorithm (e.g., deep q-network) to find optimal charging policies. To allow the use of discrete action methods, you can consider only limited charging options such as zero, low, medium, high.


**In Bulletpoints:**
- Problem description:
    - An electric taxi driver can charge her vehicle at home between 2 p.m. and 4 p.m. each day
    - The charging agent adjusts the charging power every 15 minutes within a range of 0 kW to 22 kW
    - The vehicle's battery has a limited capacity that cannot be exceeded
    - The taxi needs enough energy to complete its working day, which is a random value following a normal distribution (e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh)
    - The agent’s goal is to avoid running out of energy (with a very high penalty) and to minimize the recharging cost, which is an exponential function of the power (i.e., ![image.png](attachment:image.png)), where 𝛼𝑡 is the time coefficient and p is the charging rate
- Task description:
    - Create the environment that simulates the charging process and the energy demand, and returns the reward to the agent based on its actions
    - Define a Markov decision process, including states, actions, and reward function, that models the problem
    - Solve the Markov decision process using a reinforcement learning algorithm (e.g., deep q-network) to find optimal charging policies
    - Consider only discrete action methods, such as zero, low, medium, high, for the charging power

In [None]:
# First try mit Hilfe von diesem Tutorial:
# https://www.section.io/engineering-education/building-a-reinforcement-learning-environment-using-openai-gym/

In [1]:
import numpy as np
from gym import Env
from gym.spaces import Box, Discrete
import random
import math

In [None]:
class CustomEnv(Env):
    def __init__(self):
        
        # a range of 0 kW to 22 kW
        #self.action_space = Box(low=0, high=22)
        #a range from zero, low, medium to high
        self.action_space = Discrete(4)

        # The vehicle's battery has a limited capacity that cannot be exceeded (69KWh)
        #self.observation_space = Box(low=0, high=69)
        self.observation_space = Box(low=np.array([0]), high = np.array([69]))

        # [20,40] KWh loaded battery at initialization
        #self.state = 20 + random.randint(-10,10)
        self.state = 20

        # The charging agent adjusts the charging power every 15 minutes --> time is in [0,7] in 2 Hours
        self.time = 0


    def step(self, action):
        # Setting loading interval +1 /--> +15 minutes
        self.time += 1

        # Seting new battery state
        #zero
        load = 0
        if action == 2:
            #low
            load += 7
        if action == 3:
            #medium
            load += 14
        if action == 4:
            #high
            load += 22
        self.state += load

        print("time: " + str(self.time) + "; load:" + str(load))

        # Calculating Negative Reward from Energy Costs
        #reward = pow(self.time, 2) * math.exp(load) * (-1)
        reward = (1/self.time)*8*load*(-1)
        print("reward: " + str(reward))

        #Checking if 2 Hours are done
        #Giving panalty if car ran out of battery
        if self.time >= 8:
            #The taxi needs enough energy to complete its working day, 
            # which is a random value following a normal distribution (e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh)
            #kwh_needed = np.random.normal(loc=30, scale=5)
            kwh_needed = 30
            print("|needed: " + str(round(kwh_needed, 2))+"|")
            print("|state: " + str(round(self.state, 2))+"|")
            # The agent’s goal is to avoid running out of energy (with a very high penalty) 
            if kwh_needed > self.state:
                #reward -= 100000000
                #reward -= 1000000000
                reward -= 200
            done = True
        else:
            done = False

        info = {}

        #print("Battery State: " + str(self.state))
        #print("Reward: " + str(reward))
        # Returning the step information
        return self.state, reward, done, info
    
    def reset(self):
        # [20,40] KWh loaded battery at initialization
        #self.state = 20 + random.randint(-10,10)
        self.state = 20
        # The charging agent adjusts the charging power every 15 minutes --> time is in [0,7] in 2 Hours
        self.time = 0
        return self.state

In [68]:
class CustomEnv2(Env):
    def __init__(self):
        
        # a range of 0 kW to 22 kW
        #self.action_space = Box(low=0, high=22)
        #a range from zero, low, medium to high
        self.action_space = Discrete(4)

        # The vehicle's battery has a limited capacity that cannot be exceeded (69KWh)
        #self.observation_space = Box(low=0, high=69)
        self.observation_space = np.array([Box(low=np.array([0]), high = np.array([69])), Box(low=np.array([0]), high = np.array([8]))])

        # The charging agent adjusts the charging power every 15 minutes --> time is in [0,7] in 2 Hours
        self.time = 0

        # [20,40] KWh loaded battery at initialization
        #self.state = 20 + random.randint(-10,10)
        self.battery = 20

        self.state = np.array([self.battery, self.time])


    def step(self, action):
        # Setting loading interval +1 /--> +15 minutes
        self.time += 1
        self.state[1] = self.time

        # Seting new battery state
        #zero
        load = 0
        if action == 2:
            #low
            load += 7
        if action == 3:
            #medium
            load += 14
        if action == 4:
            #high
            load += 22
        self.battery += load
        self.state[0] = self.battery

        print("time: " + str(self.time) + "; load:" + str(load))

        # Calculating Negative Reward from Energy Costs
        #reward = pow(self.time, 2) * math.exp(load) * (-1)
        reward = (1/self.time)*8*load*(-1)
        print("reward: " + str(reward))

        #Checking if 2 Hours are done
        #Giving panalty if car ran out of battery
        if self.time >= 8:
            #The taxi needs enough energy to complete its working day, 
            # which is a random value following a normal distribution (e.g., 𝜇= 30 kWh, 𝜎 = 5 kWh)
            #kwh_needed = np.random.normal(loc=30, scale=5)
            kwh_needed = 30
            print("|needed: " + str(round(kwh_needed, 2))+"|")
            print("|state: " + str(round(self.battery, 2))+"|")
            # The agent’s goal is to avoid running out of energy (with a very high penalty) 
            if kwh_needed > self.battery:
                #reward -= 100000000
                #reward -= 1000000000
                reward -= 200
            done = True
        else:
            done = False

        info = {}

        #print("Battery State: " + str(self.state))
        #print("Reward: " + str(reward))
        # Returning the step information
        return self.state, reward, done, info
    
    def reset(self):
        # The charging agent adjusts the charging power every 15 minutes --> time is in [0,7] in 2 Hours
        self.time = 0

        # [20,40] KWh loaded battery at initialization
        #self.state = 20 + random.randint(-10,10)
        self.battery = 20

        self.state = np.array([self.battery, self.time])
        return self.state

In [69]:
env = CustomEnv2()

In [70]:
episodes = 7 #7 days
for episode in range(1, episodes+1):
    print("__ Day " + str(episode) + " ___")
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, round(score, 2)))

__ Day 1 ___
time: 1; load:0
reward: -0.0
time: 2; load:0
reward: -0.0
time: 3; load:14
reward: -37.33333333333333
time: 4; load:0
reward: -0.0
time: 5; load:14
reward: -22.400000000000002
time: 6; load:7
reward: -9.333333333333332
time: 7; load:7
reward: -8.0
time: 8; load:0
reward: -0.0
|needed: 30|
|state: 62|
Episode:1 Score:-77.07
__ Day 2 ___
time: 1; load:0
reward: -0.0
time: 2; load:7
reward: -28.0
time: 3; load:7
reward: -18.666666666666664
time: 4; load:0
reward: -0.0
time: 5; load:14
reward: -22.400000000000002
time: 6; load:0
reward: -0.0
time: 7; load:0
reward: -0.0
time: 8; load:0
reward: -0.0
|needed: 30|
|state: 48|
Episode:2 Score:-69.07
__ Day 3 ___
time: 1; load:7
reward: -56.0
time: 2; load:7
reward: -28.0
time: 3; load:14
reward: -37.33333333333333
time: 4; load:7
reward: -14.0
time: 5; load:14
reward: -22.400000000000002
time: 6; load:0
reward: -0.0
time: 7; load:0
reward: -0.0
time: 8; load:0
reward: -0.0
|needed: 30|
|state: 69|
Episode:3 Score:-157.73
__ Day 4 

In [71]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.optimizers import Adam

In [72]:
states = env.observation_space.shape
actions = env.action_space.n

In [73]:
actions

4

In [74]:
states

(2,)

In [75]:
def build_model(states, actions):
    model = Sequential()    
    #model.add(Dense(69, activation='relu', input_shape=(2,)))
    model.add(Flatten(input_shape=(1,2)))
    #model.add(Dense(32, activation='relu', input_shape=states))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [76]:
del model

In [77]:
model = build_model(states, actions)

In [78]:
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_6 (Flatten)         (None, 2)                 0         
                                                                 
 dense_18 (Dense)            (None, 32)                96        
                                                                 
 dense_19 (Dense)            (None, 16)                528       
                                                                 
 dense_20 (Dense)            (None, 4)                 68        
                                                                 
Total params: 692
Trainable params: 692
Non-trainable params: 0
_________________________________________________________________


In [79]:
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

In [80]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, 
                  nb_actions=actions, nb_steps_warmup=1000, target_model_update=1e-2)
    return dqn

In [81]:
dqn = build_agent(model, actions)
dqn.compile(Adam(learning_rate=0.01), metrics=['mae'])
dqn.fit(env, nb_steps=30000, visualize=False, verbose=1)

Training for 30000 steps ...
Interval 1 (0 steps performed)
time: 1; load:0
reward: -0.0
    1/10000 [..............................] - ETA: 1:35:14 - reward: 0.0000e+00time: 2; load:0
reward: -0.0
time: 3; load:7
reward: -18.666666666666664
time: 4; load:7
reward: -14.0
time: 5; load:0
reward: -0.0
time: 6; load:7
reward: -9.333333333333332
time: 7; load:7
reward: -8.0
time: 8; load:7
reward: -7.0
|needed: 30|
|state: 55|
time: 1; load:7
reward: -56.0
time: 2; load:0
reward: -0.0
time: 3; load:0
reward: -0.0
time: 4; load:7
reward: -14.0
time: 5; load:7
reward: -11.200000000000001
time: 6; load:7
reward: -9.333333333333332
time: 7; load:7
reward: -8.0
time: 8; load:7
reward: -7.0
|needed: 30|
|state: 62|
time: 1; load:7
reward: -56.0
time: 2; load:7
reward: -28.0
time: 3; load:0
reward: -0.0
time: 4; load:7
reward: -14.0
   20/10000 [..............................] - ETA: 26s - reward: -13.0267      time: 5; load:0
reward: -0.0
time: 6; load:0
reward: -0.0
time: 7; load:7
reward: -8.0

<keras.callbacks.History at 0x1679fcf5120>

In [82]:
results = dqn.test(env, nb_episodes=150, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 150 episodes ...
time: 1; load:0
reward: -0.0
time: 2; load:0
reward: -0.0
time: 3; load:0
reward: -0.0
time: 4; load:0
reward: -0.0
time: 5; load:0
reward: -0.0
time: 6; load:0
reward: -0.0
time: 7; load:0
reward: -0.0
time: 8; load:14
reward: -14.0
|needed: 30|
|state: 34|
Episode 1: reward: -14.000, steps: 8
time: 1; load:0
reward: -0.0
time: 2; load:0
reward: -0.0
time: 3; load:0
reward: -0.0
time: 4; load:0
reward: -0.0
time: 5; load:0
reward: -0.0
time: 6; load:0
reward: -0.0
time: 7; load:0
reward: -0.0
time: 8; load:14
reward: -14.0
|needed: 30|
|state: 34|
Episode 2: reward: -14.000, steps: 8
time: 1; load:0
reward: -0.0
time: 2; load:0
reward: -0.0
time: 3; load:0
reward: -0.0
time: 4; load:0
reward: -0.0
time: 5; load:0
reward: -0.0
time: 6; load:0
reward: -0.0
time: 7; load:0
reward: -0.0
time: 8; load:14
reward: -14.0
|needed: 30|
|state: 34|
Episode 3: reward: -14.000, steps: 8
time: 1; load:0
reward: -0.0
time: 2; load:0
reward: -0.0
time: 3; load:0
reward: -