# Reinforcement Learning to Regulate Shower Temperature
One of the problems in my current apartment is that the shower temperature fluctuates frequently. Especially if someone flushes the toilet. In this project, I wanted to simulate the fluctuating shower environment and see if I could build a reinforcement learning model with TensorFlow in order to keep the temperature within the desired range.

Although both supervised and reinforcement learning use mapping between input and output, reinforcement learning uses rewards and punishments as signals for positive and negative behavior. The goal in reinforcement learning is to find a suitable action model that would maximize the total cumulative reward of the agent.

In [1]:
from gym import Env # allows me to build a custom environment
from gym.spaces import Discrete, Box # allows me to define the actions to take in the environment and current state
import numpy as np
import random

The 5 important pieces that are needed in every reinforcment learning model are: environment, state, reward, policy, and value. The environment is the physical world within which the agent operates. The state is the current situation of the agent. The reward is feedback from the environment. The policy is a method to map the agent's state to actions. Value is the future reward that an agent would receive by taking an action in a particular state.

In this project, the initializing actions consist of setting up an action space (which defines 5 actions that the RL model can take), an oberservation space (which defines the bounds of the shower temperature), the current state (defining the current temperature of the shower), and the shower_length (which defines how long each shower is).

The RL model receives a positive reward when it keeps the shower temperature within the specified range (between 87 to 93 degrees fahrenheit) and a negative reward if it does not keep the shower temperature within the specified range. The policy of the model is that the shower's temperature changes by the RL model's action - 2. So if the model chooses action 4, then the temperature is increased by 4-2 = 2. If the model chooses action 2, then the temperature stays the same (2 - 2 = 0).

In [3]:
class ShowerEnv(Env):
    def __init__(self):
        # first initialization function that gets run automatically when I create a new instance of my function
        # initializing actions I can take, the observation space (temperature/length)
        # actions I can take: down 2, down 1, stay, up 1, up 2
        # discrete comes from gym.spaces. This allows me to have 5 values 
        #(0 - go down two, 1 - go down one, 2 - stay same, 3 - go up one, 4 - go up two)
        self.action_space = Discrete(5)
        
        # Temperature array
        # defines where the shower is currently at, which can be used to tweak/produce the reward
        # the Box space can also hold n dimensional tensors, dataframes, images, and audio
        self.observation_space = Box(low=np.array([np.float64(0)]), high=np.array([np.float64(150)]))
        
        # setting the start temperature (in fahrenheit)
        # I like my showers pretty warm, so it will start within 3 degrees of 90
        self.state = 90 + random.randint(-3, 3)
        
        # shower length in seconds
        self.shower_length = 60
    
    def step(self, action):
        # step function runs whenever I take a step in the environment
        # Apply action
        # action is going to be 0 through 5. Which was defined in the action space. Here I am applying my action to the state
        # 0 - 2 = -2 temperature
        # 1 - 2 = -1 temperature
        # 2 - 2 = 0 temperature
        # 3 - 2 = +1 temperature
        # 4 - 2 = +2 temperature
        self.state += action - 2
        
        # Reduce shower length by 1 second
        self.shower_length -= 1
        
        # Calculate reward
        # if the shower temperature is within the optimal range, then the reward is 1, otherwise it is -1
        # the model is going to try to converge so that the temperature is always within this range
        if self.state >=87 and self.state <=93:
            reward = 1
        else:
            reward = -1
            
        # check if shower is done
        if self.shower_length <= 0:
            done = True
        else:
            done = False
            
        # apply temperature noise
        # this will serve to fluctuate the temperature up and down, which is also what my real shower does
        self.state += random.randint(-1, 1)
        
        # set placeholder for info, required by OpenAI
        info = {}
        
        # return step information
        return self.state, reward, done, info
    
    def render(self):
        # could be used for visualizations, not using here since there are no visualizations
        pass
    
    def reset(self):
        # where I can reset my environment
        # resetting the temperature
        self.state = 90 + random.randint(-3, 3)
        # resetting the shower length
        self.shower_length = 60
        return self.state
        

# Creating a New Instance of the Shower Environment

In [23]:
env = ShowerEnv()



In [5]:
# example of the results within the action space
env.action_space.sample()

4

In [6]:
# example of the observation space
env.observation_space.sample()

array([55.723892], dtype=float32)

# 10 Sample Episodes Where the Actions are Randomly Sampled
If the action_space is sampled randomly over and over, what do the scores look like over the course of 10 episodes? Here, I can see that sometimes the model does alright, and gets a positive score. Other times, the model gets the lowest score possible of -60.

In [7]:
episodes = 10
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0
    
    while not done:
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+= reward
    print('Episode:{} Score:{}'.format(episode, score))

Episode:1 Score:-30
Episode:2 Score:-42
Episode:3 Score:-60
Episode:4 Score:-8
Episode:5 Score:2
Episode:6 Score:-44
Episode:7 Score:2
Episode:8 Score:-42
Episode:9 Score:-18
Episode:10 Score:14


# Create a Deep Learning Model with Keras
I'm going to write a function to build a Sequential DNN model. The input_shape of the first layer will be a single number which represents the observation space. The output of the model will contain 5 neurons since there are 5 actions that the RL model can take.

In [8]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

In [9]:
# the shape of the states (1 value)
states = env.observation_space.shape
# number of actions that I have 
actions = env.action_space.n

In [10]:
def build_model(states, actions):
    model = Sequential()
    model.add(Dense(24, activation='relu', input_shape=states))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [17]:
model = build_model(states, actions)

In [18]:
# the model takes in the temperature as input and it will produce 3 different actions, (0, 1, 2)
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 24)                48        
_________________________________________________________________
dense_4 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_5 (Dense)              (None, 5)                 125       
Total params: 773
Trainable params: 773
Non-trainable params: 0
_________________________________________________________________


# Build Agent with Keras-RL
There are many different agents within the Keras-RL environment such as DQNAgent, NAFAgent, DDPGAgent, SARSAAgent, and CEMAgent. There are also different styles of RL models such as policy based rl and value based rl.

For this project, I'm going to use the DQNAgent with a BoltzmannQPolicy (policy based rl). A DQN Agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards. DQN is a variant of Q-learning, and it operates only within discrete action spaces. Finally, the DQNAgent is going to need some memory, which will be accomplished with SequentialMemory.

Here is the link to learn more about keras-rl: https://keras-rl.readthedocs.io/en/latest/

In [26]:
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

In [20]:
# pass in the model and the actions I can take in the environment
def build_agent(model, actions):
    # set up the policy
    policy = BoltzmannQPolicy()
    # set up the memory
    memory = SequentialMemory(limit=50000, window_length=1)
    # set up DQNAgent and pass in mode, memory, policy 
    dqn = DQNAgent(model=model, memory=memory, policy=policy,
                  nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

In [21]:
# instantiating
dqn = build_agent(model, actions)
# compiling
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
# fitting
dqn.fit(env, nb_steps=50000, visualize=False, verbose = 1)

Training for 50000 steps ...
Interval 1 (0 steps performed)




    1/10000 [..............................] - ETA: 1:20:55 - reward: 1.0000



166 episodes - episode_reward: -19.602 [-60.000, 44.000] - loss: 0.628 - mae: 4.243 - mean_q: -3.558

Interval 2 (10000 steps performed)
167 episodes - episode_reward: -17.246 [-60.000, 54.000] - loss: 0.765 - mae: 5.434 - mean_q: -6.311

Interval 3 (20000 steps performed)
167 episodes - episode_reward: -14.623 [-60.000, 48.000] - loss: 0.784 - mae: 5.666 - mean_q: -6.661

Interval 4 (30000 steps performed)
166 episodes - episode_reward: -9.687 [-58.000, 56.000] - loss: 0.691 - mae: 4.593 - mean_q: -5.294

Interval 5 (40000 steps performed)
done, took 451.955 seconds


<tensorflow.python.keras.callbacks.History at 0x278384d8760>

In [73]:
# testing the dqn on the custom environment
scores = dqn.test(env, nb_episodes=100, visualize=False)
print(np.mean(scores.history['episode_reward']))

Testing for 100 episodes ...
Episode 1: reward: -50.000, steps: 60
Episode 2: reward: -58.000, steps: 60
Episode 3: reward: -56.000, steps: 60
Episode 4: reward: -54.000, steps: 60
Episode 5: reward: -60.000, steps: 60
Episode 6: reward: -54.000, steps: 60
Episode 7: reward: -58.000, steps: 60
Episode 8: reward: -56.000, steps: 60
Episode 9: reward: -58.000, steps: 60
Episode 10: reward: -56.000, steps: 60
Episode 11: reward: -60.000, steps: 60
Episode 12: reward: -60.000, steps: 60
Episode 13: reward: -56.000, steps: 60
Episode 14: reward: -56.000, steps: 60
Episode 15: reward: -56.000, steps: 60
Episode 16: reward: -60.000, steps: 60
Episode 17: reward: -58.000, steps: 60
Episode 18: reward: -56.000, steps: 60
Episode 19: reward: -58.000, steps: 60
Episode 20: reward: -58.000, steps: 60
Episode 21: reward: -54.000, steps: 60
Episode 22: reward: -54.000, steps: 60
Episode 23: reward: -52.000, steps: 60
Episode 24: reward: -58.000, steps: 60
Episode 25: reward: -60.000, steps: 60
Episo