# Overview
## State Space
- The state space consists of an 80,000-cell grid, representing different geographical locations in Puerto Rico.
- Each cell has attributes like solar PV output, wind power density, elevation, slope, cyclone risk score, building density, road density, and distance to transmission lines.
- Approximately 70% of cells are unavailable for development due to environmental or other constraints.

## Action Space
- Two types of actions are available: building a solar array or a wind turbine.
- Actions can be taken on any available cell.

## Rewards and Costs
The reward function should incorporate:
- Energy production potential (solar and wind).
- Costs or penalties associated with building on certain terrains (e.g., high elevation or steep slopes).
- Penalties for building in high cyclone risk areas.
- Incentives for maintaining a balance between solar and wind energy.
- Incentives for early deployment and distributed grid development.
- Penalties for high building or road density areas.
- Distance to transmission lines.

## RL Model
- Model Choice: Given the size of the state space, a model-based RL algorithm (like Deep Q-Networks or Actor-Critic methods) is suitable.
- Representation: The state representation should include the current status of each grid cell (whether it has a solar array, a wind turbine, or is vacant) along with its attributes.
- Sequence of Actions: The RL agent will sequentially choose actions (where to build next) based on the current state of the grid.
- Terminal State: The agent is done when the environment reaches a certain level of energy capacity or after a fixed number of steps.

# Implementation Steps
## Environment Setup: 
- Implement the environment to reflect the grid and its dynamics, including applying the binary mask for unavailable cells.
- The step(action) method should update the grid state based on the chosen action and calculate the immediate reward or cost.

## Agent Development:
- Use PyTorch for implementing the neural network models for the agent.
The agent needs to learn a policy that maximizes long-term rewards, considering the complex reward structure and large state space.

## Training and Evaluation:
- Set up a training loop where the agent interacts with the environment, receives feedback, and improves its policy.
- Periodically evaluate the agent's performance, possibly using separate evaluation episodes or metrics like total energy capacity achieved or adherence to environmental constraints.

## Hyperparameter Tuning:
- Adjust learning rates, exploration rates, discount factors, and network architecture as needed to improve performance.
 
## Scalability:
Due to the large state space, may need to:
- use function approximation for value functions
- prioritizing important experiences in the replay buffer
- parallelize computation process

## Visualization and Analysis:
- Develop tools to visualize the evolving grid layout and analyze the trade-offs made by the RL agent between different objectives (like energy maximization vs. environmental constraints).

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from scipy.spatial.distance import pdist, squareform

import random
import pandas as pd
import numpy as np

import random
from collections import namedtuple, deque

# Environment

In [2]:
class RenewableEnergyEnvironment:
    def __init__(self, grid_gdf):
        self.weights = {
        'transmission_build_cost': -1.0,
        'early_choice_reward': 1.0,
        'distributed_grid_reward': 1.0,
        }
        self.grid_columns = ['slope', 
                             'distance_to_transmission_line', 
                             'cyclone_risk',
                            'water',
                            'occupied',
                            'masked']
        self.grid_columns += [f'demand_{i}' for i in range(1,25)]
        self.grid_columns += [f'wind_power_kW_hour_{i}' for i in range(1,25)]
        self.grid_columns += [f'solar_power_kW_hour_{i}' for i in range(1,25)]
        
        # Initialize the environment
        self.starting_environment = grid_gdf
        self.state_gdf = grid_gdf.copy()
        self.state_tensor = self.gdf_to_tensor(self.state_gdf)
        self.mapping, self.action_space_size = self.create_action_to_gdf_mapping()
        
        self.total_energy_output = 0
        self.required_energy_output = 5000000

        demand_df = pd.read_csv('../data/generation_and_demand/demand_profile.csv')
        self.demand = demand_df['demand_MW'].to_numpy() * 1000
        
        self.decay_rate = 0.1 #TODO determine good decay rate
        self.max_distance = 1000 #TODO get max distance between two cells
        
    def reset(self):
        # Reset the environment to the initial state
        self.state_gdf = self.starting_environment.copy()
        self.state_tensor = self.gdf_to_tensor(self.starting_environment)
        self.total_energy_output = 0
        return self.state_tensor

    def gdf_to_tensor(self, gdf):
        # Calculate grid dimensions
        x_start = 100000
        x_end = 300000
        y_start = 200000
        y_end = 300000
        square_size = 500
        
        grid_width = int((x_end - x_start) / square_size)
        grid_height = int((y_end - y_start) / square_size)
        
        # Flatten the grid data
        flat_data = self.state_gdf[self.grid_columns].values.reshape(-1, grid_height, grid_width)
    
        # Create a 4D Tensor (batch size is the 4th dimension)
        tensor = torch.tensor(flat_data, dtype=torch.float32)

        return tensor

    def create_action_to_gdf_mapping(self):
        unmasked_gdf = self.state_gdf[self.state_gdf['masked'] == 0]
        
        mapping = {}
        action_idx = 0  # Initialize action index
    
        for _, row in unmasked_gdf.iterrows():
            # Check for valid solar action
            if row['slope'] <= 0.05:  # Slope check for solar
                mapping[action_idx] = (row.name, 'solar')
                action_idx += 1
    
            # Check for valid wind action
            if row['slope'] <= 0.15:  # Slope check for wind
                mapping[action_idx] = (row.name, 'wind')
                action_idx += 1
        
        action_space_size = action_idx  # Total number of valid actions
        return mapping, action_space_size

    
    def step(self, action):
        # Apply the action to the environment and return the result
        # action: Tuple (cell_index, action_type) where action_type could be 'solar' or 'wind'
        
        cell_index, action_type = self.mapping[action]

        # Check if the action is valid
        if not self.is_valid_action(cell_index, action_type):
            reward = -1  # Penalty for invalid action
            done = self.is_terminal_state()
            return self.state_tensor, reward, done, {}

        # Apply the action
        self.apply_action(cell_index, action_type)

        # Calculate reward
        reward = self.calculate_reward()

        # Update total energy output or other state attributes as needed
        self.total_energy_output += self.calculate_energy_output().sum()
        print(f'Cell: {cell_index}, Action: {action_type}, Reward: {reward}, Energy: {self.total_energy_output}')
        # Check if the state is terminal
        done = self.is_terminal_state()
        return self.state_tensor, reward, done, {}

    def is_valid_action(self, cell_index, action_type):
        # Implement logic to check if an action is valid
        # Example: Check if the cell is not masked and not already occupied
        cell = self.state_gdf.iloc[cell_index]
        if cell['masked']: # This should never occur
            print('Something has gone wrong. Attempting to build on a masked cell')
            return False
        elif cell['occupied']:
            return False
        elif action_type == 'solar' and cell['slope'] > 0.05:
            return False
        elif action_type == 'wind' and cell['slope'] > 0.15:
            return False
        
        return True

    def apply_action(self, cell_index, action_type):
        # Implement the changes to the environment based on the action
        # Example: Mark the cell as occupied and record the type of installation
        self.state_gdf.at[cell_index, 'occupied'] = 1
        self.state_gdf.at[cell_index, 'installation_type'] = action_type
        self.state_tensor = self.gdf_to_tensor(self.state_gdf)

    def calculate_reward(self):
        # Solar installation cost
        solar_install_cost = self.calculate_solar_install_cost()

        # Wind turbine installation cost
        wind_install_cost = self.calculate_wind_install_cost()
        
        # Solar power Reward
        power_output_reward = self.calculate_power_output_reward()

        # Transmission build cost
        # transmission_build_cost = self.calculate_transmission_build_cost()
        
        # Cyclone risk cost
        # cyclone_risk_cost = self.calculate_cyclone_risk_cost()
    
        # Distribution reward
        # distributed_grid_reward = self.calculate_distributed_grid_reward()
    
        # Early choice reward
        early_choice_reward = self.time_dependent_reward_factor()
    
        costs_rewards = (solar_install_cost,
                         wind_install_cost,
                         power_output_reward)
                         # transmission_build_cost,
                         # cyclone_risk_cost,
                         # distributed_grid_reward,
                         # early_choice_reward)
                        
        
        total_reward = self.calculate_total_reward(costs_rewards)
    
        return total_reward

    def calculate_energy_output(self):
        # Filter for solar and wind installations
        solar_gdf = self.state_gdf[(self.state_gdf['occupied']) & (self.state_gdf['installation_type'] == 'solar')]
        wind_gdf = self.state_gdf[(self.state_gdf['occupied']) & (self.state_gdf['installation_type'] == 'wind')]
    
        # Prepare column names for solar and wind power
        solar_power_columns = [f'solar_power_kW_hour_{i}' for i in range(1, 25)]
        wind_power_columns = [f'wind_power_kW_hour_{i}' for i in range(1, 25)]
    
        # Vectorized sum of power output for solar and wind for each hour
        total_solar_power = solar_gdf[solar_power_columns].sum().to_numpy()
        total_wind_power = wind_gdf[wind_power_columns].sum().to_numpy()
    
        total_power = total_solar_power + total_wind_power
        
        return total_power

    def calculate_power_output_reward(self):
        """Uses supply and demand curve to determine the amount of demand satisfied by
        the solar and wind installations"""
        cost_kWh = .22  # TODO
        demand = self.demand

        # Calculate the reward using vectorized minimum
        total_power = self.calculate_energy_output()
        reward = np.minimum(total_power, demand[0:]).sum() * cost_kWh
    
        return reward
    
    def calculate_solar_install_cost(self):
        """Cost of installing solar array on cell. Based on elevation and slope"""
        return 0

    def calculate_wind_install_cost(self):
        """Cost of installing wind turbine on cell. Based on elevation and slope"""
        return 0

    def transmission_line_cost_per_km(self, distance):
        COST_PER_KM = 2.29 * 1000000 / 1.60934  # $ Millions. Convert cost per mile to cost per kilometer
        SHORT_DISTANCE_THRESHOLD = 3  # Threshold for short distance in miles
        MEDIUM_DISTANCE_THRESHOLD = 10  # Threshold for medium distance in miles
        
        if distance < SHORT_DISTANCE_THRESHOLD:
            cost_modifier = 1.5  # 50% increase for less than 3 miles
        elif distance < MEDIUM_DISTANCE_THRESHOLD:
            cost_modifier = 1.2  # 20% increase for 3-10 miles
        else:
            cost_modifier = 1  # No modification for more than 10 miles
        return distance * COST_PER_KM * cost_modifier
    
    def calculate_transmission_build_cost(self):
        # Constants for cost adjustment
        # Extract occupied cells
        occupied_cells = self.state_gdf[self.state_gdf['occupied'] == True]
    
        # Check if there is only one occupied cell
        if len(occupied_cells) == 1:
            # For a single occupied cell, use the distance to transmission line for cost calculation
            occupied_cell = occupied_cells.iloc[0]
            distance_km = occupied_cell['distance_to_transmission_line']
            build_cost = self.transmission_line_cost_per_km(distance_km)
        else:
            # Get coordinates of occupied cells
            coords = np.array(list(zip(occupied_cells.geometry.x, occupied_cells.geometry.y)))
    
            # Calculate pairwise distances between occupied cells
            distances = cdist(coords, coords)
    
            # Replace zeros in distance matrix with np.inf to avoid zero distance to itself
            np.fill_diagonal(distances, np.inf)
    
            # Find the nearest installation for each installation
            nearest_installation_distances = np.min(distances, axis=1)
    
            # Determine the relevant distance for cost calculation
            relevant_distances = np.minimum(nearest_installation_distances, occupied_cells['distance_to_transmission_line'].to_numpy())
    
            # Calculate build cost
            build_costs = [self.transmission_line_cost_per_km(distance) for distance in relevant_distances]
            build_cost = sum(build_costs)
            
        return build_cost
    
    def calculate_distributed_grid_reward(self):
        # Extract the coordinates of the occupied cells (where installations are located)
        occupied_cells = self.state_gdf[self.state_gdf['occupied'] == True]
        if len(occupied_cells) < 2:
            # If there are less than two installations, we cannot calculate distances
            return 1
    
        coords = np.array(list(zip(occupied_cells.geometry.x, occupied_cells.geometry.y)))
    
        # Calculate pairwise distances between all occupied cells
        distances = pdist(coords)
    
        # Calculate the average distance. The larger this is, the more distributed the installations are.
        avg_distance = np.mean(distances)
    
        # Normalize the reward such that it ranges between 0 and 1
        normalized_reward = avg_distance / self.max_distance
    
        return normalized_reward
    
    def time_dependent_reward_factor(self):
        # A function to calculate the time-dependent reward factor
        # It decreases with each year from the base year
        action_number = self.state_gdf.occupied.sum()
        return 1 / (1 + self.decay_rate * action_number)
    
    def calculate_total_reward(self, costs_rewards):
        """Calculate total cost/reward based on"""
        solar_install_cost = costs_rewards[0]
        wind_install_cost = costs_rewards[1]
        power_output_reward = costs_rewards[2]

        total_cost_reward = solar_install_cost + wind_install_cost + power_output_reward
        
        return total_cost_reward

    def is_terminal_state(self):
        # The episode ends when the total energy output meets the requirement
        return self.total_energy_output >= self.required_energy_output

    def render(self):
        # Optional: Implement a method to visualize the current state of the environment
        pass

# Neural Network Architecture

In [3]:
class DQN(nn.Module):
    def __init__(self, input_shape, num_actions):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(input_shape[0], 32, kernel_size=3, stride=4)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.bn3 = nn.BatchNorm2d(64)

        self.input_shape = input_shape  # Store input_shape for feature size calculation
        self.fc1 = nn.Linear(self._feature_size(), 512)
        self.fc2 = nn.Linear(512, num_actions)

    def _feature_size(self):
        with torch.no_grad():
            return self.bn3(self.conv3(self.bn2(self.conv2(self.bn1(self.conv1(torch.zeros(1, *self.input_shape))))))).view(1, -1).size(1)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)


# DQN Agent

In [4]:
class DQNAgent:
    def __init__(self, state_space, action_space_size):
        self.action_space_size = action_space_size
        self.model = DQN(state_space.shape, action_space_size)
        self.optimizer = optim.Adam(self.model.parameters(), lr=0.001)

    def select_action(self, state, epsilon):
        if random.random() > epsilon:
            # Choose the best action (exploitation)
            with torch.no_grad():
                q_values = self.model(state)
                action = q_values.max(1)[1].item()  # Select the action with the highest Q-value
        else:
            # Choose a random action (exploration)
            action = random.randrange(self.action_space_size)

        return action

    def learn(self, batch):
        states, actions, rewards, next_states, dones = batch
        
        # Compute Q values
        current_q_values = self.model(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.model(next_states).max(1)[0]
        expected_q_values = rewards + (1 - dones) * next_q_values.detach()

        # Compute loss
        loss = torch.nn.functional.mse_loss(current_q_values, expected_q_values)

        # Backpropagation
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()


# Replay Buffer

In [5]:
Experience = namedtuple('Experience', ['state', 'action', 'reward', 'next_state', 'done'])

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def store(self, state, action, reward, next_state, done):
        experience = Experience(state, action, reward, next_state, done)
        self.buffer.append(experience)

    def sample(self, batch_size):
        experiences = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*experiences)

        # Stack the tuples of tensors to create a single tensor for each component
        states = torch.stack(states)
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float)
        next_states = torch.stack(next_states)
        dones = torch.tensor(dones, dtype=torch.float)

        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.buffer)

# Training Loop

In [6]:
def train(agent, environment, episodes, epsilon_start, epsilon_end, epsilon_decay, replay_buffer, batch_size):
    epsilon = epsilon_start
    for episode in range(episodes):
        state = environment.reset()
        done = False
        total_reward = 0  # To keep track of total reward per episode

        while not done:
            action = agent.select_action(state.unsqueeze(0), epsilon)
            next_state, reward, done, _ = environment.step(action)

            # Store experience in replay buffer
            replay_buffer.store(state, action, reward, next_state, torch.tensor(bool(done)))

            # Check if buffer is ready for sampling
            if len(replay_buffer) > batch_size:
                # Sample a batch from replay buffer
                batch = replay_buffer.sample(batch_size)
                # Learn from the sampled experiences
                agent.learn(batch)

            # Update state
            state = next_state
            total_reward += reward

        # Decay epsilon
        epsilon = max(epsilon_end, epsilon_decay * epsilon)  # Ensure epsilon doesn't go below the minimum

        # Optional: Log training progress
        print(f"Episode {episode + 1}/{episodes}, Total Reward: {total_reward}, Epsilon: {epsilon}")

    print("Training complete.")

# Execution

In [7]:
state_gdf = pd.read_parquet('../data/processed/state_randomized.parquet')

environment = RenewableEnergyEnvironment(state_gdf)
state_space = environment.state_tensor
action_space_size = environment.action_space_size

agent = DQNAgent(state_space, action_space_size)

episodes = 16

epsilon_start = 1
epsilon_end = 0.01
epsilon_decay = 0.5

replay_buffer = ReplayBuffer(128)

batch_size = 4

train(agent, environment, episodes, epsilon_start, epsilon_end, epsilon_decay, replay_buffer, batch_size)

Cell: 33636, Action: solar, Reward: 30388.599920272827, Energy: 138129.99963760376
Cell: 63454, Action: wind, Reward: 32223.842721214296, Energy: 284602.01200675964
Cell: 28939, Action: solar, Reward: 60914.31863807678, Energy: 561485.2785434723
Cell: 42539, Action: wind, Reward: 63823.1969283104, Energy: 851590.7191267014
Cell: 20038, Action: solar, Reward: 93819.92284517288, Energy: 1278044.9138774872
Cell: 34237, Action: wind, Reward: 96275.26881954193, Energy: 1715659.7721481323
Cell: 71284, Action: wind, Reward: 98069.6536786461, Energy: 2161430.9252328873
Cell: 11861, Action: solar, Reward: 125312.80485462189, Energy: 2731034.5836629868
Cell: 19837, Action: wind, Reward: 127599.1231158638, Energy: 3311030.597826004
Cell: 35337, Action: solar, Reward: 155709.6237138176, Energy: 4018801.614706993
Cell: 36333, Action: wind, Reward: 158082.83201587677, Energy: 4737359.9420518875
Cell: 62524, Action: solar, Reward: 186506.83231485367, Energy: 5585118.270755768
Episode 1/16, Total Rewa

KeyboardInterrupt: 