# Overview
## State Space
- The state space consists of an 80,000-cell grid, representing different geographical locations in Puerto Rico.
- Each cell has attributes like solar PV output, wind power density, elevation, slope, cyclone risk score, building density, road density, and distance to transmission lines.
- Approximately 70% of cells are unavailable for development due to environmental or other constraints.

## Action Space
- Two types of actions are available: building a solar array or a wind turbine.
- Actions can be taken on any available cell.

## Rewards and Costs
The reward function should incorporate:
- Energy production potential (solar and wind).
- Costs or penalties associated with building on certain terrains (e.g., high elevation or steep slopes).
- Penalties for building in high cyclone risk areas.
- Incentives for maintaining a balance between solar and wind energy.
- Incentives for early deployment and distributed grid development.
- Penalties for high building or road density areas.
- Distance to transmission lines.

## RL Model
- Model Choice: Given the size of the state space, a model-based RL algorithm (like Deep Q-Networks or Actor-Critic methods) is suitable.
- Representation: The state representation should include the current status of each grid cell (whether it has a solar array, a wind turbine, or is vacant) along with its attributes.
- Sequence of Actions: The RL agent will sequentially choose actions (where to build next) based on the current state of the grid.
- Terminal State: The agent is done when the environment reaches a certain level of energy capacity or after a fixed number of steps.

# Implementation Steps
## Environment Setup: 
- Implement the environment to reflect the grid and its dynamics, including applying the binary mask for unavailable cells.
- The step(action) method should update the grid state based on the chosen action and calculate the immediate reward or cost.

## Agent Development:
- Use PyTorch for implementing the neural network models for the agent.
The agent needs to learn a policy that maximizes long-term rewards, considering the complex reward structure and large state space.

## Training and Evaluation:
- Set up a training loop where the agent interacts with the environment, receives feedback, and improves its policy.
- Periodically evaluate the agent's performance, possibly using separate evaluation episodes or metrics like total energy capacity achieved or adherence to environmental constraints.

## Hyperparameter Tuning:
- Adjust learning rates, exploration rates, discount factors, and network architecture as needed to improve performance.
 
## Scalability:
Due to the large state space, may need to:
- use function approximation for value functions
- prioritizing important experiences in the replay buffer
- parallelize computation process

## Visualization and Analysis:
- Develop tools to visualize the evolving grid layout and analyze the trade-offs made by the RL agent between different objectives (like energy maximization vs. environmental constraints).

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from scipy.spatial.distance import pdist, squareform

import random
import numpy as np

# Environment

In [6]:
class RenewableEnergyEnvironment:
    def __init__(self, grid_df):
        # Initialize the environment
        self.grid_df = grid_df
        self.state = None
        self.total_energy_output = 0
        self.required_energy_output = [
            100, # Hour 01
            100, # Hour 02
            100, # Hour 03
            100, # Hour 04
            100, # Hour 05
            100, # Hour 06
            100, # Hour 07
            100, # Hour 08
            100, # Hour 09
            100, # Hour 10
            100, # Hour 11
            100, # Hour 12
            100, # Hour 13
            100, # Hour 14
            100, # Hour 15
            100, # Hour 16
            100, # Hour 17
            100, # Hour 18
            100, # Hour 19
            100, # Hour 20
            100, # Hour 21
            100, # Hour 22
            100, # Hour 23
            100 # Hour 24
        ]
        self.weights = {
        'transmission_line_distance': -1.0,
        'wind_solar_balance': 1.0,
        'building_density': -0.5,
        'road_density': -0.3,
        'cyclone_risk': -0.7,
        'early_choice_reward': 1.0,
        'distributed_grid_reward': 1.0,
        }
        self.bounds = {
            'transmission_line_distance': (0, max_transmission_distance),
            'wind_solar_balance': (0, max_balance_score),
            'building_density': (0, max_building_density),
            'road_density': (0, max_road_density),
            'cyclone_risk': (0, max_cyclone_risk),
            'early_choice_reward': (0, max_early_choice_reward),
            'distributed_grid_reward': (0, max_distributed_grid_reward)
        }
        

    def reset(self):
        # Reset the environment to the initial state
        self.state = self.grid_df.copy()
        self.total_energy_output = 0
        return self.state

    def step(self, action):
        # Apply the action to the environment and return the result
        # action: Tuple (cell_index, action_type) where action_type could be 'solar' or 'wind'
        cell_index, action_type = action

        # Check if the action is valid
        if not self.is_valid_action(cell_index, action_type):
            reward = -1  # Penalty for invalid action
            done = self.is_terminal_state()
            return self.state, reward, done, {}

        # Apply the action
        self.apply_action(cell_index, action_type)

        # Calculate reward
        reward = self.calculate_reward(self.state)

        # Update total energy output or other state attributes as needed
        self.total_energy_output += self.calculate_energy_output(self.state, cell_index, action_type)

        # Check if the state is terminal
        done = self.is_terminal_state()

        return self.state, reward, done, {}

    def is_valid_action(self, cell_index, action_type):
        # Implement logic to check if an action is valid
        # Example: Check if the cell is not masked and not already occupied
        cell = self.state.iloc[cell_index]
        if cell['masked']:
            return False
        elif cell['occupied']:
            return False
        elif action_type == 'solar' and cell['slope'] > 0.05:
            return False
        elif action_type == 'wind' and cell['slope'] > 0.15:
            return False
        
        return True

    def apply_action(self, cell_index, action_type):
        # Implement the changes to the environment based on the action
        # Example: Mark the cell as occupied and record the type of installation
        self.state.at[cell_index, 'occupied'] = True
        self.state.at[cell_index, 'installation_type'] = action_type

    def calculate_reward():
        # Solar installation cost
        solar_install_cost = calculate_solar_install_cost()

        # Wind turbine installation cost
        wind_install_cost = calculate_wind_install_cost()
        
        # Solar power Reward
        solar_power_reward = calculate_solar_power_reward()
        
        # Wind Power Reward
        wind_power_reward = calculate_wind_power_reward()
        
        # Transmission loss cost 
        transmission_loss_cost = calculate_transmission_loss_cost()

        # Transmission build cost
        transmission_build_cost = calculate_transmission_build_cost()
        
        # Cyclone risk cost
        cyclone_risk_cost = calculate_cyclone_risk_cost()
    
        # Distribution reward
        distributed_grid_reward = calculate_distributed_grid_reward()
    
        # Early choice reward
        early_choice_reward = calculate_early_choice_reward()
    
        costs_rewards = (solar_install_cost,
                         wind_install_cost,
                         solar_power_reward,
                         wind_power_reward,
                         transmission_loss_cost,
                         transmission_build_cost,
                         cyclone_risk_cost,
                         distributed_grid_reward,
                         early_choice_reward)
        
        total_reward = calculate_total_reward(costs_rewards, self.weights)
    
        return total_reward

    def calculate_energy_output(self, cell_index, action_type):
        # Calculate the energy output for the action
        # Example: Different output for solar and wind
        if action_type == 'solar':
            return self.state.iloc[cell_index]['solar_output']
        elif action_type == 'wind':
            return self.state.iloc[cell_index]['wind_output']
        return 0

    def calculate_solar_install_cost():
        """Cost of installing solar array on cell. Based on elevation and slope"""
        state = self.state
        bounds = self.bounds
        return cost

    def calculate_wind_install_cost():
        """Cost of installing wind turbine on cell. Based on elevation and slope"""
        state = self.state
        bounds = self.bounds
        return cost

    def calculate_power_output_reward(environment_gdf, demand):
        """Uses supply and demand curve to determine the amount of demand satisfied by
        the solar and wind installations"""
        cost_kWh = 9999  # TODO
    
        # Filter for solar and wind installations
        solar_gdf = environment_gdf[(environment_gdf['occupied']) & (environment_gdf['installation'] == 'solar')]
        wind_gdf = environment_gdf[(environment_gdf['occupied']) & (environment_gdf['installation'] == 'wind')]
    
        # Prepare column names for solar and wind power
        solar_power_columns = [f'solar_power_{i}' for i in range(1, 25)]
        wind_power_columns = [f'wind_power_{i}' for i in range(1, 25)]
    
        # Vectorized sum of power output for solar and wind for each hour
        total_solar_power = solar_gdf[solar_power_columns].sum()
        total_wind_power = wind_gdf[wind_power_columns].sum()
    
        # Calculate the reward using vectorized minimum
        total_power = total_solar_power + total_wind_power
        reward = np.minimum(total_power, demand[1:]).sum() * cost_kWh
    
        return reward
    
    def calculate_transmission_build_cost(environment_gdf):
        # Constants for cost adjustment
        # Extract occupied cells
        occupied_cells = environment_gdf[environment_gdf['occupied'] == True]
    
        # Check if there is only one occupied cell
        if len(occupied_cells) == 1:
            # For a single occupied cell, use the distance to transmission line for cost calculation
            occupied_cell = occupied_cells.iloc[0]
            distance_km = occupied_cell['distance_to_transmission_line']
            build_cost = transmission_line_cost_per_km(distance_km)
        else:
            # Get coordinates of occupied cells
            coords = np.array(list(zip(occupied_cells.geometry.x, occupied_cells.geometry.y)))
    
            # Calculate pairwise distances between occupied cells
            distances = cdist(coords, coords)
    
            # Replace zeros in distance matrix with np.inf to avoid zero distance to itself
            np.fill_diagonal(distances, np.inf)
    
            # Find the nearest installation for each installation
            nearest_installation_distances = np.min(distances, axis=1)
    
            # Determine the relevant distance for cost calculation
            relevant_distances = np.minimum(nearest_installation_distances, occupied_cells['distance_to_transmission_line'].to_numpy())
    
            # Calculate build cost
            build_costs = [transmission_line_cost_per_km(distance) for distance in relevant_distances]
            build_cost = sum(build_costs)
            
        return build_cost
    
    def transmission_line_cost_per_km(distance):
        COST_PER_KM = 2.29 / 1.60934  # $ Millions. Convert cost per mile to cost per kilometer
        SHORT_DISTANCE_THRESHOLD = 3  # Threshold for short distance in miles
        MEDIUM_DISTANCE_THRESHOLD = 10  # Threshold for medium distance in miles
        
        if distance < SHORT_DISTANCE_THRESHOLD:
            cost_modifier = 1.5  # 50% increase for less than 3 miles
        elif distance < MEDIUM_DISTANCE_THRESHOLD:
            cost_modifier = 1.2  # 20% increase for 3-10 miles
        else:
            cost_modifier = 1  # No modification for more than 10 miles
        return distance * cost_per_km * cost_modifier
    
    def calculate_distributed_grid_reward(environment_gdf, max_distance):
        # Extract the coordinates of the occupied cells (where installations are located)
        occupied_cells = environment_gdf[environment_gdf['occupied'] == True]
        if len(occupied_cells) < 2:
            # If there are less than two installations, we cannot calculate distances
            return 1
    
        coords = np.array(list(zip(occupied_cells.geometry.x, occupied_cells.geometry.y)))
    
        # Calculate pairwise distances between all occupied cells
        distances = pdist(coords)
    
        # Calculate the average distance. The larger this is, the more distributed the installations are.
        avg_distance = np.mean(distances)
    
        # Normalize the reward such that it ranges between 0 and 1
        normalized_reward = avg_distance / max_distance
    
        return normalized_reward
    
    def time_dependent_reward_factor(gdf, decay_rate):
        # A function to calculate the time-dependent reward factor
        # It decreases with each year from the base year
        action_number = gdf.occupied.sum()
        return 1 / (1 + decay_rate * action_number)
    
    def calculate_total_reward():
        """Calculate total cost/reward based on"""
        state = self.state
        bounds = self.bounds
        return total_cost_reward

    def is_terminal_state(self):
        # The episode ends when the total energy output meets the requirement
        return self.total_energy_output >= self.required_energy_output

    def render(self):
        # Optional: Implement a method to visualize the current state of the environment
        pass

NameError: name 'environment_gdf' is not defined

# Neural Network Architecture

In [27]:
class DQN(nn.Module):
    def __init__(self, input_shape, num_actions):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.bn3 = nn.BatchNorm2d(64)

        self.input_shape = input_shape  # Store input_shape for feature size calculation
        self.fc1 = nn.Linear(self._feature_size(), 512)
        self.fc2 = nn.Linear(512, num_actions)

    def _feature_size(self):
        with torch.no_grad():
            return self.conv3(self.bn3(self.conv2(self.bn2(self.conv1(torch.zeros(1, *self.input_shape)))))).view(1, -1).size(1)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)


# DQN Agent

In [4]:
class DQNAgent:
    def __init__(self, state_space, action_space):
        self.state_space = state_space
        self.action_space = action_space
        self.model = DQN(state_space.shape, action_space.n)

    def select_action(self, state, epsilon):
        if random.random() > epsilon:
            # Choose the best action (exploitation)
            with torch.no_grad():
                state_tensor = torch.tensor([state], dtype=torch.float32)
                q_values = self.model(state_tensor)
                action = q_values.max(1)[1].item()  # Select the action with the highest Q-value
        else:
            # Choose a random action (exploration)
            action = random.randrange(self.action_space.n)

        return action

    def learn(self, batch):
        states, actions, rewards, next_states, dones = batch

        # Convert to PyTorch tensors
        states = torch.tensor(states, dtype=torch.float32)
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float32)
        next_states = torch.tensor(next_states, dtype=torch.float32)
        dones = torch.tensor(dones, dtype=torch.float32)

        # Compute Q values
        current_q_values = self.model(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.model(next_states).max(1)[0]
        expected_q_values = rewards + (1 - dones) * next_q_values.detach()

        # Compute loss
        loss = torch.nn.functional.mse_loss(current_q_values, expected_q_values)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        return loss.item()


# Training Loop

In [5]:
def train(agent, environment, episodes, epsilon_start, epsilon_end, epsilon_decay, replay_buffer, batch_size):
    epsilon = epsilon_start
    for episode in range(episodes):
        state = environment.reset()
        done = False
        total_reward = 0  # To keep track of total reward per episode

        while not done:
            action = agent.select_action(state, epsilon)
            next_state, reward, done, _ = environment.step(action)

            # Store experience in replay buffer
            replay_buffer.store(state, action, reward, next_state, done)

            # Check if buffer is ready for sampling
            if len(replay_buffer) > batch_size:
                # Sample a batch from replay buffer
                batch = replay_buffer.sample(batch_size)
                # Learn from the sampled experiences
                agent.learn(batch)

            # Update state
            state = next_state
            total_reward += reward

        # Decay epsilon
        epsilon = max(epsilon_end, epsilon_decay * epsilon)  # Ensure epsilon doesn't go below the minimum

        # Optional: Log training progress
        print(f"Episode {episode + 1}/{episodes}, Total Reward: {total_reward}, Epsilon: {epsilon}")

    print("Training complete.")