# Overview
## State Space
- The state space consists of an 80,000-cell grid, representing different geographical locations in Puerto Rico.
- Each cell has attributes like solar PV output, wind power density, elevation, slope, cyclone risk score, building density, road density, and distance to transmission lines.
- Approximately 70% of cells are unavailable for development due to environmental or other constraints.

## Action Space
- Two types of actions are available: building a solar array or a wind turbine.
- Actions can be taken on any available cell.

## Rewards and Costs
The reward function should incorporate:
- Energy production potential (solar and wind).
- Costs or penalties associated with building on certain terrains (e.g., high elevation or steep slopes).
- Penalties for building in high cyclone risk areas.
- Incentives for maintaining a balance between solar and wind energy.
- Incentives for early deployment and distributed grid development.
- Penalties for high building or road density areas.
- Distance to transmission lines.

## RL Model
- Model Choice: Given the size of the state space, a model-based RL algorithm (like Deep Q-Networks or Actor-Critic methods) is suitable.
- Representation: The state representation should include the current status of each grid cell (whether it has a solar array, a wind turbine, or is vacant) along with its attributes.
- Sequence of Actions: The RL agent will sequentially choose actions (where to build next) based on the current state of the grid.
- Terminal State: The agent is done when the environment reaches a certain level of energy capacity or after a fixed number of steps.

# Implementation Steps
## Environment Setup: 
- Implement the environment to reflect the grid and its dynamics, including applying the binary mask for unavailable cells.
- The step(action) method should update the grid state based on the chosen action and calculate the immediate reward or cost.

## Agent Development:
- Use PyTorch for implementing the neural network models for the agent.
The agent needs to learn a policy that maximizes long-term rewards, considering the complex reward structure and large state space.

## Training and Evaluation:
- Set up a training loop where the agent interacts with the environment, receives feedback, and improves its policy.
- Periodically evaluate the agent's performance, possibly using separate evaluation episodes or metrics like total energy capacity achieved or adherence to environmental constraints.

## Hyperparameter Tuning:
- Adjust learning rates, exploration rates, discount factors, and network architecture as needed to improve performance.
 
## Scalability:
Due to the large state space, may need to:
- use function approximation for value functions
- prioritizing important experiences in the replay buffer
- parallelize computation process

## Visualization and Analysis:
- Develop tools to visualize the evolving grid layout and analyze the trade-offs made by the RL agent between different objectives (like energy maximization vs. environmental constraints).

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import random
import numpy as np

# Environment

In [2]:
class RenewableEnergyEnvironment:
    def __init__(self, grid_df):
        # Initialize the environment
        self.grid_df = grid_df
        self.state = None
        self.total_energy_output = 0
        self.weights = {
        'transmission_line_distance': -1.0,
        'wind_solar_balance': 1.0,
        'building_density': -0.5,
        'road_density': -0.3,
        'cyclone_risk': -0.7,
        'early_choice_reward': 1.0,
        'distributed_grid_reward': 1.0,
        }
        self.bounds = {
            'transmission_line_distance': (0, max_transmission_distance),
            'wind_solar_balance': (0, max_balance_score),
            'building_density': (0, max_building_density),
            'road_density': (0, max_road_density),
            'cyclone_risk': (0, max_cyclone_risk),
            'early_choice_reward': (0, max_early_choice_reward),
            'distributed_grid_reward': (0, max_distributed_grid_reward)
        }
        

    def reset(self):
        # Reset the environment to the initial state
        self.state = self.grid_df.copy()
        self.total_energy_output = 0
        return self.state

    def step(self, action):
        # Apply the action to the environment and return the result
        # action: Tuple (cell_index, action_type) where action_type could be 'solar' or 'wind'
        cell_index, action_type = action

        # Check if the action is valid
        if not self.is_valid_action(cell_index, action_type):
            reward = -1  # Penalty for invalid action
            done = self.is_terminal_state()
            return self.state, reward, done, {}

        # Apply the action
        self.apply_action(cell_index, action_type)

        # Calculate reward
        reward = self.calculate_reward(self.state)

        # Update total energy output or other state attributes as needed
        self.total_energy_output += self.calculate_energy_output(self.state, cell_index, action_type)

        # Check if the state is terminal
        done = self.is_terminal_state()

        return self.state, reward, done, {}

    def is_valid_action(self, cell_index, action_type):
        # Implement logic to check if an action is valid
        # Example: Check if the cell is not masked and not already occupied
        cell = self.state.iloc[cell_index]
        return not cell['masked'] and not cell['occupied']

    def apply_action(self, cell_index, action_type):
        # Implement the changes to the environment based on the action
        # Example: Mark the cell as occupied and record the type of installation
        self.state.at[cell_index, 'occupied'] = True
        self.state.at[cell_index, 'installation_type'] = action_type

    def calculate_reward():
        grid_cost_reward = list()
        solar_install_cost = 0
        wind_install_cost = 0
        solar_power_reward = 0
        wind_power_reward = 0
        transmission_loss_cost = 0
        transmission_build_cost = 0
        cyclone_risk_cost = 0
    
        for idx, row in self.state.iterrows():
            # Calculate individual costs and rewards
    
            # Solar installation cost
            solar_install_cost += calculate_solar_install_cost(self.state, idx, self.bounds)
    
            # Wind turbine installation cost
            wind_install_cost += calculate_wind_install_cost(self.state, idx, self.bounds)
            
            # Solar power Reward
            solar_power_reward += calculate_solar_power_reward(self.state, idx, self.bounds)
            
            # Wind Power Reward
            wind_power_reward += calculate_wind_power_reward(self.state, idx, self.bounds)
            
            # Transmission loss cost 
            transmission_loss_cost += calculate_transmission_loss_cost(self.state, idx, self.bounds)
    
            # Transmission build cost
            transmission_build_cost += calculate_transmission_build_cost(self.state, idx, self.bounds)
            
            # Cyclone risk cost
            cyclone_risk_cost += calculate_cyclone_risk_cost(self.state, idx, self.bounds)
    
        # Distribution reward
        distributed_grid_reward = calculate_distributed_grid_reward(self.state, idx, self.bounds)
    
        # Early choice reward
        early_choice_reward = calculate_early_choice_reward(self.state, idx, self.bounds)
    
        costs_rewards = (solar_install_cost,
                         wind_install_cost,
                         solar_power_reward,
                         wind_power_reward,
                         transmission_loss_cost,
                         transmission_build_cost,
                         cyclone_risk_cost,
                         distributed_grid_reward,
                         early_choice_reward)
        
        total_reward = calculate_total_reward(costs_rewards, self.weights)
    
        return total_reward

    def calculate_energy_output(self, cell_index, action_type):
        # Calculate the energy output for the action
        # Example: Different output for solar and wind
        if action_type == 'solar':
            return self.state.iloc[cell_index]['solar_output']
        elif action_type == 'wind':
            return self.state.iloc[cell_index]['wind_output']
        return 0

    def calculate_solar_install_cost(idx):
        """Cost of installing solar array on cell. Based on elevation and slope"""
        state = self.state
        bounds = self.bounds
        return cost

    def calculate_wind_install_cost(idx):
        """Cost of installing wind turbine on cell. Based on elevation and slope"""
        state = self.state
        bounds = self.bounds
        return cost
    
    def calculate_solar_power_reward(idx):
        """Uses GHI and demand curve to determine the amount of demand satisfied by
        the solar installation"""
        state = self.state
        bounds = self.bounds
        return reward
    
    def calculate_wind_power_reward(idx):
        """Uses Wind Speed and demand curve to determine the amount of demand 
        satisfied by the solar installation"""
        state = self.state
        bounds = self.bounds
        return reward
    
    def calculate_transmission_loss_cost(idx):
        """Energy lost due to transmission loss. Uses building density to determine 
        demand. Calculates average distance the power from the installation will 
        need to travel. Uses this distance to calculate transmission cost"""
        state = self.state
        bounds = self.bounds
        return cost
    
    def calculate_transmission_build_cost(idx):
        """Cost of building new transmission infrastructure to serve all installations.
        Based on min(distance to nearest transmission line, 
                     distance to nearest previous installation)
        And cost per km of new transmission lines"""
        state = self.state
        bounds = self.bounds
        return cost
    
    def calculate_cyclone_risk_cost(idx):
        """Increase cost proportional to chance of being destroyed. Uses cyclone risk
        score at grid cell and type of installation."""
        state = self.state
        bounds = self.bounds
        return cost
    
    def calculate_distributed_grid_reward(idx):
        """Reward proportional to average distance between installations."""
        state = self.state
        bounds = self.bounds
        return reward
    
    def calculate_early_choice_reward(idx):
        """Rewards early installations -- this is a multiplying factor that 
        scales the total reward based on how early the action is taken"""
        state = self.state
        bounds = self.bounds
        return reward
    
    def calculate_total_reward(idx):
        """Calculate total cost/reward based on"""
        state = self.state
        bounds = self.bounds
        return total_cost_reward

    def is_terminal_state(self):
        # Define the terminal condition
        # Example: Terminal state when a certain total energy output is reached
        required_energy_output = 10000  # Example value
        return self.total_energy_output >= required_energy_output

    def render(self):
        # Optional: Implement a method to visualize the current state of the environment
        pass

# Neural Network Architecture

In [3]:
class DQN(nn.Module):
    def __init__(self, input_shape, num_actions):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)

        self.input_shape = input_shape  # Store input_shape for feature size calculation
        self.fc1 = nn.Linear(self._feature_size(), 512)
        self.fc2 = nn.Linear(512, num_actions)

    def _feature_size(self):
        with torch.no_grad():  # No need to track gradients here
            return self.conv3(self.conv2(self.conv1(torch.zeros(1, *self.input_shape)))).view(1, -1).size(1)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# DQN Agent

In [4]:
class DQNAgent:
    def __init__(self, state_space, action_space):
        self.state_space = state_space
        self.action_space = action_space
        self.model = DQN(state_space.shape, action_space.n)

    def select_action(self, state, epsilon):
        if random.random() > epsilon:
            # Choose the best action (exploitation)
            with torch.no_grad():
                state_tensor = torch.tensor([state], dtype=torch.float32)
                q_values = self.model(state_tensor)
                action = q_values.max(1)[1].item()  # Select the action with the highest Q-value
        else:
            # Choose a random action (exploration)
            action = random.randrange(self.action_space.n)

        return action

    def learn(self, batch):
        states, actions, rewards, next_states, dones = batch

        # Convert to PyTorch tensors
        states = torch.tensor(states, dtype=torch.float32)
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float32)
        next_states = torch.tensor(next_states, dtype=torch.float32)
        dones = torch.tensor(dones, dtype=torch.float32)

        # Compute Q values
        current_q_values = self.model(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.model(next_states).max(1)[0]
        expected_q_values = rewards + (1 - dones) * next_q_values.detach()

        # Compute loss
        loss = torch.nn.functional.mse_loss(current_q_values, expected_q_values)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        return loss.item()


# Training Loop

In [5]:
def train(agent, environment, episodes, epsilon_start, epsilon_end, epsilon_decay, replay_buffer, batch_size):
    epsilon = epsilon_start
    for episode in range(episodes):
        state = environment.reset()
        done = False
        total_reward = 0  # To keep track of total reward per episode

        while not done:
            action = agent.select_action(state, epsilon)
            next_state, reward, done, _ = environment.step(action)

            # Store experience in replay buffer
            replay_buffer.store(state, action, reward, next_state, done)

            # Check if buffer is ready for sampling
            if len(replay_buffer) > batch_size:
                # Sample a batch from replay buffer
                batch = replay_buffer.sample(batch_size)
                # Learn from the sampled experiences
                agent.learn(batch)

            # Update state
            state = next_state
            total_reward += reward

        # Decay epsilon
        epsilon = max(epsilon_end, epsilon_decay * epsilon)  # Ensure epsilon doesn't go below the minimum

        # Optional: Log training progress
        print(f"Episode {episode + 1}/{episodes}, Total Reward: {total_reward}, Epsilon: {epsilon}")

    print("Training complete.")

# Calculating Final Reward

In [22]:
def calculate_solar_install_cost(grid_gdf, idx, bounds):
    """Cost of installing solar array on cell. Based on elevation and slope"""
    return cost

def calculate_wind_install_cost(grid_gdf, idx, bounds):
    """Cost of installing wind turbine on cell. Based on elevation and slope"""
    return cost

def calculate_solar_power_reward(grid_gdf, idx, bounds):
    """Uses GHI and demand curve to determine the amount of demand satisfied by
    the solar installation"""
    return reward

def calculate_wind_power_reward(grid_gdf, idx, bounds):
    """Uses Wind Speed and demand curve to determine the amount of demand 
    satisfied by the solar installation"""
    return reward

def calculate_transmission_loss_cost(grid_gdf, idx, bounds):
    """Energy lost due to transmission loss. Uses building density to determine 
    demand. Calculates average distance the power from the installation will 
    need to travel. Uses this distance to calculate transmission cost"""
    return cost

def calculate_transmission_build_cost(grid_gdf, idx, bounds):
    """Cost of building new transmission infrastructure to serve all installations.
    Based on min(distance to nearest transmission line, 
                 distance to nearest previous installation)
    And cost per km of new transmission lines"""
    return cost

def calculate_cyclone_risk_cost(grid_gdf, idx, bounds):
    """Increase cost proportional to chance of being destroyed. Uses cyclone risk
    score at grid cell and type of installation."""
    return cost

def calculate_distributed_grid_reward(grid_gdf, idx, bounds):
    """Reward proportional to average distance between installations."""
    return reward

def calculate_early_choice_reward(grid_gdf, idx, bounds):
    """Rewards early installations -- this is a multiplying factor that 
    scales the total reward based on how early the action is taken"""
    return reward

def calculate_total_reward(costs_rewards, weights):
    """Calculate total cost/reward based on"""
    return total_cost_reward