# Overview
## State Space
- The state space consists of an 80,000-cell grid, representing different geographical locations in Puerto Rico.
- Each cell has attributes like solar PV output, wind power density, elevation, slope, cyclone risk score, building density, road density, and distance to transmission lines.
- Approximately 70% of cells are unavailable for development due to environmental or other constraints.

## Action Space
- Two types of actions are available: building a solar array or a wind turbine.
- Actions can be taken on any available cell.

## Rewards and Costs
The reward function should incorporate:
- Energy production potential (solar and wind).
- Costs or penalties associated with building on certain terrains (e.g., high elevation or steep slopes).
- Penalties for building in high cyclone risk areas.
- Incentives for maintaining a balance between solar and wind energy.
- Incentives for early deployment and distributed grid development.
- Penalties for high building or road density areas.
- Distance to transmission lines.

## RL Model
- Model Choice: Given the size of the state space, a model-based RL algorithm (like Deep Q-Networks or Actor-Critic methods) is suitable.
- Representation: The state representation should include the current status of each grid cell (whether it has a solar array, a wind turbine, or is vacant) along with its attributes.
- Sequence of Actions: The RL agent will sequentially choose actions (where to build next) based on the current state of the grid.
- Terminal State: The agent is done when the environment reaches a certain level of energy capacity or after a fixed number of steps.

# Implementation Steps
## Environment Setup: 
- Implement the environment to reflect the grid and its dynamics, including applying the binary mask for unavailable cells.
- The step(action) method should update the grid state based on the chosen action and calculate the immediate reward or cost.

## Agent Development:
- Use PyTorch for implementing the neural network models for the agent.
The agent needs to learn a policy that maximizes long-term rewards, considering the complex reward structure and large state space.

## Training and Evaluation:
- Set up a training loop where the agent interacts with the environment, receives feedback, and improves its policy.
- Periodically evaluate the agent's performance, possibly using separate evaluation episodes or metrics like total energy capacity achieved or adherence to environmental constraints.

## Hyperparameter Tuning:
- Adjust learning rates, exploration rates, discount factors, and network architecture as needed to improve performance.
 
## Scalability:
Due to the large state space, may need to:
- use function approximation for value functions
- prioritizing important experiences in the replay buffer
- parallelize computation process

## Visualization and Analysis:
- Develop tools to visualize the evolving grid layout and analyze the trade-offs made by the RL agent between different objectives (like energy maximization vs. environmental constraints).

In [12]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import random
import numpy as np

ModuleNotFoundError: No module named 'torch'

# Environment

In [14]:
import numpy as np
import geopandas as gpd
import random

class RenewableEnergyEnvironment:
    def __init__(self, grid_df):
        self.grid_df = grid_df
        self.state = None
        self.total_energy_output = 0  # Example attribute

    def reset(self):
        # Reset the environment to the initial state
        self.state = self.grid_df.copy()
        self.total_energy_output = 0
        return self.state

    def step(self, action):
        # Apply the action to the environment and return the result
        # action: Tuple (cell_index, action_type) where action_type could be 'solar' or 'wind'
        cell_index, action_type = action

        # Check if the action is valid
        if not self.is_valid_action(cell_index, action_type):
            reward = -1  # Penalty for invalid action
            done = self.is_terminal_state()
            return self.state, reward, done, {}

        # Apply the action
        self.apply_action(cell_index, action_type)

        # Calculate reward
        reward = self.calculate_reward(cell_index, action_type)

        # Update total energy output or other state attributes as needed
        self.total_energy_output += self.calculate_energy_output(cell_index, action_type)

        # Check if the state is terminal
        done = self.is_terminal_state()

        return self.state, reward, done, {}

    def is_valid_action(self, cell_index, action_type):
        # Implement logic to check if an action is valid
        # Example: Check if the cell is not masked and not already occupied
        cell = self.state.iloc[cell_index]
        return not cell['masked'] and not cell['occupied']

    def apply_action(self, cell_index, action_type):
        # Implement the changes to the environment based on the action
        # Example: Mark the cell as occupied and record the type of installation
        self.state.at[cell_index, 'occupied'] = True
        self.state.at[cell_index, 'installation_type'] = action_type

    def calculate_reward(self, cell_index, action_type):
        # Calculate the reward for the current action
        # Example: Using building density as a factor in the reward
        cell = self.state.iloc[cell_index]
        reward = cell['building_density']  # Simplified example
        return reward

    def calculate_energy_output(self, cell_index, action_type):
        # Calculate the energy output for the action
        # Example: Different output for solar and wind
        if action_type == 'solar':
            return self.state.iloc[cell_index]['solar_output']
        elif action_type == 'wind':
            return self.state.iloc[cell_index]['wind_output']
        return 0

    def is_terminal_state(self):
        # Define the terminal condition
        # Example: Terminal state when a certain total energy output is reached
        required_energy_output = 10000  # Example value
        return self.total_energy_output >= required_energy_output

    def render(self):
        # Optional: Implement a method to visualize the current state of the environment
        pass


# Neural Network Architecture

In [10]:
class DQN(nn.Module):
    def __init__(self, input_shape, num_actions):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)

        self.input_shape = input_shape  # Store input_shape for feature size calculation
        self.fc1 = nn.Linear(self._feature_size(), 512)
        self.fc2 = nn.Linear(512, num_actions)

    def _feature_size(self):
        with torch.no_grad():  # No need to track gradients here
            return self.conv3(self.conv2(self.conv1(torch.zeros(1, *self.input_shape)))).view(1, -1).size(1)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

NameError: name 'nn' is not defined

# DQN Agent

In [11]:
class DQNAgent:
    def __init__(self, state_space, action_space):
        self.state_space = state_space
        self.action_space = action_space
        self.model = DQN(state_space.shape, action_space.n)

    def select_action(self, state, epsilon):
        if random.random() > epsilon:
            # Choose the best action (exploitation)
            with torch.no_grad():
                state_tensor = torch.tensor([state], dtype=torch.float32)
                q_values = self.model(state_tensor)
                action = q_values.max(1)[1].item()  # Select the action with the highest Q-value
        else:
            # Choose a random action (exploration)
            action = random.randrange(self.action_space.n)

        return action

    def learn(self, batch):
        states, actions, rewards, next_states, dones = batch

        # Convert to PyTorch tensors
        states = torch.tensor(states, dtype=torch.float32)
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float32)
        next_states = torch.tensor(next_states, dtype=torch.float32)
        dones = torch.tensor(dones, dtype=torch.float32)

        # Compute Q values
        current_q_values = self.model(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.model(next_states).max(1)[0]
        expected_q_values = rewards + (1 - dones) * next_q_values.detach()

        # Compute loss
        loss = torch.nn.functional.mse_loss(current_q_values, expected_q_values)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        return loss.item()


# Training Loop

In [13]:
def train(agent, environment, episodes, epsilon_start, epsilon_end, epsilon_decay, replay_buffer, batch_size):
    epsilon = epsilon_start
    for episode in range(episodes):
        state = environment.reset()
        done = False
        total_reward = 0  # To keep track of total reward per episode

        while not done:
            action = agent.select_action(state, epsilon)
            next_state, reward, done, _ = environment.step(action)

            # Store experience in replay buffer
            replay_buffer.store(state, action, reward, next_state, done)

            # Check if buffer is ready for sampling
            if len(replay_buffer) > batch_size:
                # Sample a batch from replay buffer
                batch = replay_buffer.sample(batch_size)
                # Learn from the sampled experiences
                agent.learn(batch)

            # Update state
            state = next_state
            total_reward += reward

        # Decay epsilon
        epsilon = max(epsilon_end, epsilon_decay * epsilon)  # Ensure epsilon doesn't go below the minimum

        # Optional: Log training progress
        print(f"Episode {episode + 1}/{episodes}, Total Reward: {total_reward}, Epsilon: {epsilon}")

    print("Training complete.")

# Calculating Final Reward

In [5]:
def calculate_final_reward(grid_gdf, weights, normalization_bounds):
    """
    Calculate the final reward by combining various costs and rewards.

    :param grid_gdf: GeoDataFrame containing the grid cells with their respective costs and rewards.
    :param weights: Dictionary containing the weights for each cost/reward component.
    :param normalization_bounds: Dictionary containing min and max values for normalization of each component.
    :return: Array of final reward values for each grid cell.
    """
    final_reward = np.zeros(len(grid_gdf))

    for component, weight in weights.items():
        component_values = grid_gdf[component].to_numpy()
        min_val, max_val = normalization_bounds.get(component, (None, None))
        normalized_values = normalize(component_values, min_val, max_val)
        final_reward += weight * normalized_values

    return final_reward

# Example usage
weights = {
    'transmission_line_distance': -1.0,
    'wind_solar_balance': 1.0,
    'building_density': -0.5,
    'road_density': -0.3,
    'cyclone_risk': -0.7,
    'early_choice_reward': 1.0,
    'distributed_grid_reward': 1.0,
    # Add other components here
}

normalization_bounds = {
    'transmission_line_distance': (0, max_transmission_distance),
    'wind_solar_balance': (0, max_balance_score),
    'building_density': (0, max_building_density),
    'road_density': (0, max_road_density),
    'cyclone_risk': (0, max_cyclone_risk),
    'early_choice_reward': (0, max_early_choice_reward),
    'distributed_grid_reward': (0, max_distributed_grid_reward),
    # Define bounds for other components
}

grid_gdf['final_reward'] = calculate_final_reward(grid_gdf, weights, normalization_bounds)


NameError: name 'max_transmission_distance' is not defined

In [6]:
def calculate_transmission_cost(state, cell_index, global_params):
    # Logic to calculate cost based on the entire state
    pass

def calculate_wind_solar_balance(state, cell_index, global_params):
    # Logic to calculate balance
    pass

# ... and so on for other functions


In [7]:
def calculate_costs_rewards(grid_gdf, global_params):
    for idx, row in grid_gdf.iterrows():
        # Calculate individual costs and rewards
        transmission_cost = calculate_transmission_cost(grid_gdf, idx, global_params)
        wind_solar_balance = calculate_wind_solar_balance(grid_gdf, idx, global_params)
        building_density_cost = calculate_building_density_cost(grid_gdf, idx, global_params)
        # ... other calculations

        # Aggregate costs and rewards
        total_cost_reward = (transmission_cost + wind_solar_balance +
                             building_density_cost + ...)  # Add other components

        # Store the result in the grid DataFrame
        grid_gdf.at[idx, 'total_cost_reward'] = total_cost_reward

    return grid_gdf
