## Capstone 
Goal is to make the files with 24 hour columns horizontally into a more vertical format

### From sponsor
- Data is the load data we receive with solar included 
- DataLessSolar is the data we receive without the solar 
- SolarSanitized contains formulas that subtract Data less solar from data and results in solar contribution.  
- Data Simple is the “Data” information without decimals to make it simplified.

### We will not be using Data Simple - no reason to remove decimals

In [None]:
import pandas as pd
import warnings

# Suppress warnings
warnings.filterwarnings("ignore")

# File path
file_path = 'LoadBFY2025.xlsx'

# Load specific sheets into separate DataFrames
basic_data = pd.read_excel(file_path, sheet_name='Data')
basic_data_no_solar = pd.read_excel(file_path, sheet_name='DataLessSolar')
solar_data = pd.read_excel(file_path, sheet_name='Solarsanitized') 

# Print the first few rows for verification
print("Basic Load Data:")
print(basic_data.head(2))

print("\nHBasic Load Data w/out Solar:")
print(basic_data_no_solar.head(2))

print("\nSolar only Data:")
print(solar_data.head(2))

# Store all sheets in a dictionary for easy reference and analysis later
all_load_sheets = {
    "BasicLoad": basic_data,
    "BasicLoadNoSolar": basic_data_no_solar,
    "SolarData": solar_data,
}

# Print the keys of the dictionary to verify
print("Loaded Data Sheets:")
print(all_load_sheets.keys())


In [5]:
# List of sheets that require transformation
sheets_to_transform = ['BasicLoad', 'BasicLoadNoSolar', 'SolarData']

# Dictionary to store transformed DataFrames
transformed_load_sheets = {}

# Transform the data for each specified sheet
for sheet_name in sheets_to_transform:
    # Access the sheet from the previously loaded dictionary
    data = all_load_sheets[sheet_name]
    
    # Reshape the data from wide to long format
    data_long = data.melt(
        id_vars=['Dt'],  # Columns to keep as is
        value_vars=[f'H{i}' for i in range(1, 25)],  # Columns to unpivot
        var_name='Hour',  # Name of the new column for hours
        value_name=sheet_name  # Name of the new column for the values
    )
    
    # Convert the 'Hour' column from 'H1', 'H2', etc., to integers (1-24)
    data_long['Hour'] = data_long['Hour'].str.extract('H(\d+)').astype(int)
    
    # Sort the data by date (`Dt`) and hour (`Hour`)
    data_long = data_long.sort_values(by=['Dt', 'Hour']).reset_index(drop=True)
    
    # Store the transformed DataFrame
    transformed_load_sheets[sheet_name] = data_long

    # Print a preview of the transformed data for verification
    print(f"\nTransformed Data for {sheet_name}:")
    print(data_long.head(10))



Transformed Data for BasicLoad:
          Dt  Hour  BasicLoad
0 1989-01-01     1      142.0
1 1989-01-01     2      124.0
2 1989-01-01     3      117.0
3 1989-01-01     4      112.0
4 1989-01-01     5      112.0
5 1989-01-01     6      114.0
6 1989-01-01     7      117.0
7 1989-01-01     8      127.0
8 1989-01-01     9      140.0
9 1989-01-01    10      161.0

Transformed Data for BasicLoadNoSolar:
          Dt  Hour  BasicLoadNoSolar
0 1989-01-01     1             142.0
1 1989-01-01     2             124.0
2 1989-01-01     3             117.0
3 1989-01-01     4             112.0
4 1989-01-01     5             112.0
5 1989-01-01     6             114.0
6 1989-01-01     7             117.0
7 1989-01-01     8             127.0
8 1989-01-01     9             140.0
9 1989-01-01    10             161.0

Transformed Data for SolarData:
          Dt  Hour  SolarData
0 1989-01-01     1        0.0
1 1989-01-01     2        0.0
2 1989-01-01     3        0.0
3 1989-01-01     4        0.0
4 1989-

In [6]:
# Specify the start date for filtering
start_date = "2000-01-01"

# Filter and prepare all transformed data for merging
for sheet_name in transformed_load_sheets:
    # Filter rows where the date is >= January 1, 2000
    transformed_load_sheets[sheet_name] = transformed_load_sheets[sheet_name][
        transformed_load_sheets[sheet_name]['Dt'] >= start_date
    ]

# Start with the first sheet for merging
merged_data = transformed_load_sheets['BasicLoad']

# Merge all DataFrames on common columns: 'Dt' and 'Hour'
for sheet_name in ['BasicLoadNoSolar', 'SolarData']:
    merged_data = pd.merge(
        merged_data,
        transformed_load_sheets[sheet_name],
        on=['Dt', 'Hour'],
        how='inner'
    )

# Drop unnecessary columns after merging
columns_to_drop = ['CompID']  # Add any other irrelevant columns here if needed
merged_data = merged_data.drop(columns=columns_to_drop, errors='ignore')

# Display the first few rows of the updated data
print(merged_data.head())

# Save the output to a CSV file
merged_data.to_csv('merged_load_data_verticalFormat_cleaned.csv', index=False)


          Dt  Hour  BasicLoad  BasicLoadNoSolar  SolarData
0 2000-01-01     1      193.0             193.0        0.0
1 2000-01-01     2      193.0             193.0        0.0
2 2000-01-01     3      184.0             184.0        0.0
3 2000-01-01     4      175.0             175.0        0.0
4 2000-01-01     5      172.0             172.0        0.0


### Solar didn't start until later - below shows which date it first had a non-zero value

In [7]:
# Find the first row where the SolarData column is non-zero
first_non_zero_row = merged_data[merged_data['SolarData'] != 0].iloc[0]

# Extract the date corresponding to the first non-zero value
first_non_zero_date = first_non_zero_row['Dt']

# Display the result
print(f"The first non-zero value in SolarData occurs on: {first_non_zero_date}")


The first non-zero value in SolarData occurs on: 2010-04-01 00:00:00


## Merging temperature and loads for Sam to utilize

In [8]:
# Load the load data file
load_data = pd.read_csv('merged_load_data_verticalFormat_cleaned.csv')

# Load the weather data file
weather_data = pd.read_csv('../weather_files/merged_weather_data_verticalFormat_cleaned.csv')       # make sure path correct after relocating to the 'weather_files' folder


# Merge the datasets on 'Dt' and 'Hour' columns
merged_data = pd.merge(load_data, weather_data[['Dt', 'Hour', 'Temp']], on=['Dt', 'Hour'], how='inner')

# Display the first few rows of the merged dataset
print(merged_data.head())

# Save the merged dataset to a new file
merged_data.to_csv('merged_load_weather_data.csv', index=False)


           Dt  Hour  BasicLoad  BasicLoadNoSolar  SolarData       Temp
0  2000-01-01     1      193.0             193.0        0.0  58.871429
1  2000-01-01     2      193.0             193.0        0.0  58.257143
2  2000-01-01     3      184.0             184.0        0.0  58.071429
3  2000-01-01     4      175.0             175.0        0.0  58.300000
4  2000-01-01     5      172.0             172.0        0.0  58.371429
