## Output
1. Aggregate of 171 households from Australian EDP data from 24 Dec 2018 to 30 Sep 2024, 5-minutely. We take only the datetime and netload_kW, from 1 Jul 2021 to 30 Jun 2024. These households are located close to each other, in NSW area. 

## Input
1. edp data from Jan 2019 to Sep 2019 Ausgrid Solar Home Data (71 files), 951x households, with varying first and last date of recording. 
   
## Note
1. Input is in the form of csv, with columns edp_site_id,unix_time,datetime,edp_device_and_circuit,circuit_label,edp_circuit_label,real_energy,real_energy_negative,real_energy_positive,reactive_energy,current_avg,current_min,current_max,voltage_avg,voltage_min,voltage_max
2. No missing values handling because there are no missing values, have been checked
3. No outlier handling because data are assumed to be not having outliers.
4. Step
a. Every time input csv, input only the relevant columns edp_site_id, unix_time, edp_device_and_circuit, real_energy, where circuit_label is onl ac_load_net, 

# 1. Import Data

In [56]:
import pandas as pd
import glob
import os

# file_paths = glob.glob('../../data/1. raw/ACAP EDP/*.csv')
file_paths = glob.glob('../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun/*.csv')

In [57]:
# sites_to_include_paths = glob.glob('../../data/1. raw/ACAP EDP/Metadata/edp_sites_to_include.xlsx')
sites_to_include_paths = glob.glob('../../../../data/2. processed/aedp_cluster_2_3years.xlsx')

In [58]:
edp_site_ids_to_include = pd.read_excel(sites_to_include_paths[0])['edp_site_id'].to_list()

In [59]:
len(edp_site_ids_to_include)

171

In [60]:
aggregate_df = pd.DataFrame()
for file in file_paths:
      
    print(f"Processing {file}")
    
    # Include only relevant columns
    df = pd.read_csv(file, usecols = ['edp_site_id', 'unix_time', 'edp_device_and_circuit', 'circuit_label', 'real_energy'])
    
    # Filter only the net load
    df = df[df['circuit_label'] == 'ac_load_net']
    df.drop(columns=['circuit_label'], inplace=True)    
    
    # Filter only sites with data in cluster
    df = df[df['edp_site_id'].isin(edp_site_ids_to_include)]
    
    # Aggregate accross different sites, devices and circuits at the same site and time
    df = df.groupby(['unix_time'])['real_energy'].sum().reset_index()
    aggregate_df = pd.concat([aggregate_df, df], ignore_index=True)

Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2021_0685674287.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2021_0781620172.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2021_0866839036.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2021_099370549.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2021_1099129706.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2021_1139305266.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2021_1254143330.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2022_0190743126.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2022_0259555274.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 Jun\edp_data_2022_0344159374.csv
Processing ../../../../data/1. raw/ACAP EDP/2021 Jul to 2024 

In [61]:
aggregate_df['datetime'] = pd.to_datetime(aggregate_df['unix_time'], unit='s')
aggregate_df.drop(columns=['unix_time'], inplace=True)
aggregate_df.set_index('datetime', inplace=True)
aggregate_df.rename(columns={'real_energy': 'netload_kWh'}, inplace=True)
aggregate_df['netload_kWh'] = aggregate_df['netload_kWh'] / 1000  # Convert from W to kW


In [6]:
# Filter datetime from 2021-07-01 to 2023-06-30
start_date = '2021-07-01 00:00:00'
end_date = '2024-06-30 23:59:59'


In [None]:
aggregate_df_3years = aggregate_df[(aggregate_df.index >= start_date) & (aggregate_df.index <= end_date)]

In [63]:
aggregate_df_3years

Unnamed: 0_level_0,netload_kWh
datetime,Unnamed: 1_level_1
2021-07-01 00:00:00,11.177170
2021-07-01 00:05:00,10.832638
2021-07-01 00:10:00,10.568127
2021-07-01 00:15:00,11.059183
2021-07-01 00:20:00,11.220170
...,...
2024-06-30 23:35:00,0.483461
2024-06-30 23:40:00,0.802184
2024-06-30 23:45:00,0.016977
2024-06-30 23:50:00,-0.183799


In [64]:
aggregate_df_3years['netload_kW'] = aggregate_df_3years['netload_kWh'] / (5/60)
aggregate_df_3years.drop(columns=['netload_kWh'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aggregate_df_3years['netload_kW'] = aggregate_df_3years['netload_kWh'] / (5/60)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aggregate_df_3years.drop(columns=['netload_kWh'], inplace=True)


In [65]:
#drop duplicates on datetime index
aggregate_df_3years = aggregate_df_3years[~aggregate_df_3years.index.duplicated(keep='first')]

In [66]:
aggregate_df_3years

Unnamed: 0_level_0,netload_kW
datetime,Unnamed: 1_level_1
2021-07-01 00:00:00,134.126042
2021-07-01 00:05:00,129.991660
2021-07-01 00:10:00,126.817520
2021-07-01 00:15:00,132.710191
2021-07-01 00:20:00,134.642041
...,...
2024-06-30 23:35:00,133.812538
2024-06-30 23:40:00,130.250219
2024-06-30 23:45:00,125.370684
2024-06-30 23:50:00,126.077426


In [None]:
aggregate_df_3years

Unnamed: 0,datetime,netload_kW
0,2021-07-01 00:00:00,134.126
1,2021-07-01 00:05:00,129.992
2,2021-07-01 00:10:00,126.818
3,2021-07-01 00:15:00,132.710
4,2021-07-01 00:20:00,134.642
...,...,...
315595,2024-06-30 23:35:00,133.813
315596,2024-06-30 23:40:00,130.250
315597,2024-06-30 23:45:00,125.371
315598,2024-06-30 23:50:00,126.077


In [10]:
aggregate_df_3years.index = pd.to_datetime(aggregate_df_3years['datetime'])
aggregate_df_3years.drop(columns=['datetime'], inplace=True)

In [11]:
# create an interval data from 2021-07-01 to 2024-06-30 with 5 minute interval
date_range = pd.date_range(start=start_date, end=end_date, freq='5T')

  date_range = pd.date_range(start=start_date, end=end_date, freq='5T')


In [12]:
aggregate_df_3years_complete = pd.DataFrame(index=date_range)
# merge the aggregate_df_3years with the date_range
aggregate_df_3years_complete = aggregate_df_3years_complete.merge(aggregate_df_3years, left_index=True, right_index=True, how='left')

In [13]:
aggregate_df_3years_complete

Unnamed: 0,netload_kW
2021-07-01 00:00:00,134.126
2021-07-01 00:05:00,129.992
2021-07-01 00:10:00,126.818
2021-07-01 00:15:00,132.710
2021-07-01 00:20:00,134.642
...,...
2024-06-30 23:35:00,133.813
2024-06-30 23:40:00,130.250
2024-06-30 23:45:00,125.371
2024-06-30 23:50:00,126.077


In [14]:
# see if there are any missing values
missing_values = aggregate_df_3years_complete.isnull().sum()

# see missing values rows
missing_values_rows = aggregate_df_3years_complete[aggregate_df_3years_complete.isnull().any(axis=1)]

In [16]:
missing_values_rows

Unnamed: 0,netload_kW
2022-10-02 02:00:00,
2022-10-02 02:05:00,
2022-10-02 02:10:00,
2022-10-02 02:15:00,
2022-10-02 02:20:00,
2022-10-02 02:25:00,
2022-10-02 02:30:00,
2022-10-02 02:35:00,
2022-10-02 02:40:00,
2022-10-02 02:45:00,


In [17]:
# fill missing values with prev value
aggregate_df_3years_complete.fillna(method='ffill', inplace=True)

  aggregate_df_3years_complete.fillna(method='ffill', inplace=True)


# Export Data

In [18]:
# NEED to change this to complete
aggregate_df_3years_complete.to_csv('../../../../to Github/data/ds6_aedp_cluster_5min.csv', index=True, float_format='%.3f')

# CREATE 30 MINUTELY DATA

In [76]:
aggregate_df_3years

Unnamed: 0_level_0,netload_kW
datetime,Unnamed: 1_level_1
2021-07-01 00:00:00,134.126042
2021-07-01 00:05:00,129.991660
2021-07-01 00:10:00,126.817520
2021-07-01 00:15:00,132.710191
2021-07-01 00:20:00,134.642041
...,...
2024-06-30 23:35:00,133.812538
2024-06-30 23:40:00,130.250219
2024-06-30 23:45:00,125.370684
2024-06-30 23:50:00,126.077426


In [78]:
aggregate_df_3years_30min = aggregate_df_3years_complete.resample('30min').mean()

In [81]:
#name index to be 'datetime'
aggregate_df_3years_30min.index.name = 'datetime'

In [82]:
aggregate_df_3years_30min.to_csv('../../../../to Github/data/ds7_aedp_cluster_30min.csv', index=True, float_format='%.3f')

# ARCHIVE

In [None]:
import pandas as pd
aggregate_df_3years = pd.read_csv('../../../../to Github/data/ds6_aedp_cluster_5min.csv')

In [None]:
# Identify sites with NAs or missing data
sites_with_nas_list = set()

for file in file_paths:
   
    print(f"Processing {file}") 
    
    # Include only relevant columns
    df = pd.read_csv(file, usecols = ['edp_site_id', 'unix_time', 'edp_device_and_circuit', 'circuit_label', 'real_energy'])
    
    # Filter only the net load
    df = df[df['circuit_label'] == 'ac_load_net']
    df.drop(columns=['circuit_label'], inplace=True)    
    
    # Filter only sites with data in the correct cluster
    df = df[df['edp_site_id'].isin(edp_site_ids_to_include)]
    
    # Identify site id with missing real energy data
    sites_with_nas = set(df[df['real_energy'].isna()]['edp_site_id'].unique())
    sites_with_nas_list.update(sites_with_nas)

Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0160891405.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0278800434.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0322850795.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0433678557.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0529556122.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0627907281.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_079365040.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0862506807.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_0922561682.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_1014737568.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_1194606633.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2019_1293725212.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data_2020_0116987317.csv
Processing ../../../../data/1. raw/ACAP EDP\edp_data

In [None]:
sites_with_nas_list

set()

In [None]:
edp_site_ids_to_include = [site for site in edp_site_ids_to_include if site not in sites_with_nas_list]

In [None]:
import pandas as pd
import glob
import os

In [None]:
file_paths = glob.glob('../../data/3. cleaned/ds2_aedp_5min.csv')
df = pd.read_csv(file_paths[0], sep=',', header=0, index_col=0, parse_dates=True)

In [None]:
df.rename(columns={'netload_kW': 'netload_kWh'}, inplace=True)
df['netload_kW'] = df['netload_kWh'] / (5/60)
df.drop(columns=['netload_kWh'], inplace=True)

In [None]:
aggregate_df = df.copy()

# MESSY

datetime, netload_kW
30 minutely from 1 July 2010 to 30 June 2013

27 Dec 2018

Total 487 households

Create dataset of 
from 1 Jan 2019 to 30 Sep 2024
5 minutely

Cut into
from 1 Jul 2021 to 30 Jun 2024
convert to 30 minutely data

datetime
netload_kW