# Webscraper for AEMO wind data

# Table of contents
 1.    [Necessary packages for webscraper](#packages)
 <br><br>
 2.    [Downloading ZIP files](#dataprep)

       a. [Data information](#datainfo)
     
       b. [Webpage structure](#htmlcode)
       <br>
 3.   [Data Cleaning](#dataclean)
 
      a. [Extracting ZIP files](#extractzip)
       
      b. [Remove used ZIP files](#removezips)
       
      c. [Moving subfolders to main folder](#extractsubfolder)
       
      d. [Removing empty folders](#removenull1)
       
      e. [Extract CSV files](#extractcsv)
       
      f. [Removing empty folders](#removenull2)
       
      g. [Combining daily data](#combinedata)
       
      h. [Creating necessary variables](#avereading)
       
      i. [Saving wind dataset](#prelimdata)

      j. [Filtering DUIDs](#filterDUIDs)
       
      k. [Complete wind dataset](#winddata)
       
      l. [Combining price data](#mergedata)
       
      m. [Final dataset](#finaloutput)

## A. Necessary packages  <a name="packages"></a>

In [1]:
from bs4 import BeautifulSoup
import os
import pandas as pd
import requests
import shutil
from urllib.parse import urljoin
import zipfile

## B. Downloading ZIP files  <a name="dataprep"></a>

### 1. Data information <a name="datainfo"></a>

- **Period covered**: 13 months (November 2022 - December 2023)

- **Purpose**: for OLS regression; to determine how wind energy affects spot prices

- **Source**: [AEMO Reports: Actual Next Day Generation](https://nemweb.com.au/Reports/Archive/Next_Day_Actual_Gen/)

#### A. Archive data <a name="archive"></a>


In [10]:
# Target URL for webscraper

url = 'https://nemweb.com.au/Reports/Archive/Next_Day_Actual_Gen/'
page = requests.get(url)

# BeautifulSoup package

soup = BeautifulSoup(page.content, 'html.parser')
pre_element = soup.find('pre')

# Target directory 

target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Creating directory if not existing

if not os.path.exists(target_directory):
    os.makedirs(target_directory)

if pre_element:
    
    # Finding all anchor elements within the pre-element
    
    links = pre_element.find_all('a')

    # Collecting the links (files) to be collected
    
    for link in links:
        href = link.get('href')
        if href and href.endswith('.zip'):
            zip_url = urljoin(url, href)  # Constructing the absolute URL
            print(f"Downloading ZIP file: {zip_url}")

            # Downloading the ZIP file content
            
            zip_response = requests.get(zip_url)
            
            if zip_response.status_code == 200:
                
                # Saving the ZIP file to a local directory
                
                zip_filename = os.path.join(target_directory, os.path.basename(href))
                with open(zip_filename, 'wb') as zip_file:
                    zip_file.write(zip_response.content)

                print(f"ZIP file '{zip_filename}' downloaded successfully.")
                
            else:
                print(f"Failed to download ZIP file '{href}'. Status code: {zip_response.status_code}")
                print(f"Error content: {zip_response.text}")
else:
    print("Pre element not found on the page.")

Downloading ZIP file: https://nemweb.com.au/Reports/Archive/Next_Day_Actual_Gen/NEXT_DAY_ACTUAL_GEN_20221001.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221001.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Archive/Next_Day_Actual_Gen/NEXT_DAY_ACTUAL_GEN_20221101.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Archive/Next_Day_Actual_Gen/NEXT_DAY_ACTUAL_GEN_20221201.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221201.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Archive/Next_Day_Actual_Gen/NEXT_DAY_ACTUAL_GEN_20230101.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230101.zip' downloaded successfully.
Download

#### B. Current data <a name="current"></a>

In [3]:
# Target URL for webscraper

url = 'https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/'
page = requests.get(url)

# BeautifulSoup package

soup = BeautifulSoup(page.content, 'html.parser')
pre_element = soup.find('pre')

# Target directory 

target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Creating directory if not existing

if not os.path.exists(target_directory):
    os.makedirs(target_directory)

if pre_element:
    
    # Finding all anchor elements within the pre-element
    
    links = pre_element.find_all('a')

    # Collecting the links (files) to be collected
    
    for link in links:
        href = link.get('href')
        if href and href.endswith('.zip'):
            zip_url = urljoin(url, href)  # Constructing the absolute URL
            print(f"Downloading ZIP file: {zip_url}")

            # Downloading the ZIP file content
            
            zip_response = requests.get(zip_url)
            
            if zip_response.status_code == 200:
                
                # Saving the ZIP file to a local directory
                
                zip_filename = os.path.join(target_directory, os.path.basename(href))
                with open(zip_filename, 'wb') as zip_file:
                    zip_file.write(zip_response.content)

                print(f"ZIP file '{zip_filename}' downloaded successfully.")
                
            else:
                print(f"Failed to download ZIP file '{href}'. Status code: {zip_response.status_code}")
                print(f"Error content: {zip_response.text}")
else:
    print("Pre element not found on the page.")

Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231013_0000000399586165.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231013_0000000399586165.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231014_0000000399668664.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231014_0000000399668664.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231015_0000000399751824.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231015_0000000399751824.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231

ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231110_0000000402024638.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231111_0000000402127968.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231111_0000000402127968.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231112_0000000402231319.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231112_0000000402231319.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231113_0000000402323290.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231113_

ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231208_0000000404957730.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231209_0000000405066149.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231209_0000000405066149.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231210_0000000405173475.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231210_0000000405173475.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Current/Next_Day_Actual_Gen/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231211_0000000405286094.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_NEXT_DAY_ACTUAL_GEN_20231211_

### 2. Webpage structure <a name="htmlcode"></a>
- used to check html code of the webpage
    - both webpages have the same structure, in this case we use the **Current data** webpage
- purpose is to determine what the specific items are to be extracted

In [11]:
print(soup.prettify())

<html>
 <head>
  <title>
   nemweb.com.au - /Reports/Archive/Next_Day_Actual_Gen/
  </title>
 </head>
 <body>
  <h1>
   nemweb.com.au - /Reports/Archive/Next_Day_Actual_Gen/
  </h1>
  <hr/>
  <pre><a href="/Reports/Archive/">[To Parent Directory]</a><br/><br/> Thursday, September 29, 2016  5:32 PM      1122563 <a href="/Reports/Archive/Next_Day_Actual_Gen/_ZA01492">_ZA01492</a><br/> Thursday, September 29, 2016  5:31 PM            0 <a href="/Reports/Archive/Next_Day_Actual_Gen/NEW%20TEXT%20DOCUMENT.TXT">NEW TEXT DOCUMENT.TXT</a><br/>   Thursday, December 1, 2022  1:02 AM      1472290 <a href="/Reports/Archive/Next_Day_Actual_Gen/NEXT_DAY_ACTUAL_GEN_20221001.zip">NEXT_DAY_ACTUAL_GEN_20221001.zip</a><br/>      Sunday, January 1, 2023  1:02 AM      1456025 <a href="/Reports/Archive/Next_Day_Actual_Gen/NEXT_DAY_ACTUAL_GEN_20221101.zip">NEXT_DAY_ACTUAL_GEN_20221101.zip</a><br/>  Wednesday, February 1, 2023  1:01 AM      1478854 <a href="/Reports/Archive/Next_Day_Actual_Gen/NEXT_DAY_ACTUAL_

## C. Data Cleaning <a name="dataclean"></a>

- **Structure contained within ZIP files downloaded**:
    - ZIP file (monthly data) -> folder (monthly data) -> ZIP file (daily data) -> folders (daily data) -> CSV
    - To collect the CSV files, the code:
       1. Extracts the ZIP files downloaded from the website
       2. Converts ZIP files to folders (first: monthly data)
       3. Extracts the ZIP files from the subfolders (daily data)
       4. Removes the used ZIP files from local directory

### 1. Extracting the ZIP files <a name="extractzip"></a>
    - Extracted ZIP files are monthly folders on wind energy data
    - Each folder contains daily wind data in ZIP files

### a. Monthly data ZIP files

In [12]:
main_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# List all files in the target directory
zip_files = [f for f in os.listdir(main_directory) if f.endswith('.zip')]

for zip_filename in zip_files:
    zip_path = os.path.join(main_directory, zip_filename)
    extract_path = os.path.join(main_directory, zip_filename.replace('.zip', ''))

    print(f"Extracting contents of '{zip_filename}' to '{extract_path}'")

    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)

    print(f"Extraction complete\n")

Extracting contents of 'NEXT_DAY_ACTUAL_GEN_20230201.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230201'
Extraction complete

Extracting contents of 'NEXT_DAY_ACTUAL_GEN_20230401.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230401'
Extraction complete

Extracting contents of 'NEXT_DAY_ACTUAL_GEN_20230601.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230601'
Extraction complete

Extracting contents of 'NEXT_DAY_ACTUAL_GEN_20230101.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230101'
Extraction complete

Extracting contents of 'NEXT_DAY_ACTUAL_GEN_20230301.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230301'
Extraction complete

Extracting contents of 'NEXT_DAY_ACTUAL_GEN_20230701.zip' to '/Users/cececarino/Desktop/PE/Spot price for

### b. Daily data in subfolders

In [13]:
# Target directory for ZIPs
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Listing all subdirectories in the target directory
subdirectories = [d for d in os.listdir(target_directory) if os.path.isdir(os.path.join(target_directory, d))]

# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(target_directory, subdirectory)

    # Listing all files in the subdirectory
    zip_files = [f for f in os.listdir(subdirectory_path) if f.endswith('.zip')]

    # Iterating through zip files in the subdirectory
    for zip_filename in zip_files:
        zip_path = os.path.join(subdirectory_path, zip_filename)
        extract_path = os.path.join(subdirectory_path, zip_filename.replace('.zip', ''))

        print(f"Extracting contents of '{zip_filename}' to '{extract_path}'")

        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)

        print(f"Extraction complete for '{zip_filename}' in '{subdirectory}'\n")

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221128_0000000375895033.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101/PUBLIC_NEXT_DAY_ACTUAL_GEN_20221128_0000000375895033'
Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221128_0000000375895033.zip' in 'NEXT_DAY_ACTUAL_GEN_20221101'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221116_0000000375140645.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101/PUBLIC_NEXT_DAY_ACTUAL_GEN_20221116_0000000375140645'
Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221116_0000000375140645.zip' in 'NEXT_DAY_ACTUAL_GEN_20221101'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221112_0000000374882973.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101/PUBLIC_NEXT_DAY_ACTUAL_GEN_20221112_0000000374882973'
Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_2022111

Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230615_0000000389696818.zip' in 'NEXT_DAY_ACTUAL_GEN_20230601'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230622_0000000390239768.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230601/PUBLIC_NEXT_DAY_ACTUAL_GEN_20230622_0000000390239768'
Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230622_0000000390239768.zip' in 'NEXT_DAY_ACTUAL_GEN_20230601'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230623_0000000390319778.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230601/PUBLIC_NEXT_DAY_ACTUAL_GEN_20230623_0000000390319778'
Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230623_0000000390319778.zip' in 'NEXT_DAY_ACTUAL_GEN_20230601'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230616_0000000389775398.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_2023

Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230925_0000000398102154.zip' in 'NEXT_DAY_ACTUAL_GEN_20230901'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230926_0000000398188338.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230901/PUBLIC_NEXT_DAY_ACTUAL_GEN_20230926_0000000398188338'
Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230926_0000000398188338.zip' in 'NEXT_DAY_ACTUAL_GEN_20230901'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230927_0000000398272072.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20230901/PUBLIC_NEXT_DAY_ACTUAL_GEN_20230927_0000000398272072'
Extraction complete for 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230927_0000000398272072.zip' in 'NEXT_DAY_ACTUAL_GEN_20230901'

Extracting contents of 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230923_0000000397942347.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_2023

### 2. Removing used ZIPs <a name="removezips"></a>
    - removes the ZIP files from the main directory (wind folder)
    - removes the ZIP files from the subfolders in the wind folder


### a. Removing ZIP files from wind folder

In [14]:
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# List all files in the target directory
zip_files = [f for f in os.listdir(target_directory) if f.endswith('.zip')]

for zip_filename in zip_files:
    zip_path = os.path.join(target_directory, zip_filename)

    # Delete the ZIP file
    os.remove(zip_path)
    print(f"Deleted '{zip_filename}'\n")

Deleted 'NEXT_DAY_ACTUAL_GEN_20230201.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230401.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230601.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230101.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230301.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230701.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230501.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230901.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20231001.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20221201.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20221001.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20230801.zip'

Deleted 'NEXT_DAY_ACTUAL_GEN_20221101.zip'



### b. Removing ZIP files from subfolders inside wind folder

In [15]:
def delete_zip_files(directory):
    # Listing all files in the current directory
    files = os.listdir(directory)

    for filename in files:
        file_path = os.path.join(directory, filename)

        if os.path.isdir(file_path):
            # Recursively call the function for subdirectories
            delete_zip_files(file_path)
        elif filename.endswith('.zip'):
            # Deleting the ZIP file
            os.remove(file_path)
            print(f"Deleted '{filename}' in '{directory}'\n")

# Target directory for ZIPs
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Call the function to delete zip files in the main directory and subdirectories
delete_zip_files(target_directory)

Deleted 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221128_0000000375895033.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101'

Deleted 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221116_0000000375140645.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101'

Deleted 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221112_0000000374882973.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101'

Deleted 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221122_0000000375519655.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101'

Deleted 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221106_0000000374485242.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_20221101'

Deleted 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221129_0000000375955463.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/NEXT_DAY_ACTUAL_GEN_202

### 3. Moving subfolders to main folder <a name="extractsubfolder"></a>

In [16]:
# Source directory
source_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Getting a list of all subdirectories in the source directory
subdirectories = [d for d in os.listdir(source_directory) if os.path.isdir(os.path.join(source_directory, d))]

# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Getting a list of all subdirectories within each subdirectory
    nested_subdirectories = [d for d in os.listdir(subdirectory_path) if os.path.isdir(os.path.join(subdirectory_path, d))]

    # Moving each nested subdirectory to the main 'wind' directory
    for nested_subdirectory in nested_subdirectories:
        nested_subdirectory_path = os.path.join(subdirectory_path, nested_subdirectory)
        target_path = os.path.join(source_directory, nested_subdirectory)

        # Moving the nested subdirectory
        shutil.move(nested_subdirectory_path, target_path)
        print(f"Moved '{nested_subdirectory}' to '{source_directory}'")

Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221124_0000000375647535' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221109_0000000374690648' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221119_0000000375330426' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221110_0000000374755918' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221120_0000000375392119' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221130_0000000376016627' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221107_0000000374552046' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221121_0000000375456841' to '/Users/cece

### 4. Removing empty folders <a name="removenull1"></a>

In [17]:
# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Checking if the subdirectory is empty
    if not os.listdir(subdirectory_path):
        # Removing the empty subdirectory
        os.rmdir(subdirectory_path)
        print(f"Removed empty folder: '{subdirectory}'")

Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20221101'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230501'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230101'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20221201'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230601'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230801'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230201'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230701'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230301'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230901'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20231001'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20221001'
Removed empty folder: 'NEXT_DAY_ACTUAL_GEN_20230401'


### 5. Extract CSV files <a name="extractcsv"></a>

In [19]:
# Naming source and target directories
source_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'

# Getting a list of all subdirectories in the source directory
subdirectories = [d for d in os.listdir(source_directory) if os.path.isdir(os.path.join(source_directory, d))]

# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Checking if the subdirectory starts with "PUBLIC" and contains CSV files
    if subdirectory.startswith("PUBLIC_"):
        csv_files = [f for f in os.listdir(subdirectory_path) if f.endswith('.CSV')]

        # Moving each CSV file to the target directory
        for csv_file in csv_files:
            csv_file_path = os.path.join(subdirectory_path, csv_file)
            target_path = os.path.join(target_directory, csv_file)

            # Moving the CSV file
            shutil.move(csv_file_path, target_path)
            print(f"Moved '{csv_file}' to '{target_directory}'")

Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230921_0000000397786443.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230501_0000000386207735.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221124_0000000375647535.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230121_0000000379271539.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230903_0000000396279331.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221229_0000000377853232.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230614_0000000389612386.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/dat

### 6. Remove empty folders <a name="removenull2"></a>

In [20]:
# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Checking if the subdirectory is empty
    if not os.listdir(subdirectory_path):
        # Removing the empty subdirectory
        os.rmdir(subdirectory_path)
        print(f"Removed empty folder: '{subdirectory}'")

Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230921_0000000397786443'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230501_0000000386207735'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221124_0000000375647535'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230121_0000000379271539'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230903_0000000396279331'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221229_0000000377853232'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230614_0000000389612386'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230806_0000000393943863'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230324_0000000383474675'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20221219_0000000377229764'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230303_0000000381998466'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230710_0000000391667128'
Removed empty folder: 'PUBLIC_NEXT_DAY_ACTUAL_GEN_20230403_0000000384156962'

### 7. Combining daily data <a name="combinedata"></a>

In [123]:
# Source directory containing finalwebscrape CSV files
source_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind/'

# Listing to store DataFrames for each CSV file
dfs = []

# Iterating over CSV files in the source directory
for file_name in os.listdir(source_directory):
    if file_name.endswith('.CSV'):
        csv_path = os.path.join(source_directory, file_name)

        # Reading each CSV file into a DataFrame
        df = pd.read_csv(csv_path, skiprows=1)  # Skip the first two rows

        # Appending the DataFrame to the list
        dfs.append(df)

In [124]:
# Combining all DataFrames into a single DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

In [125]:
combined_df.head()

Unnamed: 0,I,METER_DATA,GEN_DUID,1,INTERVAL_DATETIME,DUID,MWH_READING,LASTCHANGED
0,D,METER_DATA,GEN_DUID,1.0,2023/04/01 04:05:00,BARCSF1,0.2,2023/04/01 04:00:04
1,D,METER_DATA,GEN_DUID,1.0,2023/04/01 04:05:00,BUTLERSG,7.199998,2023/04/01 04:00:04
2,D,METER_DATA,GEN_DUID,1.0,2023/04/01 04:05:00,CAPTL_WF,0.0,2023/04/01 04:00:04
3,D,METER_DATA,GEN_DUID,1.0,2023/04/01 04:05:00,CHALLHWF,16.4,2023/04/01 04:00:04
4,D,METER_DATA,GEN_DUID,1.0,2023/04/01 04:05:00,CLOVER,-0.01,2023/04/01 04:00:04


### 8. Creating mean MWH_READING for 30T and INTERVAL variables <a name="avereading"></a>


In [126]:
# Converting CALENDAR_DAY to datetime
combined_df['INTERVAL_DATETIME'] = pd.to_datetime(combined_df['INTERVAL_DATETIME'])
combined_df = combined_df.sort_values(by='INTERVAL_DATETIME', ascending=True)
combined_df['INTERVAL'] = combined_df['INTERVAL_DATETIME'].copy()
combined_df.set_index('INTERVAL_DATETIME', inplace=True)
combined_df.head()

Unnamed: 0_level_0,I,METER_DATA,GEN_DUID,1,DUID,MWH_READING,LASTCHANGED,INTERVAL
INTERVAL_DATETIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-10-01 04:05:00,D,METER_DATA,GEN_DUID,1.0,RUBICON,0.0,2022/10/01 04:00:04,2022-10-01 04:05:00
2022-10-01 04:05:00,D,METER_DATA,GEN_DUID,1.0,HEZ1,0.0,2022/10/01 04:00:04,2022-10-01 04:05:00
2022-10-01 04:05:00,D,METER_DATA,GEN_DUID,1.0,HUGSF1,-0.172,2022/10/01 04:00:04,2022-10-01 04:05:00
2022-10-01 04:05:00,D,METER_DATA,GEN_DUID,1.0,KEPBG1,0.0,2022/10/01 04:00:04,2022-10-01 04:05:00
2022-10-01 04:05:00,D,METER_DATA,GEN_DUID,1.0,LRSF1,0.053,2022/10/01 04:00:04,2022-10-01 04:05:00


In [127]:
# Resample the data into 30-minute intervals and calculate the mean
average_per_30min = combined_df['MWH_READING'].resample('30T').mean()
combined_df['AveMWH_READING'] = average_per_30min

combined_df.dropna(inplace=True)

combined_df.reset_index(drop=True)
combined_df.head()

Unnamed: 0_level_0,I,METER_DATA,GEN_DUID,1,DUID,MWH_READING,LASTCHANGED,INTERVAL,AveMWH_READING
INTERVAL_DATETIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-10-01 04:30:00,D,METER_DATA,GEN_DUID,1.0,CHALLHWF,7.8,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
2022-10-01 04:30:00,D,METER_DATA,GEN_DUID,1.0,BUTLERSG,8.799998,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
2022-10-01 04:30:00,D,METER_DATA,GEN_DUID,1.0,YSWF1,2.0,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
2022-10-01 04:30:00,D,METER_DATA,GEN_DUID,1.0,WOOLNTH1,36.149998,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
2022-10-01 04:30:00,D,METER_DATA,GEN_DUID,1.0,WAUBRAWF,41.007999,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815


In [128]:
combined_df = combined_df.reset_index(drop=True)
combined_df.head()

Unnamed: 0,I,METER_DATA,GEN_DUID,1,DUID,MWH_READING,LASTCHANGED,INTERVAL,AveMWH_READING
0,D,METER_DATA,GEN_DUID,1.0,CHALLHWF,7.8,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
1,D,METER_DATA,GEN_DUID,1.0,BUTLERSG,8.799998,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
2,D,METER_DATA,GEN_DUID,1.0,YSWF1,2.0,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
3,D,METER_DATA,GEN_DUID,1.0,WOOLNTH1,36.149998,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
4,D,METER_DATA,GEN_DUID,1.0,WAUBRAWF,41.007999,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815


### 9. Saving dataset `wind` <a name="prelimdata"></a>

In [129]:
# Saving the combined DataFrame to a new CSV file
output_csv_path = '/Users/cececarino/Desktop/PE/Spot price forecasting/finaldataset/NEM Wind Energy.csv'
combined_df.to_csv(output_csv_path, index=False)

print(f"Combined data saved to: {output_csv_path}")

Combined data saved to: /Users/cececarino/Desktop/PE/Spot price forecasting/finaldataset/NEM Wind Energy.csv


### 10. Filtering DUIDs <a name="filterDUIDs"></a>

- DUIDs listed below are already filtered to wind generators in NSW

In [130]:
# List of DUID values to filter
duid_values = ['BANGOWF1', 'BANGOWF2', 'BOCORWF1', 'BOCORWF1', 'BODWF1', 'CAPTL_WF', 'COLWF01', 'CROOKWF2',
               'CRURWF1', 'CULLRGWF', 'FLYCRKWF', 'GULLRWF1', 'GULLRWF1', 'GULLRWF2', 'GUNNING1', 'RYEPARK1',
               'SAPHWF1', 'STWF1', 'TARALGA1', 'TARALGA1', 'TARALGA1', 'WRWF1', 'WOODLWN1']

# Filtering the wind data based on the specified DUID values
wind_data = combined_df[combined_df['DUID'].isin(duid_values)]

# Print or further process the filtered DataFrame
wind_data.head()

Unnamed: 0,I,METER_DATA,GEN_DUID,1,DUID,MWH_READING,LASTCHANGED,INTERVAL,AveMWH_READING
23,D,METER_DATA,GEN_DUID,1.0,CULLRGWF,20.98,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
26,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,64.551765,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
46,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,46.528053,2022/10/01 04:55:03,2022-10-01 05:00:00,10.212102
49,D,METER_DATA,GEN_DUID,1.0,CULLRGWF,18.16,2022/10/01 04:55:03,2022-10-01 05:00:00,10.212102
76,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,42.270176,2022/10/01 05:25:03,2022-10-01 05:30:00,10.105626


### 11. Final wind dataset <a name="winddata"></a>

In [131]:
# Saving the combined DataFrame to a new CSV file
output_csv_path = '/Users/cececarino/Desktop/PE/Spot price forecasting/finaldataset/NSW Wind.csv'
wind_data.to_csv(output_csv_path, index=False)

print(f"Combined data saved to: {output_csv_path}")

Combined data saved to: /Users/cececarino/Desktop/PE/Spot price forecasting/finaldataset/NSW Wind.csv


In [132]:
wind_data.head()

Unnamed: 0,I,METER_DATA,GEN_DUID,1,DUID,MWH_READING,LASTCHANGED,INTERVAL,AveMWH_READING
23,D,METER_DATA,GEN_DUID,1.0,CULLRGWF,20.98,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
26,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,64.551765,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815
46,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,46.528053,2022/10/01 04:55:03,2022-10-01 05:00:00,10.212102
49,D,METER_DATA,GEN_DUID,1.0,CULLRGWF,18.16,2022/10/01 04:55:03,2022-10-01 05:00:00,10.212102
76,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,42.270176,2022/10/01 05:25:03,2022-10-01 05:30:00,10.105626


### 12. Merging with price dataset  <a name="mergedata"></a>

In [133]:
# Path to the CSV files
wind_file_path = '/Users/cececarino/Desktop/PE/Spot price forecasting/finaldataset/NSW Wind.csv'
rrp_file_path = '/Users/cececarino/Desktop/PE/Spot price forecasting/[Final] Datasets/NSW RRP(2022-2023).csv'

# Read CSV files into DataFrames
wind_df = pd.read_csv(wind_file_path)
rrp_df = pd.read_csv(rrp_file_path)

# Convert 'INTERVAL_DATETIME' to datetime in pv_df
wind_df['INTERVAL'] = pd.to_datetime(wind_df['INTERVAL'])
# Convert 'SETTLEMENTDATE' to datetime in rrp_df
rrp_df['SETTLEMENTDATE'] = pd.to_datetime(rrp_df['SETTLEMENTDATE'])

# Perform the inner join on 'INTERVAL_DATETIME' and 'SETTLEMENTDATE'
merged_df = pd.merge(wind_df, rrp_df, how='inner', left_on='INTERVAL', right_on='SETTLEMENTDATE')

# Print or further process the merged DataFrame
merged_df.head()

Unnamed: 0,I,METER_DATA,GEN_DUID,1,DUID,MWH_READING,LASTCHANGED,INTERVAL,AveMWH_READING,REGION,SETTLEMENTDATE,TOTALDEMAND,RRP,PERIODTYPE
0,D,METER_DATA,GEN_DUID,1.0,CULLRGWF,20.98,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815,NSW1,2022-10-01 04:30:00,6274.24,152.14,TRADE
1,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,64.551765,2022/10/01 04:25:03,2022-10-01 04:30:00,11.805815,NSW1,2022-10-01 04:30:00,6274.24,152.14,TRADE
2,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,46.528053,2022/10/01 04:55:03,2022-10-01 05:00:00,10.212102,NSW1,2022-10-01 05:00:00,6417.1,156.11,TRADE
3,D,METER_DATA,GEN_DUID,1.0,CULLRGWF,18.16,2022/10/01 04:55:03,2022-10-01 05:00:00,10.212102,NSW1,2022-10-01 05:00:00,6417.1,156.11,TRADE
4,D,METER_DATA,GEN_DUID,1.0,CAPTL_WF,42.270176,2022/10/01 05:25:03,2022-10-01 05:30:00,10.105626,NSW1,2022-10-01 05:30:00,6423.95,162.0,TRADE


### 13. Final dataset to use for OLS  <a name="finaloutput"></a>

In [134]:
final_output_path = '/Users/cececarino/Desktop/PE/Spot price forecasting/[Final] Datasets/NSW Wind(2022-2023).csv'
merged_df.to_csv(final_output_path, index=False)

print(f"Combined data saved to: {final_output_path}")

Combined data saved to: /Users/cececarino/Desktop/PE/Spot price forecasting/[Final] Datasets/NSW Wind(2022-2023).csv
