# Webscraper for AEMO wind data

# Table of contents
 1.    [Necessary packages for webscraper](#packages)
 <br><br>
 2.    [Downloading ZIP files](#dataprep)

       a. [Data information](#datainfo)
     
       b. [Webpage structure](#htmlcode)
       <br>
 3.   [Data Cleaning](#dataclean)
 
       a. [Extracting ZIP files](#extractzip)
       
       b. [Remove used ZIP files](#removezips)
       
       c. [Moving subfolders to main folder](#extractsubfolder)
       
       d. [Removing empty folders](#removenull1)
       
       e. [Extract CSV files](#extractcsv)
       
       f. [Removing empty folders](#removenull2)
       
       g. [Combining daily data](#combinedata)
       
       h. [Filtering to NSW data](#filternsw)
       
       i. [Saving final dataset](#finaloutput)
       

## A. Necessary packages  <a name="packages"></a>

In [1]:
from bs4 import BeautifulSoup
import os
import pandas as pd
import requests
import shutil
from urllib.parse import urljoin
import zipfile

## B. Downloading ZIP files  <a name="dataprep"></a>

### 1. Data information <a name="datainfo"></a>

- **Period covered**: 13 months (November 2022 - December 2023)

- **Purpose**: for OLS regression; to determine how wind energy affects spot prices

- **Source**: [AEMO Reports](https://nemweb.com.au/Reports/Archive/WDR_CAPACITY_NO_SCADA/)

In [2]:
# Target URL for webscraper

url = 'https://nemweb.com.au/Reports/Archive/WDR_CAPACITY_NO_SCADA/'
page = requests.get(url)

# BeautifulSoup package

soup = BeautifulSoup(page.content, 'html.parser')
pre_element = soup.find('pre')

# Target directory 

target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Creating directory if not existing

if not os.path.exists(target_directory):
    os.makedirs(target_directory)

if pre_element:
    
    # Finding all anchor elements within the pre-element
    
    links = pre_element.find_all('a')

    # Collecting the links (files) to be collected
    
    for link in links:
        href = link.get('href')
        if href and href.endswith('.zip'):
            zip_url = urljoin(url, href)  # Constructing the absolute URL
            print(f"Downloading ZIP file: {zip_url}")

            # Downloading the ZIP file content
            
            zip_response = requests.get(zip_url)
            
            if zip_response.status_code == 200:
                
                # Saving the ZIP file to a local directory
                
                zip_filename = os.path.join(target_directory, os.path.basename(href))
                with open(zip_filename, 'wb') as zip_file:
                    zip_file.write(zip_response.content)

                print(f"ZIP file '{zip_filename}' downloaded successfully.")
                
            else:
                print(f"Failed to download ZIP file '{href}'. Status code: {zip_response.status_code}")
                print(f"Error content: {zip_response.text}")
else:
    print("Pre element not found on the page.")

Downloading ZIP file: https://nemweb.com.au/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20221001.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20221001.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20221101.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20221101.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20221201.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20221201.zip' downloaded successfully.
Downloading ZIP file: https://nemweb.com.au/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20230101.zip
ZIP file '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/w

### 2. Webpage structure <a name="htmlcode"></a>
- used to check html code of the webpage
- purpose is to determine what the specific items are to be extracted

In [3]:
print(soup.prettify())

<html>
 <head>
  <title>
   nemweb.com.au - /Reports/Archive/WDR_CAPACITY_NO_SCADA/
  </title>
 </head>
 <body>
  <h1>
   nemweb.com.au - /Reports/Archive/WDR_CAPACITY_NO_SCADA/
  </h1>
  <hr/>
  <pre><a href="/Reports/Archive/">[To Parent Directory]</a><br/><br/>   Thursday, December 1, 2022  1:04 AM        15405 <a href="/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20221001.zip">PUBLIC_WDR_CAPACITY_NO_SCADA_20221001.zip</a><br/>      Sunday, January 1, 2023  1:06 AM        14902 <a href="/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20221101.zip">PUBLIC_WDR_CAPACITY_NO_SCADA_20221101.zip</a><br/>  Wednesday, February 1, 2023  1:05 AM        15375 <a href="/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20221201.zip">PUBLIC_WDR_CAPACITY_NO_SCADA_20221201.zip</a><br/>     Wednesday, March 1, 2023  1:09 AM        15399 <a href="/Reports/Archive/WDR_CAPACITY_NO_SCADA/PUBLIC_WDR_CAPACITY_NO_SCADA_20230101.zip">PUBLIC_WDR_CAP

## C. Data Cleaning <a name="dataclean"></a>

- **Structure contained within ZIP files downloaded**:
    - ZIP file (monthly data) -> folder (monthly data) -> ZIP file (daily data) -> folders (daily data) -> CSV
    - To collect the CSV files, the code:
       1. Extracts the ZIP files downloaded from the website
       2. Converts ZIP files to folders (first: monthly data)
       3. Extracts the ZIP files from the subfolders (daily data)
       4. Removes the used ZIP files from local directory

### 1. Extracting the ZIP files <a name="extractzip"></a>
    - Extracted ZIP files are monthly folders on wind energy data
    - Each folder contains daily wind data in ZIP files

### a. Monthly data ZIP files

In [4]:
main_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# List all files in the target directory
zip_files = [f for f in os.listdir(main_directory) if f.endswith('.zip')]

for zip_filename in zip_files:
    zip_path = os.path.join(main_directory, zip_filename)
    extract_path = os.path.join(main_directory, zip_filename.replace('.zip', ''))

    print(f"Extracting contents of '{zip_filename}' to '{extract_path}'")

    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)

    print(f"Extraction complete\n")

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231001.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20231001'
Extraction complete

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230901.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230901'
Extraction complete

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221001.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20221001'
Extraction complete

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221201.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20221201'
Extraction complete

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230801.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230801'
Extraction complete

Extracting cont

### b. Daily data in subfolders

In [5]:
# Target directory for ZIPs
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Listing all subdirectories in the target directory
subdirectories = [d for d in os.listdir(target_directory) if os.path.isdir(os.path.join(target_directory, d))]

# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(target_directory, subdirectory)

    # Listing all files in the subdirectory
    zip_files = [f for f in os.listdir(subdirectory_path) if f.endswith('.zip')]

    # Iterating through zip files in the subdirectory
    for zip_filename in zip_files:
        zip_path = os.path.join(subdirectory_path, zip_filename)
        extract_path = os.path.join(subdirectory_path, zip_filename.replace('.zip', ''))

        print(f"Extracting contents of '{zip_filename}' to '{extract_path}'")

        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)

        print(f"Extraction complete for '{zip_filename}' in '{subdirectory}'\n")

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230716_0000000392203062.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701/PUBLIC_WDR_CAPACITY_NO_SCADA_20230716_0000000392203062'
Extraction complete for 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230716_0000000392203062.zip' in 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230722_0000000392736419.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701/PUBLIC_WDR_CAPACITY_NO_SCADA_20230722_0000000392736419'
Extraction complete for 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230722_0000000392736419.zip' in 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'

Extracting contents of 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230725_0000000392971521.zip' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701/PUBLIC_WDR_CAPACITY_NO_SCADA_20230725_0000000392971521

### 2. Removing used ZIPs <a name="removezips"></a>
    - removes the ZIP files from the main directory (wind folder)
    - removes the ZIP files from the subfolders in the wind folder


### a. Removing ZIP files from wind folder

In [6]:
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# List all files in the target directory
zip_files = [f for f in os.listdir(target_directory) if f.endswith('.zip')]

for zip_filename in zip_files:
    zip_path = os.path.join(target_directory, zip_filename)

    # Delete the ZIP file
    os.remove(zip_path)
    print(f"Deleted '{zip_filename}'\n")

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231001.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230901.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221001.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221201.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230801.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221101.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230201.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230601.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230401.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230301.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230101.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230501.zip'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230701.zip'



### b. Removing ZIP files from subfolders inside wind folder

In [7]:
def delete_zip_files(directory):
    # Listing all files in the current directory
    files = os.listdir(directory)

    for filename in files:
        file_path = os.path.join(directory, filename)

        if os.path.isdir(file_path):
            # Recursively call the function for subdirectories
            delete_zip_files(file_path)
        elif filename.endswith('.zip'):
            # Deleting the ZIP file
            os.remove(file_path)
            print(f"Deleted '{filename}' in '{directory}'\n")

# Target directory for ZIPs
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Call the function to delete zip files in the main directory and subdirectories
delete_zip_files(target_directory)

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230716_0000000392203062.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230722_0000000392736419.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230725_0000000392971521.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230713_0000000391923147.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230709_0000000391587216.zip' in '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind/PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'

Deleted 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230702_0000000391027632.zip' in '/Users/cececarino/Desktop/PE/Spo

### 3. Moving subfolders to main folder <a name="extractsubfolder"></a>

In [8]:
# Source directory
source_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'

# Getting a list of all subdirectories in the source directory
subdirectories = [d for d in os.listdir(source_directory) if os.path.isdir(os.path.join(source_directory, d))]

# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Getting a list of all subdirectories within each subdirectory
    nested_subdirectories = [d for d in os.listdir(subdirectory_path) if os.path.isdir(os.path.join(subdirectory_path, d))]

    # Moving each nested subdirectory to the main 'wind' directory
    for nested_subdirectory in nested_subdirectories:
        nested_subdirectory_path = os.path.join(subdirectory_path, nested_subdirectory)
        target_path = os.path.join(source_directory, nested_subdirectory)

        # Moving the nested subdirectory
        shutil.move(nested_subdirectory_path, target_path)
        print(f"Moved '{nested_subdirectory}' to '{source_directory}'")

Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230715_0000000392115013' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230708_0000000391509388' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230709_0000000391587216' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230718_0000000392386713' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230702_0000000391027632' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230711_0000000391751545' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230730_0000000393374655' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230727_0000000393131631'

### 4. Removing empty folders <a name="removenull1"></a>

In [9]:
# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Checking if the subdirectory is empty
    if not os.listdir(subdirectory_path):
        # Removing the empty subdirectory
        os.rmdir(subdirectory_path)
        print(f"Removed empty folder: '{subdirectory}'")

Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230701'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230901'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230301'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221001'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231001'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230401'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221101'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230501'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230101'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221201'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230601'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230201'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230801'


### 5. Extract CSV files <a name="extractcsv"></a>

In [10]:
# Naming source and target directories
source_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/wind'
target_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'

# Getting a list of all subdirectories in the source directory
subdirectories = [d for d in os.listdir(source_directory) if os.path.isdir(os.path.join(source_directory, d))]

# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Checking if the subdirectory starts with "PUBLIC" and contains CSV files
    if subdirectory.startswith("PUBLIC_"):
        csv_files = [f for f in os.listdir(subdirectory_path) if f.endswith('.CSV')]

        # Moving each CSV file to the target directory
        for csv_file in csv_files:
            csv_file_path = os.path.join(subdirectory_path, csv_file)
            target_path = os.path.join(target_directory, csv_file)

            # Moving the CSV file
            shutil.move(csv_file_path, target_path)
            print(f"Moved '{csv_file}' to '{target_directory}'")

Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231025_0000000400630916.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230922_0000000397868004.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231018_0000000400005603.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221128_0000000375896906.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230910_0000000396861050.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231030_0000000401101223.CSV' to '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind'
Moved 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230408_0000000384503349.CSV' to '/Users/cececarino/Desktop/PE/Spot price f

### 6. Remove empty folders <a name="removenull2"></a>

In [11]:
# Iterating through subdirectories
for subdirectory in subdirectories:
    subdirectory_path = os.path.join(source_directory, subdirectory)

    # Checking if the subdirectory is empty
    if not os.listdir(subdirectory_path):
        # Removing the empty subdirectory
        os.rmdir(subdirectory_path)
        print(f"Removed empty folder: '{subdirectory}'")

Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231025_0000000400630916'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230922_0000000397868004'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231018_0000000400005603'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221128_0000000375896906'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230910_0000000396861050'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20231030_0000000401101223'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230408_0000000384503349'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230924_0000000398022688'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230223_0000000381470057'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20221229_0000000377855116'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230316_0000000382927994'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_20230319_0000000383132463'
Removed empty folder: 'PUBLIC_WDR_CAPACITY_NO_SCADA_

### 7. Combining daily data <a name="combinedata"></a>

In [22]:
# Source directory containing finalwebscrape CSV files
source_directory = '/Users/cececarino/Desktop/PE/Spot price forecasting/datasets/collected_wind/'

# Listing to store DataFrames for each CSV file
dfs = []

# Iterating over CSV files in the source directory
for file_name in os.listdir(source_directory):
    if file_name.endswith('.CSV'):
        csv_path = os.path.join(source_directory, file_name)

        # Reading each CSV file into a DataFrame
        df = pd.read_csv(csv_path, skiprows=1)  # Skip the first two rows

        # Appending the DataFrame to the list
        dfs.append(df)

# Combining all DataFrames into a single DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# Converting CALENDAR_DAY to datetime
combined_df['CALENDAR_DAY'] = pd.to_datetime(combined_df['CALENDAR_DAY'])
combined_df = combined_df.sort_values(by='CALENDAR_DAY', ascending=True)

### 8. Filtering to NSW region `(REGIONID == NSW1)` <a name="filternsw"></a>

In [23]:
# Obtaining data where REGIONID == NSW1
nsw_data = combined_df[combined_df['REGIONID'] == 'NSW1']

# Displaying obtained data
nsw_data.head()

Unnamed: 0,I,DAILY,WDR_NO_SCADA,1,CALENDAR_DAY,REGIONID,MW_CAPACITY_NO_SCADA
1046,D,DAILY,WDR_NO_SCADA,1.0,2022-10-01,NSW1,31.0
1418,D,DAILY,WDR_NO_SCADA,1.0,2022-10-02,NSW1,31.0
751,D,DAILY,WDR_NO_SCADA,1.0,2022-10-03,NSW1,31.0
992,D,DAILY,WDR_NO_SCADA,1.0,2022-10-04,NSW1,31.0
254,D,DAILY,WDR_NO_SCADA,1.0,2022-10-05,NSW1,31.0


### 9. Saving final dataset `wind` <a name="finaloutput"></a>

In [25]:
# Saving the combined DataFrame to a new CSV file
output_csv_path = '/Users/cececarino/Desktop/PE/Spot price forecasting/finaldataset/wind.csv'
combined_df.to_csv(output_csv_path, index=False)

print(f"Combined data saved to: {output_csv_path}")

Combined data saved to: /Users/cececarino/Desktop/PE/Spot price forecasting/finaldataset/wind.csv
