# Question 2 - "What is the shape of the intra-daily carbon emission intensity curve for the NEM and for the five regional markets. What can be concluded about trends in this pattern? Considering the typical consumption pattern of a household or company, which of them is more likely to have a lower emission intensity per KWh of electricity consumption."

### 📦 Environment Setup: Installing Required Libraries

Before fetching and analyzing emissions data, we need to ensure that all required libraries are installed. These libraries allow us to authenticate securely with the CSIRO API, retrieve the emissions data, and handle large datasets efficiently.


###  Install Required Packages

To begin the analysis, we need to install the following Python packages:

- `requests`: For making HTTP requests to the CSIRO API.
- `requests-oauth2client`: For handling OAuth2 authentication.
- `polars`: A high-performance DataFrame library used to parse and save large datasets.

These packages allow us to authenticate, download, and process emissions data efficiently from the CSIRO Senaps platform.

Run the following cell to ensure all dependencies are installed:

In [1]:
!pip install requests requests-oauth2client polars



### 📊 Justification for Dataset and Timeframe Selection

To analyze the **intra-daily carbon emission intensity patterns** for the NEM and its five regional markets (NSW, QLD, VIC, SA, TAS), I selected the **CSIRO regional emissions dataset** because:

- ✅ **High Temporal Resolution**: It provides 5-minute interval emissions data, enabling a detailed intra-day analysis.
- ✅ **Regional Breakdown**: Each state is recorded individually, aligning perfectly with the need to compare all five regional markets.
- ✅ **Official & Reliable**: Data is sourced via the Eratos platform from CSIRO, ensuring credibility and consistency.

### ⏳ Why 2019 to April 2025?

- 🔄 **Captures Seasonal and Policy Variations**: A 5+ year span helps smooth out short-term fluctuations and includes key changes in grid mix and renewable uptake.
- 🌐 **COVID-19 and Post-pandemic Trends**: The timeframe includes pre-COVID, lockdown, and recovery periods, offering valuable insight into behavioral shifts.
- 📈 **Improves Trend Analysis**: Extending beyond a single year avoids overfitting conclusions to temporary anomalies or weather effects.
- 🧮 **Enables Better User-Type Comparison**: The broader the data, the more representative it becomes for distinguishing between household and company emission intensity patterns across different hours.

This dataset and timeframe offer the granularity and context necessary to draw meaningful conclusions about daily carbon intensity patterns and their implications for household vs company energy consumption behavior.


### 📥 Download Emissions Data (2019–Apr 2025)

Extract and save 5-minute interval carbon emission data for all NEM regions using the CSIRO API.


In [2]:
import json
import polars as pl
import requests
import tempfile

from pathlib import Path
from requests_oauth2client import OAuth2Client, OAuth2ClientCredentialsAuth
from typing import List

# ✅ Your new credentials
CLIENT_ID = "5259460c-3aa8-421a-afc2-eefca36ab37e"
CLIENT_SECRET = "~2s8Q~9b~pQ458c140PcH5mJFgKsspfW_6upCdAE"

class MyEmissionsData(requests.Session):
    _auth_url = "https://login.microsoftonline.com/a815c246-a01f-4d10-bc3e-eeb6a48ef48a/oauth2/v2.0/token"
    _senaps_url = "https://senaps.eratos.com/api/sensor/v2/observations"

    def __init__(self, client_id: str = CLIENT_ID, client_secret: str = CLIENT_SECRET):
        super().__init__()
        oauth2client = OAuth2Client(self._auth_url, (client_id, client_secret))
        self.auth = OAuth2ClientCredentialsAuth(oauth2client, scope=f"{client_id}/.default")
        self.headers = {
            "accept": "*/*",
            "content-type": "application/json",
        }

    def download_and_parse_data(self, *, write_path: Path, regions: List[str], start: str, end: str, limit: int = 99_999_999):
        if not regions:
            raise ValueError("`regions` list cannot be empty")
        parser = self._parse_single_stream if len(regions) == 1 else self._parse_multiple_streams

        streamid = ",".join(
            f"csiro.energy.dch.agshop.regional_global_emissions.{region}" for region in regions
        )

        with tempfile.TemporaryDirectory() as tmpdir:
            fname = Path(tmpdir) / "response.json"
            with self.get(
                url=self._senaps_url,
                params=dict(streamid=streamid, start=start, end=end, limit=limit),
            ) as response:
                response.raise_for_status()
                with open(fname, "wb") as fp:
                    for chunk in response.iter_content(chunk_size=1024):
                        fp.write(chunk)

            write_path.parent.mkdir(parents=True, exist_ok=True)
            with open(fname, "r") as fp:
                data = json.load(fp)
                parser(data, write_path)

    @staticmethod
    def _parse_single_stream(data, write_path):
        col_name = data["_embedded"]["stream"]["_links"]["self"]["id"]
        (
            pl.LazyFrame([
                {"timestamp": elem["t"], col_name: elem["v"]["v"]}
                for elem in data["results"]
            ])
            .with_columns(
                pl.col("timestamp")
                .str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.fZ", strict=True)
                .cast(pl.Datetime(time_unit="ms", time_zone="UTC"))
            )
            .sort("timestamp")
            .sink_parquet(write_path)
        )

    @staticmethod
    def _parse_multiple_streams(data, write_path):
        (
            pl.LazyFrame([
                {
                    "timestamp": key,
                    "struct": {k: v["v"] for k, v in val.items()}
                }
                for elem in data["results"]
                for key, val in elem.items()
            ])
            .unnest("struct")
            .with_columns(
                pl.col("timestamp")
                .str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.fZ", strict=True)
                .cast(pl.Datetime(time_unit="ms", time_zone="UTC"))
            )
            .sort("timestamp")
            .sink_parquet(write_path)
        )

# ✅ Download data from 2019 to April 2025
if __name__ == "__main__":
    e = MyEmissionsData()
    e.download_and_parse_data(
        regions=["nsw", "qld", "vic", "sa", "tas"],
        start="2019-01-01T00:00:00.000Z",
        end="2025-04-30T23:59:59.999Z",
        write_path=Path.home() / "Desktop" / "dataset_merge" / "Question 2" / "csiro_2019_to_2025.parquet"
    )
    print("✅ CSIRO data from 2019 to April 2025 downloaded successfully.")


✅ CSIRO data from 2019 to April 2025 downloaded successfully.


### 🔄 Converting Parquet to CSV

While `.parquet` files are optimized for storage and speed—especially in big data environments—`.csv` files offer greater **readability and compatibility** with traditional data analysis tools such as Excel, pandas, and Streamlit. 

For the purpose of this project:
- 🔍 I may need to visually inspect and quickly browse data samples.
- 📊 Some visualization libraries or dashboards (e.g., Streamlit or web tools) integrate more smoothly with `.csv` inputs.
- 🤝 Easier sharing with collaborators who may not be familiar with `.parquet`.

Hence, converting the emissions dataset to `.csv` ensures better accessibility and usability for analysis and presentation.


In [3]:
import polars as pl
from pathlib import Path

# Load the .parquet file
parquet_path = Path.home() / "Desktop" / "dataset_merge" / "Question 2" / "csiro_2019_to_2025.parquet"
df = pl.read_parquet(parquet_path)

# Define the output path for CSV
output_path = Path.home() / "Desktop" / "dataset_merge" / "Question 2" / "csiro_2019_to_2025.csv"

# Save as CSV
df.write_csv(output_path)

print(f"✅ CSV file saved to: {output_path}")

✅ CSV file saved to: /Users/nafis/Desktop/dataset_merge/Question 2/csiro_2019_to_2025.csv


### ✅ Validating Latest Available Timestamp in Dataset

To confirm the most recent data point included in the dataset, the following command was used:

In [4]:
df['timestamp'].max()


datetime.datetime(2025, 4, 30, 23, 55, tzinfo=zoneinfo.ZoneInfo(key='UTC'))

This confirms that the most recent data point in the dataset is from April 30, 2025 at 23:55 UTC. This aligns with the intended data retrieval range (from 2019 to the latest available in 2025) and indicates:

The CSIRO API successfully returned data up to the very end of April 2025.

✅ No missing trailing data from 2025, as might have been the case with earlier attempts.

You are now working with the most complete and up-to-date emissions dataset available via the CSIRO platform for this analysis.