### Extract and Filter Solar Proton Event Data (2010 and above)

This section loads a catalog of solar proton events from a NOAA SEP dataset stored in an AWS S3 bucket. It filters the data to include only events starting in the year 2010 or later, and extracts the **Begin Time** and **Flare Peak Time** columns. This cleaned dataset will be used to query HEK for matching flare metadata.

```python
import os
import csv
import json
import requests
import pandas as pd
from datetime import datetime, timedelta
```

#### Define Output Parameters
```python
OUTPUT_JSON_DIR = "hek_events"
CSV_OUTPUT_FILE = "flare_summary_final/flare_summary_peaks.csv"
DURATION_MINUTES = 15  # Time window (in minutes) for HEK event queries
```

#### Load SEP Catalog from S3
```python
url = "https://helioconverter-web-application.s3.amazonaws.com/sep_catalogs/2025-07-01T21-26-54_solar_proton_events.csv"
df = pd.read_csv(url, dtype=str)  # Load all fields as strings to preserve timestamp formatting
```

#### Filter Events to Include Only Year 2010 and Above
```python
df = df[df["Begin Time Yr M/D (UTC)"] >= "2010"]
```

#### Extract Begin and Flare Peak Times
```python
df_filtered = df[["Begin Time Yr M/D (UTC)", "Flare Peak Time (UTC)"]].copy()
df_filtered["Flare Peak Time (UTC)"] = df_filtered["Flare Peak Time (UTC)"].fillna("")
```

#### Save Cleaned Data to CSV
```python
df_filtered.to_csv("flare_summary_final/noaa_flare_peaks_2010_and_above.csv", index=False)
```

This CSV now contains only the relevant timing data and will serve as input for HEK queries in the next step.


In [21]:
import os
import csv
import json
import requests
import pandas as pd
from datetime import datetime, timedelta

timestamp = datetime.utcnow().strftime("%Y-%m-%d")
# Load CSV without parsing dates
url = f"https://helioconverter-web-application.s3.amazonaws.com/sep_catalogs/{timestamp}_solar_proton_events.csv"
df = pd.read_csv(url, dtype=str)

# Filter by year
df = df[df["Begin Time Yr M/D (UTC)"] >= "2010"]

# Select relevant columns
df_filtered = df[["Begin Time Yr M/D (UTC)", "Flare Peak Time (UTC)"]].copy()
df_filtered["Flare Peak Time (UTC)"] = df_filtered["Flare Peak Time (UTC)"].fillna("")

df_filtered.to_csv("flare_summary_final/noaa_flare_peaks_2010_and_above.csv", index=False)

### HEK Metadata Extraction Based on Flare Peaks

This section uses the NOAA flare peak times to extract additional metadata from the Heliophysics Event Knowledgebase (HEK) via the Helioviewer API.

---

#### 1. Convert Flare Peak Strings to Timestamps
Only flare peak entries with non-empty values are converted into `datetime` format. This is required for time-based queries.

#### 2. Create Output Folder
An output directory is created (if it doesn't already exist) to store the HEK event data returned as `.json` files. This ensures we don’t re-download existing data.

#### 3. Query the Helioviewer HEK API
For each flare peak, the API is queried using a ±15-minute time window around the peak. This helps capture flares that may be recorded slightly before or after the listed peak time.

#### 4. Save the Response Locally
Each response is saved as a JSON file named with the flare timestamp, allowing you to inspect individual results later and avoid repeated downloads during debugging.

#### 5. Parse the Returned Event Data
The returned JSON is parsed to extract solar flare records (`type = FL`). Only flares are kept from the HEK results, even if other event types are present.

#### 6. Extract Key Metadata Fields
The script pulls out the most relevant physical and observational properties of each flare, including:
- Start, peak, and end time
- Bounding boxes and coordinates
- Instrument and observatory metadata
- Peak flux (if available)

#### 7. Store the Full Flare Metadata to CSV
All extracted HEK flare records are written to a CSV file. Time fields are placed at the beginning of the file for clarity, followed by other metadata fields.

---

This HEK metadata will later be useful for identifying flare positions, bounding boxes, and cross-matching with SDO/AIA image data for visualization or ML applications.


In [None]:
# Constants
OUTPUT_JSON_DIR = "hek_events"
CSV_OUTPUT_FILE = "flare_summary_final/flare_hek_peaks_data.csv"
DURATION_MINUTES = 15

# Convert flare peak times to datetime (skip blanks)
flare_peaks_dt = [
    datetime.strptime(t, "%Y-%m-%dT%H:%M:%S")
    for t in df_filtered["Flare Peak Time (UTC)"] if t
]

# Create output folder
os.makedirs(OUTPUT_JSON_DIR, exist_ok=True)

def fetch_hek_events(timestamp, filename, duration_minutes=15):
    url = "https://api.helioviewer.org/v2/events/"
    start_time_str = timestamp.strftime("%Y-%m-%dT%H:%M:%SZ")
    end_time_str = (timestamp + timedelta(minutes=duration_minutes)).strftime("%Y-%m-%dT%H:%M:%SZ")

    response = requests.get(url, params={"startTime": start_time_str, "endTime": end_time_str, "sources": "HEK"})
    if response.status_code == 200:
        data = response.json()
        with open(filename, "w") as f:
            json.dump(data, f, indent=2)
        print(f" HEK data saved to {filename} for {start_time_str}")
        return data
    else:
        print(f" Failed to fetch HEK events for {start_time_str}")
        return []

def parse_hek_events(timestamp, filename):
    parsed_data = fetch_hek_events(timestamp, filename, duration_minutes=DURATION_MINUTES)
    shared_fields = [
        "boundbox_c1ll", "boundbox_c1ur", "boundbox_c2ll", "boundbox_c2ur",
        "event_starttime", "event_endtime", "hpc_coord", "hpc_bbox", "hpc_radius",
        "hv_hpc_x_final", "hv_hpc_y_final", "hv_labels_formatted", "frm_name",
        "obs_observatory", "obs_instrument", "obs_channelid", "frm_url", "event_type"
    ]
    flare_fields = shared_fields + ["fl_peakflux", "fl_peakfluxunit", "event_peaktime"]
    hek_flares = []

    for record in parsed_data:
        if "groups" in record:
            for group in record.get("groups", []):
                for item in group.get("data", []):
                    if item.get("type") == "FL":
                        flare = {k: item.get(k, None) for k in flare_fields}
                        hek_flares.append(flare)
        elif record.get("type") == "FL":
            flare = {k: record.get(k, None) for k in flare_fields}
            hek_flares.append(flare)

    print(f" Found {len(hek_flares)} flares in {filename}")
    return hek_flares

# Process each flare
all_flares = []
for peak_time in flare_peaks_dt:
    iso_str = peak_time.strftime("%Y-%m-%dT%H-%M-%S")
    json_file = os.path.join(OUTPUT_JSON_DIR, f"hek_{iso_str}.json")

    if os.path.exists(json_file):
        print(f" Skipping {json_file} (already exists)")
        with open(json_file, "r") as f:
            data = json.load(f)
    else:
        data = fetch_hek_events(peak_time, json_file)

    flares = parse_hek_events(peak_time, json_file)
    all_flares.extend(flares)

# Save to CSV
if all_flares:
    time_fields = ["event_starttime", "event_peaktime", "event_endtime"]
    other_fields = [k for k in all_flares[0].keys() if k not in time_fields]
    fieldnames = time_fields + other_fields

    with open(CSV_OUTPUT_FILE, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(all_flares)

    print(f" Wrote flare metadata for {len(all_flares)} events to {CSV_OUTPUT_FILE}")
else:
    print(" No flare metadata extracted.")