### JSOC Data Export: Why I Used the Web App Instead of SunPy/VSO

The **JSOC Data Export web application** is an online tool hosted by Stanford University that allows users to retrieve high-quality, calibrated solar data products—such as AIA Level 1 EUV images—from the Solar Dynamics Observatory (SDO).

While tools like **SunPy** and **VSO (Virtual Solar Observatory)** offer Python APIs to query and download SDO data, they often suffer from limitations such as:
- **Unreliable API performance** (timeouts, failed connections, or partial downloads)
- **Limited bulk retrieval support** for large date ranges
- **Rate limiting** or internal server issues during peak times

#### Why I Chose the JSOC Web App
Sometimes, it is simply **more efficient and reliable** to use the JSOC Export tool manually—especially when:
- Retrieving data for well-defined time intervals (e.g., around solar flare events)
- Avoiding timeout errors or broken download links
- Ensuring 100% data availability without worrying about batch errors

#### Manual Workflow Overview
To guarantee clean, complete downloads for each flare event, I followed these manual steps:

1. **Go to the JSOC Data Export web app**:  
   https://jsoc.stanford.edu/ajax/exportdata.html

2. **Fill in the input boxes**:  
   Enter the recordset string for the flare window, such as:  
   ```
   aia.lev1_euv_12s[2014-04-18T12:45:00/180m][131]
   ```

3. **Submit and wait for email**:  
   The app sends a unique export link to your email.

4. **Open the emailed link**:  
   This link leads to a temporary webpage listing all `.fits` files available for download.

5. **View page source**:  
   Right-click → “View Page Source” to reveal the raw HTML.

6. **Copy the entire HTML code**:  
   Save it into a local `.html` file for parsing.

7. **Parse and download**:  
   Use `BeautifulSoup` and Python to extract all non-"spikes" `.fits` links and download them using `requests`.

This method ensures **full control, reliability, and transparency**—key when working with high-volume solar datasets.


### 🔗 Click the image below to open the JSOC Data Export Tool

[![Open JSOC Data Export Tool](https://helioconverter-web-application.s3.us-east-1.amazonaws.com/JSOC_Homepage.png)](http://jsoc.stanford.edu/ajax/exportdata.html?ds=aia.lev1_euv_12s[2020-01-01T00:00:00/20m][131])

### What This Function Does — Step-by-Step

1. **Load Event Data**  
   Reads a CSV file containing solar flare events, including their start, peak, and end times, as well as buffer times used for downloading data.

2. **Iterate Through Events**  
   For each row (flare event) in the CSV:
   - Extracts timestamps for the start/end of the flare and the buffered observation window.
   - Converts the flare date into a folder name for organizing downloaded files.
   - Skips over any flare dates that are missing or intentionally excluded (e.g., January 7, 2014).

3. **Parse Corresponding HTML Dump**  
   For each flare date, the function opens a pre-downloaded HTML file (from JSOC) containing links to FITS files. It uses this file to extract only the links to valid `.fits` files, excluding any that contain “spikes.”

4. **Download FITS Files**  
   The extracted links are downloaded into the corresponding flare date folder.  
   - Skips any files that already exist.  
   - Skips files that are too small to be valid.  
   - Progress is shown using a status bar.

5. **Sort Files by Time Phase**  
   Once downloaded, the files are grouped into three categories based on timestamps:
   - **Pre-flare**: From buffer start to flare start  
   - **Flare**: From flare start to flare end  
   - **Post-flare**: From flare end to buffer end

6. **Move Files into Subfolders**  
   Each group of files is moved into its corresponding subfolder (`pre`, `flare`, or `post`) within the flare’s main directory.

7. **Clean Up**  
   Any leftover `.fits` files still sitting in the parent folder (not sorted) are deleted, leaving only the three clean subdirectories.

8. **Repeat for All Events**  
   This process continues for every event listed in the CSV until all have been processed, downloaded, and neatly organized.


In [None]:
import os
import shutil
import requests
import pandas as pd
from tqdm import tqdm
from datetime import datetime
from bs4 import BeautifulSoup

def download_and_organize_fits(csv_path, html_dir, fits_root, wavelength="131"):
    def filename_from_timestamp(ts):
        return f"aia.lev1_euv_12s.{ts.strftime('%Y-%m-%dT%H%M%S')}Z.{wavelength}.image_lev1.fits"

    df = pd.read_csv(csv_path)

    for idx, row in df.iterrows():
        try:
            jsoc_start = datetime.strptime(row["jsoc_start_time"], "%Y-%m-%d %H:%M:%S")
            flare_start = datetime.strptime(row["event_starttime"], "%Y-%m-%d %H:%M:%S")
            flare_end = datetime.strptime(row["event_endtime"], "%Y-%m-%d %H:%M:%S")
            jsoc_end = datetime.strptime(row["jsoc_end_time"], "%Y-%m-%d %H:%M:%S")
        except Exception as e:
            print(f"Row {idx} skipped (timestamp error): {e}")
            continue

        folder_name = flare_start.strftime("%Y-%m-%d")
        # if folder_name == "2014-01-07":
            # continue

        html_file_path = os.path.join(html_dir, f"{folder_name}.html")
        if not os.path.exists(html_file_path):
            print(f"HTML file not found for {folder_name}")
            continue

        folder_path = os.path.join(fits_root, folder_name)
        os.makedirs(folder_path, exist_ok=True)

        # Step 1: Parse HTML and download links
        with open(html_file_path, "r") as file:
            soup = BeautifulSoup(file, "html.parser")

        links = [
            a['href'] for a in soup.find_all('a', href=True)
            if a['href'].endswith('.fits') and 'spikes' not in a['href']
        ]
        full_urls = [
            link if link.startswith("http") else f"https://jsoc1.stanford.edu{link}"
            for link in links
        ]

        print(f"\n Downloading {len(full_urls)} files for {folder_name}...")

        for url in tqdm(full_urls, desc=f"📥 {folder_name}"):
            filename = url.split("/")[-1]
            filepath = os.path.join(folder_path, filename)
            if os.path.exists(filepath):
                continue
            try:
                response = requests.get(url, stream=True, timeout=60)
                response.raise_for_status()
                content_length = int(response.headers.get("Content-Length", 0))
                if content_length < 10_000:
                    print(f"Skipped small file: {filename} ({content_length} bytes)")
                    continue
                with open(filepath, "wb") as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        if chunk:
                            f.write(chunk)
            except Exception as e:
                print(f"Failed to download {url}: {e}")

        # Step 2: Partition into pre, flare, post
        all_files = sorted(f for f in os.listdir(folder_path) if f.endswith(".fits"))
        jsoc_start_file = filename_from_timestamp(jsoc_start)
        flare_start_file = filename_from_timestamp(flare_start)
        flare_end_file = filename_from_timestamp(flare_end)
        jsoc_end_file = filename_from_timestamp(jsoc_end)

        pre_files = [f for f in all_files if jsoc_start_file <= f < flare_start_file]
        flare_files = [f for f in all_files if flare_start_file <= f <= flare_end_file]
        post_files = [f for f in all_files if flare_end_file < f <= jsoc_end_file]

        for phase, file_list in [("pre", pre_files), ("flare", flare_files), ("post", post_files)]:
            target_dir = os.path.join(folder_path, phase)
            os.makedirs(target_dir, exist_ok=True)
            for fname in file_list:
                src = os.path.join(folder_path, fname)
                dst = os.path.join(target_dir, fname)
                if not os.path.exists(dst):
                    shutil.move(src, dst)

        # Step 3: Delete any .fits files left in the parent folder
        leftovers = [f for f in os.listdir(folder_path) if f.endswith(".fits")]
        for fname in leftovers:
            try:
                os.remove(os.path.join(folder_path, fname))
            except Exception as e:
                print(f"Failed to delete leftover file {fname}: {e}")

        print(f"Organized {folder_name}: {len(pre_files)} pre, {len(flare_files)} flare, {len(post_files)} post")

    print("\nAll events processed and sorted.")

    
download_and_organize_fits(
    csv_path="flare_summary_final/flare_selection/strongest_flares_2014_SDO_AIA_131.csv",
    html_dir="html_dump",
    fits_root="final_fits"
)

### Flare Duration Calculation (Discrete Minute Binning)

In this analysis, we define the duration of each solar flare event using **discrete minute-level bins**, rather than continuous time intervals. This approach aligns with how FITS image files are timestamped — down to the minute and second (e.g., `T081811Z`), with a consistent cadence of one image approximately every 12 seconds.

Rather than using floating-point duration (e.g., 103.4 minutes), we:

1. **Count the number of unique `hhmm` values** between the flare's official start and end timestamps.
2. **Include both the starting and ending minutes** (i.e., inclusive binning).
3. Use the resulting integer count to represent the number of expected image rows during the flare.

This is statistically equivalent to **rounding up (ceiling)** the true flare duration to ensure complete coverage of the flare window. The discrete duration better reflects how the data are stored and accessed, and ensures consistency in file-based image selection workflows.

We also log both:
- The **continuous duration** in minutes (for reference).
- The **discretized count of unique minutes** (used for all sampling and model input).

Each flare's total image count is then calculated as:

```text
Total Images = Pre-Flare (178) + Flare (discrete) + Post-Flare (178)


In [None]:
import pandas as pd
import numpy as np
from tabulate import tabulate

# Load CSV
df = pd.read_csv("flare_summary_final/flare_selection/strongest_flares_2014_SDO_AIA_131.csv", parse_dates=[
    'event_starttime', 'event_endtime', 'jsoc_start_time', 'jsoc_end_time'])

# Continuous durations (still used for display mapping)
df['flare_duration_sec'] = (df['event_endtime'] - df['event_starttime']).dt.total_seconds()
df['flare_duration_min'] = df['flare_duration_sec'] / 60

# Discrete flare minute counter using unique hhmm values
def count_unique_minutes(start, end):
    times = pd.date_range(start=start.floor('min'), end=end.floor('min'), freq='T')
    unique_hhmms = {t.strftime("%H%M") for t in times}
    return len(unique_hhmms), unique_hhmms

# Apply function to get discrete durations
df['flare_duration_discrete'], df['flare_hhmm_bins'] = zip(*df.apply(
    lambda row: count_unique_minutes(row['event_starttime'], row['event_endtime']),
    axis=1
))
df['flare_duration_min_discrete'] = df['flare_duration_discrete']

# Add pre-flare and post-flare counts
df['pre_flare'] = 178
df['post_flare'] = 178
df['total_per_event'] = df['flare_duration_min_discrete'] + df['pre_flare'] + df['post_flare']

# Format for display: show float minutes → discrete count
df['flare_display'] = df.apply(
    lambda row: f"{row['flare_duration_min']:.1f} → {row['flare_duration_min_discrete']}",
    axis=1
)

# Create display table
summary_df = df[['event_starttime', 'flare_display', 'pre_flare', 'post_flare', 'total_per_event']].copy()
summary_df.columns = ['Event Date', 'Flare Images', 'Pre-Flare (0)', 'Post-Flare (0)', 'Total per Event']
summary_df['Event Date'] = summary_df['Event Date'].dt.date

# Add separator row
separator_row = pd.DataFrame([['—'] * len(summary_df.columns)], columns=summary_df.columns)

# Add TOTAL row using discrete durations
total_row = pd.DataFrame([[
    'TOTAL',
    df['flare_duration_min_discrete'].sum(),
    df['pre_flare'].sum(),
    df['post_flare'].sum(),
    df['total_per_event'].sum()
]], columns=summary_df.columns)

# Concatenate and print
summary_df = pd.concat([summary_df, separator_row, total_row], ignore_index=True)

print("Final Summary Per Event\n")
print(tabulate(summary_df, headers='keys', tablefmt='github', showindex=False))

### Stratified Random Sampling of Minute-Level FITS Files

To prepare a diverse and balanced training dataset for flare classification, we implemented a **stratified random sampling strategy** based on the `hhmm` timestamp of each FITS file.

#### Process Overview:
- Each flare folder contains dozens to hundreds of FITS files with timestamps in the format:  
  `aia.lev1_euv_12s.2014-02-20T081811Z.131.image_lev1.fits`
- The script extracts the `hhmm` (hour–minute) portion from each filename.
- We count how many files occur in each `hhmm` group using a `Counter`.
- For every unique `hhmm`, we:
  - Track the index position(s) of all files within that minute.
  - **Randomly select one FITS file** to represent that minute bin.
- The selected FITS file is copied into a central folder called `data_selection`.

This process guarantees:
- One file per minute (if multiple are present).
- Balanced temporal representation across the flare duration.
- True randomness **within each bin**, ensuring reproducibility.

#### Sampling Logs:
- For transparency and reproducibility, we save a `.csv` log for each flare in the `sampling_logs` directory.
- Each log includes:
  - `hhmm` group
  - Number of files in the group (`count`)
  - Index positions of files in the group (`place`)
  - Which file was selected (`chosen_file` and `chosen_index`)

> **Example Output Log:**
> ```
> hhmm,count,place,chosen_index,chosen_file  
> 0723,1,[0],0,aia.lev1...  
> 0724,5,[1,2,3,4,5],3,aia.lev1...  
> 0725,5,[6,7,8,9,10],8,aia.lev1...
> ```

#### Image Labeling:
- All copied files are labeled as either `1` (flare) or `0` (non-flare: pre/post) and saved in a master label file.
- The labels are stored under the same filenames in both `image_labels.csv` (comma-separated) and `image_labels.txt` (tab-separated), located in the `data_selection` folder for direct use in ML workflows.

> **Example Format (CSV or TXT):**  
> ```
> filename,label
> aia.lev1_euv_12s.2014-02-20T081811Z.131.image_lev1.fits,1  
> aia.lev1_euv_12s.2014-02-20T074011Z.131.image_lev1.fits,0  
> ```

In [None]:
import os
import random
import shutil
import pandas as pd
from collections import Counter

# Folders to process
flare_folders = [
    "/Users/indiajackson/PycharmProjects/1_NSF_Postdoc/FITSvsASDF/JSOC_Events/final_fits/2014-01-07",
    "/Users/indiajackson/PycharmProjects/1_NSF_Postdoc/FITSvsASDF/JSOC_Events/final_fits/2014-02-20",
    "/Users/indiajackson/PycharmProjects/1_NSF_Postdoc/FITSvsASDF/JSOC_Events/final_fits/2014-02-25",
    "/Users/indiajackson/PycharmProjects/1_NSF_Postdoc/FITSvsASDF/JSOC_Events/final_fits/2014-04-18",
    "/Users/indiajackson/PycharmProjects/1_NSF_Postdoc/FITSvsASDF/JSOC_Events/final_fits/2014-09-10"
]

data_selection = "data_selection/fits/raw"
data_selection_parent = "data_selection"

log_dir = "data_selection/sampling_logs"
os.makedirs(data_selection, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

file_labels = []

for base_dir in flare_folders:
    flare_dir = os.path.join(base_dir, "flare")
    pre_dir = os.path.join(base_dir, "pre")
    post_dir = os.path.join(base_dir, "post")

    print(f"\n Processing: {flare_dir}")
    fits_files = sorted([f for f in os.listdir(flare_dir) if f.endswith(".fits")])
    hhmm_counter = Counter()

    # Count by hhmm
    for fname in fits_files:
        try:
            timestamp = fname.split("T")[1]
            hhmm = timestamp[:4]
            hhmm_counter[hhmm] += 1
        except IndexError:
            print(f" Skipping malformed filename: {fname}")

    # Stratified random selection for flare (label = 1)
    sorted_items = sorted(hhmm_counter.items())
    place_counter = 0
    rows = []

    for hhmm, count in sorted_items:
        place_list = list(range(place_counter, place_counter + count))
        chosen_index = random.choice(place_list)
        chosen_file = fits_files[chosen_index]

        src = os.path.join(flare_dir, chosen_file)
        dst = os.path.join(data_selection, chosen_file)
        if not os.path.exists(dst):
            shutil.copy2(src, dst)
            file_labels.append((chosen_file, 1))

        rows.append({
            "hhmm": hhmm,
            "count": count,
            "place": place_list,
            "chosen_index": chosen_index,
            "chosen_file": chosen_file
        })

        place_counter += count

    # Save stratified flare sample log
    flare_date = base_dir.split("/")[-1]
    csv_name = f"{flare_date}.csv"
    csv_path = os.path.join(log_dir, csv_name)

    df = pd.DataFrame(rows)
    df.to_csv(csv_path, index=False)
    print(f"Saved flare sampling log to `{csv_path}`")
    print(f"Copied {len(rows)} flare FITS files to `{data_selection}`")

    # Validation
    total_from_counter = sum(hhmm_counter.values())
    total_in_directory = len(fits_files)
    print(f"FITS in dir: {total_in_directory} | Counted: {total_from_counter}")
    if total_from_counter == total_in_directory:
        print("Count matches perfectly.")
    else:
        print("Mismatch detected!")

    # Copy 178 pre/post-flare files (label = 0)
    for tag, path in zip(['pre', 'post'], [pre_dir, post_dir]):
        tag_files = [f for f in os.listdir(path) if f.endswith(".fits")]
        selected = tag_files if len(tag_files) <= 178 else random.sample(tag_files, 178)
        print(f"Copying {len(selected)} {tag}-flare files...")

        for f in selected:
            src = os.path.join(path, f)
            dst = os.path.join(data_selection, f)
            if not os.path.exists(dst):
                shutil.copy2(src, dst)
                file_labels.append((f, 0))

# Save full label list
label_df = pd.DataFrame(file_labels, columns=["filename", "label"])
label_df = label_df.sort_values("filename")  # 🔍 Sort alphabetically
# label_df.to_csv("data_selection_labels.csv", index=False)
label_df.to_csv(os.path.join(data_selection_parent, "image_labels.csv"), index=False)
label_df.to_csv(os.path.join(data_selection_parent, "image_labels.txt"), index=False, sep="\t")

print("\nSaved `image_labels.csv` and `image_labels.txt` to data_selection/")
print("\nSaved full label file to `image_labels.csv`")
print(f"Data extraction complete. Total files: {len(file_labels)}")