In [2]:
#caution
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/Github/flare-cme/2024

Mounted at /content/drive
/content/drive/MyDrive/Github/flare-cme/2024


In [3]:
! pip install sunpy[all]
from IPython.display import clear_output
clear_output()

In [4]:
import warnings

warnings.filterwarnings("ignore")

# predicting coronal mass ejections using machine learning methods

A Coronal Mass Ejection (CME) throws magnetic flux and plasma from the Sun into interplanetary space. These eruptions are actually related to solar flares -- in fact, CMEs and solar flares are considered “a single magnetically driven event” ([Webb & Howard 2012](http://adsabs.harvard.edu/abs/2012LRSP....9....3W)), wherein a flare unassociated with a CME is called a confined or compact flare. <br>

In general, the more energetic a flare, the more likely it is to be associated with a CME ([Yashiro et al. 2005](http://adsabs.harvard.edu/abs/2005JGRA..11012S05Y)) -- but this is not, by any means, a rule. For example, [Sun et al. (2015)](http://adsabs.harvard.edu/abs/2015ApJ...804L..28S) found that the largest active region in the last 24 years, shown below, produced 6 X-class flares but not a single observed CME.<br>

In this notebook, we will be predicting whether or not a flaring active region will also emit a CME using a machine learning algorithm from the scikit-learn package called Support Vector Machine.

The analysis that follows is published in [Bobra & Ilonidis, 2016, <i> Astrophysical Journal</i>, 821, 127](http://adsabs.harvard.edu/abs/2016ApJ...821..127B). If you use any of this code, we ask that you cite Bobra & Ilonidis (2016).

To do this analysis, we'll look at every active region observed by the Helioseismic and Magnetic Imager instrument on NASA's Solar Dynamics Observatory (SDO) satellite over the last eight years. Each active region is characterized by a bunch of features. These features describe the magnetic field at the solar surface. One feature, for example, is the total energy contained within an active region. Another is the total flux through an active region. We have 18 features, all of which are calculated every 12 minutes throughout an active region's lifetime. See [Bobra et al., 2014](http://link.springer.com/article/10.1007%2Fs11207-014-0529-3) for more information on how we calculate these features. <br>

We'll then ascribe each active region to one of two classes:

1. The positive class contains flaring active regions that did produce a CME.
2. The negative class contains flaring active regions that did not produce a CME.

First, we'll import some modules.

In [5]:
import numpy as np
import matplotlib.pylab as plt
import matplotlib.mlab as mlab
import pandas as pd
import scipy.stats
import requests
import urllib
import json
from datetime import datetime as dt_obj
from datetime import timedelta
from sklearn import svm
from sklearn.model_selection import StratifiedKFold
from sunpy.time import TimeRange
from sunpy.net import Fido, attrs as a

pd.set_option('display.max_rows', 500)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Now we'll gather the data. The data come from three different places:

1. CME data from SOHO/LASCO and STEREO/SECCHI coronographs, which can be accesed from the [DONKI database](http://kauai.ccmc.gsfc.nasa.gov/DONKI/) at NASA Goddard. This tells us if an active region has produced a CME or not.
2. Flare data from the GOES flare catalog at NOAA, which can be accessed with the `sunpy.instr.goes.get_event_list()` function. This tells us if an active region produced a flare or not.
3. Active region data from the Solar Dynamics Observatory's Heliosesmic and Magnetic Imager instrument, which can be accessed from the [JSOC database](http://jsoc.stanford.edu/) via a JSON API. This gives us the features characterizing each active region.

### step 1: gathering data for the positive class

Let's first query the [DONKI database](http://kauai.ccmc.gsfc.nasa.gov/DONKI/) to get the data associated with the positive class. Be forewarned: there's a lot of data cleaning involved with building the positive class.

In [6]:
# request the data
baseurl = "https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/FLR?"
t_start = "2010-05-01"
t_end = "2024-12-31"
url = baseurl+"startDate="+t_start+"&endDate="+t_end

# if there's no response at this time, print warning
response = requests.get(url)
if response.status_code != 200:
    print('cannot successfully get an http response')

In [7]:
# read the data

print("Getting data from", url)
df = pd.read_json(url)

# select flares associated with a linked event (SEP or CME), and
# select only M or X-class flares
events_list = df.loc[df['classType'].str.contains("M|X") & ~df['linkedEvents'].isnull()]

# drop all rows that don't satisfy the above conditions
events_list = events_list.reset_index(drop=True)

Getting data from https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/FLR?startDate=2010-05-01&endDate=2024-12-31


In [8]:
# drop the rows that aren't linked to CME events
for i in range(events_list.shape[0]):
    value = events_list.loc[i]['linkedEvents'][0]['activityID']
    if not "CME" in value:
        print(value, "not a CME, dropping row")
        events_list = events_list.drop([i])
events_list = events_list.reset_index(drop=True)

2011-08-09T08:40:00-SEP-001 not a CME, dropping row
2011-09-07T06:00:00-SEP-001 not a CME, dropping row
2011-09-24T20:45:00-SEP-001 not a CME, dropping row
2013-06-21T06:09:00-SEP-001 not a CME, dropping row
2015-03-06T19:20:00-SEP-001 not a CME, dropping row
2015-06-18T09:25:00-SEP-001 not a CME, dropping row
2015-06-21T20:35:00-SEP-001 not a CME, dropping row
2015-06-21T20:35:00-SEP-001 not a CME, dropping row
2023-07-18T01:00:00-SEP-002 not a CME, dropping row
2023-08-05T10:00:00-SEP-001 not a CME, dropping row
2024-01-22T13:00:00-SEP-001 not a CME, dropping row
2024-05-10T14:50:00-SEP-001 not a CME, dropping row
2024-09-09T20:55:00-SEP-002 not a CME, dropping row
2024-12-21T16:12:00-SEP-001 not a CME, dropping row
2024-12-21T16:12:00-SEP-001 not a CME, dropping row
2024-12-21T16:12:00-SEP-001 not a CME, dropping row


Convert the `peakTime` column in the `events_list` dataframe from a string into a datetime object:

In [9]:
def parse_tai_string(tstr):
    year = int(tstr[:4])
    month = int(tstr[5:7])
    day = int(tstr[8:10])
    hour = int(tstr[11:13])
    minute = int(tstr[14:16])
    return dt_obj(year, month, day, hour, minute)


for i in range(events_list.shape[0]):
    events_list['peakTime'].iloc[i] = parse_tai_string(events_list['peakTime'].iloc[i])

Check for Case 1: In this case, the CME and flare exist but NOAA active region number does not exist in the DONKI database.

In [10]:
# Case 1: CME and Flare exist but NOAA active region number does not exist in DONKI database

number_of_donki_mistakes = 0  # count the number of DONKI mistakes
# create an empty array to hold row numbers to drop at the end
event_list_drops = []

for i in range(events_list.shape[0]):
    if (np.isnan(events_list.loc[i]['activeRegionNum'])):
        time = events_list['peakTime'].iloc[i]
        time_range = TimeRange(time, time)
        listofresult = Fido.search(a.Time(time_range),a.hek.EventType("FL"),a.hek.FL.GOESCls >= "M1.0",a.hek.OBS.Observatory == "GOES")

        if len(listofresult["hek"]) == 0:
            print(events_list.loc[i]['classType'], "has no match in the GOES flare database ; dropping row.")
            event_list_drops.append(i)
            number_of_donki_mistakes += 1
            continue
        else:
            if (listofresult[0]['ar_noaanum'] == 0):
                print(events_list.loc[i]['activeRegionNum'], events_list.loc[i]
                    ['classType'], "has no match in the GOES flare database ; dropping row.")
                event_list_drops.append(i)
                number_of_donki_mistakes += 1
                continue
            else:
                print("Missing NOAA number:", events_list['activeRegionNum'].iloc[i], events_list['classType'].iloc[i],
                    events_list['peakTime'].iloc[i], "should be", listofresult[0]['ar_noaanum'][0], "; changing now.")
                events_list['activeRegionNum'].iloc[i] = listofresult[0]['ar_noaanum']
                number_of_donki_mistakes += 1

# Drop the rows for which there is no active region number in both the DONKI and GOES flare databases
events_list = events_list.drop(event_list_drops)
events_list = events_list.reset_index(drop=True)
print('There are', number_of_donki_mistakes, 'DONKI mistakes so far.')

Missing NOAA number: nan X1.4 2011-09-22 11:01:00 should be 11302 ; changing now.
Missing NOAA number: nan X1.3 2012-03-07 01:14:00 should be 11430 ; changing now.
Missing NOAA number: nan M6.3 2012-03-09 03:53:00 should be 11429 ; changing now.
Missing NOAA number: nan M5.1 2012-05-17 01:47:00 should be 11476 ; changing now.
Missing NOAA number: nan X1.1 2012-07-06 23:08:00 should be 11515 ; changing now.
Missing NOAA number: nan M6.2 2012-07-28 20:56:00 should be 11532 ; changing now.
Missing NOAA number: nan M1.7 2012-11-08 02:23:00 should be 11611 ; changing now.
Missing NOAA number: nan M1.2 2013-03-15 06:58:00 should be 11692 ; changing now.
Missing NOAA number: nan X1.6 2013-05-13 02:17:00 should be 11748 ; changing now.
Missing NOAA number: nan X2.8 2013-05-13 16:05:00 should be 11748 ; changing now.
Missing NOAA number: nan X3.2 2013-05-14 01:11:00 should be 11748 ; changing now.
Missing NOAA number: nan X1.2 2013-05-15 01:48:00 should be 11748 ; changing now.
Missing NOAA num

Now we grab all the data from the GOES database in preparation for checking Cases 2 and 3.

In [None]:
# Grab all the data from the GOES database
t_start = "2010-05-01"
t_end = "2024-12-31"
time_range = TimeRange(t_start, t_end)
listofresults = Fido.search(a.Time(time_range),a.hek.EventType("FL"),a.hek.FL.GOESCls >= "M1.0",a.hek.OBS.Observatory == "GOES")
print('Grabbed all the GOES data; there are', len(listofresults["hek"]), 'events.')

Grabbed all the GOES data; there are 2035 events.


Check for Case 2: In this case, the NOAA active region number is wrong in the DONKI database.

In [12]:
# Case 2: NOAA active region number is wrong in DONKI database

# collect all the peak flares times in the NOAA database
peak_times_noaa = [item["event_peaktime"] for item in listofresults["hek"]]

for i in range(events_list.shape[0]):
    # check if a particular DONKI flare peak time is also in the NOAA database
    peak_time_donki = events_list['peakTime'].iloc[i]
    if peak_time_donki in peak_times_noaa:
        index = peak_times_noaa.index(peak_time_donki)
    else:
        continue
    # ignore NOAA active region numbers equal to zero
    if (listofresults["hek"][index]['ar_noaanum'] == 0):
        continue
    # if yes, check if the DONKI and NOAA active region numbers match up for this peak time
    # if they don't, flag this peak time and replace the DONKI number with the NOAA number
    if (listofresults["hek"][index]['ar_noaanum'] != int(events_list['activeRegionNum'].iloc[i])):
        print('Messed up NOAA number:', int(events_list['activeRegionNum'].iloc[i]), events_list['classType'].iloc[i],
              events_list['peakTime'].iloc[i], "should be", listofresults["hek"][index]['ar_noaanum'], "; changing now.")
        events_list['activeRegionNum'].iloc[i] = listofresults["hek"][index]['ar_noaanum']
        number_of_donki_mistakes += 1
print('There are', number_of_donki_mistakes, 'DONKI mistakes so far.')

Messed up NOAA number: 11943 X1.2 2014-01-07 18:32:00 should be 11944 ; changing now.
Messed up NOAA number: 12051 M1.2 2014-05-07 16:29:00 should be 12055 ; changing now.
Messed up NOAA number: 12160 M1.4 2014-07-01 11:23:00 should be 12106 ; changing now.
Messed up NOAA number: 12282 M2.4 2015-02-09 23:35:00 should be 12280 ; changing now.
Messed up NOAA number: 12321 M1.1 2015-04-23 10:07:00 should be 12322 ; changing now.
Messed up NOAA number: 12565 M7.6 2016-07-23 05:16:00 should be 12567 ; changing now.
Messed up NOAA number: 12565 M5.5 2016-07-23 05:31:00 should be 12567 ; changing now.
Messed up NOAA number: 13191 M6.0 2023-01-15 03:42:00 should be 13188 ; changing now.
Messed up NOAA number: 13312 M1.9 2023-05-22 13:37:00 should be 13314 ; changing now.
Messed up NOAA number: 13615 M9.4 2024-03-30 21:16:00 should be 13515 ; changing now.
Messed up NOAA number: 13730 M1.5 2024-07-03 07:41:00 should be 13729 ; changing now.
Messed up NOAA number: 13762 M2.0 2024-07-27 10:40:00 

Check for Case 3: In this case, the flare peak time is wrong in the DONKI database.

In [13]:
# Case 3: The flare peak time is wrong in the DONKI database.

# create an empty array to hold row numbers to drop at the end
event_list_drops = []

active_region_numbers_noaa = [item["ar_noaanum"]
                              for item in listofresults["hek"]]
flare_classes_noaa = [item["fl_goescls"] for item in listofresults["hek"]]

for i in range(events_list.shape[0]):
    # check if a particular DONKI flare peak time is also in the NOAA database
    peak_time_donki = events_list['peakTime'].iloc[i]
    if not peak_time_donki in peak_times_noaa:
        active_region_number_donki = int(events_list['activeRegionNum'].iloc[i])
        flare_class_donki = events_list['classType'].iloc[i]
        flare_class_indices = [i for i, x in enumerate(flare_classes_noaa) if x == flare_class_donki]
        active_region_indices = [i for i, x in enumerate(active_region_numbers_noaa) if x == active_region_number_donki]
        common_indices = list(set(flare_class_indices).intersection(active_region_indices))
        if common_indices:
            print("Messed up time:", int(events_list['activeRegionNum'].iloc[i]), events_list['classType'].iloc[i],events_list['peakTime'].iloc[i], "should be", peak_times_noaa[common_indices[0]], "; changing now.")
            events_list['peakTime'].iloc[i] = peak_times_noaa[common_indices[0]]
            number_of_donki_mistakes += 1
        if not common_indices:
            print("DONKI flare peak time",events_list['peakTime'].iloc[i], "has no match; dropping row.")
            event_list_drops.append(i)
            number_of_donki_mistakes += 1

# Drop the rows for which the NOAA active region number and flare class associated with
# the messed-up flare peak time in the DONKI database has no match in the GOES flare database
events_list = events_list.drop(event_list_drops)
events_list = events_list.reset_index(drop=True)

# Create a list of corrected flare peak times
peak_times_donki = [events_list['peakTime'].iloc[i]
                    for i in range(events_list.shape[0])]

print('There are', number_of_donki_mistakes, 'DONKI mistakes so far.')

Messed up time: 11429 X1.1 2012-03-05 04:05:00 should be 2012-03-05 04:09:00.000 ; changing now.
DONKI flare peak time 2012-03-10 17:27:00 has no match; dropping row.
Messed up time: 11745 M5.0 2013-05-22 13:38:00 should be 2013-05-22 13:32:00.000 ; changing now.
DONKI flare peak time 2014-02-09 16:14:00 has no match; dropping row.
DONKI flare peak time 2014-05-06 22:09:00 has no match; dropping row.
Messed up time: 12127 M1.5 2014-08-01 18:12:00 should be 2014-08-01 18:13:00.000 ; changing now.
Messed up time: 12146 M2.0 2014-08-25 15:10:00 should be 2014-08-25 15:11:00.000 ; changing now.
DONKI flare peak time 2014-09-03 13:53:00 has no match; dropping row.
DONKI flare peak time 2014-09-09 00:28:00 has no match; dropping row.
Messed up time: 12172 M2.3 2014-09-23 23:15:00 should be 2014-09-23 23:16:00.000 ; changing now.
Messed up time: 12242 X1.8 2014-12-20 00:24:00 should be 2014-12-20 00:28:00.000 ; changing now.
DONKI flare peak time 2014-12-21 12:17:00 has no match; dropping row

This is our final table of events that fall into the positive class:

In [14]:
events_list

Unnamed: 0,flrID,catalog,instruments,beginTime,peakTime,endTime,classType,sourceLocation,activeRegionNum,note,submissionTime,versionId,link,linkedEvents
0,2011-02-15T01:44:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-02-15T01:44Z,2011-02-15 01:56:00,2011-02-15T02:06Z,X2.2,S20W10,11158.0,,2015-07-16T19:24Z,2,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-02-15T02:25:00-CME-001'}]
1,2011-02-24T07:23:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-02-24T07:23Z,2011-02-24 07:35:00,2011-02-24T07:42Z,M3.5,N14E87,11163.0,,2015-07-16T19:28Z,2,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-02-24T08:00:00-CME-001'}]
2,2011-03-07T13:44:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-03-07T13:44Z,2011-03-07 14:30:00,2011-03-07T15:08Z,M2.0,N11E21,11166.0,,2013-07-18T13:28Z,1,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-03-07T14:40:00-CME-001'}]
3,2011-03-07T19:43:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-03-07T19:43Z,2011-03-07 20:12:00,2011-03-07T21:40Z,M3.7,N30W48,11164.0,,2013-07-18T18:41Z,1,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-03-07T20:12:00-CME-001'}]
4,2011-03-08T03:37:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-03-08T03:37Z,2011-03-08 03:58:00,2011-03-08T04:20Z,M1.5,S21E72,11171.0,,2013-07-18T19:11Z,1,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-03-08T05:00:00-CME-001'}]
5,2011-06-07T06:16:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-06-07T06:16Z,2011-06-07 06:41:00,2011-06-07T06:59Z,M2.5,S22W53,11226.0,,2013-07-19T14:28Z,1,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-06-07T06:50:00-CME-001'}...
6,2011-08-03T13:17:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-08-03T13:17Z,2011-08-03 13:48:00,2011-08-03T14:10Z,M6.0,N17W29,11261.0,,2015-07-16T19:33Z,2,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-08-03T13:55:00-CME-001'}]
7,2011-08-04T03:41:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-08-04T03:41Z,2011-08-04 03:57:00,2011-08-04T04:04Z,M9.3,N15W39,11261.0,,2015-07-16T19:37Z,2,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-08-04T04:10:00-CME-001'}...
8,2011-09-07T22:32:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-09-07T22:32Z,2011-09-07 22:38:00,2011-09-07T22:44Z,X1.8,N15W31,11283.0,,2015-07-17T15:07Z,3,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-09-07T23:24:00-CME-001'}]
9,2011-09-22T10:29:00-FLR-001,M2M_CATALOG,[{'displayName': 'GOES15: SEM/XRS 1.0-8.0'}],2011-09-22T10:29Z,2011-09-22 11:01:00,2011-09-22T11:44Z,X1.4,N09E89,11302.0,,2015-07-16T19:57Z,2,https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/FL...,[{'activityID': '2011-09-22T11:24:00-CME-001'}...


Now let's query the JSOC database to see if there are active region parameters at the time of the flare. First read the following file to map NOAA active region numbers to HARPNUMs (a HARP, or an HMI Active Region Patch, is the preferred numbering system for the HMI active regions as they appear in the magnetic field data before NOAA observes them in white light):

In [15]:
answer = pd.read_csv('http://jsoc.stanford.edu/doc/data/hmi/harpnum_to_noaa/all_harps_with_noaa_ars.txt', sep=' ')

Now, let's determine at which time we'd like to predict CMEs. In general, many people try to predict a CME either 24 or 48 hours before it happens. We can report both in this study by setting a variable called `timedelayvariable`:

In [16]:
timedelayvariable = 24

Now, we'll convert subtract `timedelayvariable` from the GOES Peak Time and re-format the datetime object into a string that JSOC can understand:

In [17]:
t_rec = [(events_list['peakTime'].iloc[i] - timedelta(hours=timedelayvariable)).strftime('%Y.%m.%d_%H:%M_TAI') for i in range(events_list.shape[0])]

Now we can grab the SDO data from the JSOC database by executing the JSON queries. We are selecting data that satisfies several criteria: The data has to be [1] disambiguated with a version of the disambiguation module greater than 1.1, [2] taken while the orbital velocity of the spacecraft is less than 3500 m/s, [3] of a high quality, and [4] within 70 degrees of central meridian. If the data pass all these tests, they are stuffed into one of two lists: one for the positive class (called CME_data) and one for the negative class (called no_CME_data).

In [18]:
def get_the_jsoc_data(event_count, t_rec):
    """
    Parameters
    ----------
    event_count: number of events
                 int

    t_rec:       list of times, one associated with each event in event_count
                 list of strings in JSOC format ('%Y.%m.%d_%H:%M_TAI')

    """

    catalog_data = []
    classification = []

    for i in range(event_count):

        print("=====", i, "=====")
        # next match NOAA_ARS to HARPNUM
        idx = answer[answer['NOAA_ARS'].str.contains(
            str(int(listofactiveregions[i])))]

        # if there's no HARPNUM, quit
        if (idx.empty == True):
            print('skip: there are no matching HARPNUMs for',
                  str(int(listofactiveregions[i])))
            continue

        # construct jsoc_info queries and query jsoc database; we are querying for 25 keywords
        url = "http://jsoc.stanford.edu/cgi-bin/ajax/jsoc_info?ds=hmi.sharp_720s["+str(
            idx.HARPNUM.values[0])+"]["+t_rec[i]+"][? (CODEVER7 !~ '1.1 ') and (abs(OBS_VR)< 3500) and (QUALITY<65536) ?]&op=rs_list&key=USFLUX,MEANGBT,MEANJZH,MEANPOT,SHRGT45,TOTUSJH,MEANGBH,MEANALP,MEANGAM,MEANGBZ,MEANJZD,TOTUSJZ,SAVNCPP,TOTPOT,MEANSHR,AREA_ACR,R_VALUE,ABSNJZH"
        response = requests.get(url)

        # if there's no response at this time, quit
        if response.status_code != 200:
            print('skip: cannot successfully get an http response')
            continue

        # read the JSON output
        data = response.json()

        # if there are no data at this time, quit
        if data['count'] == 0:
            print('skip: there are no data for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        # check to see if the active region is too close to the limb
        # we can compute the latitude of an active region in stonyhurst coordinates as follows:
        # latitude_stonyhurst = CRVAL1 - CRLN_OBS
        # for this we have to query the CEA series (but above we queried the other series as the CEA series does not have CODEVER5 in it)

        url = "http://jsoc.stanford.edu/cgi-bin/ajax/jsoc_info?ds=hmi.sharp_cea_720s["+str(
            idx.HARPNUM.values[0])+"]["+t_rec[i]+"][? (abs(OBS_VR)< 3500) and (QUALITY<65536) ?]&op=rs_list&key=CRVAL1,CRLN_OBS"
        response = requests.get(url)

        # if there's no response at this time, quit
        if response.status_code != 200:
            print('skip: failed to find CEA JSOC data for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        # read the JSON output
        latitude_information = response.json()

        # if there are no data at this time, quit
        if latitude_information['count'] == 0:
            print('skip: there are no data for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        CRVAL1 = float(latitude_information['keywords'][0]['values'][0])
        CRLN_OBS = float(latitude_information['keywords'][1]['values'][0])
        if (np.absolute(CRVAL1 - CRLN_OBS) > 70.0):
            print('skip: latitude is out of range for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        if ('MISSING' in str(data['keywords'])):
            print('skip: there are some missing keywords for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        print('accept NOAA Active Region number', str(int(
            listofactiveregions[i])), 'and HARPNUM', idx.HARPNUM.values[0], 'at time', t_rec[i])

        individual_flare_data = []
        for j in range(18):
            individual_flare_data.append(
                float(data['keywords'][j]['values'][0]))

        catalog_data.append(list(individual_flare_data))

        single_class_instance = [idx.HARPNUM.values[0], str(
            int(listofactiveregions[i])), listofgoesclasses[i], t_rec[i]]
        classification.append(single_class_instance)

    return catalog_data, classification

Now we prepare the data to be fed into the function:

In [19]:
listofactiveregions = list(events_list['activeRegionNum'].values.flatten())
listofgoesclasses = list(events_list['classType'].values.flatten())

And call the function:

In [20]:
positive_result = get_the_jsoc_data(events_list.shape[0], t_rec)

===== 0 =====
accept NOAA Active Region number 11158 and HARPNUM 377 at time 2011.02.14_01:56_TAI
===== 1 =====
skip: there are no data for HARPNUM 392 at time 2011.02.23_07:35_TAI
===== 2 =====
accept NOAA Active Region number 11166 and HARPNUM 401 at time 2011.03.06_14:30_TAI
===== 3 =====
accept NOAA Active Region number 11164 and HARPNUM 393 at time 2011.03.06_20:12_TAI
===== 4 =====
skip: there are no data for HARPNUM 415 at time 2011.03.07_03:58_TAI
===== 5 =====
accept NOAA Active Region number 11226 and HARPNUM 637 at time 2011.06.06_06:41_TAI
===== 6 =====
accept NOAA Active Region number 11261 and HARPNUM 750 at time 2011.08.02_13:48_TAI
===== 7 =====
accept NOAA Active Region number 11261 and HARPNUM 750 at time 2011.08.03_03:57_TAI
===== 8 =====
accept NOAA Active Region number 11283 and HARPNUM 833 at time 2011.09.06_22:38_TAI
===== 9 =====
skip: there are no data for HARPNUM 892 at time 2011.09.21_11:01_TAI
===== 10 =====
skip: there are no data for HARPNUM 892 at time 20

Here is the number of events associated with the positive class:

In [21]:
CME_data = positive_result[0]
positive_class = positive_result[1]
print("There are", len(CME_data), "CME events in the positive class.")

There are 305 CME events in the positive class.


In [22]:
positive_df = pd.concat([pd.DataFrame(CME_data), pd.DataFrame(positive_class)], axis=1)
positive_df.columns = ["USFLUX", "MEANGBT", "MEANJZH", "MEANPOT", "SHRGT45", "TOTUSJH", "MEANGBH",
              "MEANALP", "MEANGAM", "MEANGBZ", "MEANJZD", "TOTUSJZ", "SAVNCPP", "TOTPOT",
              "MEANSHR", "AREA_ACR", "R_VALUE", "ABSNJZH", "HARPNUM", "NOAA", "Class", "Peak Time"]


In [23]:
positive_df

Unnamed: 0,USFLUX,MEANGBT,MEANJZH,MEANPOT,SHRGT45,TOTUSJH,MEANGBH,MEANALP,MEANGAM,MEANGBZ,...,SAVNCPP,TOTPOT,MEANSHR,AREA_ACR,R_VALUE,ABSNJZH,HARPNUM,NOAA,Class,Peak Time
0,2.246101e+22,107.736,0.021724,15533.1,54.882,2975.118,77.983,0.041413,59.541,117.582,...,12261480000000.0,7.077275e+23,50.329,945.314636,4.805,745.287,377,11158,X2.2,2011.02.14_01:56_TAI
1,2.235615e+22,93.863,-0.001943,10914.06,44.178,1636.71,56.639,-0.003942,50.68,99.163,...,3884153000000.0,4.733711e+23,43.479,961.502686,4.303,63.441,401,11166,M2.0,2011.03.06_14:30_TAI
2,6.039302e+22,91.233,0.007535,9912.353,29.044,4229.148,49.56,0.015028,41.15,100.01,...,34306470000000.0,1.074422e+24,35.873,1934.446167,4.865,615.008,393,11164,M3.7,2011.03.06_20:12_TAI
3,2.38494e+22,109.706,0.00962,3703.91,13.5,1505.331,47.602,0.030907,34.321,110.939,...,19693860000000.0,1.879337e+23,28.525,970.551636,3.971,367.53,637,11226,M2.5,2011.06.06_06:41_TAI
4,1.989983e+22,109.891,0.032849,12062.61,44.09,2547.054,73.681,0.079865,53.27,115.866,...,44606710000000.0,5.269515e+23,43.929,1259.099854,4.777,1080.528,750,11261,M6.0,2011.08.02_13:48_TAI
5,2.240558e+22,109.474,0.034282,12019.99,43.44,2862.191,70.452,0.079225,51.866,118.352,...,46375610000000.0,5.611343e+23,43.442,1285.486938,4.73,1205.011,750,11261,M9.3,2011.08.03_03:57_TAI
6,1.574615e+22,113.941,0.014583,10300.17,37.082,1630.318,69.928,0.035764,49.86,118.314,...,18908720000000.0,3.384889e+23,39.761,1212.428345,4.336,360.819,833,11283,X1.8,2011.09.06_22:38_TAI
7,3.790051e+22,61.66,-0.020737,14635.0,42.289,2375.819,42.751,-0.031655,49.339,79.403,...,35521030000000.0,8.881245e+23,43.869,1152.677368,4.436,947.592,1449,11429,X1.1,2012.03.04_04:09_TAI
8,5.332782e+22,89.149,-0.032106,18206.98,45.66,5119.345,59.904,-0.048339,51.177,106.736,...,76721450000000.0,1.604684e+24,45.734,1720.481201,5.152,2130.605,1449,11429,X5.4,2012.03.06_00:24_TAI
9,5.372936e+22,89.536,-0.033009,18189.7,46.386,5227.58,60.44,-0.049342,51.243,107.903,...,80033020000000.0,1.609369e+24,46.017,1736.178833,5.153,2198.94,1449,11430,X1.3,2012.03.06_01:14_TAI


### step 2: gathering data for the negative class

To gather the examples for the negative class, we only need to:

1. Query the GOES database for all the M- and X-class flares during our time of interest, and
2. Select the ones that are not associated with a CME.

In [24]:
# select peak times that belong to both classes
all_peak_times = np.array([(listofresults["hek"][i]['event_peaktime'])
                           for i in range(len(listofresults["hek"]))])

negative_class_possibilities = []
counter_positive = 0
counter_negative = 0
for i in range(len(listofresults["hek"])):
    this_peak_time = all_peak_times[i]
    if (this_peak_time in peak_times_donki):
        counter_positive += 1
    else:
        counter_negative += 1
        this_instance = [listofresults["hek"][i]['ar_noaanum'],
                         listofresults["hek"][i]['fl_goescls'], listofresults["hek"][i]['event_peaktime']]
        negative_class_possibilities.append(this_instance)
print("There are", counter_positive, "events in the positive class.")
print("There are", counter_negative, "events in the negative class.")

There are 504 events in the positive class.
There are 1531 events in the negative class.


Again, we compute times that are one day before the flare peak time and convert it into a string that JSOC can understand:

In [25]:
t_rec = np.array([(negative_class_possibilities[i][2] - timedelta(hours=timedelayvariable)).strftime('%Y.%m.%d_%H:%M_TAI') for i in range(len(negative_class_possibilities))])

And again, we query the JSOC database to see if these data are present:

In [26]:
listofactiveregions = list(negative_class_possibilities[i][0] for i in range(counter_negative))
listofgoesclasses = list(negative_class_possibilities[i][1] for i in range(counter_negative))

In [27]:
negative_result = get_the_jsoc_data(counter_negative, t_rec)

===== 0 =====
accept NOAA Active Region number 11069 and HARPNUM 8 at time 2010.05.04_17:19_TAI
===== 1 =====
skip: there are no data for HARPNUM 54 at time 2010.06.11_00:57_TAI
===== 2 =====
accept NOAA Active Region number 11112 and HARPNUM 211 at time 2010.10.15_19:12_TAI
===== 3 =====
skip: latitude is out of range for HARPNUM 245 at time 2010.11.03_23:58_TAI
===== 4 =====
skip: latitude is out of range for HARPNUM 245 at time 2010.11.03_23:58_TAI
===== 5 =====
skip: latitude is out of range for HARPNUM 245 at time 2010.11.05_15:36_TAI
===== 6 =====
accept NOAA Active Region number 11149 and HARPNUM 345 at time 2011.01.27_01:03_TAI
===== 7 =====
accept NOAA Active Region number 11153 and HARPNUM 362 at time 2011.02.08_01:31_TAI
===== 8 =====
accept NOAA Active Region number 11158 and HARPNUM 377 at time 2011.02.12_17:38_TAI
===== 9 =====
accept NOAA Active Region number 11158 and HARPNUM 377 at time 2011.02.13_17:26_TAI
===== 10 =====
accept NOAA Active Region number 11161 and HARP

Here is the number of events associated with the negative class:

In [28]:
no_CME_data = negative_result[0]
negative_class = negative_result[1]
print("There are", len(no_CME_data), "no-CME events in the negative class.")

There are 836 no-CME events in the negative class.


In [29]:
negative_df = pd.concat([pd.DataFrame(no_CME_data), pd.DataFrame(negative_class)], axis=1)
negative_df.columns = ["USFLUX", "MEANGBT", "MEANJZH", "MEANPOT", "SHRGT45", "TOTUSJH", "MEANGBH",
              "MEANALP", "MEANGAM", "MEANGBZ", "MEANJZD", "TOTUSJZ", "SAVNCPP", "TOTPOT",
              "MEANSHR", "AREA_ACR", "R_VALUE", "ABSNJZH", "HARPNUM", "NOAA", "Class", "Peak Time"]


In [30]:
negative_df

Unnamed: 0,USFLUX,MEANGBT,MEANJZH,MEANPOT,SHRGT45,TOTUSJH,MEANGBH,MEANALP,MEANGAM,MEANGBZ,...,SAVNCPP,TOTPOT,MEANSHR,AREA_ACR,R_VALUE,ABSNJZH,HARPNUM,NOAA,Class,Peak Time
0,5.987544e+21,112.002,-0.024196,5501.074,25.269,720.336,62.781,-0.067238,43.847,125.326,...,9.859024e+12,7.265489e+22,36.054,171.006271,4.072,240.634,8,11069,M1.2,2010.05.04_17:19_TAI
1,1.104408e+22,128.494,0.000534,2971.182,12.117,855.488,59.704,0.001917,35.838,128.672,...,7.092123e+12,7.395334e+22,27.692,703.612671,3.707,10.009,211,11112,M2.9,2010.10.15_19:12_TAI
2,1.812796e+22,61.379,0.005562,7593.361,24.858,878.336,32.508,0.012794,40.577,68.417,...,8.784931e+12,2.608914e+23,29.979,515.142334,4.015,143.879,345,11149,M1.3,2011.01.27_01:03_TAI
3,1.017101e+22,84.182,-0.002347,3373.081,11.460,536.272,40.112,-0.006268,32.987,88.180,...,1.490940e+12,6.688043e+22,26.689,240.660339,3.782,35.038,362,11153,M1.9,2011.02.08_01:31_TAI
4,5.466512e+21,121.227,-0.001571,5202.920,34.474,497.171,78.803,-0.004505,53.747,119.240,...,1.266448e+12,6.762531e+22,40.947,302.568970,3.473,15.378,377,11158,M6.6,2011.02.12_17:38_TAI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
831,6.049120e+22,113.651,-0.019093,13083.810,34.170,6069.883,66.119,-0.033447,45.709,126.779,...,3.504728e+13,1.329080e+24,39.289,3688.956055,5.127,1460.385,12471,13936,M5.0,2024.12.29_16:54_TAI
832,6.068721e+22,113.223,-0.018846,13089.460,33.902,6105.217,65.565,-0.032996,45.425,126.593,...,3.698424e+13,1.329793e+24,39.109,3663.961670,5.162,1441.706,12471,13936,M1.2,2024.12.29_17:30_TAI
833,6.058633e+22,112.725,-0.019508,13116.960,34.317,6120.661,65.513,-0.034247,45.580,126.133,...,3.834556e+13,1.335375e+24,39.362,3655.612793,5.180,1495.448,12471,13936,M1.6,2024.12.29_17:42_TAI
834,6.123191e+22,113.080,-0.018925,12900.890,33.518,6172.515,64.694,-0.033407,44.952,126.485,...,3.587639e+13,1.329003e+24,38.897,3638.466797,5.141,1467.962,12471,13936,M1.6,2024.12.29_18:24_TAI


In [31]:
positive_df.to_csv("positive_2024.csv")
negative_df.to_csv("negative_2024.csv")