# Accessing Georgia State University (GSU) Data Mining Lab's S3 Datasets with dmLab Module

This Jupyter Notebook module, `dmLab`, facilitates easy access to various datasets stored on an AWS S3 bucket for Georgia State University's Data Mining Lab. The module streamlines fetching datasets for analysis in your Python environment.

These modules were developed by India Jackson, PhD as a requirement for a PhD in Astrophysics and MS in Computer Science at Georgia State University.

## Available Datasets
The module provides access to a range of datasets, each with its unique focus and data characteristics:

### GSEP (Integrated Geostationary Solar Energetic Particle Events Catalog)
- **Dataset**: `gsep_list`
- **Source**: [GSEP Publication](https://iopscience.iop.org/article/10.3847/1538-4365/ac87ac)

### SOHO EIT (A Catalog of Solar Flare Events Observed by the SOHO/EIT)
- **Dataset**: `soho_eit_flare`
- **Source**: [SOHO EIT Publication](https://iopscience.iop.org/article/10.3847/1538-4365/ab9a42)

### Flare to CME Catalogs
- **Datasets**: `cdaw_cactus_lz_halo_integrated`, `cdaw_cactus_lz_nonhalo_integrated`, `cdaw_cactus_qkl_halo_integrated`, `cdaw_cactus_qkl_nonhalo_integrated`, `fl2cme_sdo`, `fl2cme_soho`, `soho_goes_flares`, `sdo_goes_flares`

### SWAN-SF (Multivariate time series dataset for space weather data analytics)
- **Datasets**: `goes_flares_integrated`, `Hinode_all`, `ssw_hpc`
- **Source**: [SWAN-SF Publication](https://www.nature.com/articles/s41597-020-0548-x)

## Usage:
To use the module, simply import it and call the `get_data` function with the dataset name.

In [1]:
import dmLab as data

In [2]:
gsep = data.get_dataset('gsep_list')
print(gsep.columns)

Index(['sep_index', 'pp_index', 'cdaw_sep_id', 'timestamp', 'cdaw_start_time',
       'cdaw_max_time', 'cdaw_evn_max', 'cme_id', 'cme_launch_time',
       'cme_1st_app_time', 'lasco_cme_width', 'p_cme_width',
       'lasco_linear_speed', 'p_cme_speed', 'fl_id', 'fl_start_time',
       'fl_peak_time', 'fl_rise_time', 'fl_lon', 'fl_lat', 'fl_goes_class',
       'noaa_ar', 'noaa_ar_uncertain', 'harpnum', 'noaa_pf10MeV',
       'ppf_gt10MeV', 'ppf_gt30MeV', 'ppf_gt60MeV', 'ppf_gt100MeV',
       'fluence_gt10MeV', 'fluence_gt30MeV', 'fluence_gt60MeV',
       'fluence_gt100MeV', 'gsep_pf_gt10MeV', 'gsep_max_time',
       'gsep_fluence_gt10MeV', 'm_type2_onset_time', 'dh_type2_onset_time',
       'start_fr', 'noaa-sep_flag', 'Inst_category', 'Comments', 'Notes',
       'Fe_e_p_shock_notes', 'gsep_notes', 'slice_start', 'slice_end', 'Flag'],
      dtype='object')


In [4]:
soho_eit_flare = data.get_dataset('soho_eit_flare')
print(soho_eit_flare.columns)

Index(['start_time_detection', 'end_time_detection', 'eit_fl_location',
       'goes_class', 'noaa_active_region', 'FLAG', 'Remarks'],
      dtype='object')


In [5]:
cdaw_cactus_lz_halo_integrated = data.get_dataset('cdaw_cactus_lz_halo_integrated')
print(cdaw_cactus_lz_halo_integrated.columns)

Index(['Instrument', 'Timestamp', 'Central_PA', 'Width', 'Linear_Speed',
       '2nd_order_speed_initial', '2nd_order_speed_final',
       '2nd_order_speed_20R', 'Accel', 'Mass', 'Kinetic_Energy', 'MPA',
       'Remarks', 'Acceleration_uncertainty', 'Kinetic_Energy_uncertainty',
       'Mass_uncertainty', 'cactus_id', 'lz_td', 'lz_pad', 'lz_wd', 'lz_vd',
       'cactus_pa', 'cactus_da', 'cactus_v', 'lz_tot_de', 'Unnamed: 25'],
      dtype='object')


In [6]:
cdaw_cactus_lz_nonhalo_integrated = data.get_dataset('cdaw_cactus_lz_nonhalo_integrated')
print(cdaw_cactus_lz_nonhalo_integrated.columns)

Index(['Instrument', 'Timestamp', 'Central_PA', 'Width', 'Linear_Speed',
       '2nd_order_speed_initial', '2nd_order_speed_final',
       '2nd_order_speed_20R', 'Accel', 'Mass', 'Kinetic_Energy', 'MPA',
       'Remarks', 'Acceleration_uncertainty', 'Kinetic_Energy_uncertainty',
       'Mass_uncertainty', 'cactus_id', 'lz_td', 'lz_pad', 'lz_wd', 'lz_vd',
       'cactus_pa', 'cactus_da', 'cactus_v', 'lz_tot_de', 'Unnamed: 25'],
      dtype='object')


In [7]:
cdaw_cactus_qkl_halo_integrated = data.get_dataset('cdaw_cactus_qkl_halo_integrated')
print(cdaw_cactus_qkl_halo_integrated.columns)

Index(['Instrument', 'Timestamp', 'Central_PA', 'Width', 'Linear_Speed',
       '2nd_order_speed_initial', '2nd_order_speed_final',
       '2nd_order_speed_20R', 'Accel', 'Mass', 'Kinetic_Energy', 'MPA',
       'Remarks', 'Acceleration_uncertainty', 'Kinetic_Energy_uncertainty',
       'Mass_uncertainty', 'cactus_id', 'qkl_td', 'qkl_pad', 'qkl_wd',
       'qkl_vd', 'cactus_pa', 'cactus_da', 'cactus_v', 'qkl_tot_de',
       'Unnamed: 25'],
      dtype='object')


In [8]:
cdaw_cactus_qkl_nonhalo_integrated = data.get_dataset('cdaw_cactus_qkl_nonhalo_integrated')
print(cdaw_cactus_qkl_nonhalo_integrated.columns)

Index(['Instrument', 'Timestamp', 'Central_PA', 'Width', 'Linear_Speed',
       '2nd_order_speed_initial', '2nd_order_speed_final',
       '2nd_order_speed_20R', 'Accel', 'Mass', 'Kinetic_Energy', 'MPA',
       'Remarks', 'Acceleration_uncertainty', 'Kinetic_Energy_uncertainty',
       'Mass_uncertainty', 'cactus_id', 'qkl_td', 'qkl_pad', 'qkl_wd',
       'qkl_vd', 'cactus_pa', 'cactus_da', 'cactus_v', 'qkl_tot_de',
       'Unnamed: 25'],
      dtype='object')


In [10]:
fl2cme_sdo = data.get_dataset('fl2cme_sdo')
print(fl2cme_sdo.columns)

Index(['flare_id', 'start_time', 'peak_time', 'end_time', 'goes_class',
       'noaa_active_region', 'fl_lon', 'fl_lat', 'fl_loc_src', 'ssw_flare_id',
       'hinode_flare_id', 'primary_verified', 'secondary_verified',
       'candidate_ars', 'cme_id', 'fl_pa', 'cme_mpa', 'diff_a', 'cme_vel',
       'cme_width', 'cme_assoc_conf', 'cdaw_cme_id', 'cdaw_cme_width',
       'cdaw_cme_vel', 'cdaw_cme_pa', 'donki_cme_id', 'donki_cme_half_angle',
       'donki_cme_vel', 'lowcat_cme_id', 'lowcat_cme_width', 'lowcat_cme_vel',
       'lowcat_cme_pa', 'cme_valid_conf'],
      dtype='object')


In [11]:
fl2cme_soho = data.get_dataset('fl2cme_soho')
print(fl2cme_soho.columns)

Index(['flare_id', 'noaa_active_region', 'event_date', 'start_time',
       'peak_time', 'goes_class', 'goes_location', 'Unnamed: 7', 'end_time',
       'start_time_detection', 'end_time_detection', 'eit_location',
       'fl_location', 'y ', 'x', 'centroids', 'fl_lat', 'fl_lon', 'x_hpc',
       'y_hpc', 'hinode_fl_id', 'hinode_verified', 'hinode_x_hpc',
       'hinode_y_hpc', 'eit_lon', 'eit_lat', 'eit_x_hpc', 'eit_y_hpc',
       'eit_verified', 'cme_id', 'fl_pa', 'cme_mpa', 'diff_a', 'cme_vel',
       'cme_width', 'cme_assoc_conf'],
      dtype='object')


In [12]:
soho_goes_flares = data.get_dataset('soho_goes_flares')
print(soho_goes_flares.columns)

Index(['flare_id', 'noaa_active_region', 'event_date', 'start_time',
       'peak_time', 'goes_class', 'goes_location', 'Unnamed: 7', 'end_time',
       'start_time_detection', 'end_time_detection', 'eit_location',
       'fl_location', 'y ', 'x', 'centroids', 'fl_lat', 'fl_lon', 'x_hpc',
       'y_hpc', 'hinode_fl_id', 'hinode_verified', 'hinode_x_hpc',
       'hinode_y_hpc', 'eit_lon', 'eit_lat', 'eit_x_hpc', 'eit_y_hpc',
       'eit_verified'],
      dtype='object')


In [13]:
sdo_goes_flares = data.get_dataset('sdo_goes_flares')
print(sdo_goes_flares.columns)

Index(['flare_id', 'start_time', 'peak_time', 'end_time', 'goes_class',
       'noaa_active_region', 'fl_lon', 'fl_lat', 'fl_loc_src', 'ssw_flare_id',
       'hinode_flare_id', 'primary_verified', 'secondary_verified',
       'candidate_ars'],
      dtype='object')
