# Part 1: Data scraping and preparation

## Step 1: Scraping competitor's data

First, I need to pull the data from the competitor's website and put it into a data structure in Python that I can work with. I do this by using the requests library to get the HTML from the site as a string, and using BeautifulSoup to help me find the relevant table. I then use the `read_html` method from Pandas to get the table and import the data into a Pandas data frame.

In [1]:
import requests as req
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re

In [2]:
web = req.get('https://cmsc320.github.io/files/top-50-solar-flares.html')

In [3]:
soup = BeautifulSoup(web.text)

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html class="fontawesome-i2svg-active fontawesome-i2svg-complete" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Top 50 solar flares | Solar activity | SpaceWeatherLive.com
  </title>
  <meta charset="utf-8"/>
  <meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/>
  <meta content="On this page you will find an overview of the strongest solar flares since June 1996 together with links to more information in our archive and a v..." name="description"/>
  <meta content="SpaceWeatherLive, Live, Aurora, Auroral activity, Aurora Australis, Aurora Borealis, northern lights, Solar wind, Kp-index, Space Weather, Space Weather Updates, Aurora forecast, Space Weather Alerts, Solar activity, Solar flares, Sunspots, Aurora alert, Auroral activity, The Sun, SDO, STEREO, EPAM, DSCOVR" name="keywords"/>
  <!-- Facebook meta -->
  <meta content="https://spaceweatherlive.com/en

In [5]:
table = soup.find('table')

In [6]:
# imports table from html to pandas data frame
df1 = pd.read_html(web.text, flavor="bs4")[0]

In [7]:
df1.columns = ['rank', 'x_class', 'date', 'region', 'start_time', 'max_time', 'end_time', 'movie']

df1

Unnamed: 0,rank,x_class,date,region,start_time,max_time,end_time,movie
0,1,X28+,2003/11/04,486,19:29,19:53,20:06,MovieView archive
1,2,X20+,2001/04/02,9393,21:32,21:51,22:03,MovieView archive
2,3,X17.2+,2003/10/28,486,09:51,11:10,11:24,MovieView archive
3,4,X17+,2005/09/07,808,17:17,17:40,18:03,MovieView archive
4,5,X14.4,2001/04/15,9415,13:19,13:50,13:55,MovieView archive
5,6,X10,2003/10/29,486,20:37,20:49,21:01,MovieView archive
6,7,X9.4,1997/11/06,8100,11:49,11:55,12:01,MovieView archive
7,8,X9.3,2017/09/06,2673,11:53,12:02,12:10,MovieView archive
8,9,X9,2006/12/05,930,10:18,10:35,10:45,MovieView archive
9,10,X8.3,2003/11/02,486,17:03,17:25,17:39,MovieView archive


## Step 2: Tidying the top 50 solar flare data

To tidy the data, I first got rid of the movies column using the `drop` command. Then, I used `itterows` to loop through the data frame and combine the date and time of that row into a string that could be parsed by `to_datetime`, which I used to convert each time entry into a datetime entry. I then dropped the date column and renamed the columns. To deal with missing data indicated by a "-", I used the `replace` method from Pandas to replace "-"s with NaN. 

In [8]:
# removes movie column
df1 = df1.drop(columns='movie')

df1

Unnamed: 0,rank,x_class,date,region,start_time,max_time,end_time
0,1,X28+,2003/11/04,486,19:29,19:53,20:06
1,2,X20+,2001/04/02,9393,21:32,21:51,22:03
2,3,X17.2+,2003/10/28,486,09:51,11:10,11:24
3,4,X17+,2005/09/07,808,17:17,17:40,18:03
4,5,X14.4,2001/04/15,9415,13:19,13:50,13:55
5,6,X10,2003/10/29,486,20:37,20:49,21:01
6,7,X9.4,1997/11/06,8100,11:49,11:55,12:01
7,8,X9.3,2017/09/06,2673,11:53,12:02,12:10
8,9,X9,2006/12/05,930,10:18,10:35,10:45
9,10,X8.3,2003/11/02,486,17:03,17:25,17:39


In [9]:
# loops through rows and convert time columns to datetime columns
for i, row in df1.iterrows():
    df1.at[i, 'start_time'] = pd.to_datetime(row.at['date'] + " " + row.at['start_time'])
    df1.at[i, 'max_time'] = pd.to_datetime(row.at['date'] + " " + row.at['max_time'])
    df1.at[i, 'end_time'] = pd.to_datetime(row.at['date'] + " " + row.at['end_time'])
    
df1 = df1.drop(columns='date')

df1.columns = ['rank', 'x_class', 'region', 'start_datetime', 'max_datetime', 'end_datetime']

In [10]:
# replaces missing data with NaN
df1 = df1.replace({'-' : np.nan})

df1

Unnamed: 0,rank,x_class,region,start_datetime,max_datetime,end_datetime
0,1,X28+,486,2003-11-04 19:29:00,2003-11-04 19:53:00,2003-11-04 20:06:00
1,2,X20+,9393,2001-04-02 21:32:00,2001-04-02 21:51:00,2001-04-02 22:03:00
2,3,X17.2+,486,2003-10-28 09:51:00,2003-10-28 11:10:00,2003-10-28 11:24:00
3,4,X17+,808,2005-09-07 17:17:00,2005-09-07 17:40:00,2005-09-07 18:03:00
4,5,X14.4,9415,2001-04-15 13:19:00,2001-04-15 13:50:00,2001-04-15 13:55:00
5,6,X10,486,2003-10-29 20:37:00,2003-10-29 20:49:00,2003-10-29 21:01:00
6,7,X9.4,8100,1997-11-06 11:49:00,1997-11-06 11:55:00,1997-11-06 12:01:00
7,8,X9.3,2673,2017-09-06 11:53:00,2017-09-06 12:02:00,2017-09-06 12:10:00
8,9,X9,930,2006-12-05 10:18:00,2006-12-05 10:35:00,2006-12-05 10:45:00
9,10,X8.3,486,2003-11-02 17:03:00,2003-11-02 17:25:00,2003-11-02 17:39:00


## Step 3: Scraping the NASA data

The NASA table was more difficult to convert to a data frame because it's not inside an html table—it's just written out as lines of text. So I used BeautifulSoup to get to the `<pre>` tag that the "table" is in, and then deleted the lines that weren't data. I then put those lines into rows of a data frame, and then expanded them so each data entry was in its own column. I then got rid of the excess variables/junk that were caused by the way the site is formatted, and gave appropriate names to the remaining columns.

In [11]:
web = req.get('https://cdaw.gsfc.nasa.gov/CME_list/radio/waves_type2.html')

In [12]:
soup = BeautifulSoup(web.text)

In [13]:
print(soup.prettify())

<html>
 <body>
  <h2>
   Wind/WAVES type II bursts and CMEs
  </h2>
  <a href="waves_type2_description.htm">
   A Brief Description
  </a>
  <pre>
NOTE: List includes DH type II bursts observed by Wind spacecraft, 
but after STEREO launch on Oct 2006 the start and end times and 
frequencies of bursts are determined using both Wind and STEREO 
observations

                DH Type II                       Flare                     CME                   
----------------------------------------   -----------------   --------------------------   Plots
Start            End          Frequency     Loc   NOAA  Imp    Date  Time CPA  Width  Spd        
(1)        (2)   (3)   (4)   (5)    (6)     (7)    (8)  (9)    (10)  (11) (12)  (13) (14)   (15) 
1997/04/01 14:00 04/01 14:15  <a href="https://cdaw.gsfc.nasa.gov/CME_list/daily_movies/1997/04/01/c2rdif_waves.html">8000</a>  <a href="https://cdaw.gsfc.nasa.gov/CME_list/daily_movies/1997/04/01/c3rdif_waves.html">4000</a>   S25E16  8026 M1.3   <a

In [14]:
# finds the <pre> containing the table and split its lines
pre = soup.find('pre')
lines = pre.text.splitlines()

# removes excess strings at beginning and end of list
for i in range(12):
    del lines[0]

del lines[len(lines) - 1]

lines

['1997/04/01 14:00 04/01 14:15  8000  4000   S25E16  8026 M1.3   04/01 15:18   74   79  312   PHTX',
 '1997/04/07 14:30 04/07 17:30 11000  1000   S28E19  8027 C6.8   04/07 14:27 Halo  360  878   PHTX',
 '1997/05/12 05:15 05/14 16:00 12000    80   N21W08  8038 C1.3   05/12 05:30 Halo  360  464   PHTX',
 '1997/05/21 20:20 05/21 22:00  5000   500   N05W12  8040 M1.3   05/21 21:00  263  165  296   PHTX',
 '1997/09/23 21:53 09/23 22:16  6000  2000   S29E25  8088 C1.4   09/23 22:02  133  155  712   PHTX',
 '1997/11/03 05:15 11/03 12:00 14000   250   S20W13  8100 C8.6   11/03 05:28  240  109  227   PHTX',
 '1997/11/03 10:30 11/03 11:30 14000  5000   S16W21  8100 M4.2   11/03 11:11  233  122  352   PHTX',
 '1997/11/04 06:00 11/05 04:30 14000   100   S14W33  8100 X2.1   11/04 06:10 Halo  360  785   PHTX',
 '1997/11/06 12:20 11/07 08:30 14000   100   S18W63  8100 X9.4   11/06 12:10 Halo  360 1556   PHTX',
 '1997/11/27 13:30 11/27 14:00 14000  7000   N17E63  8113 X2.6   11/27 13:56   98   91  441

In [15]:
# puts the lines into a data frame
df2 = pd.DataFrame(lines, columns=['string'])

df2

Unnamed: 0,string
0,1997/04/01 14:00 04/01 14:15 8000 4000 S25...
1,1997/04/07 14:30 04/07 17:30 11000 1000 S28...
2,1997/05/12 05:15 05/14 16:00 12000 80 N21...
3,1997/05/21 20:20 05/21 22:00 5000 500 N05...
4,1997/09/23 21:53 09/23 22:16 6000 2000 S29...
...,...
513,2017/09/04 20:27 09/05 04:54 14000 210 S10...
514,2017/09/06 12:05 09/07 08:00 16000 70 S08...
515,2017/09/10 16:02 09/11 06:50 16000 150 S09...
516,2017/09/12 07:38 09/12 07:43 16000 13000 N08...


In [16]:
# splits strings by ' ' and expand each block into its own column
df2 = df2['string'].str.split(expand=True)

# drops columns caused by text to the right of table on website
df2 = df2.drop(range(14, 24), axis=1)

# assigns appropriate column names
df2.columns=['start_date', 'start_time', 'end_date', 'end_time', 'start_frequency', 'end_frequency', 'flare_location', 'flare_region', 'importance',  'cme_date', 'cme_time',  'cpa', 'width', 'speed']

In [17]:
df2

Unnamed: 0,start_date,start_time,end_date,end_time,start_frequency,end_frequency,flare_location,flare_region,importance,cme_date,cme_time,cpa,width,speed
0,1997/04/01,14:00,04/01,14:15,8000,4000,S25E16,8026,M1.3,04/01,15:18,74,79,312
1,1997/04/07,14:30,04/07,17:30,11000,1000,S28E19,8027,C6.8,04/07,14:27,Halo,360,878
2,1997/05/12,05:15,05/14,16:00,12000,80,N21W08,8038,C1.3,05/12,05:30,Halo,360,464
3,1997/05/21,20:20,05/21,22:00,5000,500,N05W12,8040,M1.3,05/21,21:00,263,165,296
4,1997/09/23,21:53,09/23,22:16,6000,2000,S29E25,8088,C1.4,09/23,22:02,133,155,712
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
513,2017/09/04,20:27,09/05,04:54,14000,210,S10W12,12673,M5.5,09/04,20:12,Halo,360,1418
514,2017/09/06,12:05,09/07,08:00,16000,70,S08W33,12673,X9.3,09/06,12:24,Halo,360,1571
515,2017/09/10,16:02,09/11,06:50,16000,150,S09W92,-----,X8.3,09/10,16:00,Halo,360,3163
516,2017/09/12,07:38,09/12,07:43,16000,13000,N08E48,12680,C3.0,09/12,08:03,124,96,252


## Step 4: Tidying the NASA table

To tidy the NASA table, I first replaced the many different symbols that represend missing data with NaN. I then added a new column to indicate if a row corresoponds to a halo flare, and another to indicate if the width given is a lower bound; while then changing the cpa and with columns to be homogeneous. After that I converted the date and time columns into datetime columns. I had to account for the cases where the year changes in between the start and the end or cme, and fix some values in the data that didn't fit the datetime format (24:00 -> 00:00 the next day).

In [18]:
# while working on this step I wanted to be able to see all rows of the table, this removes the display limit
pd.set_option('display.max_rows', None)

In [19]:
# replaces missing data with NaN
df2 = df2.replace({'????' : np.nan, '----' : np.nan , '-----' : np.nan, '------' : np.nan, '--/--' : np.nan, '--:--' : np.nan})

In [20]:
# adds column to indicate if row corresponds to a halo flare
df2.insert(len(df2.columns), 'is_halo', False)

# loops through rows and if halo flare, set cpa to NaN and is_halo to true
for i, row in df2.iterrows():
    if row.at['cpa'] == 'Halo':
        df2.at[i, 'cpa'] = np.nan
        df2.at[i, 'is_halo'] = True

In [21]:
# adds column to indicate if width is given as a lower bound
df2.insert(len(df2.columns), 'width_lower_bound', False)

# loops through rows and if width is lower bound, removes the '>' and set width_lower_bound to true
for i, row in df2.iterrows():
    if  str(row.at['width'])[0] == '>':
        df2.at[i, 'width'] = row.at['width'].replace('>', '')
        df2.at[i, 'width_lower_bound'] = True

In [22]:
# loops through rows and convert time columns to datetime columns
for i, row in df2.iterrows():
    
    # start
    inc = 0 # how many days to increment by (accounts for time = 24:00)
    date = row.at['start_date']
    time = row.at['start_time']
    if time == '24:00':
        time = '00:00'
        inc += 1
        
    datetime = pd.to_datetime(date + " " + time)
    datetime = datetime + pd.Timedelta(days=inc)
    df2.at[i, 'start_time'] = datetime
    
    # end
    inc = 0
    date = str(df2.at[i, 'start_time'].year) + "/" + row.at['end_date']
    time = row.at['end_time']
    if time == '24:00':
        time = '00:00'
        inc += 1
    
    datetime = pd.to_datetime(date + " " + time)
    datetime = datetime + pd.Timedelta(days=inc)
    if datetime < df2.at[i, 'start_time']: # if solar flare spanned new year
        new_year = datetime.year + 1
        datetime.replace(year=new_year)
    
    df2.at[i, 'end_time'] = datetime
    
    # cme
    if row.at['cme_date'] != row.at['cme_date']: # if NaN, this is true
        datetime = np.nan
    else:
        inc = 0
        date = str(df2.at[i, 'start_time'].year) + "/" + row.at['cme_date']
        time = row.at['cme_time']
        if time == '24:00':
            time = '00:00'
            inc += 1

        datetime = pd.to_datetime(date + " " + time)
        datetime = datetime + pd.Timedelta(days=inc)
        if datetime < df2.at[i, 'start_time']: # if cme after new year
            new_year = datetime.year + 1
            datetime.replace(year=new_year)
    
    df2.at[i, 'cme_time'] = datetime
    
df2 = df2.drop(columns=['start_date', 'end_date', 'cme_date'])
df2 = df2.rename(columns={'start_time': 'start_datetime', 'end_time': 'end_datetime', 'cme_time': 'cme_datetime'})

df2

Unnamed: 0,start_datetime,end_datetime,start_frequency,end_frequency,flare_location,flare_region,importance,cme_datetime,cpa,width,speed,is_halo,width_lower_bound
0,1997-04-01 14:00:00,1997-04-01 14:15:00,8000.0,4000.0,S25E16,8026,M1.3,1997-04-01 15:18:00,74.0,79,312.0,False,False
1,1997-04-07 14:30:00,1997-04-07 17:30:00,11000.0,1000.0,S28E19,8027,C6.8,1997-04-07 14:27:00,,360,878.0,True,False
2,1997-05-12 05:15:00,1997-05-14 16:00:00,12000.0,80.0,N21W08,8038,C1.3,1997-05-12 05:30:00,,360,464.0,True,False
3,1997-05-21 20:20:00,1997-05-21 22:00:00,5000.0,500.0,N05W12,8040,M1.3,1997-05-21 21:00:00,263.0,165,296.0,False,False
4,1997-09-23 21:53:00,1997-09-23 22:16:00,6000.0,2000.0,S29E25,8088,C1.4,1997-09-23 22:02:00,133.0,155,712.0,False,False
5,1997-11-03 05:15:00,1997-11-03 12:00:00,14000.0,250.0,S20W13,8100,C8.6,1997-11-03 05:28:00,240.0,109,227.0,False,False
6,1997-11-03 10:30:00,1997-11-03 11:30:00,14000.0,5000.0,S16W21,8100,M4.2,1997-11-03 11:11:00,233.0,122,352.0,False,False
7,1997-11-04 06:00:00,1997-11-05 04:30:00,14000.0,100.0,S14W33,8100,X2.1,1997-11-04 06:10:00,,360,785.0,True,False
8,1997-11-06 12:20:00,1997-11-07 08:30:00,14000.0,100.0,S18W63,8100,X9.4,1997-11-06 12:10:00,,360,1556.0,True,False
9,1997-11-27 13:30:00,1997-11-27 14:00:00,14000.0,7000.0,N17E63,8113,X2.6,1997-11-27 13:56:00,98.0,91,441.0,False,False


# Part 2: Analysis

## Question 1: Replication

In [23]:
# creates new data frame with NASA data, removes all NaN importance rows
df2_top50 = df2.dropna(subset=['importance'])

# loops through rows, drops non X-class flares and makes importnace a float so rows can be sorted
# (all of top 50 will be X-class, so this is okay)
for i, row in df2_top50.iterrows():
    if row.at['importance'][0] == 'X':
        df2_top50.at[i, 'importance'] = float(row.at['importance'].replace('X', ''))
    else:
        df2_top50 = df2_top50.drop(i)
df2_top50.rename(columns={'importance': 'x_class'}, inplace=True) # they're all X-class now

# sorts by X-class
df2_top50 = df2_top50.sort_values(by='x_class', ascending=False)

df2_top50.reset_index(inplace=True)

# drop all rows outside of the top 50
df2_top50 = df2_top50.drop(range(50, len(df2_top50)))

Top 50 from NASA table:

In [24]:
df2_top50

Unnamed: 0,index,start_datetime,end_datetime,start_frequency,end_frequency,flare_location,flare_region,x_class,cme_datetime,cpa,width,speed,is_halo,width_lower_bound
0,240,2003-11-04 20:00:00,2003-11-05 00:00:00,10000,200,S19W83,10486.0,28.0,2003-11-04 19:54:00,,360.0,2657.0,True,False
1,117,2001-04-02 22:05:00,2001-04-03 02:30:00,14000,250,N19W72,9393.0,20.0,2001-04-02 22:06:00,261.0,244.0,2505.0,False,False
2,233,2003-10-28 11:10:00,2003-10-30 00:00:00,14000,40,S16E08,10486.0,17.0,2003-10-28 11:30:00,,360.0,2459.0,True,False
3,126,2001-04-15 14:05:00,2001-04-16 13:00:00,14000,40,S20W85,9415.0,14.0,2001-04-15 14:06:00,245.0,167.0,1199.0,False,False
4,234,2003-10-29 20:55:00,2003-10-30 00:00:00,11000,500,S15W02,10486.0,10.0,2003-10-29 20:54:00,,360.0,2029.0,True,False
5,8,1997-11-06 12:20:00,1997-11-07 08:30:00,14000,100,S18W63,8100.0,9.4,1997-11-06 12:10:00,,360.0,1556.0,True,False
6,514,2017-09-06 12:05:00,2017-09-07 08:00:00,16000,70,S08W33,12673.0,9.3,2017-09-06 12:24:00,,360.0,1571.0,True,False
7,328,2006-12-05 10:50:00,2006-12-05 20:00:00,14000,250,S07E68,10930.0,9.0,,,,,False,False
8,237,2003-11-02 17:30:00,2003-11-03 01:00:00,12000,250,S14W56,10486.0,8.3,2003-11-02 17:30:00,,360.0,2598.0,True,False
9,515,2017-09-10 16:02:00,2017-09-11 06:50:00,16000,150,S09W92,,8.3,2017-09-10 16:00:00,,360.0,3163.0,True,False


SpaceWeatherLive top 50:

In [25]:
df1

Unnamed: 0,rank,x_class,region,start_datetime,max_datetime,end_datetime
0,1,X28+,486,2003-11-04 19:29:00,2003-11-04 19:53:00,2003-11-04 20:06:00
1,2,X20+,9393,2001-04-02 21:32:00,2001-04-02 21:51:00,2001-04-02 22:03:00
2,3,X17.2+,486,2003-10-28 09:51:00,2003-10-28 11:10:00,2003-10-28 11:24:00
3,4,X17+,808,2005-09-07 17:17:00,2005-09-07 17:40:00,2005-09-07 18:03:00
4,5,X14.4,9415,2001-04-15 13:19:00,2001-04-15 13:50:00,2001-04-15 13:55:00
5,6,X10,486,2003-10-29 20:37:00,2003-10-29 20:49:00,2003-10-29 21:01:00
6,7,X9.4,8100,1997-11-06 11:49:00,1997-11-06 11:55:00,1997-11-06 12:01:00
7,8,X9.3,2673,2017-09-06 11:53:00,2017-09-06 12:02:00,2017-09-06 12:10:00
8,9,X9,930,2006-12-05 10:18:00,2006-12-05 10:35:00,2006-12-05 10:45:00
9,10,X8.3,486,2003-11-02 17:03:00,2003-11-02 17:25:00,2003-11-02 17:39:00


I think I've replicated the top 50 somewhat well. The first three clearly match, based on classification and time. The third flare and some others have rounded down classifications in the NASA table (X17.2 -> X17). The fourth flare, rated as X17 on SWL is missing from my NASA table. I checked the NASA website to see if I lost the data somewhere but it is not there either; however, there is a flare that occured on the same date around the same time, but it is classified as X1.7 instead of X17, could this be an error...? The next 14 flares seem to match, but after that the NASA table is missing a couple of X5.4s.

I've also noticed that the NASA table seems to round the time data, while SWL has it to the minute.

## Question 2: Integration

In [26]:
df2.insert(len(df2.columns), 'rank', np.nan) # column for rank from SpaceWeaterLive

# loop through SWL flares and NASA flares, find the NASA flare with a starting datetime that's closest to
# the corresponding SWL one and set its rank to the corresponding SWL rank
for i1, row1 in df1.iterrows():
    lowest = abs(row1.at['start_datetime'] - df2.at[0, 'start_datetime']).total_seconds()
    lowest_i = 0
    for i2, row2 in df2.iterrows():
        current = abs(row1.at['start_datetime'] - row2.at['start_datetime']).total_seconds()
        rank = df2.at[i2, 'rank'] # rank already associated with row
        if rank != rank and current < lowest: # set lowest if lowest timedelta row's rank is NaN
            lowest = current
            lowest_i = i2
    df2.at[lowest_i, 'rank'] = row1.at['rank']    

In [28]:
df2.sort_values(by='rank')

Unnamed: 0,start_datetime,end_datetime,start_frequency,end_frequency,flare_location,flare_region,importance,cme_datetime,cpa,width,speed,is_halo,width_lower_bound,rank
240,2003-11-04 20:00:00,2003-11-05 00:00:00,10000.0,200.0,S19W83,10486,X28.,2003-11-04 19:54:00,,360,2657.0,True,False,1.0
117,2001-04-02 22:05:00,2001-04-03 02:30:00,14000.0,250.0,N19W72,9393,X20.,2001-04-02 22:06:00,261.0,244,2505.0,False,False,2.0
233,2003-10-28 11:10:00,2003-10-30 00:00:00,14000.0,40.0,S16E08,10486,X17.,2003-10-28 11:30:00,,360,2459.0,True,False,3.0
316,2005-09-07 18:05:00,2005-09-08 00:00:00,12000.0,200.0,S11E77,10808,X1.7,,,,,False,False,4.0
126,2001-04-15 14:05:00,2001-04-16 13:00:00,14000.0,40.0,S20W85,9415,X14.,2001-04-15 14:06:00,245.0,167,1199.0,False,False,5.0
234,2003-10-29 20:55:00,2003-10-30 00:00:00,11000.0,500.0,S15W02,10486,X10.,2003-10-29 20:54:00,,360,2029.0,True,False,6.0
8,1997-11-06 12:20:00,1997-11-07 08:30:00,14000.0,100.0,S18W63,8100,X9.4,1997-11-06 12:10:00,,360,1556.0,True,False,7.0
514,2017-09-06 12:05:00,2017-09-07 08:00:00,16000.0,70.0,S08W33,12673,X9.3,2017-09-06 12:24:00,,360,1571.0,True,False,8.0
328,2006-12-05 10:50:00,2006-12-05 20:00:00,14000.0,250.0,S07E68,10930,X9.0,,,,,False,False,9.0
237,2003-11-02 17:30:00,2003-11-03 01:00:00,12000.0,250.0,S14W56,10486,X8.3,2003-11-02 17:30:00,,360,2598.0,True,False,10.0


I defined "best matching" as having the closest start datetime. I initially tried to make it prioritize the lowest time differences first, but that implementation became too complicated. The way it is now, the higher ranks get to find a match first, so if a lower rank is closest to a flare that's already been matched, it has to match with the flare with the second closest start datetime.

This solution works very well at first, as you can see by looking at ranked flares in order and their classification. The classifications match what's on SpaceWeatherLife, and descend as rank descends. At the 4th ranked flare you can see the possible error I mentioned in the last question: what should be an X17 flare according to SWL is listed as a X1.7 flare. At rank 24 is when the issues begin, a C2.9 flare somehow snuck itself into the 24th spot. Many more matching errors occur on the lower end of the top 50.

## Question 3: Analysis