### Find duplicate samples
The bits of interactive data verification below are done on some intermediate meta data, which looks the same as the published meta_data.csv but has some more info in it ("end", "peak" and "type", which is the manually labeled class).

In [25]:
import datetime as dt
import pandas as pd
df = pd.read_csv('C:/Users/Roman Bolzern/Desktop/D4/neu/train.csv', sep=";", parse_dates=["start", "end", "peak"], index_col="id")
timesteps = pd.TimedeltaIndex([dt.timedelta(minutes=d) for d in [0, 7*60, 10*60+30, 11*60+50]])
all_image_times = []
for t in timesteps:
    new_df = df[['start', 'noaa_num']].copy()
    new_df['start'] = new_df['start'] + t
    all_image_times.append(new_df)
all_image_times = pd.concat(all_image_times)
res = all_image_times.groupby(by=['noaa_num', 'start']).size().reset_index(name='counts')
print(f"{len(res[res['counts'] > 1])} of {len(all_image_times)} images have a duplicate")
res[res['counts'] > 1]

26 of 33348 images have a duplicate


Unnamed: 0,noaa_num,start,counts
417,11401,2012-01-22 01:33:43,2
1757,11450,2012-04-03 04:27:01,2
3308,11504,2012-06-19 04:28:01,2
4264,11528,2012-07-28 23:24:01,2
6628,11635,2012-12-21 11:31:00,2
6629,11635,2012-12-21 18:31:00,2
6630,11635,2012-12-21 22:01:00,2
6631,11635,2012-12-21 23:21:00,2
7485,11663,2013-01-30 03:04:01,2
9708,11743,2013-05-11 08:53:01,2


Not too wild, but AR 11635 looks suspicious:

In [26]:
df['peak_after_end'] = df['peak'] - df['end']
df[df['noaa_num'] == 11635].sort_values(by=['start'])

Unnamed: 0_level_0,noaa_num,start,end,type,peak,peak_flux,peak_after_end
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
11635_2012_12_19_12_00_00_0,11635,2012-12-19 00:00:00,2012-12-19 12:00:00,B4.9,2012-12-20 09:46:00,5.764706e-07,21:46:00
11635_2012_12_19_12_00_00_1,11635,2012-12-19 16:52:40,2012-12-20 04:52:40,B4.9,2012-12-20 09:46:00,5.764706e-07,04:53:20
11635_2012_12_20_04_54_01_0,11635,2012-12-19 16:54:01,2012-12-20 04:54:01,B6.0,2012-12-21 04:54:00,7.058824e-07,23:59:59
11635_2012_12_20_17_08_01_0,11635,2012-12-20 05:08:01,2012-12-20 17:08:01,B5.4,2012-12-21 17:08:00,6.352941e-07,23:59:59
11635_2012_12_20_19_34_01_0,11635,2012-12-20 07:34:01,2012-12-20 19:34:01,B6.5,2012-12-21 19:34:00,7.647059e-07,23:59:59
11635_2012_12_20_20_41_01_0,11635,2012-12-20 08:41:01,2012-12-20 20:41:01,B6.4,2012-12-21 20:41:00,7.529412e-07,23:59:59
11635_2012_12_20_04_54_01_1,11635,2012-12-20 11:53:28,2012-12-20 23:53:28,B6.0,2012-12-21 04:54:00,7.058824e-07,05:00:32
11635_2012_12_20_23_55_01_0,11635,2012-12-20 11:55:01,2012-12-20 23:55:01,C1.9,2012-12-21 23:55:00,2.235294e-06,23:59:59
11635_2012_12_21_04_05_01_0,11635,2012-12-20 16:05:01,2012-12-21 04:05:01,C1.9,2012-12-22 04:05:00,2.235294e-06,23:59:59
11635_2012_12_20_23_55_01_1,11635,2012-12-21 11:31:00,2012-12-21 23:31:00,C1.9,2012-12-21 23:55:00,2.235294e-06,00:24:00


Two successive row groups jump into my view...

In [27]:
df[(df['noaa_num'] == 11635) & (df['start'] == pd.Timestamp('2012-12-21 11:31:00'))].sort_values(by=['start'])

Unnamed: 0_level_0,noaa_num,start,end,type,peak,peak_flux,peak_after_end
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
11635_2012_12_21_04_05_01_1,11635,2012-12-21 11:31:00,2012-12-21 23:31:00,C1.9,2012-12-22 04:05:00,2e-06,04:34:00
11635_2012_12_20_23_55_01_1,11635,2012-12-21 11:31:00,2012-12-21 23:31:00,C1.9,2012-12-21 23:55:00,2e-06,00:24:00


In [28]:
df[(df['noaa_num'] == 11635) & (df['start'].dt.day == 25)].sort_values(by=['start'])

Unnamed: 0_level_0,noaa_num,start,end,type,peak,peak_flux,peak_after_end
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
11635_2012_12_26_04_56_47_0,11635,2012-12-25 16:56:47,2012-12-26 04:56:47,C1.6,2012-12-27 02:24:00,2e-06,21:27:13
11635_2012_12_26_04_57_30_0,11635,2012-12-25 16:57:30,2012-12-26 04:57:30,C1.6,2012-12-26 11:25:00,2e-06,06:27:30
11635_2012_12_26_04_57_59_0,11635,2012-12-25 16:57:59,2012-12-26 04:57:59,C1.6,2012-12-26 17:14:00,2e-06,12:16:01


Seems like the first sample is for two separate C1.9 flares, but cut the same way. Maybe there are flares before and after this range so that there aren't any other cutting options. Imho not worth of further investigation.
The second group's ranges are overlapping heavily, even though their peaks differ. Same issue as before it seems. One might ask, are there many such overlapping samples in the dataset?

In [49]:
gp = df.sort_values(by=['start']).groupby(['noaa_num'])
for g in gp:
    df.loc[g[1].index, 'delta'] = (g[1]['start']-g[1]['start'].shift())

In [57]:
threshold = 30
overlapping_samples = df[df['delta'] < pd.Timedelta(minutes=threshold)]
print(f"{len(overlapping_samples)} of {len(df)} samples have a sample start that is followed directly by another sample start (same AR, max {threshold} minutes afterwards)")

583 of 8337 samples have a sample start that is followed directly by another sample start (same AR, max 30 minutes afterwards)


Tbh, 583 strongly overlapping samples are a few more than I expected. It's nothing too bad and their values seem correct, it might be worth to investigate into improving those overlaps at some point though.
The 26 (almost) duplicate samples aren't too wild, and by taking care of the 583 overlaps those 26 would be taken care of too.