## ALFABURST Event Buffer Feature Builder

The ALFABURST commensal FRB search survey searches for dedisperesed pulses above a signal to noise of 10 across of 56 MHz band. Data is processed in time windows of 2^15 * 256 microseconds (~8.4 seconds), 512 frequency channels. If a pulse is detected the entire time window is recorded to disk.

The vast majority of detected pulses are false-positive events due to human-made RFI. Only a small minority of events (less than 1%) is due to astrophysical sources, primarily bright pulses from pulsars. The RFI takes on a wide range of characteristics. In the processing pipeline the brightest RFI is clipped and replaced, but low-level RFI and spectra statistics still lead to an excess of false-positives.

In order to automate the processing the 150000+ recorded buffers a classifier model would be useful to ***probabilistically*** classify each event. Approximately 15000 events have been labelled into 10 different categories. We can use this *labelled* data set for training a model.

In [1]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cPickle as pickle
import os

%matplotlib inline

In [2]:
#BASE_DATA_PATH = '/local/griffin/data/alfaburst/priorityModel/' #AX
BASE_DATA_PATH = '/home/griffin/data/alfa/priorityModel/' #WATERMARK

#### Build buffer database

In [3]:
baseBufferPklFile = BASE_DATA_PATH + 'ALFAbuffers.pkl'

# load baseBufferPkl
df = pd.read_pickle(baseBufferPklFile)

# create a predicted label column with 'unlabelled' label
df = df.assign(predictLabel=-1)

The intial buffer dataframe contains a list of all buffers with meta-data such as time, beam ID, and buffer ID. There is also global statistics for each buffer usch as number of events in the buffer and the maximum SNR event. The label column is initially empty, we need to fill it with the labels.

In [4]:
print df.describe()
print df.columns.values

               Beam        Buffer      MJDstart        bestDM       bestSNR  \
count  73282.000000  73282.000000  73282.000000  73282.000000  73282.000000   
mean       3.556617    211.437952  57392.275750    994.132829     12.890546   
std        2.433997    272.753575    186.092812   2270.499049     79.265038   
min        0.000000      1.000000  57197.378446      0.000000      6.001704   
25%        1.000000     25.000000  57275.173537      7.000000     10.510364   
50%        4.000000    111.000000  57328.330289     13.000000     11.301913   
75%        6.000000    295.000000  57372.219326    273.000000     12.641252   
max        6.000000   2001.000000  57893.840891  10039.000000  20954.304688   

          BinFactor        Events         DMmax         DMmin        DMmean  \
count  73282.000000  7.328200e+04  73282.000000  73282.000000  73282.000000   
mean      14.171406  5.290837e+03   1843.879770    355.848537   1106.626399   
std       20.233713  3.448143e+04   3329.930528   1

#### Add additional buffer features

In [5]:
# metadata and features pickles
baseDedispDirs = [BASE_DATA_PATH + 'snr14_dm50/',
                  BASE_DATA_PATH + 'snr11-14_dm50/',
                  BASE_DATA_PATH + 'snr10-11_dm50/']
#baseDedispDirs = [BASE_DATA_PATH + 'test/']

for dDir in baseDedispDirs:
    for subDir in os.listdir(dDir):
        if os.path.isdir(dDir + '/' + subDir):
            metaPklFns = glob.glob(dDir + subDir + '/*.meta.pkl')
            if len(metaPklFns) > 0:
                print 'Found features in ', dDir + subDir
                
                for mIdx, metaPkl in enumerate(metaPklFns):
                    
                    # Event meta-data
                    baseMetaFn = os.path.basename(metaPkl)
                    bufID = int(baseMetaFn.split('.')[1].split('buffer')[-1])
                    metaDict = pickle.load(open(metaPkl, 'rb'))
                    idx = df.loc[(df['datfile']==metaDict['dat']) & (df['Buffer']==bufID)].index
                    
                    df.ix[idx, 'filterbank'] = metaDict['filterbank']
                        
                    # Percent of a time series which is 0
                    df.ix[idx, 'pctZero'] = metaDict.get('pctZero', 0.)
                    # take the 0-dm time series derivative, calculate the percent of time series with derivative=0
                    df.ix[idx, 'pctZeroDeriv'] = metaDict.get('pctZeroDeriv', 0.)
                    
                    # Overflow counter
                    # number of values which are above 1e20 threshold
                    ofDict = metaDict.get('overflows', {'ncount': 0, 'pct': 0.})
                    df.ix[idx, 'ofCount'] = ofDict['ncount']
                    df.ix[idx, 'ofPct'] = ofDict['pct']
                    
                    # Longest continuous run of a constant in the dedispersed time series
                    # tuple: (maxRun, maxVal, maxRun / float(arr.size))
                    longestRun = metaDict.get('longestRun', (0, 0., 0.))
                    df.ix[idx, 'longestRun0'] = longestRun[0]
                    df.ix[idx, 'longestRun1'] = longestRun[1]
                    df.ix[idx, 'longestRun2'] = longestRun[2]
                    
                    # Global statistics of the DM-0 time series
                    globalTimeStats = metaDict.get('globalTimeStats', {'std': 0., 'max': 0., 'posCount': 0, \
                                                                       'min': 0., 'negPct': 0., 'median': 0.,\
                                                                       'meanMedianRatio': 0., 'posPct': 0.,\
                                                                       'negCount': 0, 'maxMinRatio': 0.,\
                                                                       'mean': 0. })
                    
                    df.ix[idx, 'globtsStatsStd'] = globalTimeStats['std']
                    df.ix[idx, 'globtsStatsMax'] = globalTimeStats['max']
                    df.ix[idx, 'globtsStatsPosCnt'] = globalTimeStats['posCount']
                    df.ix[idx, 'globtsStatsMin'] = globalTimeStats['min']
                    df.ix[idx, 'globtsStatsNegPct'] = globalTimeStats['negPct']
                    df.ix[idx, 'globtsStatsMedian'] = globalTimeStats['median']
                    df.ix[idx, 'globtsStatsRatio0'] = globalTimeStats['meanMedianRatio']
                    df.ix[idx, 'globtsStatsPosPct'] = globalTimeStats['posPct']
                    df.ix[idx, 'globtsStatsNegCnt'] = globalTimeStats['negCount']
                    df.ix[idx, 'globtsStatsRatio1'] = globalTimeStats['maxMinRatio']
                    df.ix[idx, 'globtsStatsMean'] = globalTimeStats['mean']
                    
                    # Global statistics of the best DM time series
                    globalDedispTimeStats = metaDict.get('globalDedispTimeStats', {'std': 0., 'max': 0., \
                                                                       'posCount': 0,
                                                                       'min': 0., 'negPct': 0., 'median': 0.,\
                                                                       'meanMedianRatio': 0., 'posPct': 0.,\
                                                                       'negCount': 0, 'maxMinRatio': 0.,\
                                                                       'mean': 0. })
                    
                    df.ix[idx, 'globDedisptsStatsStd'] = globalDedispTimeStats['std']
                    df.ix[idx, 'globDedisptsStatsMax'] = globalDedispTimeStats['max']
                    df.ix[idx, 'globDedisptsStatsPosCnt'] = globalDedispTimeStats['posCount']
                    df.ix[idx, 'globDedisptsStatsMin'] = globalDedispTimeStats['min']
                    df.ix[idx, 'globDedisptsStatsNegPct'] = globalDedispTimeStats['negPct']
                    df.ix[idx, 'globDedisptsStatsMedian'] = globalDedispTimeStats['median']
                    df.ix[idx, 'globDedisptsStatsRatio0'] = globalDedispTimeStats['meanMedianRatio']
                    df.ix[idx, 'globDedisptsStatsPosPct'] = globalDedispTimeStats['posPct']
                    df.ix[idx, 'globDedisptsStatsNegCnt'] = globalDedispTimeStats['negCount']
                    df.ix[idx, 'globDedisptsStatsRatio1'] = globalDedispTimeStats['maxMinRatio']
                    df.ix[idx, 'globDedisptsStatsMean'] = globalDedispTimeStats['mean']
                    
                    # Statistics of 16 segments of the DM-0 time series
                    windZeros = np.zeros(16)
                    windTime = metaDict.get('windTimeStats',{'std':windZeros, 'max':windZeros, \
                                                             'min':windZeros, 'snr':windZeros, \
                                                             'mean':windZeros})
                    for i in range(16):
                        df.ix[idx, 'windTimeStatsStd'+str(i)] = windTime['std'][i]
                        df.ix[idx, 'windTimeStatsMax'+str(i)] = windTime['max'][i]
                        df.ix[idx, 'windTimeStatsMin'+str(i)] = windTime['min'][i]
                        df.ix[idx, 'windTimeStatsSnr'+str(i)] = windTime['snr'][i]
                        df.ix[idx, 'windTimeStatsMean'+str(i)] = windTime['mean'][i]
                        
                    # Statistics of 16 segments of the best DM time series
                    windDedispTime = metaDict.get('windDedispTimeStats',{'std':windZeros, 'max':windZeros,\
                                                                         'min':windZeros, 'snr':windZeros,\
                                                                         'mean':windZeros})
                    for i in range(16):
                        df.ix[idx, 'windDedispTimeStatsStd'+str(i)] = windDedispTime['std'][i]
                        df.ix[idx, 'windDedispTimeStatsMax'+str(i)] = windDedispTime['max'][i]
                        df.ix[idx, 'windDedispTimeStatsMin'+str(i)] = windDedispTime['min'][i]
                        df.ix[idx, 'windDedispTimeStatsSnr'+str(i)] = windDedispTime['snr'][i]
                        df.ix[idx, 'windDedispTimeStatsMean'+str(i)] = windDedispTime['mean'][i]
                    
                    # Statistics of the coarsely pixelized spectrogram
                    pixelZeros = np.zeros((16, 4))
                    pixels = metaDict.get('pixels',{'max':pixelZeros, 'min':pixelZeros, 'mean':pixelZeros})
                    for i in range(16):
                        for j in range(4):
                            df.ix[idx, 'pixelMax_%i_%i'%(i,j)] = pixels['max'][i][j]
                            df.ix[idx, 'pixelMin_%i_%i'%(i,j)] = pixels['max'][i][j]
                            df.ix[idx, 'pixelMean_%i_%i'%(i,j)] = pixels['max'][i][j]

Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.ak
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.am
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.ai
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.al
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.aa
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.aj
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.ad
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.ag
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.ab
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.af
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.ac
Found features in  /home/griffin/data/alfa/priorityModel/snr14_dm50/snr14_dm50.ah
Found features i

In [6]:
print df['pixelMin_1_0'].dropna()
#print df

60        0.000000
62        0.000000
63        0.000000
73        0.000000
74        0.000000
76       17.259674
77        0.000000
137       0.000000
183       0.000000
184       0.000000
185       0.000000
190      18.284214
191       0.000000
195      26.616144
196       0.000000
197       0.000000
198       0.000000
200       0.000000
201       0.000000
202       0.000000
203       0.000000
206       0.000000
207      11.540236
209       0.000000
210       0.000000
212       0.000000
213       0.000000
214       0.000000
215      11.769779
217       0.000000
           ...    
66198    60.269836
66199    35.591095
66200    38.080162
66201    39.586185
66202    59.194279
66203    40.457069
66204    58.731380
66205    46.999820
66206    17.956202
66207    41.805580
66208     0.000000
66209    47.248844
66210    18.441801
66211    20.029493
66212    51.569515
66213    45.532402
66214    18.825869
66215     7.698993
66216     6.647788
66217     0.000000
66218    27.258652
66219    26.

#### Add labels

In [7]:
# output of labelImg2.py
labelPKlFiles = glob.glob(BASE_DATA_PATH + 'allLabels/*.pkl')

# add assigned labels to main dataframe
for lPkl in labelPKlFiles:
    print 'Reading labels from', lPkl
    labelDict = pickle.load(open(lPkl, 'rb'))
    for key,val in labelDict.iteritems():
        fbFN = key.split('buffer')[0] + 'fil'
        bufID = int(key.split('.')[1].split('buffer')[-1])
        df.loc[(df['filterbank']==fbFN) & (df['Buffer']==bufID), 'Label'] = val

Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr11-14_dm50.al.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr10-11_dm50.ah.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr14_dm50.ah.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr11-14_dm50.ac.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr10-11_dm50.ad.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr14_dm50.am.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr11-14_dm50.af.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr11-14_dm50.ag.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr11-14_dm50.aa.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr14_dm50.af.pkl
Reading labels from /home/griffin/data/alfa/priorityModel/allLabels/snr14_dm50.ai.pkl
Reading labels from /home/griffin

In [8]:
print df['Label'].describe()

count    73282.000000
mean         0.186676
std          2.557381
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max          9.000000
Name: Label, dtype: float64


In [13]:
print df['Label'].value_counts()

-1    58212
 6     4649
 2     4159
 3     1898
 8     1594
 7      863
 9      685
 5      617
 4      448
 1      151
 0        6
Name: Label, dtype: int64


#### Save combined dataframe to file

This would be a good point to split into a new notebook as the previous setups have been run to combine the various labels and features into a single dataframe. We will likely not need to re-run this code often, and as it takes a few minutes to run we can just save the final dataframe to file. Then use that dataframe as the starting point for the model.

In [9]:
df.to_pickle('featureDataframe.pkl')