## ALFABURST Event Buffer Feature Builder

The ALFABURST commensal FRB search survey searches for dedisperesed pulses above a signal to noise of 10 across of 56 MHz band. Data is processed in time windows of 2^15 * 256 microseconds (~8.4 seconds), 512 frequency channels. If a pulse is detected the entire time window is recorded to disk.

The vast majority of detected pulses are false-positive events due to human-made RFI. Only a small minority of events (less than 1%) is due to astrophysical sources, primarily bright pulses from pulsars. The RFI takes on a wide range of characteristics. In the processing pipeline the brightest RFI is clipped and replaced, but low-level RFI and spectra statistics still lead to an excess of false-positives.

In order to automate the processing the 150000+ recorded buffers a classifier model would be useful to ***probabilistically*** classify each event. Approximately 15000 events have been labelled into 10 different categories. We can use this *labelled* data set for training a model.

In [1]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cPickle as pickle
import os

%matplotlib inline

In [2]:
BASE_DATA_PATH = '/data2/griffin/ALFABURST/'

#### Build buffer database

In [3]:
baseBufferPklFile = BASE_DATA_PATH + 'ALFAbuffers.pkl'

# load baseBufferPkl
df = pd.read_pickle(baseBufferPklFile)

# create a predicted label column with 'unlabelled' label
df = df.assign(predictLabel=-1)

The intial buffer dataframe contains a list of all buffers with meta-data such as time, beam ID, and buffer ID. There is also global statistics for each buffer usch as number of events in the buffer and the maximum SNR event. The label column is initially empty, we need to fill it with the labels.

In [4]:
print df.describe()
print df.columns.values #each column

               Beam        Buffer      MJDstart        bestDM       bestSNR  \
count  92453.000000  92453.000000  92453.000000  92453.000000  92453.000000   
mean       3.773150    218.654960  57495.859286   1826.484657     13.224055   
std        2.379155    298.492012    261.931401   2913.489521     99.728634   
min        0.000000      1.000000  57197.378446      0.000000      6.001704   
25%        1.000000     23.000000  57281.042274      9.000000     10.587121   
50%        5.000000    107.000000  57350.285694     18.000000     11.497226   
75%        6.000000    294.000000  57845.909094   2730.000000     13.102362   
max        6.000000   2025.000000  57995.132488  10039.000000  20954.304688   

          BinFactor        Events         DMmax         DMmin        DMmean  \
count  92453.000000  9.245300e+04  92453.000000  92453.000000  92453.000000   
mean      14.522752  6.196611e+03   3221.769508    602.744697   1940.321632   
std       18.036017  3.171940e+04   4171.687561   1

#### Add additional buffer features

In [5]:
# metadata and features pickles
baseDedispDirs = [BASE_DATA_PATH + 'snr14_dm50/',
                  BASE_DATA_PATH + 'snr11-14_dm50/',
                  BASE_DATA_PATH + 'snr10-11_dm50/']
#baseDedispDirs = [BASE_DATA_PATH + 'test/']

for dDir in baseDedispDirs:
    for subDir in os.listdir(dDir): #os.listdir(dDir)= list of files in directory dDir
        if os.path.isdir(dDir + '/' + subDir): #returns true if there is a directory called "dDir + '/' + subDir"
            metaPklFns = glob.glob(dDir + subDir + '/*.meta.pkl') # glob.glob() returns a list of path names matching 
            if len(metaPklFns) > 0: #if atleast one of these directories exists                             
                print 'Found features in ', dDir + subDir #print to confirm existence
                
                for mIdx, metaPkl in enumerate(metaPklFns): #enumerate just pairs an index starting from 0 to each metaPklFns value
                    
                    # Event meta-data
                    baseMetaFn = os.path.basename(metaPkl) #returns last directory in metaPkl pathname (if path ends with '/' returns nothing)
                    bufID = int(baseMetaFn.split('.')[1].split('buffer')[-1]) #not quite sure about this line? split splits a path into (head, tail)
                    metaDict = pickle.load(open(metaPkl, 'rb')) #rb = read binary (read pickle file)
                    idx = df.loc[(df['datfile']==metaDict['dat']) & (df['Buffer']==bufID)].index #.loc is some sort of indexer, not sure what & does or this line though...
                    
                    df.ix[idx, 'filterbank'] = metaDict['filterbank']
                        
                    # Percent of a time series which is 0
                    df.ix[idx, 'pctZero'] = metaDict.get('pctZero', 0.) #returns number of times 'pctZero occurs in metaDict? why the zero in the arg
                    # take the 0-dm time series derivative, calculate the percent of time series with derivative=0
                    df.ix[idx, 'pctZeroDeriv'] = metaDict.get('pctZeroDeriv', 0.)
                      
                  # Overflow counter
                    # number of values which are above 1e20 threshold
                    ofDict = metaDict.get('overflows', {'ncount': 0, 'pct': 0.})
                    df.ix[idx, 'ofCount'] = ofDict['ncount']
                    df.ix[idx, 'ofPct'] = ofDict['pct']
                    
                    # Longest continuous run of a constant in the dedispersed time series
                    # tuple: (maxRun, maxVal, maxRun / float(arr.size))
                    longestRun = metaDict.get('longestRun', (0, 0., 0.)) #again not sure what the 2nd arg of .get does
                    df.ix[idx, 'longestRun0'] = longestRun[0]
                    df.ix[idx, 'longestRun1'] = longestRun[1]
                    df.ix[idx, 'longestRun2'] = longestRun[2]
                    
                    # Global statistics of the DM-0 time series
                    globalTimeStats = metaDict.get('globalTimeStats', {'std': 0., 'max': 0., 'posCount': 0, \
                                                                       'min': 0., 'negPct': 0., 'median': 0.,\
                                                                       'meanMedianRatio': 0., 'posPct': 0.,\
                                                                       'negCount': 0, 'maxMinRatio': 0.,\
                                                                       'mean': 0. }) #returns null values for all metrics if globalTimeStats doesnt exist
                    
                    df.ix[idx, 'globtsStatsStd'] = globalTimeStats['std']
                    df.ix[idx, 'globtsStatsMax'] = globalTimeStats['max']
                    df.ix[idx, 'globtsStatsPosCnt'] = globalTimeStats['posCount']
                    df.ix[idx, 'globtsStatsMin'] = globalTimeStats['min']
                    df.ix[idx, 'globtsStatsNegPct'] = globalTimeStats['negPct']
                    df.ix[idx, 'globtsStatsMedian'] = globalTimeStats['median']
                    df.ix[idx, 'globtsStatsRatio0'] = globalTimeStats['meanMedianRatio']
                    df.ix[idx, 'globtsStatsPosPct'] = globalTimeStats['posPct']
                    df.ix[idx, 'globtsStatsNegCnt'] = globalTimeStats['negCount']
                    df.ix[idx, 'globtsStatsRatio1'] = globalTimeStats['maxMinRatio']
                    df.ix[idx, 'globtsStatsMean'] = globalTimeStats['mean']
                    
                    # Global statistics of the best DM time series
                    globalDedispTimeStats = metaDict.get('globalDedispTimeStats', {'std': 0., 'max': 0., \
                                                                       'posCount': 0,
                                                                       'min': 0., 'negPct': 0., 'median': 0.,\
                                                                       'meanMedianRatio': 0., 'posPct': 0.,\
                                                                       'negCount': 0, 'maxMinRatio': 0.,\
                                                                       'mean': 0. }) 
                    
                    #save all of the global stats of the dm timeseries separately
                    df.ix[idx, 'globDedisptsStatsStd'] = globalDedispTimeStats['std']
                    df.ix[idx, 'globDedisptsStatsMax'] = globalDedispTimeStats['max']
                    df.ix[idx, 'globDedisptsStatsPosCnt'] = globalDedispTimeStats['posCount']
                    df.ix[idx, 'globDedisptsStatsMin'] = globalDedispTimeStats['min']
                    df.ix[idx, 'globDedisptsStatsNegPct'] = globalDedispTimeStats['negPct']
                    df.ix[idx, 'globDedisptsStatsMedian'] = globalDedispTimeStats['median']
                    df.ix[idx, 'globDedisptsStatsRatio0'] = globalDedispTimeStats['meanMedianRatio']
                    df.ix[idx, 'globDedisptsStatsPosPct'] = globalDedispTimeStats['posPct']
                    df.ix[idx, 'globDedisptsStatsNegCnt'] = globalDedispTimeStats['negCount']
                    df.ix[idx, 'globDedisptsStatsRatio1'] = globalDedispTimeStats['maxMinRatio']
                    df.ix[idx, 'globDedisptsStatsMean'] = globalDedispTimeStats['mean']
                    
                    # Statistics of 16 segments of the DM-0 time series
                    windZeros = np.zeros(16) #empty matrix
                    windTime = metaDict.get('windTimeStats',{'std':windZeros, 'max':windZeros, \
                                                             'min':windZeros, 'snr':windZeros, \
                                                             'mean':windZeros})
                    for i in range(16):
                        df.ix[idx, 'windTimeStatsStd'+str(i)] = windTime['std'][i]
                        df.ix[idx, 'windTimeStatsMax'+str(i)] = windTime['max'][i]
                        df.ix[idx, 'windTimeStatsMin'+str(i)] = windTime['min'][i]
                        df.ix[idx, 'windTimeStatsSnr'+str(i)] = windTime['snr'][i]
                        df.ix[idx, 'windTimeStatsMean'+str(i)] = windTime['mean'][i]
                        
                    # Statistics of 16 segments of the best DM time series
                    windDedispTime = metaDict.get('windDedispTimeStats',{'std':windZeros, 'max':windZeros,\
                                                                         'min':windZeros, 'snr':windZeros,\
                                                                         'mean':windZeros})
                    for i in range(16):
                        df.ix[idx, 'windDedispTimeStatsStd'+str(i)] = windDedispTime['std'][i] #concatenates each label with its corresponding number
                        df.ix[idx, 'windDedispTimeStatsMax'+str(i)] = windDedispTime['max'][i]
                        df.ix[idx, 'windDedispTimeStatsMin'+str(i)] = windDedispTime['min'][i]
                        df.ix[idx, 'windDedispTimeStatsSnr'+str(i)] = windDedispTime['snr'][i]
                        df.ix[idx, 'windDedispTimeStatsMean'+str(i)] = windDedispTime['mean'][i]
                    
                    # Statistics of the coarsely pixelized spectrogram
                    pixelZeros = np.zeros((16, 4))
                    pixels = metaDict.get('pixels',{'max':pixelZeros, 'min':pixelZeros, 'mean':pixelZeros})
                    for i in range(16):
                        for j in range(4):
                            df.ix[idx, 'pixelMax_%i_%i'%(i,j)] = pixels['max'][i][j] #what argument specifier is %i??
                            df.ix[idx, 'pixelMin_%i_%i'%(i,j)] = pixels['max'][i][j]
                            df.ix[idx, 'pixelMean_%i_%i'%(i,j)] = pixels['max'][i][j]

OSError: [Errno 2] No such file or directory: '/data2/griffin/ALFABURST/snr14_dm50/'

In [None]:
print df['pixelMin_1_0'].dropna()
#print df

#### Add labels

In [None]:
# output of labelImg2.py
labelPKlFiles = glob.glob(BASE_DATA_PATH + 'allLabels/*.pkl')

# add assigned labels to main dataframe
for lPkl in labelPKlFiles:
    print 'Reading labels from', lPkl
    labelDict = pickle.load(open(lPkl, 'rb'))
    for key,val in labelDict.iteritems():
        fbFN = key.split('buffer')[0] + 'fil'
        bufID = int(key.split('.')[1].split('buffer')[-1])
        df.loc[(df['filterbank']==fbFN) & (df['Buffer']==bufID), 'Label'] = val

In [None]:
print df['Label'].describe()

#### Save combined dataframe to file

This would be a good point to split into a new notebook as the previous setups have been run to combine the various labels and features into a single dataframe. We will likely not need to re-run this code often, and as it takes a few minutes to run we can just save the final dataframe to file. Then use that dataframe as the starting point for the model.

In [None]:
df.to_pickle('featureDataframeInigo.pkl')