## ALFABURST Event Buffer Feature Builder

The ALFABURST commensal FRB search survey searches for dedisperesed pulses above a signal to noise of 10 across of 56 MHz band. Data is processed in time windows of 2^15 * 256 microseconds (~8.4 seconds), 512 frequency channels. If a pulse is detected the entire time window is recorded to disk.

The vast majority of detected pulses are false-positive events due to human-made RFI. Only a small minority of events (less than 1%) is due to astrophysical sources, primarily bright pulses from pulsars. The RFI takes on a wide range of characteristics. In the processing pipeline the brightest RFI is clipped and replaced, but low-level RFI and spectra statistics still lead to an excess of false-positives.

In order to automate the processing the 150000+ recorded buffers a classifier model would be useful to ***probabilistically*** classify each event. Approximately 15000 events have been labelled into 10 different categories. We can use this *labelled* data set for training a model.

In [1]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cPickle as pickle
import os

%matplotlib inline

In [2]:
BASE_DATA_PATH = '/data2/alfaburstdata200218/'

#### Build buffer database

In [3]:
baseBufferPklFile = '/home/inigo/pulseClassifier/notebooks/newdf1.pkl'

# load baseBufferPkl
df = pd.read_pickle(baseBufferPklFile)

# create a predicted label column with 'unlabelled' label
df = df.assign(predictLabel=-1)

The initial buffer dataframe contains a list of all buffers with meta-data such as time, beam ID, and buffer ID. There is also global statistics for each buffer usch as number of events in the buffer and the maximum SNR event. The label column is initially empty, we need to fill it with the labels.

In [4]:
print df.describe()
print df.columns.values #each column

                Beam         Buffer       MJDstart         bestDM  \
count  125439.000000  125439.000000  125439.000000  125439.000000   
mean        3.531262     216.217652   57429.620744    1349.748866   
std         2.090338     282.191040     236.046434    2608.874241   
min         0.000000       1.000000   57197.378446       0.000000   
25%         2.000000      22.000000   57275.185900       9.000000   
50%         3.000000     113.000000   57324.388785      16.000000   
75%         6.000000     301.000000   57414.833449    1236.000000   
max         6.000000    2025.000000   58098.268715   10039.000000   

             bestSNR      BinFactor        Events          DMmax  \
count  125439.000000  125439.000000  1.254390e+05  125439.000000   
mean       12.915690      14.184632  6.864074e+03    2373.811865   
std        60.700000      19.310866  4.732979e+04    3724.165909   
min         6.001111       1.000000  1.000000e+00       3.000000   
25%        10.541907       2.000000  6

#### Add additional buffer features

In [5]:
# metadata and features pickles
picklepath = BASE_DATA_PATH + 'output2/'
pickledir = os.listdir(BASE_DATA_PATH + 'output2/')
#baseDedispDirs = [BASE_DATA_PATH + 'test/']

count = 0

for path in pickledir:
    
    metaPklFns = glob.glob(picklepath + path + '/*.meta.pkl')

    if len(metaPklFns) > 0: #if atleast one of these directories exists                             
        print 'Found %d files in '%len((metaPklFns)) + picklepath + path #print to confirm existence

        for mIdx, metaPkl in enumerate(metaPklFns): #enumerate just pairs an index starting from 0 to each metaPklFns value
    
            count += 1
        
            # Event meta-data
            baseMetaFn = os.path.basename(metaPkl) #returns last directory in metaPkl pathname (if path ends with '/' returns nothing)
            bufID = int(baseMetaFn.split('.')[1].split('buffer')[-1]) #not quite sure about this line? split splits a path into (head, tail)
            metaDict = pickle.load(open(metaPkl, 'rb')) #rb = read binary (read pickle file)
            idx = df.loc[(df['datfile']==metaDict['dat']) & (df['Buffer']==bufID)].index

            df.ix[idx, 'filterbank'] = metaDict['filterbank'] 


            # Percent of a time series which is 0
            #print metaDict
            df.ix[idx, 'pctZero'] = metaDict.get('pctZero', 0.)
            # take the 0-dm time series derivative, calculate the percent of time series with derivative=0
            df.ix[idx, 'pctZeroDeriv'] = metaDict.get('pctZeroDeriv', 0.)


            # Overflow counter
            # number of values which are above 1e20 threshold
            ofDict = metaDict.get('overflows', {'ncount': 0, 'pct': 0.})
            df.ix[idx, 'ofCount'] = ofDict['ncount']
            df.ix[idx, 'ofPct'] = ofDict['pct']


            # Longest continuous run of a constant in the dedispersed time series
            # tuple: (maxRun, maxVal, maxRun / float(arr.size))
            longestRun = metaDict.get('longestRun', {'maxRun':-1, 'maxVal':-1, 'maxRunpct':-1, 
                                                   'ddmaxRun':-1, 'ddmaxVal':-1})
            for key in longestRun:
                df.ix[idx,'longestRun' + key] = longestRun[key]
            
            
#             df.ix[idx, 'longestRun0'] = longestRun[0]
#             df.ix[idx, 'longestRun1'] = longestRun[1]
#             df.ix[idx, 'longestRun2'] = longestRun[2]


            # Global statistics of the DM-0 time series
            globalTimeStats = metaDict.get('globalTimeStats', {'std': 0., 'max': 0., 'posCount': 0, \
                                                               'min': 0., 'negPct': 0., 'median': 0.,\
                                                               'meanMedianRatio': 0., 'posPct': 0.,\
                                                               'negCount': 0, 'maxMinRatio': 0.,\
                                                               'mean': 0. }) #returns null values for all metrics if globalTimeStats doesnt exist        
            for key in globalTimeStats:
                df.ix[idx, 'globalTimeStats' + key] = globalTimeStats[key]



            # Global statistics of the best DM time series
            globalDedispTimeStats = metaDict.get('globalDedispTimeStats', {'std': 0., 'max': 0., \
                                                               'posCount': 0,
                                                               'min': 0., 'negPct': 0., 'median': 0.,\
                                                               'meanMedianRatio': 0., 'posPct': 0.,\
                                                               'negCount': 0, 'maxMinRatio': 0.,\
                                                               'mean': 0. }) 
            for key in globalDedispTimeStats:
                df.ix[idx, 'globalDedispTimeStats' + key] = globalDedispTimeStats[key]


            # Statistics of 16 segments of the DM-0 time series
            windZeros = np.zeros(16) #empty matrix
            windTime = metaDict.get('windTimeStats',{'std':windZeros, 'max':windZeros, \
                                                     'min':windZeros, 'snr':windZeros, \
                                                     'mean':windZeros})
            
                #key has lower case first letter, previuosly was saved into dataframe with capital first letter so this could fuck up
            for key in windTime:
                windTimeval = windTime[key]
                if windTimeval.size != 1:
                    for i in range(16):
                        df.ix[idx, 'windTimeStats' + key + str(i)] = windTimeval[i]
                else:
                    df.ix[idx, 'windTimeStats' + key] = windTimeval
                
                        

            # Statistics of 16 segments of the best DM time series
            windDedispTime = metaDict.get('windDedispTimeStats',{'std':windZeros, 'max':windZeros,\
                                                                 'min':windZeros, 'snr':windZeros,\
                                                                 'mean':windZeros})
            
            for key in windDedispTime:
                windDedispTimeval = windDedispTime[key]
                if windDedispTime[key].size != 1:
                    for i in range(16):
                        df.ix[idx, 'windDedispTimeStats' + key + str(i)] = windDedispTime[key][i]  #concatenates each label with its corresponding number
                else:
                    df.ix[idx, 'windDedispTimeStats' + key] = windDedispTime[key]
            
            
            # Statistics of the coarsely pixelized spectrogram
            pixelZeros = np.zeros((16, 4))
            pixels = metaDict.get('pixels',{'max':pixelZeros, 'min':pixelZeros, 'mean':pixelZeros})
            for i in range(16):
                for j in range(4):
                    df.ix[idx, 'pixelMax_%i_%i'%(i,j)] = pixels['max'][i][j] 
                    df.ix[idx, 'pixelMin_%i_%i'%(i,j)] = pixels['min'][i][j]
                    df.ix[idx, 'pixelMean_%i_%i'%(i,j)] = pixels['mean'][i][j]

            for key in pixels:
                if pixels[key].size == 1:
                    df.ix[idx, 'pixelstats' + key] = pixels[key]
            
            # Gaussian testng statistics
            GaussianTests = metaDict.get('GaussianTests', { 'kurtosis': 0, 'skew': 0, 'dpearsonomni': 0, 
                                                         'dpearsonp': 0, 'lsD': 0, 'lsp': 0, 'ks': np.zeros(2)})
            for key in GaussianTests:
                #print
                gausstestval = GaussianTests[key]
                if key != 'ks':
                    df.ix[idx, 'GaussianTests' + key] = gausstestval
                else:
                    for i in range(2):
                        df.ix[idx, 'GaussianTests' + key + str(i)] = gausstestval[i]


            #Segmented Gaussian testing statistics
            segGaussianTests = metaDict.get('segGaussianTests', { 'lillieforsmaxp': 0, 'lillieforsmaxD': 0, 
                                                               'lfDmin': 0, 'lillieforssum': 0, 'dpearson': np.empty([8,2])
                                                               , 'dpearsonomnisum': 0, 'dpearsonpsum': 0})
            for key in segGaussianTests:
                seggaussval = segGaussianTests[key]
                if key != 'dpearson': 
                    df.ix[idx, 'segGaussianTests' + key] = segGaussianTests[key]
                else:
                    for i in range(8):
                        for j in range(2):
                            df.ix[idx, 'segGaussianTests' + key + str(i) + str(j)] = seggaussval[i][j]

            
            if count % 100 == 0:
                print '%d loops completed' %(count)
            
            
            

SyntaxError: invalid syntax (<ipython-input-5-be26b94fb2f4>, line 47)

In [None]:
#print df['pixelMin_1_0'].dropna()
#print df

#### Add labels

In [None]:
# output of labelImg2.py
labelPKlFiles = glob.glob('/data2/griffin/ALFABURST/allLabels/*.pkl')

# add assigned labels to main dataframe
for lPkl in labelPKlFiles:
    print 'Reading labels from', lPkl
    labelDict = pickle.load(open(lPkl, 'rb'))
    for key,val in labelDict.iteritems():
        fbFN = key.split('buffer')[0] + 'fil'
        bufID = int(key.split('.')[1].split('buffer')[-1])
        df.loc[(df['filterbank']==fbFN) & (df['Buffer']==bufID), 'Label'] = val

In [None]:
print df['Label'].describe()

#### Save combined dataframe to file

This would be a good point to split into a new notebook as the previous setups have been run to combine the various labels and features into a single dataframe. We will likely not need to re-run this code often, and as it takes a few minutes to run we can just save the final dataframe to file. Then use that dataframe as the starting point for the model.

In [None]:
df.to_pickle('featureDataframe1.pkl')