## ALFABURST Event Buffer Feature Builder

The ALFABURST commensal FRB search survey searches for dedisperesed pulses above a signal to noise of 10 across of 56 MHz band. Data is processed in time windows of 2^15 * 256 microseconds (~8.4 seconds), 512 frequency channels. If a pulse is detected the entire time window is recorded to disk.

The vast majority of detected pulses are false-positive events due to human-made RFI. Only a small minority of events (less than 1%) is due to astrophysical sources, primarily bright pulses from pulsars. The RFI takes on a wide range of characteristics. In the processing pipeline the brightest RFI is clipped and replaced, but low-level RFI and spectra statistics still lead to an excess of false-positives.

In order to automate the processing the 150000+ recorded buffers a classifier model would be useful to ***probabilistically*** classify each event. Approximately 15000 events have been labelled into 10 different categories. We can use this *labelled* data set for training a model.

In [1]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cPickle as pickle
import os

%matplotlib inline

In [2]:
BASE_DATA_PATH = '/data2/griffin/ALFABURST/'

#### Build buffer database

In [3]:
baseBufferPklFile = BASE_DATA_PATH + 'ALFAbuffers.pkl'

# load baseBufferPkl
df = pd.read_pickle(baseBufferPklFile)

# create a predicted label column with 'unlabelled' label
df = df.assign(predictLabel=-1)

The initial buffer dataframe contains a list of all buffers with meta-data such as time, beam ID, and buffer ID. There is also global statistics for each buffer usch as number of events in the buffer and the maximum SNR event. The label column is initially empty, we need to fill it with the labels.

In [4]:
print df.describe()
print df.columns.values #each column

               Beam        Buffer      MJDstart        bestDM       bestSNR  \
count  92453.000000  92453.000000  92453.000000  92453.000000  92453.000000   
mean       3.773150    218.654960  57495.859286   1826.484657     13.224055   
std        2.379155    298.492012    261.931401   2913.489521     99.728634   
min        0.000000      1.000000  57197.378446      0.000000      6.001704   
25%        1.000000     23.000000  57281.042274      9.000000     10.587121   
50%        5.000000    107.000000  57350.285694     18.000000     11.497226   
75%        6.000000    294.000000  57845.909094   2730.000000     13.102362   
max        6.000000   2025.000000  57995.132488  10039.000000  20954.304688   

          BinFactor        Events         DMmax         DMmin        DMmean  \
count  92453.000000  9.245300e+04  92453.000000  92453.000000  92453.000000   
mean      14.522752  6.196611e+03   3221.769508    602.744697   1940.321632   
std       18.036017  3.171940e+04   4171.687561   1

#### Add additional buffer features

In [10]:
# metadata and features pickles
picklepath = BASE_DATA_PATH + 'labelled/'
#baseDedispDirs = [BASE_DATA_PATH + 'test/']

metaPklFns = glob.glob(picklepath + '*.meta.pkl')

if len(metaPklFns) > 0: #if atleast one of these directories exists                             
    print 'Found %d files in '%len((metaPklFns)) + picklepath #print to confirm existence
                
    for mIdx, metaPkl in enumerate(metaPklFns): #enumerate just pairs an index starting from 0 to each metaPklFns value

        # Event meta-data
        baseMetaFn = os.path.basename(metaPkl) #returns last directory in metaPkl pathname (if path ends with '/' returns nothing)
        bufID = int(baseMetaFn.split('.')[1].split('buffer')[-1]) #not quite sure about this line? split splits a path into (head, tail)
        metaDict = pickle.load(open(metaPkl, 'rb')) #rb = read binary (read pickle file)
        idx = df.loc[(df['datfile']==metaDict['dat']) & (df['Buffer']==bufID)].index

        df.ix[idx, 'filterbank'] = metaDict['filterbank'] 

        
        # Percent of a time series which is 0
        print metaDict
        df.ix[idx, 'pctZero'] = metaDict.get('pctZero', 0.)
#         # take the 0-dm time series derivative, calculate the percent of time series with derivative=0
#         df.ix[idx, 'pctZeroDeriv'] = metaDict.get('pctZeroDeriv', 0.)

        
#         # Overflow counter
#         # number of values which are above 1e20 threshold
#         ofDict = metaDict.get('overflows', {'ncount': 0, 'pct': 0.})
#         df.ix[idx, 'ofCount'] = ofDict['ncount']
#         df.ix[idx, 'ofPct'] = ofDict['pct']

        
#         # Longest continuous run of a constant in the dedispersed time series
#         # tuple: (maxRun, maxVal, maxRun / float(arr.size))
#         longestRun = metaDict.get('longestRun', (0, 0., 0.)) #2nd argument in .get is what get's returned if the key doesn't exist (avoids errors)
#         df.ix[idx, 'longestRun0'] = longestRun[0]
#         df.ix[idx, 'longestRun1'] = longestRun[1]
#         df.ix[idx, 'longestRun2'] = longestRun[2]

        
#         # Global statistics of the DM-0 time series
#         globalTimeStats = metaDict.get('globalTimeStats', {'std': 0., 'max': 0., 'posCount': 0, \
#                                                            'min': 0., 'negPct': 0., 'median': 0.,\
#                                                            'meanMedianRatio': 0., 'posPct': 0.,\
#                                                            'negCount': 0, 'maxMinRatio': 0.,\
#                                                            'mean': 0. }) #returns null values for all metrics if globalTimeStats doesnt exist        
#         for key in globalTimeStats:
#             df.ix[idx, 'globtsStats' + key] = globalTimeStats[key]
    
    
        
#         # Global statistics of the best DM time series
#         globalDedispTimeStats = metaDict.get('globalDedispTimeStats', {'std': 0., 'max': 0., \
#                                                            'posCount': 0,
#                                                            'min': 0., 'negPct': 0., 'median': 0.,\
#                                                            'meanMedianRatio': 0., 'posPct': 0.,\
#                                                            'negCount': 0, 'maxMinRatio': 0.,\
#                                                            'mean': 0. }) 
#         for key in globalDedispTimeStats:
#             df.ix[idx, 'globalDedisptsStats' + key] = globalDedispTimeStats[key]

        
        
        
#         # Statistics of 16 segments of the DM-0 time series
#         windZeros = np.zeros(16) #empty matrix
#         windTime = metaDict.get('windTimeStats',{'std':windZeros, 'max':windZeros, \
#                                                  'min':windZeros, 'snr':windZeros, \
#                                                  'mean':windZeros})
#         for i in range(16):
#             #key has lower case first letter, previuosly was saved into dataframe with capital first letter so this could fuck up
#             for key in windTime:
#                 df.ix[idx, 'windTimeStats' + key + str(i)] = windTime[key][i] 
                

#         # Statistics of 16 segments of the best DM time series
#         windDedispTime = metaDict.get('windDedispTimeStats',{'std':windZeros, 'max':windZeros,\
#                                                              'min':windZeros, 'snr':windZeros,\
#                                                              'mean':windZeros})
#         for i in range(16):
#             for key in windDedispTime:
#                 df.ix[idx, 'windDedispTimeStats' + key + str(i)] = windDedispTime[key][i]  #concatenates each label with its corresponding number

        
#         # Statistics of the coarsely pixelized spectrogram
#         pixelZeros = np.zeros((16, 4))
#         pixels = metaDict.get('pixels',{'max':pixelZeros, 'min':pixelZeros, 'mean':pixelZeros})
#         for i in range(16):
#             for j in range(4):
#                 df.ix[idx, 'pixelMax_%i_%i'%(i,j)] = pixels['max'][i][j] 
#                 df.ix[idx, 'pixelMin_%i_%i'%(i,j)] = pixels['min'][i][j]
#                 df.ix[idx, 'pixelMean_%i_%i'%(i,j)] = pixels['mean'][i][j]


#         # Gaussian testng statistics
#         GaussianTests = metaDict.get('GaussianTests', { 'kurtosis': 0, 'skew': 0, 'dpearsonomni': 0, 
#                                                      'dpearsonp': 0, 'lsD': 0, 'lsp': 0, 'ks': 0})
#         for key in GaussianTests:
#             df.ix[idx, 'GaussianTests' + key] = GaussianTests[key]

        
#         #Segmented Gaussian testing statistics
#         segGaussianTests = metaDict.get('segGaussianTests', { 'lillieforsmaxp': 0, 'lillieforsmaxD': 0, 
#                                                            'lfDmin': 0, 'lillieforssum': 0, 'dpearson': 0
#                                                            , 'dpearsonomnisum': 0, 'dpearsonpsum': 0})
#         for key in segGaussianTests:
#             df.ix[idx, 'segGaussianTests' + key] = segGaussianTests[key]


Found 15070 files in /data2/griffin/ALFABURST/labelled/
{'DMmin': 5385, 'DMmedian': 5478.0, 'filterbank': 'Beam0_fb_D20161119T002404.fil', 'Dec': None, 'DMmean': 5661.6842105263158, 'maxSNR': 10.228535652161, 'dat': 'Beam0_dm_D20161119T002404.dat', 'nEvents': 19, 'RA': None, 'maxMJD': 57711.322394262002, 'DMmax': 9279, 'MJD0': '57711.322384259'}
{'DMmin': 6586, 'DMmedian': 7148.5, 'filterbank': 'Beam6_fb_D20160215T160004.fil', 'Dec': None, 'DMmean': 7220.4341500765695, 'maxSNR': 10.983736991882, 'dat': 'Beam6_dm_D20160215T160004.dat', 'nEvents': 1306, 'RA': None, 'maxMJD': 57433.834213712005, 'DMmax': 7987, 'MJD0': '57433.834155116'}
{'DMmin': 1920, 'DMmedian': 3507.0, 'filterbank': 'Beam1_fb_D20161022T091504.fil', 'Dec': None, 'DMmean': 3062.2693110647183, 'maxSNR': 11.420582771301, 'dat': 'Beam1_dm_D20161022T091504.dat', 'nEvents': 479, 'RA': None, 'maxMJD': 57683.552381397996, 'DMmax': 3764, 'MJD0': '57683.552355324'}
{'DMmin': 882, 'DMmedian': 1236.0, 'filterbank': 'Beam0_fb_D20161

{'DMmin': 12, 'DMmedian': 232.0, 'filterbank': 'Beam6_fb_D20151222T222903.fil', 'Dec': None, 'DMmean': 305.66492520138092, 'maxSNR': 31.608816146851, 'dat': 'Beam6_dm_D20151222T222903.dat', 'nEvents': 43450, 'RA': None, 'maxMJD': 57379.111184332003, 'DMmax': 1394, 'MJD0': '57379.111111135'}
{'DMmin': 4, 'DMmedian': 129.5, 'filterbank': 'Beam1_fb_D20150729T021504.fil', 'Dec': None, 'DMmean': 164.88913138367525, 'maxSNR': 38.650161743164, 'dat': 'Beam1_dm_D20150729T021504.dat', 'nEvents': 4582, 'RA': None, 'maxMJD': 57232.266377457003, 'DMmax': 614, 'MJD0': '57232.266335359'}
{'DMmin': 8, 'DMmedian': 130.0, 'filterbank': 'Beam6_fb_D20150920T224310.fil', 'Dec': None, 'DMmean': 149.46627810158202, 'maxSNR': 20.688077926636, 'dat': 'Beam6_dm_D20150920T224310.dat', 'nEvents': 1201, 'RA': None, 'maxMJD': 57286.113810196999, 'DMmax': 464, 'MJD0': '57286.113732639'}
{'DMmin': 1366, 'DMmedian': 4251.0, 'filterbank': 'Beam1_fb_D20160203T162304.fil', 'Dec': None, 'DMmean': 5356.193128913148, 'maxS

{'DMmin': 4, 'DMmedian': 2055.0, 'filterbank': 'Beam1_fb_D20151122T000803.fil', 'Dec': None, 'DMmean': 1929.4698544698545, 'maxSNR': 12.761337280273, 'dat': 'Beam1_dm_D20151122T000803.dat', 'nEvents': 2886, 'RA': None, 'maxMJD': 57348.203749777, 'DMmax': 2559, 'MJD0': '57348.20369213'}
{'DMmin': 37, 'DMmedian': 69.5, 'filterbank': 'Beam6_fb_D20150729T022511.fil', 'Dec': None, 'DMmean': 78.489361702127653, 'maxSNR': 13.300783157349, 'dat': 'Beam6_dm_D20150729T022511.dat', 'nEvents': 94, 'RA': None, 'maxMJD': 57232.267955804004, 'DMmax': 130, 'MJD0': '57232.267921007'}
{'DMmin': 1171, 'DMmedian': 1713.0, 'filterbank': 'Beam0_fb_D20160824T202304.fil', 'Dec': None, 'DMmean': 3553.9704641350213, 'maxSNR': 12.325090408325, 'dat': 'Beam0_dm_D20160824T202304.dat', 'nEvents': 237, 'RA': None, 'maxMJD': 57625.020246097003, 'DMmax': 6328, 'MJD0': '57625.020214144'}
{'DMmin': 354, 'DMmedian': 7432.0, 'filterbank': 'Beam1_fb_D20160702T220304.fil', 'Dec': None, 'DMmean': 7233.4851226221381, 'maxSNR'

{'DMmin': 376, 'DMmedian': 5434.5, 'filterbank': 'Beam6_fb_D20160211T164404.fil', 'Dec': None, 'DMmean': 5533.9194351427568, 'maxSNR': 21.749641418457, 'dat': 'Beam6_dm_D20160211T164404.dat', 'nEvents': 45392, 'RA': None, 'maxMJD': 57429.864321112, 'DMmax': 10039, 'MJD0': '57429.86431713'}
{'DMmin': 15, 'DMmedian': 121.0, 'filterbank': 'Beam1_fb_D20150727T015504.fil', 'Dec': None, 'DMmean': 148.054890921886, 'maxSNR': 22.384609222412, 'dat': 'Beam1_dm_D20150727T015504.dat', 'nEvents': 1421, 'RA': None, 'maxMJD': 57230.248869365001, 'DMmax': 428, 'MJD0': '57230.2488614'}
{'DMmin': 1936, 'DMmedian': 2002.0, 'filterbank': 'Beam0_fb_D20160824T104004.fil', 'Dec': None, 'DMmean': 2690.4651162790697, 'maxSNR': 10.32119178772, 'dat': 'Beam0_dm_D20160824T104004.dat', 'nEvents': 43, 'RA': None, 'maxMJD': 57624.735403625004, 'DMmax': 5415, 'MJD0': '57624.735325545'}
{'DMmin': 67, 'DMmedian': 1464.0, 'filterbank': 'Beam6_fb_D20160121T193004.fil', 'Dec': None, 'DMmean': 1550.8666765644718, 'maxSNR'

{'DMmin': 654, 'DMmedian': 6861.0, 'filterbank': 'Beam0_fb_D20160629T235504.fil', 'Dec': None, 'DMmean': 6848.8677455357147, 'maxSNR': 13.170910835266, 'dat': 'Beam0_dm_D20160629T235504.dat', 'nEvents': 12544, 'RA': None, 'maxMJD': 57569.355177004007, 'DMmax': 10038, 'MJD0': '57569.355117935'}
{'DMmin': 9, 'DMmedian': 2275.0, 'filterbank': 'Beam6_fb_D20151014T070604.fil', 'Dec': None, 'DMmean': 2202.0289548387095, 'maxSNR': 17.81856918335, 'dat': 'Beam6_dm_D20151014T070604.dat', 'nEvents': 19375, 'RA': None, 'maxMJD': 57309.463906904006, 'DMmax': 2559, 'MJD0': '57309.463888889'}
{'DMmin': 5, 'DMmedian': 88.0, 'filterbank': 'Beam6_fb_D20160112T211604.fil', 'Dec': None, 'DMmean': 122.84820261437909, 'maxSNR': 19.891860961914, 'dat': 'Beam6_dm_D20160112T211604.dat', 'nEvents': 6120, 'RA': None, 'maxMJD': 57400.062512722005, 'DMmax': 566, 'MJD0': '57400.06248845'}
{'DMmin': 2779, 'DMmedian': 9129.0, 'filterbank': 'Beam1_fb_D20161020T061603.fil', 'Dec': None, 'DMmean': 8667.5707620528774, '

{'DMmin': 4, 'DMmedian': 1076.0, 'filterbank': 'Beam0_fb_D20151125T023504.fil', 'Dec': None, 'DMmean': 1075.7777777777778, 'maxSNR': 11.214935302734, 'dat': 'Beam0_dm_D20151125T023504.dat', 'nEvents': 747, 'RA': None, 'maxMJD': 57351.306051717002, 'DMmax': 1678, 'MJD0': '57351.306041667'}
{'DMmin': 115, 'DMmedian': 231.0, 'filterbank': 'Beam0_fb_D20160821T101803.fil', 'Dec': None, 'DMmean': 224.96969696969697, 'maxSNR': 10.757969856262, 'dat': 'Beam0_dm_D20160821T101803.dat', 'nEvents': 33, 'RA': None, 'maxMJD': 57621.598526130001, 'DMmax': 247, 'MJD0': '57621.598482373'}
{'DMmin': 1507, 'DMmedian': 1512.5, 'filterbank': 'Beam0_fb_D20151215T221703.fil', 'Dec': None, 'DMmean': 1517.75, 'maxSNR': 10.05402469635, 'dat': 'Beam0_dm_D20151215T221703.dat', 'nEvents': 8, 'RA': None, 'maxMJD': 57372.167840362003, 'DMmax': 1541, 'MJD0': '57372.167789352'}
{'DMmin': 3, 'DMmedian': 816.0, 'filterbank': 'Beam4_fb_D20151124T051504.fil', 'Dec': None, 'DMmean': 784.52838673412032, 'maxSNR': 14.0713062

{'DMmin': 9473, 'DMmedian': 9908.0, 'filterbank': 'Beam0_fb_D20161119T002404.fil', 'Dec': None, 'DMmean': 9889.4788135593226, 'maxSNR': 11.69565486908, 'dat': 'Beam0_dm_D20161119T002404.dat', 'nEvents': 2124, 'RA': None, 'maxMJD': 57711.272001200996, 'DMmax': 10039, 'MJD0': '57711.27192132'}
{'DMmin': 5, 'DMmedian': 257.0, 'filterbank': 'Beam4_fb_D20151124T051504.fil', 'Dec': None, 'DMmean': 247.45714285714286, 'maxSNR': 12.142247200012, 'dat': 'Beam4_dm_D20151124T051504.dat', 'nEvents': 105, 'RA': None, 'maxMJD': 57350.428856212995, 'DMmax': 309, 'MJD0': '57350.428784722'}
{'DMmin': 7, 'DMmedian': 1220.0, 'filterbank': 'Beam1_fb_D20150908T194603.fil', 'Dec': None, 'DMmean': 1209.9822150363784, 'maxSNR': 12.493451118469, 'dat': 'Beam1_dm_D20150908T194603.dat', 'nEvents': 1237, 'RA': None, 'maxMJD': 57273.995094860999, 'DMmax': 2287, 'MJD0': '57273.995081019'}
{'DMmin': 5, 'DMmedian': 839.5, 'filterbank': 'Beam1_fb_D20151124T051504.fil', 'Dec': None, 'DMmean': 849.19077568134173, 'maxSN

{'DMmin': 4.5, 'DMmedian': 189.5, 'filterbank': 'Beam4_fb_D20150626T034219.fil', 'Dec': None, 'DMmean': 344.55914380703405, 'maxSNR': 22.587228775024, 'dat': 'Beam4_dm_D20150626T034219.dat', 'nEvents': 1086741, 'RA': None, 'maxMJD': 57199.446209585993, 'DMmax': 1279.5, 'MJD0': '57199.446209491'}
{'DMmin': 5, 'DMmedian': 207.0, 'filterbank': 'Beam5_fb_D20151124T051504.fil', 'Dec': None, 'DMmean': 202.31978107896794, 'maxSNR': 17.981990814209, 'dat': 'Beam5_dm_D20151124T051504.dat', 'nEvents': 1279, 'RA': None, 'maxMJD': 57350.396287154006, 'DMmax': 297, 'MJD0': '57350.396226852'}
{'DMmin': 5, 'DMmedian': 270.0, 'filterbank': 'Beam0_fb_D20151124T023003.fil', 'Dec': None, 'DMmean': 232.63902439024389, 'maxSNR': 12.165351867676, 'dat': 'Beam0_dm_D20151124T023003.dat', 'nEvents': 205, 'RA': None, 'maxMJD': 57350.310804672998, 'DMmax': 352, 'MJD0': '57350.310787037'}
{'DMmin': 4492, 'DMmedian': 4801.0, 'filterbank': 'Beam0_fb_D20160824T104004.fil', 'Dec': None, 'DMmean': 4790.0086956521736, 

{'DMmin': 3, 'DMmedian': 42.0, 'filterbank': 'Beam6_fb_D20151217T050412.fil', 'Dec': None, 'DMmean': 108.10580522220458, 'maxSNR': 25.617456436157, 'dat': 'Beam6_dm_D20151217T050412.dat', 'nEvents': 88162, 'RA': None, 'maxMJD': 57373.380563879007, 'DMmax': 1691, 'MJD0': '57373.380520833'}
{'DMmin': 162, 'DMmedian': 1806.0, 'filterbank': 'Beam6_fb_D20150729T222004.fil', 'Dec': None, 'DMmean': 1699.5283466597884, 'maxSNR': 16.43662071228, 'dat': 'Beam6_dm_D20150729T222004.dat', 'nEvents': 19385, 'RA': None, 'maxMJD': 57233.210497314998, 'DMmax': 2559, 'MJD0': '57233.210431134'}
{'DMmin': 4, 'DMmedian': 1755.0, 'filterbank': 'Beam6_fb_D20150927T034004.fil', 'Dec': None, 'DMmean': 1682.2224164724164, 'maxSNR': 15.938606262207, 'dat': 'Beam6_dm_D20150927T034004.dat', 'nEvents': 15444, 'RA': None, 'maxMJD': 57292.353884892997, 'DMmax': 2559, 'MJD0': '57292.353865741'}
{'DMmin': 5, 'DMmedian': 986.0, 'filterbank': 'Beam6_fb_D20151124T051503.fil', 'Dec': None, 'DMmean': 950.98920863309354, 'ma

{'DMmin': 5600, 'DMmedian': 5662.5, 'filterbank': 'Beam0_fb_D20160824T104004.fil', 'Dec': None, 'DMmean': 6554.1627906976746, 'maxSNR': 10.472749710083, 'dat': 'Beam0_dm_D20160824T104004.dat', 'nEvents': 86, 'RA': None, 'maxMJD': 57624.676689034997, 'DMmax': 8696, 'MJD0': '57624.676611714'}
{'DMmin': 7, 'DMmedian': 977.5, 'filterbank': 'Beam6_fb_D20150909T195304.fil', 'Dec': None, 'DMmean': 933.56756756756761, 'maxSNR': 10.530237197876, 'dat': 'Beam6_dm_D20150909T195304.dat', 'nEvents': 74, 'RA': None, 'maxMJD': 57275.061134344003, 'DMmax': 1104, 'MJD0': '57275.061116898'}
{'DMmin': 490, 'DMmedian': 4429.0, 'filterbank': 'Beam1_fb_D20160702T220304.fil', 'Dec': None, 'DMmean': 3577.5139092240115, 'maxSNR': 12.146995544434, 'dat': 'Beam1_dm_D20160702T220304.dat', 'nEvents': 683, 'RA': None, 'maxMJD': 57572.181437988002, 'DMmax': 9456, 'MJD0': '57572.181424334'}
{'DMmin': 3176, 'DMmedian': 6889.0, 'filterbank': 'Beam0_fb_D20160824T104004.fil', 'Dec': None, 'DMmean': 6948.7652173913048, 'm

{'DMmin': 9, 'DMmedian': 113.0, 'filterbank': 'Beam1_fb_D20150905T184608.fil', 'Dec': None, 'DMmean': 128.37966640190626, 'maxSNR': 23.373037338257, 'dat': 'Beam1_dm_D20150905T184608.dat', 'nEvents': 1259, 'RA': None, 'maxMJD': 57270.948952637998, 'DMmax': 337, 'MJD0': '57270.948946759'}
{'DMmin': 8, 'DMmedian': 219.5, 'filterbank': 'Beam5_fb_D20151125T051804.fil', 'Dec': None, 'DMmean': 224.90101522842639, 'maxSNR': 13.649489402771, 'dat': 'Beam5_dm_D20151125T051804.dat', 'nEvents': 394, 'RA': None, 'maxMJD': 57351.410746579008, 'DMmax': 374, 'MJD0': '57351.410671296'}
{'DMmin': 214, 'DMmedian': 1829.0, 'filterbank': 'Beam1_fb_D20150802T020512.fil', 'Dec': None, 'DMmean': 1733.97320471597, 'maxSNR': 14.288830757141, 'dat': 'Beam1_dm_D20150802T020512.dat', 'nEvents': 10263, 'RA': None, 'maxMJD': 57236.305592077995, 'DMmax': 2559, 'MJD0': '57236.305506366'}
{'DMmin': 13, 'DMmedian': 197.0, 'filterbank': 'Beam5_fb_D20151125T051804.fil', 'Dec': None, 'DMmean': 190.14893617021278, 'maxSNR'

{'DMmin': 423, 'DMmedian': 3436.5, 'filterbank': 'Beam0_fb_D20161014T022503.fil', 'Dec': None, 'DMmean': 3175.429347826087, 'maxSNR': 13.578004837036, 'dat': 'Beam0_dm_D20161014T022503.dat', 'nEvents': 1656, 'RA': None, 'maxMJD': 57675.268427064002, 'DMmax': 8164, 'MJD0': '57675.268425926'}
{'DMmin': 6, 'DMmedian': 192.0, 'filterbank': 'Beam6_fb_D20151127T044704.fil', 'Dec': None, 'DMmean': 197.22252911213201, 'maxSNR': 37.046817779541, 'dat': 'Beam6_dm_D20151127T044704.dat', 'nEvents': 72049, 'RA': None, 'maxMJD': 57353.418881284008, 'DMmax': 696, 'MJD0': '57353.418854167'}
{'DMmin': 9.0, 'DMmedian': 886.5, 'filterbank': 'Beam4_fb_D20150626T035947.fil', 'Dec': None, 'DMmean': 838.36403000395512, 'maxSNR': 45.872283935547, 'dat': 'Beam4_dm_D20150626T035947.dat', 'nEvents': 437409, 'RA': None, 'maxMJD': 57199.458867587004, 'DMmax': 1279.5, 'MJD0': '57199.458819444'}
{'DMmin': 950, 'DMmedian': 5007.5, 'filterbank': 'Beam6_fb_D20161121T171504.fil', 'Dec': None, 'DMmean': 4818.164438502673

KeyboardInterrupt: 

In [None]:
print df['pixelMin_1_0'].dropna()
#print df

#### Add labels

In [None]:
# output of labelImg2.py
labelPKlFiles = glob.glob(BASE_DATA_PATH + 'allLabels/*.pkl')

# add assigned labels to main dataframe
for lPkl in labelPKlFiles:
    print 'Reading labels from', lPkl
    labelDict = pickle.load(open(lPkl, 'rb'))
    for key,val in labelDict.iteritems():
        fbFN = key.split('buffer')[0] + 'fil'
        bufID = int(key.split('.')[1].split('buffer')[-1])
        df.loc[(df['filterbank']==fbFN) & (df['Buffer']==bufID), 'Label'] = val

In [None]:
print df['Label'].describe()

#### Save combined dataframe to file

This would be a good point to split into a new notebook as the previous setups have been run to combine the various labels and features into a single dataframe. We will likely not need to re-run this code often, and as it takes a few minutes to run we can just save the final dataframe to file. Then use that dataframe as the starting point for the model.

In [None]:
df.to_pickle('featureDataframeInigo.pkl')