# Exploring data quality flags

In this notebook, we will explore the numerous data flags that are produced by the LSST stack processing. This is not meant to provide a recommendation about which flags to use, but is rather an illustration of how one can examine the effects of flags on a data set.

(Based in part on notebooks originally written by Angelo Fausti and Sasha Brownsberger)

In [None]:
# LSST stack imports
from lsst.daf.persistence import Butler
import lsst.afw.display as afw_display

# Other python imports
%matplotlib inline
import matplotlib.pyplot as plt
import math
import pandas as pd
import numpy


We'll use results from coadds produced during the HSC weekly reprocessing. See more info here: https://confluence.lsstcorp.org/display/DM/S18+HSC+PDR1+reprocessing

There is currently no convenient way to learn from the butler what dataset types are available, or what tracts, patches, and filters have data. So, we have to know in advance the name of the dataset type (e.g., `deepCoadd_forced_src`), the tract, patch IDs, and filter, which we can get from the wiki page linked above (see also [this summary of HSC-SSP data](https://hsc-release.mtk.nao.ac.jp/doc/index.php/database/)). 

In [None]:
##### If you want to edit the data set, tract, and patch yourself, change the numbers below appropriately
data_set = 'UDEEP'# HSC/SSP survey data include WIDE, DEEP, UDEEP fields
datadir = '/datasets/hsc/repo/rerun/DM-13666/' + data_set 
butler = Butler(datadir)

# We selected the "Ultra-deep (UDEEP)" data, and will choose a tract from the SXDS field (tract 8765):
tract = 8765 #8766
patch = '1,2' #'8,3'  # patch selected at random

#All subsequent data structures will be stored in sets, with the elements of the sets 
# corresponding to the filters you specify here.  So, probably a good idea to remember the order! 
filters = ['HSC-G','HSC-I','HSC-Y'] 
n_filters = len(filters)

# We'll focus our analysis on the forced photometry from the co-add data set:
data_type = 'deepCoadd_forced_src'

Now extract the data for this patch using the butler, looping over the filters defined above:

In [None]:
deep_coadds = [butler.get(data_type, tract = tract, patch = patch, dataId={'filter': filter}) for filter in filters]
n_raw_objects = len(deep_coadds[0])
for i in range(len(filters)):
    print ('Number of objects in filter ' + filters[i] + ' = ' + str(len(deep_coadds[i])))
print ('(Note that those numbers should all be the same.)')

# Convert the catalogs to Astropy tables so we can easily work with them:
deep_coadd_tables = [deep_coadd.asAstropy() for deep_coadd in deep_coadds]

What columns are in the catalogs?

In [None]:
table = deep_coadd_tables[0]
table.colnames

## Exploring data flags in the catalogs:

Here we filter all columns that have 'flag' in their names and get their descriptions.

In [None]:
flags = [ (colname, table[colname].description) for colname in table.colnames if '_flag_' in colname]
pd.DataFrame(flags, columns=['Column Name', 'Description'])

Now let's calculate the fraction of objects rejected by each flag:

In [None]:
#tables = [catalog.asAstropy() for catalog in catalogs]

def compute_fraction(table, colname):
    size = len(table)
    fraction = int(len(table[table[colname]==True])/size*100)
    return fraction

fraction_rejected = [[colname] + ["{}%".format(compute_fraction(table, colname)) for table in deep_coadd_tables]
                     for colname, description in flags] 

In [None]:
# Display a table of the fraction of objects rejected by each flag:

df = pd.DataFrame(fraction_rejected, columns=['Flag'] + filters)

# By default, Pandas only displays some rows at the beginning and end. This will force it to display all rows:
with pd.option_context('display.max_rows', None):
    display(df)


## Are there any spatial patterns in the flagged objects?

Select a set of the flags to explore:

In [None]:
flags_to_reject_on = ['base_PixelFlags_flag_bright_objectCenter',
                      'base_PixelFlags_flag_saturated', 
                      'base_PixelFlags_flag_cr',
                      'base_GaussianFlux_flag_badShape',
                      'base_PixelFlags_flag_sensor_edge']

In [None]:
# Plot the spatial distribution of flagged objects in the first filter (in this case, 'HSC-G'):

for flag in flags_to_reject_on:
    index = deep_coadd_tables[0][flag]==True
    plt.figure()
# Note that (RA, Dec) are in radians -- use numpy.rad2deg() to convert to degrees.
    plt.hist2d(numpy.rad2deg(deep_coadd_tables[0]['coord_ra'][index]), numpy.rad2deg(deep_coadd_tables[0]['coord_dec'][index]), bins=(100, 100), cmap='Blues')
    plt.colorbar(label='Counts')
    plt.xlabel('RA(deg)')
    plt.ylabel('Dec(deg)')
    plt.title("{} ({}%)".format(flag, compute_fraction(deep_coadd_tables[0], flag)))
    plt.show()

In [None]:
#Define all possible flags, and our recommended flags 
flags_to_reject_on_recommended= ['base_PixelFlags_flag_bad', 'base_PixelFlags_flag_cr', 'base_PixelFlags_flag_offimage','base_PixelFlags_flag_edge', 'base_PixelFlags_flag_saturated', 'base_PixelFlags_flag_rejected', 'base_PixelFlags_flag_interpolated']
flags_to_reject_on_all = [ (colname, deep_coadd_tables[0][colname].description) for colname in deep_coadd_tables[0].colnames if '_flag_' in colname]
#print (flags_to_reject_on_all)

In [None]:
print ('Choose the flags on which you wish to reject observations. ')
#print ('Here are your options: ')
#print (flags_to_reject_on_all)
#df_allflags = pd.DataFrame(flags_to_reject_on_all, columns=['Column Name', 'Description'])
#with pd.option_context('display.max_rows', None):
#    display(df_allflags)
    
print ('Here are our suggestions:')
print (flags_to_reject_on_recommended)


In [None]:
######ENTER YOUR CHOICE OF REJECTION FLAGS HERE##### 
flags_to_reject_on = [] #Leave empty to use "recommended" flags

# EXAMPLE:
#flags_to_reject_on = ['base_PixelFlags_flag_bright_objectCenter',
#                      'base_PixelFlags_flag_saturated', 
#                      'base_PixelFlags_flag_cr',
#                      'base_GaussianFlux_flag_badShape',
#                      'base_PixelFlags_flag_sensor_edge']

if len(flags_to_reject_on) == 0: 
    flags_to_reject_on = flags_to_reject_on_recommended[:]

In [None]:
#Do operation on all flags in individual filter, and then between filter.  
# So if an object has ANY of the desired flags set, that object will be rejected by the master flag 
total_rej_flags_by_filters = [[any([deep_coadd_table[flag][i] for flag in flags_to_reject_on]) for i in range(n_raw_objects)] 
                                              for deep_coadd_table in deep_coadd_tables] 

for i in range(n_filters):
    nok = len([flag for flag in total_rej_flags_by_filters[i] if flag == False])
    print ('For filter ' + str(filters[i]) + ', ' + str(nok) +\
           ' of ' + str(n_raw_objects) + " ({:.1%})".format((nok/n_raw_objects)) + ' objects were "good" based on your chosen flags.')
 

### More flags ###

Those aren't the only data flags. There are also flags contained in the "ref" catalogs for each coadd data set. Many of these do not have "flag" in their name, but can be very important. This includes important fields such as "detect_isPrimary" and "base_ClassificationExtendedness_value", among others.


In [None]:
data_type = 'deepCoadd_ref'

deep_coadd_refs = [butler.get(data_type, tract = tract, patch = patch, dataId={'filter': filter}) for filter in filters]

In [None]:
# Convert the catalogs to Astropy tables so we can easily work with them:
deep_coadd_reftables = [deep_coadd_ref.asAstropy() for deep_coadd_ref in deep_coadd_refs]

In [None]:
ref_table = deep_coadd_reftables[0]
ref_table.colnames

In [None]:
print(np.size(table.colnames))
print(np.size(ref_table.colnames))

In [None]:
# Filter out the ones with "slot_" in them, since they are simply repeats of the values contained elsewhere:
rr_flags = [ (colname, ref_table[colname].description) for colname in ref_table.colnames if 'slot_' not in colname]
pd.DataFrame(rr_flags[600:650], columns=['Column Name', 'Description'])