# Exploring data quality flags

In this notebook, we will explore the numerous data flags that are produced by the LSST stack processing. As you will see below, there are nearly 200 flags that indicate various issues with the measurement algorithms. This notebook shows you how to access those flags and filter catalogs based on the flags.

This is not meant to provide a recommendation about which flags to use, but is rather an illustration of how one can examine the effects of flags on a data set. For more information about some useful flags, see:

https://pipelines.lsst.io/getting-started/multiband-analysis.html#
(in particular, there are sections about "Filtering for unique, deblended sources with the detect_isPrimary flag" and "Using measurement flags")

(Based in part on notebooks originally written by Angelo Fausti and Sasha Brownsberger)

In [None]:
# LSST stack imports
from lsst.daf.persistence import Butler
import lsst.afw.display as afw_display

# Other python imports
%matplotlib inline
import matplotlib.pyplot as plt
import math
import pandas as pd
import numpy

We'll use results from coadds produced during the HSC weekly reprocessing. See more info here: https://confluence.lsstcorp.org/display/DM/S18+HSC+PDR1+reprocessing

There is currently no convenient way to learn from the butler what dataset types are available, or what tracts, patches, and filters have data. So, we have to know in advance the name of the dataset type (e.g., `deepCoadd_forced_src`), the tract, patch IDs, and filter, which we can get from the wiki page linked above (see also [this summary of HSC-SSP data](https://hsc-release.mtk.nao.ac.jp/doc/index.php/database/)). 

In [None]:
##### If you want to edit the data set, tract, and patch yourself, change the numbers below appropriately
data_set = 'UDEEP'# HSC/SSP survey data include WIDE, DEEP, UDEEP fields
datadir = '/datasets/hsc/repo/rerun/DM-13666/' + data_set 
butler = Butler(datadir)

# We selected the "Ultra-deep (UDEEP)" data, and will choose a tract from the SXDS field (tract 8765):
tract = 8765 #8766
patch = '1,2' #'8,3'  # patch selected at random

#All subsequent data structures will be stored in sets, with the elements of the sets 
# corresponding to the filters you specify here.  So, probably a good idea to remember the order! 
filters = ['HSC-G','HSC-I','HSC-Y'] 
n_filters = len(filters)

# We'll focus our analysis on the forced photometry from the co-add data set:
data_type = 'deepCoadd_forced_src'

Now extract the data for this patch using the butler. Note that we'll combine data from all three of the filters defined above by looping over them:

In [None]:
deep_coadds = [butler.get(data_type, tract = tract, patch = patch, dataId={'filter': filter}) for filter in filters]
n_raw_objects = len(deep_coadds[0])
for i in range(len(filters)):
    print ('Number of objects in filter ' + filters[i] + ' = ' + str(len(deep_coadds[i])))
print ('(Note that those numbers should all be the same, because forced photometry is performed on the same set of sources in all bands.)')

# Convert the catalogs to Astropy tables so we can easily work with them:
deep_coadd_tables = [deep_coadd.asAstropy() for deep_coadd in deep_coadds]

What columns are in the catalogs? We can use the 'colnames' method to see a list:

In [None]:
table = deep_coadd_tables[0]
table.colnames

## Exploring data flags in the catalogs:

If you executed the previous cell, you saw that there are a _lot_ of columns in the data structures. Each of the measurements has an associated flag to denote whether that measurement is reliable; we want to explore those flags and the brief descriptions of their meaning. To do so, we filter all columns that have 'flag' in their names and get their descriptions, placing them in a pandas dataframe.

In [None]:
flags = [ (colname, table[colname].description) for colname in table.colnames if '_flag_' in colname]
flags_df = pd.DataFrame(flags, columns=['Column Name', 'Description'])
display(flags_df)

Now let's say we want to know how common it is for measurements to be affected by each of these flags. Let's define a short function to calculate the fraction of objects rejected by each flag, then feed it the list of flags from the previous cell.

Note that these are boolean (True/False) flags, so we simply need to find where each of them is set to **True**.

In [None]:
def compute_fraction(table, colname):
    size = len(table)
    fraction = int(len(table[table[colname]==True])/size*100)
    return fraction

# This line wraps up a handful of tasks into one efficient process. It places the "colname" (i.e., the name of the flag) in the first column,
#   then loops over the three filters, calling the "compute_fraction" function for each of them, while formatting them into percentages,
#   and adding them to the structure:
fraction_rejected = [[colname] + ["{}%".format(compute_fraction(table, colname)) for table in deep_coadd_tables]
                     for colname, description in flags] 

In [None]:
# Let's check what a couple lines of that structure look like:
fraction_rejected[0:3]

In [None]:
# Display a table of the fraction of objects rejected by each flag, by first turning the list object into a pandas dataframe:

df = pd.DataFrame(fraction_rejected, columns=['Flag'] + filters)

# By default, Pandas only displays some rows at the beginning and end. This will force it to display all rows:
with pd.option_context('display.max_rows', None):
    display(df)


## Are there any spatial patterns in the flagged objects?

To answer this, we will explore the positions of the objects satisfying certain flags in the patch that we're considering. 

First, select a set of the flags to explore:

In [None]:
flags_to_reject_on = ['base_PixelFlags_flag_bright_objectCenter',
                      'base_PixelFlags_flag_saturated', 
                      'base_PixelFlags_flag_cr',
                      'base_GaussianFlux_flag_badShape',
                      'base_PixelFlags_flag_sensor_edge']

# Select those flags from the dataframe we made earlier, so we can see their descriptions:
display(flags_df.loc[flags_df['Column Name'].isin(flags_to_reject_on)])

We'll bin them (using the 2D histogram routine from numpy) to show the number of objects satisfying each flag as a function of position. We also include the fraction of objects affected by each flag in the plot title (using the 'compute_fraction' function we defined earlier).

In [None]:
# Plot the spatial distribution of flagged objects in the first filter (in this case, 'HSC-G'):

for flag in flags_to_reject_on:
    index = deep_coadd_tables[0][flag]==True
    plt.figure()
# Note that (RA, Dec) are in radians -- use numpy.rad2deg() to convert to degrees.
    plt.hist2d(numpy.rad2deg(deep_coadd_tables[0]['coord_ra'][index]), numpy.rad2deg(deep_coadd_tables[0]['coord_dec'][index]), bins=(100, 100), cmap='Blues')
    plt.colorbar(label='Counts')
    plt.xlabel('RA(deg)')
    plt.ylabel('Dec(deg)')
    plt.title("{} ({}%)".format(flag, compute_fraction(deep_coadd_tables[0], flag)))
    plt.show()

That seems like you might expect. For example, the `sensor_edge` flag seems to form linear features that look like the edges of CCD sensors, and the `saturated` flag shows only a few clusters (presumably saturated stars) and a streak (a bleed trail from a saturated star).

### Combine selection based on a set of flags:

In [None]:
#Define a list of all possible flags, and one with a "default" set of flags (selected for illustration purposes)
flags_to_reject_on_default = ['base_PixelFlags_flag_bad', 'base_PixelFlags_flag_cr', 'base_PixelFlags_flag_offimage','base_PixelFlags_flag_edge',\
                              'base_PixelFlags_flag_saturated', 'base_PixelFlags_flag_rejected', 'base_PixelFlags_flag_interpolated']
flags_to_reject_on_all = [ (colname, deep_coadd_tables[0][colname].description) for colname in deep_coadd_tables[0].colnames if '_flag_' in colname]

In [None]:
print ('Choose the flags on which you wish to reject observations. Here is a set of default options we have defined:\n')
print (flags_to_reject_on_default)

In [None]:
######ENTER YOUR CHOICE OF REJECTION FLAGS HERE##### 
flags_to_reject_on = [] #Leave empty to use "default" flags. Otherwise, populate it as in the example below.

# EXAMPLE:
#flags_to_reject_on = ['base_PixelFlags_flag_bright_objectCenter',
#                      'base_PixelFlags_flag_saturated', 
#                      'base_PixelFlags_flag_cr',
#                      'base_GaussianFlux_flag_badShape',
#                      'base_PixelFlags_flag_sensor_edge']

if len(flags_to_reject_on) == 0: 
    flags_to_reject_on = flags_to_reject_on_default[:]

Now remove objects from the catalogs that satisfy any of the flags we've selected to filter on:

In [None]:
# Do the operation on all flags in individual filter, and then between filters.  
# If an object has ANY of the desired flags set, that object will be rejected by the master flag 
total_rej_flags_by_filters = [[any([deep_coadd_table[flag][i] for flag in flags_to_reject_on]) for i in range(n_raw_objects)] 
                                              for deep_coadd_table in deep_coadd_tables] 

# Print a summary of the number of objects retained after filtering:
for i in range(n_filters):
    nok = len([flag for flag in total_rej_flags_by_filters[i] if flag == False])
    print ('For filter ' + str(filters[i]) + ', ' + str(nok) +\
           ' of ' + str(n_raw_objects) + " ({:.1%})".format((nok/n_raw_objects)) + ' objects were "good" based on your chosen flags.')
 

### More flags ###

Those aren't the only data flags. There are also flags contained in the "ref" catalogs for each coadd data set. Many of these do not have "flag" in their name, but can be very important. This includes important fields such as "detect_isPrimary" and "base_ClassificationExtendedness_value", among others. For more information about some useful flags, see:

https://pipelines.lsst.io/getting-started/multiband-analysis.html#
(in particular, there are sections about "Filtering for unique, deblended sources with the detect_isPrimary flag" and "Using measurement flags")

In [None]:
data_type = 'deepCoadd_ref'

deep_coadd_refs = [butler.get(data_type, tract = tract, patch = patch, dataId={'filter': filter}) for filter in filters]

In [None]:
# Convert the catalogs to Astropy tables so we can easily work with them:
deep_coadd_reftables = [deep_coadd_ref.asAstropy() for deep_coadd_ref in deep_coadd_refs]

In [None]:
ref_table = deep_coadd_reftables[0]
ref_table.colnames

In [None]:
print(numpy.size(table.colnames))
print(numpy.size(ref_table.colnames))

### First check out the entries with "flag" in the column names:

In [None]:
# Filter out the ones with "slot_" in them, since they are simply repeats of the values contained elsewhere:
ref_flags = [ (colname, ref_table[colname].description) for colname in ref_table.colnames if (('slot_' not in colname) & ('_flag_' in colname))]
pd.DataFrame(ref_flags, columns=['Column Name', 'Description'])

In [None]:
# This line wraps up a handful of tasks into one efficient process. It places the "colname" (i.e., the name of the flag) in the first column,
#   then loops over the three filters, calling the "compute_fraction" function for each of them, while formatting them into percentages,
#   and adding them to the structure:
fraction_rejected_ref = [[colname] + ["{}%".format(compute_fraction(table, colname)) for table in deep_coadd_reftables]
                     for colname, description in ref_flags] 

In [None]:
# Let's check what a couple lines of that structure look like:
fraction_rejected_ref[0:3]

In [None]:
# Display a table of the fraction of objects rejected by each flag, by first turning the list object into a pandas dataframe:

df_ref = pd.DataFrame(fraction_rejected_ref, columns=['Flag'] + filters)

# By default, Pandas only displays some rows at the beginning and end. This will force it to display all rows:
with pd.option_context('display.max_rows', None):
    display(df_ref)


### Now check some of the flags that don't have "flag" in their names. 

As mentioned above, this includes important fields such as "detect_isPrimary" and "base_ClassificationExtendedness_value", among others. For more information about some useful flags, see:

https://pipelines.lsst.io/getting-started/multiband-analysis.html#
(in particular, there are sections about "Filtering for unique, deblended sources with the detect_isPrimary flag" and "Using measurement flags")

Let's check out the "detect_isPrimary" flag and its constituents. "detect_isPrimary" is the combination of requiring sources to be fully deblended (i.e., have no _children_), to not be a duplicate source in the overlaps of tracts or patches, and to not be one of the "sky" objects that is added during processing. 

"detect_isPrimary" is defined by this combination: (deblend_nChild == 0) & detect_isPatchInner & detect_isTractInner & (merge_peak_sky == False)


In [None]:
# Print a summary of the number of objects retained after filtering:
for i in range(n_filters):
    isPrimary = deep_coadd_reftables[i]['detect_isPrimary'] == True
    isDeblended = deep_coadd_reftables[i]['deblend_nChild'] == 0
    inInnerRegions = deep_coadd_reftables[i]['detect_isPatchInner'] & deep_coadd_reftables[i]['detect_isTractInner']
    notSkyObject = deep_coadd_reftables[i]['merge_peak_sky'] == False
#   Add in a "combined" flag to verify that combining the three cuts selects the same number of objects as isPrimary:
    combined = isDeblended & inInnerRegions & notSkyObject
    sel = [[isPrimary],[isDeblended],[inInnerRegions],[notSkyObject],[combined]]
    selname = ['isPrimary','isDeblended','inInnerRegions','notSkyObject','combined']
    for j in range(numpy.size(selname)):
        print('For filter ' + str(filters[i]) + ', ' + str(numpy.size(deep_coadd_reftables[i][sel[j]])) +\
              ' of ' + str(n_raw_objects) + " ({:.1%})".format((numpy.size(deep_coadd_reftables[i][sel[j]]))/n_raw_objects) +\
              ' objects were retained by the ' + selname[j] + ' cut.')
    print('\n')


This notebook has illustrated ways to explore the various data flags in the LSST Stack, including ways of combining flags (as in the above cell, where we created the "combined" filter). There are many more flags that could be applied; explore suggestions at https://pipelines.lsst.io/getting-started/multiband-analysis.html#, or modify the code in this notebook to explore some of the flags yourself!