Skip to content

Commit

Permalink
Reformat docs
Browse files Browse the repository at this point in the history
  • Loading branch information
yalsayyad committed Nov 18, 2019
1 parent 1d992fa commit 3d69605
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 57 deletions.
88 changes: 46 additions & 42 deletions python/lsst/pipe/tasks/functors.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@ def init_fromDict(initDict, basePath='lsst.pipe.tasks.functors', typeKey='functo
----------
initDict : dictionary
Dictionary describing object's initialization. Must contain
an entry keyed by `typeKey` that is the name of the object,
relative to `basePath`.
an entry keyed by ``typeKey`` that is the name of the object,
relative to ``basePath``.
basePath : str
Path relative to module in which ``initDict[typeKey]`` is defined.
typeKey : str
Key of `initDict` that is the name of the object
(relative to ``basePath``).
Key of ``initDict`` that is the name of the object
(relative to `basePath`).
"""
initDict = initDict.copy()
# TO DO: DM-21956 We should be able to define functors outside this module
Expand All @@ -44,15 +44,17 @@ def init_fromDict(initDict, basePath='lsst.pipe.tasks.functors', typeKey='functo


class Functor(object):
"""Define and execute a calculation on a deepCoadd_obj ParquetTable
"""Define and execute a calculation on a ParquetTable
The `__call__` method accepts a `ParquetTable` object, and returns the result
of the calculation as a single column. Each functor defines what columns are needed
for the calculation, and only these columns are read from the ``ParquetTable``.
The `__call__` method accepts a `ParquetTable` object, and returns the
result of the calculation as a single column. Each functor defines what
columns are needed for the calculation, and only these columns are read
from the `ParquetTable`.
The action of `__call__` consists of two steps: first, loading the necessary
columns from disk into memory as a ``pandas.DataFrame` object; and second, performing
the computation on this dataframe and returning the result.
The action of `__call__` consists of two steps: first, loading the
necessary columns from disk into memory as a `pandas.DataFrame` object;
and second, performing the computation on this dataframe and returning the
result.
To define a new `Functor`, a subclass must define a `_func` method,
Expand All @@ -63,40 +65,39 @@ class Functor(object):
* `name`: A name appropriate for a figure axis label
* `shortname`: A name appropriate for use as a dictionary key
On initialization, a `Functor` should declare what filter (`filt` kwarg) and dataset
(e.g. `'ref'`, `'meas'`, `'forced_src'`) it is intended to be applied to.
This enables the `_get_cols` method to extract the proper columns from the parquet file.
If not specified, the dataset will fall back on the `_defaultDataset` attribute.
If filter is not specified and `dataset` is anything other than `'ref'`, then an error
will be raised when trying to perform the calculation.
As currently implemented, `Functor` is only set up to expect a `ParquetTable`
of the format of the `deepCoadd_obj` dataset; that is, a `MultilevelParquetTable`
with the levels of the column index being `filter`, `dataset`, and `column`.
This is defined in the `_columnLevels` attribute, as well as being implicit in
the role of the `filt` and `dataset` attributes defined at initialization.
In addition, the `_get_cols` method that
reads the dataframe from the `ParquetTable` will return a dataframe with column
index levels defined by the `_dfLevels` attribute; by default, this is `column`.
On initialization, a `Functor` should declare what filter (`filt` kwarg)
and dataset (e.g. `'ref'`, `'meas'`, `'forced_src'`) it is intended to be
applied to. This enables the `_get_cols` method to extract the proper
columns from the parquet file. If not specified, the dataset will fall back
on the `_defaultDataset`attribute. If filter is not specified and `dataset`
is anything other than `'ref'`, then an error will be raised when trying to
perform the calculation.
As currently implemented, `Functor` is only set up to expect a
`ParquetTable` of the format of the `deepCoadd_obj` dataset; that is, a
`MultilevelParquetTable` with the levels of the column index being `filter`,
`dataset`, and `column`. This is defined in the `_columnLevels` attribute,
as well as being implicit in the role of the `filt` and `dataset` attributes
defined at initialization. In addition, the `_get_cols` method that reads
the dataframe from the `ParquetTable` will return a dataframe with column
index levels defined by the `_dfLevels` attribute; by default, this is
`column`.
The `_columnLevels` and `_dfLevels` attributes should generally not need to
be changed, unless `_func` needs columns from multiple filters or datasets
to do the calculation.
An example of this is the ``lsst.pipe.tasks.functors.Color` functor, for which
`_dfLevels = ('filter', 'column')`, and `_func` expects the dataframe it gets to
have those levels in the column index.
While not currently implemented, it would be
relatively straightforward to generalize the base `Functor` class to be able to
accept arbitrary `ParquetTable` formats (other than that of `deepCoadd_obj`).
An example of this is the `lsst.pipe.tasks.functors.Color` functor, for
which `_dfLevels = ('filter', 'column')`, and `_func` expects the dataframe
it gets to have those levels in the column index.
Parameters
----------
filt : str
Filter upon which to do the calculation
dataset : str
Dataset upon which to do the calculation (e.g., 'ref', 'meas', 'forced_src').
Dataset upon which to do the calculation
(e.g., 'ref', 'meas', 'forced_src').
"""

Expand Down Expand Up @@ -203,12 +204,13 @@ def shortname(self):
class CompositeFunctor(Functor):
"""Perform multiple calculations at once on a catalog
The role of a `CompositeFunctor` is to group together computations from multiple
functors. Instead of returning ``pandas.Series`` a `CompositeFunctor` returns
a ``pandas.Dataframe``, with the column names being the keys of `funcDict`.
The role of a `CompositeFunctor` is to group together computations from
multiple functors. Instead of returning `pandas.Series` a
`CompositeFunctor` returns a `pandas.Dataframe`, with the column names
being the keys of `funcDict`.
The `columns` attribute of a `CompositeFunctor` is the union of all columns in all
the component functors.
The `columns` attribute of a `CompositeFunctor` is the union of all columns
in all the component functors.
A `CompositeFunctor` does not use a `_func` method itself; rather,
when a `CompositeFunctor` is called, all its columns are loaded
Expand Down Expand Up @@ -513,7 +515,8 @@ class Mag(Functor):
col : `str`
Name of flux column from which to compute magnitude. Can be parseable
by `lsst.pipe.tasks.functors.fluxName` function---that is, you can pass
`'modelfit_CModel'` instead of `'modelfit_CModel_instFlux'`) and it will understand.
`'modelfit_CModel'` instead of `'modelfit_CModel_instFlux'`) and it will
understand.
calib : `lsst.afw.image.calib.Calib` (optional)
Object that knows zero point.
"""
Expand Down Expand Up @@ -639,7 +642,7 @@ class Color(Functor):
----------
col : str
Name of flux column from which to compute; same as would be passed to
``lsst.pipe.tasks.functors.Mag``.
`lsst.pipe.tasks.functors.Mag`.
filt2, filt1 : str
Filters from which to compute magnitude difference.
Expand Down Expand Up @@ -913,7 +916,8 @@ def getFilterAliasName(row):


class Photometry(Functor):
AB_FLUX_SCALE = (0 * u.ABmag).to_value(u.nJy) # AB to NanoJansky (3631 Jansky)
# AB to NanoJansky (3631 Jansky)
AB_FLUX_SCALE = (0 * u.ABmag).to_value(u.nJy)
LOG_AB_FLUX_SCALE = 12.56
FIVE_OVER_2LOG10 = 1.085736204758129569
# TO DO: DM-21955 Replace hard coded photometic calibration values
Expand Down
30 changes: 15 additions & 15 deletions python/lsst/pipe/tasks/parquetTable.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,20 +151,22 @@ class MultilevelParquetTable(ParquetTable):
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.
Additionally, pyarrow stores multilevel index information in a very strange way.
Pandas stores it as a tuple, so that one can access a single column from a pandas
dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`. However, for some reason
pyarrow saves these indices as "stringified" tuples, such that in order to read this
same column from a table written to Parquet, you would have to do the following:
Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`. However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:
pf = pyarrow.ParquetFile(filename)
df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])
See also https://github.com/apache/arrow/issues/1771, where I've raised this issue.
I don't know if this is a bug or intentional, and it may be addressed in the future.
See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.
As multilevel-indexed dataframes can be very useful to store data like multiple filters'
worth of data in the same table, this case deserves a wrapper to enable easier access;
As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for. For example,
parq = MultilevelParquetTable(filename)
Expand All @@ -175,8 +177,8 @@ class MultilevelParquetTable(ParquetTable):
will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just those
columns from disk. You can also request a sub-table; e.g.,
but without having to load the whole frame into memory---this reads just
those columns from disk. You can also request a sub-table; e.g.,
parq = MultilevelParquetTable(filename)
columnDict = {'dataset':'meas',
Expand All @@ -185,14 +187,12 @@ class MultilevelParquetTable(ParquetTable):
and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.
Parameters
----------
filename : str
filename : str, optional
Path to Parquet file.
dataFrame : dataFrame, optional
"""

def __init__(self, *args, **kwargs):
super(MultilevelParquetTable, self).__init__(*args, **kwargs)

Expand Down

0 comments on commit 3d69605

Please sign in to comment.