This package provides a class with methods to perform special hierarchy-preserving set operations on hierarchically organized data.
Examples of such hierarchically organized data include:
- genus, species
- countries, regions, cities
- treatment/control group, subject, observation date
Such data are sometimes referred to as multilevel or nested and are commonly encountered in social sciences.
By 'hierarchy-preserving set operations', we mean the following:
- when a row is intersected with another row that contains partially matching but less specific data, the row with more specific data is the result
- when a row is unioned with another row that contains partially matching but less specific data, the row with less specific data is the result
For example, assume the following two datasets, where the first column specifies genus, and the second column species.
None
is used as a sentinel value to indicate "any".
A = [
('Felis', 'silvestris'),
('Canis', 'lupus'),
('Homo', None),
('Ovis', None),
]
B = [
('Felis', 'silvestris'),
('Homo', 'sapiens'),
('Homo', 'neanderthalensis'),
('Ovis', None),
]
Then we define A ∩ B
to be:
{
('Felis', 'silvestris'),
('Homo', 'sapiens'),
('Homo', 'neanderthalensis'),
('Ovis', None),
}
And A ∪ B
to be:
{
('Felis', 'silvestris'),
('Canis', 'lupus'),
('Homo', None),
('Ovis', None),
}
As expected, the order of operands does not matter: A ∩ B = B ∩ A
and A ∪ B = B ∪ A
.
However, the order of the rows in the result is not guaranteed.
See the included example for more information.
- Input data must be 2-dimenisonal integer or floating point
numpy.ndarray
arrays.- It is trivial to convert non-numeric data to this format; see the included example.
- The special value
NaN
is a sentinel indicating a wildcard; it will match any other value in the same hierarchical path. - Each input array's rows must be unique.
- The hierarchy is assumed to flow from left to right; i.e., the 0-th column must contain the highest level of the hierarchy.
- If a value in a row of the hierarchy is
NaN
, all remaining values in the row must beNaN
.- I.e.
(1., NaN, NaN)
is permitted, but(1., NaN, 1.)
is not.
- I.e.
- Operations must be performed on arrays with equal numbers of columns.
See the included example for usage examples.
Running make install
will pip install
the package into the current environment.
Run make develop
followed by make test
.
-
Decompose input arrays of width c into c arrays of widths
[1, c]
.- If
n ∈ [1, c]
, then the n-th array contains input rows of width n whose original values in columns > n contained onlyNaN
. - Some of these c arrays may be empty.
- If
-
Intersect or union the arrays of width c, preserving hierarchy by examining values in the arrays of width c - 1.
-
Concatenate and deduplicate these results as needed.
-
Recurse to step 2 with arrays of width c - 1. When the base case is reached, return.
This implementation can perform intersections and unions on datasets of shape (100,000, 3)
in approximately 1 second on a MacBook Pro (Retina, 13-inch, Early 2015).
Input data containing rows that "skip" an intervening level of the hierarchy is not correctly handled. Such cases are detected and raise an exception. See the test suite for an example.
This limitation only affects panels with > 2 columns.
© 2018 Gabriel J. Michael
This software is distributed under the MIT License. See the license file for details.