-
Notifications
You must be signed in to change notification settings - Fork 17
HierarchyEvaluation
Functions we have described so far treat Paritions as flat lists of rows. Hierarchy takes a hierarchy Structure argument that describes the levels of a Hierarchy in terms of columns in the partition; it then enables hierarchical navigation and computations based on sets of things imposed by the hierarchy.
Assume that we have a Sales Table in Hive with the following structure: (LineItem, Country, State, City, Part, Sales). And we want to compute the following result:
| Country | State | City | Sales | % Country | Top City | Avg. City |
|---|---|---|---|---|---|---|
- there are separate rows for Country, State and Cities (may have indicator fields to indicate the level of a row, like in the Grouping clause)
- % Country returns the % of Sales that a City or State contributes to its Country.
- Top City is the Top City within a Country or State.
- Avg. City is the Avg Sales per City within a State or across a Country.
We propose to implement a Table function with the following signature and behavior:
Signature: HierarchyEvaluate( ArrayList<String> hierarchyLevelColumns, ArrayList<GroovyExpression> hierExpressions)
- hierarchyLevelColumns: describes the hierarchy being considered. Initially this will have to match the Order specification; so this is informational only, and may not be included in the function at all.
- hierExpressions : This is a List, where each element is just a GroovyExpression or a pair of the form (GroovyExpression, alias). The GroovyExpression defines how to compute the value for this output column. In this expression hierarchy navigation and set operators are available.
So the above example would be expressed as:
from hierarchyEvaluate(
< select Country, State, City, sum(Sales)
from Sales
group by Country, State, City
>
partition by Country
order by Country, State, City,
<['Country', 'State', 'City']>,
<[
["Sales / Ancestor('Country', 'Sales') * 100.0", "% Country"],
["TopN(Descendants('City'), 'Sales', ,1)", "Top City"],
["Avg(Descendants('City'), Sales)", "Avg. City"]
]>)
select Country, State, City, '% Country', 'Top City', 'Avg. City'
into path='/tmp/wout'
- Assumes data coming is at the lowest grain of the hierarchy and sorted from highest level to lowest level. We use a HQL query to aggregate raw data, we use the partition & order clauses to get data to an instance of the function in the righ order. Assume level columns are Strings for now.
- Outputs a row for each Member of the hierarchy. So the output has all the rows in the input plus Output for members at higher levels in the hierarchy.
- when evaluating expressions, navigation functions like Parent, Ancestor, Sibling, Descendants are available. (exact list and semantics to be flushed out; but based on MDX functions)
- when evaluating expressions Set operations like Order, Sum, Avg, TopN will be available.
- Functions can evaluate to a value or a Set or to a Member.
So the output from the sample Query is something like this:
| Country | State | City | Sales | % Country | Top City | Avg. City |
|---|---|---|---|---|---|---|
| US | - | - | 100 | 100 | New York | |
| US | CA | - | 45 | 45 | San Fran | 15 |
| US | CA | San Fran | 25 | 55.5 | San Fran | 25 |
| US | CA | San Jose | 10 | 22.2 | San Jose | 10 |
| US | CA | LA | 10 | 22.2 | LA | 10 |
| US | NY | - | 55 | 55 | New York | 55 |
| US | NY | New York | 55 | 55 | New York | 55 |