Skip to content

HierarchyEvaluation

Harish Butani edited this page Apr 18, 2012 · 2 revisions

Hierarchical Evaluation

Functions we have described so far treat Paritions as flat lists of rows. Hierarchy takes a hierarchy Structure argument that describes the levels of a Hierarchy in terms of columns in the partition; it then enables hierarchical navigation and computations based on sets of things imposed by the hierarchy.

Assume that we have a Sales Table in Hive with the following structure: (LineItem, Country, State, City, Part, Sales). And we want to compute the following result:

Country State City Sales % Country Top City Avg. City
  • there are separate rows for Country, State and Cities (may have indicator fields to indicate the level of a row, like in the Grouping clause)
  • % Country returns the % of Sales that a City or State contributes to its Country.
  • Top City is the Top City within a Country or State.
  • Avg. City is the Avg Sales per City within a State or across a Country.

We propose to implement a Table function with the following signature and behavior:

Signature: HierarchyEvaluate( ArrayList<String> hierarchyLevelColumns, ArrayList<GroovyExpression> hierExpressions)

  • hierarchyLevelColumns: describes the hierarchy being considered. Initially this will have to match the Order specification; so this is informational only, and may not be included in the function at all.
  • hierExpressions : This is a List, where each element is just a GroovyExpression or a pair of the form (GroovyExpression, alias). The GroovyExpression defines how to compute the value for this output column. In this expression hierarchy navigation and set operators are available.

So the above example would be expressed as:

	  from hierarchyEvaluate( 
             < select Country, State, City, sum(Sales)
               from Sales
               group by Country, State, City
             >
             partition by Country
             order by Country, State, City,
             <['Country', 'State', 'City']>,
             <[
                ["Sales / Ancestor('Country', 'Sales') * 100.0", "% Country"],
                ["TopN(Descendants('City'), 'Sales', ,1)", "Top City"],
                ["Avg(Descendants('City'), Sales)", "Avg. City"]
             ]>)
       select Country, State, City, '% Country', 'Top City', 'Avg. City'
	  into path='/tmp/wout'

Function Details

  • Assumes data coming is at the lowest grain of the hierarchy and sorted from highest level to lowest level. We use a HQL query to aggregate raw data, we use the partition & order clauses to get data to an instance of the function in the righ order. Assume level columns are Strings for now.
  • Outputs a row for each Member of the hierarchy. So the output has all the rows in the input plus Output for members at higher levels in the hierarchy.
  • when evaluating expressions, navigation functions like Parent, Ancestor, Sibling, Descendants are available. (exact list and semantics to be flushed out; but based on MDX functions)
  • when evaluating expressions Set operations like Order, Sum, Avg, TopN will be available.
  • Functions can evaluate to a value or a Set or to a Member.

So the output from the sample Query is something like this:

Country State City Sales % Country Top City Avg. City
US - - 100 100 New York
US CA - 45 45 San Fran 15
US CA San Fran 25 55.5 San Fran 25
US CA San Jose 10 22.2 San Jose 10
US CA LA 10 22.2 LA 10
US NY - 55 55 New York 55
US NY New York 55 55 New York 55

Clone this wiki locally