In [1]:
import MacroTools: postwalk
import DataFrames: DataFrame

### Rewriting Expressions as Functions over Tuples or Named Tuples

Suppose that we want to offer an API like `@select(df, col3 = sin(col1) + cos(col2))` for working with DataFrames. One approach we can take is to take the expression `sin(col1) + cos(col2)` and rewrite it into an anoymous function that maps tuples to scalars. This transformation is easiest to understand with named tuples, but we will write code to support working with either tuples or named tuples in this notebook.

In the named tuple case, the transformation looks like:

```
row -> sin(row.col1) + cos(row.col2)
```

In what follows, we'll show how to do this. Our approach can easily be extended to offer additional functionality such as automatic lifting of functions to ensure that they can process the `missing` value or even the introduction of three-valued logic.

Our approach will involve the following rules:

1. If a symbol in an expression occurs in a syntactic position that implies it's function name, we'll assume it's a function name.
2. All other symbols are assumed to be column names.

Based on these rules, we'll rewrite expressions by making three passes through the expression:

1. Find all function names by finding all symbols that are syntactically treated like function names in the expression and accumulating them in a `Set{Symbol}`.
2. Pass through the expression again and find all column names by accumulating all symbols that are not function names in a `Set{Symbol}`.
3. Rewrite all column name symbools as either (a) tuple numeric indexing or (b) named tupled field access depending on a Boolean flag.

In [2]:
function expr_to_tuple_function_expr(e::Any, named::Bool)
    function_names = find_function_names(e)
    column_names = find_column_names(e, function_names)
    column_name_to_index = Dict(column_names .=> 1:length(column_names))
    tuple_name = gensym()
    anon_func_body = postwalk(
        e′ -> symbol_to_tuple_index(
            e′,
            function_names,
            column_names,
            column_name_to_index,
            tuple_name,
            named,
        ),
        e,
    )
    (
        :($tuple_name -> $anon_func_body),
        collect(column_names),
    )
end

expr_to_tuple_function_expr (generic function with 1 method)

To make this work, we need to define the core functions: we'll start with `find_function_names`, which is easy to write using the `postwalk` function in the MacroTools package.

In [3]:
function find_function_names(e::Any)
    function_names = Set{Symbol}()
    postwalk(
        e′ -> update_function_names!(function_names, e′),
        e,
    )
    function_names
end

find_function_names (generic function with 1 method)

In [4]:
function update_function_names!(function_names::Set{Symbol}, e::Any)
    if isa(e, Expr) && e.head == :call
        push!(function_names, e.args[1])
    end
    e
end

update_function_names! (generic function with 1 method)

Let's try it out:

In [5]:
find_function_names(:(a + sin(b)))

Set{Symbol} with 2 elements:
  :+
  :sin

Now we'll implement column names:

In [6]:
function find_column_names(e::Any, function_names::Set{Symbol})
    column_names = Set{Symbol}()
    postwalk(
        e′ -> update_column_names!(column_names, e′, function_names),
        e,
    )
    column_names
end

find_column_names (generic function with 1 method)

In this case, we want to distinguish two cases

In [7]:
function update_column_names!(column_names::Set{Symbol}, e::Any, function_names::Set{Symbol})
    if isa(e, Symbol) && !(e in function_names)
        push!(column_names, e)
    end
    e
end

update_column_names! (generic function with 1 method)

In [8]:
let e = :(a + sin(b))
    find_column_names(e, find_function_names(e))
end

Set{Symbol} with 2 elements:
  :a
  :b

In [9]:
function symbol_to_tuple_index(
    e::Any,
    function_names::Set{Symbol},
    column_names::Set{Symbol},
    column_name_to_index::Dict{Symbol, Int},
    tuple_name::Symbol,
    named::Bool,
)
    if isa(e, Symbol) && e in column_names
        if !named
            :($(tuple_name)[$(column_name_to_index[e])])
        else
            :($(tuple_name).$e)
        end
    else
        e
    end
end

symbol_to_tuple_index (generic function with 1 method)

In [10]:
expr_to_tuple_function_expr(:(a + sin(b)), false)

(:(var"##254"->begin
          #= In[2]:18 =#
          var"##254"[1] + sin(var"##254"[2])
      end), [:a, :b])

In [11]:
expr_to_tuple_function_expr(:(a + sin(b)), true)

(:(var"##255"->begin
          #= In[2]:18 =#
          (var"##255").a + sin((var"##255").b)
      end), [:a, :b])

To get a sense how we might use this, let's write a macro that performs SQL-like select in which users write something like:

```
@select(df, c = a + sin(b), d = a - b)
```

To make this work, we'll do a few things:

1. We'll define a method to construct a tuple iterator from a DataFrame. The iterator can be used to give us tuple that we can apply a tuple function to.
2. For every expression in the list of macro arguments, we'll translate it into a tuple function, then we'll apply that function to the tuple iterator.
3. We'll construct a new DataFrame from the generated columns.

In [12]:
macro select(df, es...)
    kwargs = Any[]
    for assignment_e in es
        @assert isa(assignment_e, Expr) && assignment_e.head == :(=)
        res_name = assignment_e.args[1]
        e = assignment_e.args[2]
        anon_func_expr, column_names = expr_to_tuple_function_expr(e, false)
        res_column = quote
            map(
                $anon_func_expr,
                get_tuple_iterator($(esc(df)), $column_names),
            )
        end
        push!(kwargs, Expr(:kw, res_name, res_column))
    end
    quote
        DataFrame(
            $(kwargs...),
        )
    end
end

@select (macro with 1 method)

In [13]:
function get_tuple_iterator(df::DataFrame, names::Vector{Symbol})
    requested_columns = [df[name] for name in names]
    zip(requested_columns...)
end

get_tuple_iterator (generic function with 1 method)

In [14]:
df = DataFrame(a = [1, 2, 3], b = [2.1, 3.4, missing])

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64?
1,1,2.1
2,2,3.4
3,3,missing


In [15]:
@select(df, c = a + sin(b), d = a - b)

Unnamed: 0_level_0,c,d
Unnamed: 0_level_1,Float64?,Float64?
1,1.86321,-1.1
2,1.74446,-1.4
3,missing,missing


Extensions include:

1. Support for referencing columns already introduced left-to-right. 
2. Support for pulling in local variables with `$`.
3. Support for passing `*` as an argument that returns all existing columns.
4. Handle lifting of functions to ensure they process `missing` correctly.
5. Instead of an iterator approach, rewrite everything in broadcasting form.