# Programming guide

> ⚠️ If you want to get started analyzing data, see the [analysis guide](/guide_analysis.html). This guide gives an overview of how to implement and extend most parts of siuba. It also discusses the rationale behind siuba's architecture.

In [1]:
import pandas as pd

pd.set_option("display.max_rows", 5)

In [2]:
%%html

<style>
.table-container table {
  margin-left: 0;
}
</style>


<div class="table-container">

| feature                                    | siuba   | dplython   | pandas   |
|:-------------------------------------------|:--------|:-----------|:---------|
| Column operations are pandas Series methods | ✅      | ✅         | ✅       |
| Table verbs supports user defined functions | ✅      | ✅         | ✅       |    
| pipe syntax (`>>`)                          | ✅      | ✅         | ❌       |
| concise, **lazy expressions** (`_.a + _.b`) | ✅      | ✅         | ❌       |    
| No more reset_index                        | ✅      | ✅         | ❌       |
| **unified API** over (un)grouped data      | ✅      | ✅         | ❌       |    
| generate fast grouped operations           | ✅      | ❌         | ✅       |
| **generate SQL queries**                   | ✅      | ❌         | ❌       |
| Abstract syntax trees for **transforming operations**   | ✅      | ❌         | ❌       |    
| handles nested data                        | ✅      | ❌         | ⚠️        |

</div>


Siuba aims at meeting 3 goals:

* **pandas overlap:** allow users to re-use their knowledge of pandas methods.
* **dplython overlap:** provide more flexible, consistent syntax for operating on columns of data.
* **siuba specific:** support a broader set of data sources, like SQL.





## Column operations

For example, on ungrouped data siuba's mutate function..

* takes a pandas DataFrame.
* uses Series operations just like the `DataFrame.assign()` method would.

In [3]:
from siuba.data import mtcars
from siuba import mutate

cars = mtcars[["hp", "mpg", "cyl"]]

# pandas assign method
cars.assign( demean = lambda d: d.mpg - d.mpg.mean())

# siuba mutate function
mutate(cars, demean = lambda d: d.mpg - d.mpg.mean())

Unnamed: 0,hp,mpg,cyl,demean
0,110,21.0,6,0.909375
1,110,21.0,6,0.909375
...,...,...,...,...
30,335,15.0,8,-5.090625
31,109,21.4,4,1.309375


This means that you can use and debug Series methods just like you would with pandas.

For **grouped data**, or a **SQL database**, siuba can't use Series methods because they don't exist.
For example, on grouped data, the same operation above in pandas would be..


In [4]:
# create grouped data
g_cyl = cars.groupby('cyl')

# error: g_cyl doesn't have an .assign method! :/
# g_cyl.assign

cars.assign(demean = g_cyl.mpg.transform(lambda x: x - x.mean()))

Unnamed: 0,hp,mpg,cyl,demean
0,110,21.0,6,1.257143
1,110,21.0,6,1.257143
...,...,...,...,...
30,335,15.0,8,-0.100000
31,109,21.4,4,-5.263636


In this grouped data case, siuba runs the necessary pandas methods for executing the operation. However, different pandas method have different limitation. For example, the `.transform()` method can only operate on one column of data. In order to allow the flexibility of pandas `.assign` operations, siuba needs to use a simple but robust strategy for representing grouped operations.

The good news is that you can test the grouped approach piece by piece too!


In [5]:
# TODO

## Table verbs

You may be wondering how a siuba function, like mutate, could work on a SQL database.
This is because these functions are defined using a technique called single dispatch.

In [6]:
from siuba.dply.verbs import singledispatch2

# DataFrame version of function ---

@singledispatch2(pd.DataFrame)
def head(__data, n = 5):
    return __data.head(n)

head(cars, 2)

Unnamed: 0,hp,mpg,cyl
0,110,21.0,6
1,110,21.0,6


In [7]:
# SQL version of function ---
from sqlalchemy import Table, Column, MetaData

@head.register(Table)
def _head_sql(__data, n = 5):
    return __data.select().limit(n)

table = Table("some_table", MetaData(), Column('a'), Column('b')), 2

print(
    head(table)
)

<siuba.dply.verbs.Pipeable object at 0x11801c240>


why use singledispatch rather than a class method like `mtcars.head()`?

There are two big benefits:

1. **Anyone can cleanly define and package a function**. Using it is just a matter of importing it. With a method, you need to somehow put it onto the class representing your data. You end up with 300+ methods on a class.
2. Your function might do something that is **not the class's core responsibility**. In this case, it should not be part of the class definition.

One challenge with adding functions through singledispatch is that combining function calls results in a big [(dagwood) sandwich](http://wiki.c2.com/?ThickBreadSmell) of parentheses.

```python
from siuba import mutate, filter, head

# run filter, then mutate, then head
head(mutate(filter(mtcars, _.cyl == 4), demeaned = _.hp - _.hp.mean()))
```

Note that the order of execution is from inside to outside. These issues are resolved in siuba by using a pipe.

## Pipe syntax

In the previous section I discussed how siuba uses singledispatch. This allows people to define new functions that are easy to package and import, as well as handle both a pandas DataFrame and SqlAlchemy table.

One challenge is finding a way to combine functions so that they execute from "top to bottom". In pandas this is done using method chaining. For example, the code below starts with `cars`, then runs `.assign()`, then runs `.head()`.

In [8]:
(cars
  .assign(hp_per_cyl = lambda d: d.hp / d.cyl)
  .head(2)
)

Unnamed: 0,hp,mpg,cyl,hp_per_cyl
0,110,21.0,6,18.333333
1,110,21.0,6,18.333333


Here is a similar version in siuba of the above code without and with piping.

In [9]:
# without pipe ----
head(
    mutate(
        cars,
        hp_per_cyl = lambda d: d.hp / d.cyl),
    2
)

# with pipe ----
(cars
  >> mutate(hp_per_cyl = lambda d: d.hp / d.cyl)
  >> head(2)
)

Unnamed: 0,hp,mpg,cyl,hp_per_cyl
0,110,21.0,6,18.333333
1,110,21.0,6,18.333333


Notice how in this case we can import just the two functions we're using: head and mutate. Functions defined with `singledispatch2`--as described in the previous section--can be used in the same way. Indeed, this function is what created `head` and `mutate`.

Under the hood, function calls like below are turned into a Pipeable object.

In [10]:
mutate(hp_per_cyl = lambda d: d.hp / d.cyl)

<siuba.dply.verbs.Pipeable at 0x10eec1d30>

This happens when the function's first positional argument is not a known data source, like a DataFrame or SQL database. This is the case above.

Alternatively, you can explicitly create a pipe by passing an instance of siuba's Symbolic class. This is shown in the code below, and explained in detail in the following section.

In [11]:
from siuba import _

mutate(_, hp_per_cyl = lambda d: d.hp / d.cyl)

<siuba.dply.verbs.Pipeable at 0x11801ceb8>

## Lazy expressions (`_.a + _.b`)

TODO: mention limitations (eg external function calls)

Up to this point, we've used lambda functions like the one below to express operations on a DataFrame's columns.

In [12]:
from siuba import summarize

summarize(cars, hp_mean = lambda d: d.hp.mean())

Unnamed: 0,hp_mean
0,146.6875


However, peppering our analysis with lambda functions creates two challenges:

1. writing `lambda d:` can take up as many characters as its operation `d.hp.mean()`.
2. lambdas can execute some operations, but cannot tell us **what** operations they will execute.

siuba handles these challenges by using a Symbolic operator.

In [13]:
from siuba import _

summarize(cars, avg_hp = _.hp.mean())

Unnamed: 0,avg_hp
0,146.6875


### Declaring **what** to perform

In [14]:
_.hp.mean()

█─'__call__'
└─█─.
  ├─█─.
  │ ├─_
  │ └─'hp'
  └─'mean'

### Translating from what to **how**

Note that this is not just shorter to write, it also enables siuba to convert an operation to the right SQL code.

In [16]:
from siuba.data import cars_sql
from siuba import group_by, summarize, mutate, show_query

q = summarize(cars_sql, avg_hp = _.hp.mean()) >> show_query()

SELECT avg(cars.hp) AS avg_hp 
FROM cars


In [17]:
q = summarize(cars_sql, ttl = n(_.hp)) >> show_query()

SELECT count(*) AS ttl 
FROM cars


Depending on whether we are using like summarize, filter, or mutate, the exact query generated can take different forms.

For example, the code below calculates `demeaned` using an aggregate (`_.hp.mean()`), so requires a partition clause.

In [18]:
q = (cars_sql 
  >> group_by("cyl")
  >> mutate(
       demeaned = _.hp - _.hp.mean(),
       mpg_per_hp = _.mpg / _.hp,
  )
  >> show_query()
)

SELECT cars.cyl, cars.mpg, cars.hp, cars.hp - avg(cars.hp) OVER (PARTITION BY cars.cyl) AS demeaned, cars.mpg / cars.hp AS mpg_per_hp 
FROM cars


siuba's SQL generation is discussed in more detail in (TODO: LINK TO) generate sql queries.

### With user defined functions

In [15]:
from siuba.dply.vector import n

n(_.hp)

█─'__call__'
├─█─'__custom_func__'
│ └─<function n at 0x117f87e18>
└─█─.
  ├─_
  └─'hp'

## Handling indexes


## Unified API over (un)grouped data

In general, grouped calculations in pandas use a different syntax from ungrouped calculations.

For example, the code below shows how we could perform an efficient assign (siuba's mutate) on both kinds of data.

In [1]:
g_cyl = cars.groupby("cyl")

# ungrouped assign ----
cars.assign(demeaned = lambda d: d.hp - d.hp.mean())

# grouped assign, with g_cyl only ----
demeaned = g_cyl.obj.hp - g_cyl.hp.transform("mean")
g_cyl.obj.assign(demeaned = demeaned)

NameError: name 'cars' is not defined

### Separating groupings, verbs, and operations

Notice how much more work the grouped approach is. Ideally a user should be able to specify these three things independently:

1. any groupings--like `cars.groupby("cyl")`
2. table actions--like `mutate`, `summarize`, or `filter`
3. column operations--like `_.hp -_.hp.mean()`

In [24]:
from siuba import mutate, _

# ungrouped assign ----
mutate(cars, demeaned = lambda d: d.hp - d.hp.mean())

# grouped assign ----
mutate(g_cyl, demeaned = lambda d: d.hp - d.hp.mean())

Unnamed: 0,hp,mpg,cyl,demeaned
0,110,21.0,6,-12.285714
1,110,21.0,6,-12.285714
...,...,...,...,...
30,335,15.0,8,125.785714
31,109,21.4,4,26.363636


Note that for operations that return a boolean Series or array, we can just swap out mutate for filter to change the end result. Rather than create a new column, it will remove rows where the operation is False.

In [25]:
from siuba import filter

# swap out mutate, return rows where hp > average hp
filter(g_cyl, lambda d: d.hp > d.hp.mean())

Unnamed: 0,hp,mpg,cyl
2,93,22.8,4
6,245,14.3,8
...,...,...,...
30,335,15.0,8
31,109,21.4,4


### Operating over multiple columns

Another important improvement is that we can use multiple columns in our summarize operations. Since `DataFrame.agg()` only operates on single columns, this is big improvement for functions like summarize.

In [26]:
summarize(g_cyl, avg_mpg_per_hp = lambda d: (d.mpg / d.hp).mean())

Unnamed: 0,cyl,avg_mpg_per_hp
0,4,0.351103
1,6,0.166018
2,8,0.076657


Such flexibility can come at the cost of speed. However, in the same way that siuba can generate SQL code, it can also execute grouped operations that use the fastest pandas code possible.

## Transforming operations



You may have noticed in previous sections that some features--like making grouped pandas operations fast and executing SQL--require declaring what you want to do with `_`. Ultimately, when you write an operation like `_.a + _.b`, it results in a new Symbolic object that can do two things.

1. be executed like a lambda
2. allow self-representation and transformation through an abstract syntax tree

For example, consider the representation from the code below.

In [31]:
symbol = _.a + _.b

symbol

█─+
├─█─.
│ ├─_
│ └─'a'
└─█─.
  ├─_
  └─'b'

In [32]:
from siuba.siu import strip_symbolic

call = strip_symbolic(symbol)
call

_.a + _.b

A call has three attributes

* func: the function being called (eg `__add__` for addition)
* args: positional arguments passed to the call
* kwargs: keyword arguments passed to the call

This makes it very easy to inspect and modify calls. For example, we could change the function from addition to subtraction.

In [33]:
call.func = '__sub__'
call

_.a - _.b

Note that in practice, calls should not be modified in place like that. The `siuba.siu` module implements a common tool called a `TreeVisitor`. This is nearly identical to the class of the same name in python's build in `ast` module.

To learn more, see TODO: link developer doc or ADR.

## Backends

### Fast grouped operations

The previous section discussed how siuba uses a consistent API to separate any groupings, operations to apply, and how to combine results. Thinking about these three concepts separately makes them easy to tweak individually.

(TODO probably move this to prev section)


In [27]:
from siuba import group_by, mutate, filter, _

grouping = group_by("cyl")

action1 = mutate
action2 = filter

operation = _.hp > _.hp.mean()

In [28]:
cars >> grouping >> action1(result = operation)

Unnamed: 0,hp,mpg,cyl,result
0,110,21.0,6,False
1,110,21.0,6,False
...,...,...,...,...
30,335,15.0,8,True
31,109,21.4,4,True


In [29]:
cars >> grouping >> action2(operation)

Unnamed: 0,hp,mpg,cyl
2,93,22.8,4
6,245,14.3,8
...,...,...,...
30,335,15.0,8
31,109,21.4,4


Critically, the tools siuba uses allow it to generate SQL code, also allow it to generate highly performant grouped operations.


To see what functions are supported in siuba's fast grouped operations, see this method support table. (TODO: link)

### Querying SQL database

A killer feature of siuba is that you can run your code locally in pandas, or against a SQL database.

TODO: use sqlite to demo? support later version of sqlite, so can use partitions etc?

In [30]:
from siuba import _, group_by, summarize

q = (cars_sql
  >> group_by("cyl")
  >> mutate(demeaned = _.hp - _.hp.mean(0))
  >> show_query()
)

SELECT cars.cyl, cars.mpg, cars.hp, cars.hp - avg(cars.hp, 0) OVER (PARTITION BY cars.cyl) AS demeaned 
FROM cars


Reference to...

* SQL UDF example
* ADR on call trees and SQL

## Nested data