# `bfill` and `ffill` Interoperability Example

I wrote about a `bfill` and `ffill` strategy on the Ibis Project blog.
You can view the post [here](https://ibis-project.org/docs/3.2.0/blog/ffill-and-bfill-using-ibis/),
but the strategy for forward-fill is thus:

Let `j` be an event group sorted by date and let `i` be a date within `j`.
```
If i is the first date in j, then continue.
If i is not the first date in j then:
    if measurement in i is null then replace it with measurement for i-1.
Otherwise, do nothing.
```

Ibis allows you to take any tabular data in any backend supporting the relevant functions and apply this logic to it using the same function.

This notebook demonstrates this using the Pandas, DuckDB, Postgres, and SQLite backends.

This tutorial series does not support setting up BigQuery, but if you have `ibis-bigquery` set up, a section on that is included in this notebook.
If you want to set up Ibis for BigQuery, there are guides for that on [Google's website](https://cloud.google.com/community/tutorials/bigquery-ibis).

(Disclosure: most of my wrangling for ibis-bigquery was running the connect function and following the prompts until it was error-free.)

`ibis-bigquery` is maintained by Google and is technically third-party (though some Ibis maintainers do contribute/support the `ibis-bigquery` project).

### Data Setup

Let's create some data with some gaps in it.
This data simulates some sets of measurements over a few weeks in 2021.

`event_id` allows us to partition the data by some sort of event--for example, suppose there are specific sensors or measurement events that prohibit us from `ffill`ing or `bfill`ing outside of that group.

`measured_on` would be our order column.
`ffill` and `bfill` require some sort of order to make sense, and in this case we're using dates.

`measurement` is the value we're filling.

In [1]:
from datetime import date

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "event_id": [0] * 2 + [1] * 3 + [2] * 5 + [3] * 2
    ,"measured_on": map(
        date
        ,[2021] * 12, [6] * 4 + [5] * 6 + [7] * 2
        ,range(1, 13)
    )
    ,"measurement": np.nan
})

df.at[1, "measurement"] = 5.
df.at[4, "measurement"] = 42.
df.at[5, "measurement"] = 42.
df.at[7, "measurement"] = 11.

df.to_parquet("data.parquet")
df.head()

Unnamed: 0,event_id,measured_on,measurement
0,0,2021-06-01,
1,0,2021-06-02,5.0
2,1,2021-06-03,
3,1,2021-06-04,
4,1,2021-05-05,42.0


### Directional Fill Function

Here we make some minor modifications to our fill operation.
This allows us to take a table expression and arbitrarily decide what ordering by and what we're grouping by (if at all).

The original how to on `bfill` and `ffill` can be found [here](https://ibis-project.org/docs/latest/how_to/ffill_bfill_w_window/).

In [2]:
import ibis

def dirfill_na(data, order_by='measured_on', group_by=None, value='measurement', fill_dir='f'):
    dirs = {
        'f': ibis.asc
        ,'b': ibis.desc
    }
    if fill_dir not in dirs:
        raise ValueError(f"method must be in {dirs.keys()}")
    else:
        strat_order = dirs[fill_dir](data[order_by])

    # Create a window that orders your series, default ascending
    win = ibis.window(group_by=None if group_by is None else data[group_by], order_by=strat_order, following=0)
    # Create a grouping that is a rolling count of non-null values
    # This creates a partition where each set has no more than one non-null value
    grouped = data.mutate(grouper=data[value].count().over(win))
    # Group by your newly-created grouping and, in each set,
    # set all values to the one non-null value in that set (if it exists)
    result = (
        grouped
        .group_by([grouped.grouper] if not group_by else [grouped[group_by], grouped.grouper])
        .mutate(filled=grouped[value].max())
        .relabel({'filled': f'{fill_dir}fill'})
    )
    # execute to get a pandas dataframe, sort values in case your backend shuffles
    return result.sort_by((order_by, True))

### Pandas

Since our dataframe is already available, let's execute this against the pandas backend.

In [3]:
# Connect to Backend
conn = ibis.pandas.connect({'data': df})

# group by event id, order by measured_on, backfill
dirfill_na(
    conn.table("data")
    ,order_by='measured_on'
    ,group_by='event_id'
    ,value='measurement'
    ,fill_dir='b'
).sort_by(('measured_on', True)).execute()

Unnamed: 0,event_id,measured_on,measurement,grouper,bfill
0,1,2021-05-05,42.0,1,42.0
1,2,2021-05-06,42.0,2,42.0
2,2,2021-05-07,,1,11.0
3,2,2021-05-08,11.0,1,11.0
4,2,2021-05-09,,0,
5,2,2021-05-10,,0,
6,0,2021-06-01,,1,5.0
7,0,2021-06-02,5.0,1,5.0
8,1,2021-06-03,,0,
9,1,2021-06-04,,0,


### DuckDB

In [4]:
# Connect to Backend
conn = ibis.connect('duckdb://:memory:')
conn.register('data.parquet', 'data')

# group by event id, order by measured_on, backfill
dirfill_na(
    conn.table("data")
    ,order_by='measured_on'
    ,group_by='event_id'
    ,value='measurement'
    ,fill_dir='b'
).sort_by(('measured_on', True)).execute()



Unnamed: 0,event_id,measured_on,measurement,grouper,bfill
0,1,2021-05-05,42.0,1,42.0
1,2,2021-05-06,42.0,2,42.0
2,2,2021-05-07,,1,11.0
3,2,2021-05-08,11.0,1,11.0
4,2,2021-05-09,,0,
5,2,2021-05-10,,0,
6,0,2021-06-01,,1,5.0
7,0,2021-06-02,5.0,1,5.0
8,1,2021-06-03,,0,
9,1,2021-06-04,,0,


### Postgres

In [5]:
# Set up a .pgpass file to use this without a password, otherwise use username:password@host:port/database
cstring = 'postgres://username@host:port/database'
cstring = 'postgres://ibistutorials@localhost:5432/pg-ibis'

conn = ibis.connect(cstring)

# Decimal type
dec_type = ibis.backends.postgres.sa.types.DECIMAL()

# Integer type
int_type = ibis.backends.postgres.sa.types.INT()

# Text type
str_type = ibis.backends.postgres.sa.types.Text()

# date type
dte_type = ibis.backends.postgres.sa.types.Date()

schema = {
    'event_id': int_type
    ,'measured_on': dte_type
    ,'measurement': dec_type
}

df.to_sql(
    name='data'
    ,con=conn.con.connect()
    ,if_exists='replace'
    ,index=False
    ,dtype=schema
)

# group by event id, order by measured_on, backfill
dirfill_na(
    conn.table("data")
    ,order_by='measured_on'
    ,group_by='event_id'
    ,value='measurement'
    ,fill_dir='b'
).sort_by(('measured_on', True)).execute()

Unnamed: 0,event_id,measured_on,measurement,grouper,bfill
0,1,2021-05-05,42.0,1,42.0
1,2,2021-05-06,42.0,2,42.0
2,2,2021-05-07,,1,11.0
3,2,2021-05-08,11.0,1,11.0
4,2,2021-05-09,,0,
5,2,2021-05-10,,0,
6,0,2021-06-01,,1,5.0
7,0,2021-06-02,5.0,1,5.0
8,1,2021-06-03,,0,
9,1,2021-06-04,,0,


### SQLite

In [6]:
conn = ibis.connect('sqlite://data.db')

# use the schema above to upload data to db file
df.to_sql(
    name='data'
    ,con=conn.con.connect()
    ,if_exists='replace'
    ,index=False
    ,dtype=schema
)

# group by event id, order by measured_on, backfill
dirfill_na(
    conn.table("data")
    ,order_by='measured_on'
    ,group_by='event_id'
    ,value='measurement'
    ,fill_dir='b'
).sort_by(('measured_on', True)).execute()

  yield con.execute(*args, **kwargs)


Unnamed: 0,event_id,measured_on,measurement,grouper,bfill
0,1,2021-05-05,42.0,1,42.0
1,2,2021-05-06,42.0,2,42.0
2,2,2021-05-07,,1,11.0
3,2,2021-05-08,11.0,1,11.0
4,2,2021-05-09,,0,
5,2,2021-05-10,,0,
6,0,2021-06-01,,1,5.0
7,0,2021-06-02,5.0,1,5.0
8,1,2021-06-03,,0,
9,1,2021-06-04,,0,


### BigQuery

Setting up BigQuery is not covered in the Data Setup tutorial.

BigQuery is free for the first TB of data, which is enough to get started and see if it fits your project's needs.

In [7]:
import ibis_bigquery

PROJECT_ID = 'ibis-tutorials'
DATASET_ID = 'tutorial_data'
TABLE_ID = '.'.join([PROJECT_ID, DATASET_ID, 'data'])

# connect to get a client
conn = ibis_bigquery.connect(project_id=PROJECT_ID)

# use the client to create the dataset if it doesn't exist (if it exists then that's fine)
conn.client.create_dataset(DATASET_ID, exists_ok=True)
conn = ibis_bigquery.connect(project_id=PROJECT_ID, dataset_id=DATASET_ID)

# create the sample data using the client
conn.client.delete_table(TABLE_ID, not_found_ok=True)
conn.client.load_table_from_dataframe(df, TABLE_ID)

LoadJob<project=ibis-tutorials, location=US, id=96d3de60-077e-4558-97ec-7b6a3abf18f6>

Execute `dirfill_na`.
Note that BigQuery takes about half a minute to register that a table was uploaded.

In [8]:
dirfill_na(conn.table("data"), group_by='event_id').execute()

Unnamed: 0,event_id,measured_on,measurement,grouper,ffill
0,1,2021-05-05,42.0,1,42.0
1,2,2021-05-06,42.0,1,42.0
2,2,2021-05-07,,1,42.0
3,2,2021-05-08,11.0,2,11.0
4,2,2021-05-09,,2,11.0
5,2,2021-05-10,,2,11.0
6,0,2021-06-01,,0,
7,0,2021-06-02,5.0,1,5.0
8,1,2021-06-03,,1,42.0
9,1,2021-06-04,,1,42.0
