# Merging / joining dataframe using with key being between 2 dates

Given a key in one dataframe, how can we join it onto another dataframe, where that key is between the values of two columns?

In SQL the solution would look like

```sql
SELECT *
FROM table1
JOIN table2
WHERE date IS BETWEEN _to AND _from
```

However, in pandas this isn't straightforward as you cannot add conditionals to `join` or `merge` functions.

You can use `apply` rowwise or loop through the data, but that is very slow. This notebook shows how this can be accomplished.

## links / sources
- https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range
- https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others

In [1]:
# imports

import pandas as pd

## 0: setting up the data

In [2]:
new_vacation = lambda: pd.DataFrame(dict(
  vacation_name=("Summer", "Spring"),
  _from=pd.to_datetime(["2022-06-05", "2022-04-01"]),
  _to=pd.to_datetime(["2022-07-03", "2022-04-08"])
))

In [3]:
new_events = lambda: pd.DataFrame(dict(
  event_name=("Park", "Hike", "waterski", "Flower watching", "forest walk"),
  date=pd.to_datetime(["2022-06-01", "2022-06-05", "2022-07-01", "2022-03-28", "2022-04-05"]),
))

In [4]:
new_vacation()

Unnamed: 0,vacation_name,_from,_to
0,Summer,2022-06-05,2022-07-03
1,Spring,2022-04-01,2022-04-08


In [5]:
new_events()

Unnamed: 0,event_name,date
0,Park,2022-06-01
1,Hike,2022-06-05
2,waterski,2022-07-01
3,Flower watching,2022-03-28
4,forest walk,2022-04-05


## 1: Finding which vacation an event takes place in

Given an event date, how can we find which vacation date interval it falls inside?

Roughly speaking we are looking for writing the following SQL equivelant

```sql
SELECT *
FROM events
JOIN vacations
WHERE date IS BETWEEN _to AND _from
```

This can be solved using `pd.Intervalindex`.

In Pandas, when writing values from on dataframe into another, values are assigned implicitly like a merge operation. 
That is, rows are aligned on indices, rather than their order. If two rows from the right-hand dataframe matches an index in the lefthand dataframe that row in the lefthand dataframe is duplicated.

For some reason, "matches" here works differently from how matches works when calling `.merge()` or `.join()`-methods.

We can take advantage of the fact that a date matches any `pd.Interval`-object with a start and end around that date.

In [6]:
events, vacations = new_events(), new_vacation()

In [7]:
vacations = (vacations
    # Creates a closed interval between the start and end date for each vacation
    .assign(interval = pd.IntervalIndex.from_arrays(vacations._from, vacations._to, closed='both'))
    # Set this as the index, so that a future assign operation can align and duplicate rows based on the index+
    .set_index('interval')
)


In [8]:
(events
    # Set the index to `date`, so the assign operation can align on the date
    .set_index('date')
    # during assigning, rows are aligned and any `pd.Interval` from vacations that encompasses
    # a date from events is considered a match
    .assign(vacation_name = vacations['vacation_name'])
)

Unnamed: 0_level_0,event_name,vacation_name
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-06-01,Park,
2022-06-05,Hike,Summer
2022-07-01,waterski,Summer
2022-03-28,Flower watching,
2022-04-05,forest walk,Spring


In [9]:
# Note that the `assign`-method is basically just a fancy version of regular Pandas assignment
# and we can get the same result by mutating the events dataframe

events = events.set_index('date')
events['vacation_name'] = vacations['vacation_name']
events

Unnamed: 0_level_0,event_name,vacation_name
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-06-01,Park,
2022-06-05,Hike,Summer
2022-07-01,waterski,Summer
2022-03-28,Flower watching,
2022-04-05,forest walk,Spring


### 1.2 Wrapping `interval index` up into a function

In [10]:
def left_join_interval(
        left: pd.DataFrame, 
        right: pd.DataFrame, 
        _from: str, 
        _to: str,
        _on: str = None,
        suffix = "_right"
    ) -> pd.DataFrame:
    """
    SQL-like left-joins values (such as integers, floats or even dates) on an interval.
    
    Values in `_on`-column or in index in left dataframe are joined with rows in right dataframe 
    where values are between or on `_from` and `_to` columns.

    The returned dataframe uses the index if joining on index or the `_on` column as index.

    Parameters
    ----------
        left:  pandas dataframe
        right: pandas dataframe containing intervals
        _from: String. Name of the column containing the from dates in right.
        _to:   String. Name of the column containing the to dates in right.
        _on:   Name of the column containing the dates in the left dataframe that will be joined on. If `None`, use the index.

    """
    
    if _on is None:
        left = left.copy()
    else:
        left = left.set_index(_on)
    
    right = right.set_index(
        pd.IntervalIndex.from_arrays(right[_from], right[_to], closed='both')
    )

    # TO-DO: this fails if left dataframe has a column that is named s+suffix
    #        Look-up how Pandas actually handles this.
    add_suffix = lambda s: s+suffix if s in left.columns else s
    right_cols = [add_suffix(colname) for colname in right.columns]

    # We can do this safely without side-effects, as left has been copied
    left[right_cols] = right
    
    return left

In [11]:
events, vacations = new_events(), new_vacation()

In [12]:
(events
    .pipe(left_join_interval, vacations, '_from', '_to', 'date')
)

Unnamed: 0_level_0,event_name,vacation_name,_from,_to
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-06-01,Park,,NaT,NaT
2022-06-05,Hike,Summer,2022-06-05,2022-07-03
2022-07-01,waterski,Summer,2022-06-05,2022-07-03
2022-03-28,Flower watching,,NaT,NaT
2022-04-05,forest walk,Spring,2022-04-01,2022-04-08


## 2: Converting vacation definitions to records of days

We can use the same method as before to expand date-ranges into date-records, by creating a dataframe with an record for each day between the oldest and newest date in vacations and join on it.

In [13]:
vacations = new_vacation().assign(interval = pd.IntervalIndex.from_arrays(vacations._from, vacations._to, closed='both')).set_index('interval')
days = pd.date_range(
    start = vacations._from.min(),
    end= vacations._to.max()
)
daysdf = pd.DataFrame(index=days)

(daysdf
    # Add vacation identifier to each day that is inside the vacation interval
    .assign(vacation_name = vacations.vacation_name)
    # merge vacations onto the days, where the identifier matches
    # Note: We do not use the assign-method here, as we want _all_ columns in vacation to
    #       be preserved.
    .reset_index().rename(columns={'index': 'date'})
    .merge(vacations, on="vacation_name")
)

Unnamed: 0,date,vacation_name,_from,_to
0,2022-04-01,Spring,2022-04-01,2022-04-08
1,2022-04-02,Spring,2022-04-01,2022-04-08
2,2022-04-03,Spring,2022-04-01,2022-04-08
3,2022-04-04,Spring,2022-04-01,2022-04-08
4,2022-04-05,Spring,2022-04-01,2022-04-08
5,2022-04-06,Spring,2022-04-01,2022-04-08
6,2022-04-07,Spring,2022-04-01,2022-04-08
7,2022-04-08,Spring,2022-04-01,2022-04-08
8,2022-06-05,Summer,2022-06-05,2022-07-03
9,2022-06-06,Summer,2022-06-05,2022-07-03


In this case it is easier to do the mutating way, which is already implemented in `left_join_on_interval()`

In [14]:
vacations = new_vacation()

daysdf.pipe(left_join_interval, vacations, '_from', '_to').dropna()

Unnamed: 0,vacation_name,_from,_to
2022-04-01,Spring,2022-04-01,2022-04-08
2022-04-02,Spring,2022-04-01,2022-04-08
2022-04-03,Spring,2022-04-01,2022-04-08
2022-04-04,Spring,2022-04-01,2022-04-08
2022-04-05,Spring,2022-04-01,2022-04-08
2022-04-06,Spring,2022-04-01,2022-04-08
2022-04-07,Spring,2022-04-01,2022-04-08
2022-04-08,Spring,2022-04-01,2022-04-08
2022-06-05,Summer,2022-06-05,2022-07-03
2022-06-06,Summer,2022-06-05,2022-07-03
