## Inequality joins
By the end of this lecture you will be able to:
- do nearest neighbour joins
- do joins based on a condition

In [1]:
import polars as pl

## Nearest neighbour joins with `join_asof`
We have the following `DataFrame` of individuals and the year they were born

In [2]:
# People with their birth years
people_df = pl.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlotte', 'David', 'Eve'],
    'Birth_Year': [1950, 1965, 1985, 1996, 2015]
})
people_df

Name,Birth_Year
str,i64
"""Alice""",1950
"""Bob""",1965
"""Charlotte""",1985
"""David""",1996
"""Eve""",2015


We have this `DataFrame` that defines the starting birth year for a number of generation groups

In [3]:
generations_df = pl.DataFrame({
    'Generation': ['Baby Boomer', 'Gen X', 'Millennial', 'Gen Z', 'Gen Alpha'],
    'Start_Year': [1946, 1965, 1981, 1997, 2013]
})
generations_df

Generation,Start_Year
str,i64
"""Baby Boomer""",1946
"""Gen X""",1965
"""Millennial""",1981
"""Gen Z""",1997
"""Gen Alpha""",2013


We want to map each person in `people_df` to their generation group. By default in `join_asof` this is defined as the nearest generation start year less than or equal to the year they were born

In [4]:
(
    people_df
    .join_asof(
        generations_df,
        left_on="Birth_Year",
        right_on="Start_Year",
    )
)

Name,Birth_Year,Generation,Start_Year
str,i64,str,i64
"""Alice""",1950,"""Baby Boomer""",1946
"""Bob""",1965,"""Gen X""",1965
"""Charlotte""",1985,"""Millennial""",1981
"""David""",1996,"""Millennial""",1981
"""Eve""",2015,"""Gen Alpha""",2013


Note that `Bob` is `Gen X` as his birth year equals the start year.

While the default strategy is `backwards` so we look for the nearest match on or before the join column we can also choose a `forwards` or `nearest` strategy

In [5]:
(
    people_df
    .join_asof(
        generations_df,
        left_on="Birth_Year",
        right_on="Start_Year",
        strategy="nearest"
    )
)

Name,Birth_Year,Generation,Start_Year
str,i64,str,i64
"""Alice""",1950,"""Baby Boomer""",1946
"""Bob""",1965,"""Gen X""",1965
"""Charlotte""",1985,"""Millennial""",1981
"""David""",1996,"""Gen Z""",1997
"""Eve""",2015,"""Gen Alpha""",2013


By choosing `nearest` we see that `David` is now in `Gen Z` which started one year after he was born.

To avoid joining over a very long period we can set a `tolerance` to limit the difference between the birth year and generation start year

In [6]:
(
    people_df
    .join_asof(
        generations_df,
        left_on="Birth_Year",
        right_on="Start_Year",
        # strategy="nearest",
        tolerance=5
    )
)

Name,Birth_Year,Generation,Start_Year
str,i64,str,i64
"""Alice""",1950,"""Baby Boomer""",1946.0
"""Bob""",1965,"""Gen X""",1965.0
"""Charlotte""",1985,"""Millennial""",1981.0
"""David""",1996,,
"""Eve""",2015,"""Gen Alpha""",2013.0


When no match is made we get a `null` for the `Generation`

## Inequality joins
We can join `DataFrames` based on one or more conditions.

In this example we have a `DataFrame` of potential customers with their budget and whether they require a swimming pool. We want to find hotels from our database that fit their budget and pool requirement

In [None]:
# Customers with their budgets
customers_df = pl.DataFrame({
    'customer': ['John', 'Anna', 'Ravi', 'Fatima'],
    'budget': [120, 60, 180, 210],
    'min_bed_length': [200,180,190,180]
})

We have this `DataFrame` showing the potential hotels 

In [None]:
hotels_df = pl.DataFrame({
    'hotel': ['Budget Inn', 'Comfort Suites', 'Luxury Stay', 'Presidential Palace'],
    'price': [50, 100, 150, 200],
    'bed_length': [210,185,185,185]
})


We first join to find all the hotels that meet each customers budget. We provide the predicate to set the inequality. Note that: 
- the expressions relate to columns in the left and right `DataFrames`
- the column names in the expressions cannot be duplicated between the `DataFrames`

In [None]:
(
    customers_df
    .join_where(
        hotels_df,
        pl.col("budget") >= pl.col("price"),
    )
    .select("customer","budget","hotel","price")
)

Note that the result is equivalent to:
- a cross `join` of both `DataFrames`
- a `filter` on the output

We can also provide multiple predicates. Here we add the bed length condition

In [None]:
(
    customers_df
    .join_where(
        hotels_df,
        pl.col("budget") >= pl.col("price"),
        pl.col("bed_length") > pl.col("min_bed_length")
    )
)

## Exercises
In the exercises you will develop your understanding of
- doing `asof` joins and
- doing inequality joins

We read in this dataset of taxi journeys in New York over 10 days

In [41]:
taxi_df = (
    pl.read_parquet("../data/nyc_trip_data_600k.parquet")
    .sort("pickup")
)
taxi_df.head()

VendorID,pickup,dropoff,passenger_count,trip_distance
i32,datetime[μs],datetime[μs],f32,f32
2,2022-01-01 00:00:08,2022-01-01 00:14:14,1.0,7.94
2,2022-01-01 00:00:11,2022-01-01 00:14:29,1.0,5.55
2,2022-01-01 00:00:17,2022-01-01 00:31:04,1.0,3.86
2,2022-01-01 00:00:18,2022-01-01 00:13:27,1.0,3.37
2,2022-01-01 00:00:20,2022-01-01 00:20:51,1.0,4.89


We also load this dataset of hourly weather data from New York with temperature in degrees Celsius and precipitation in mm

In [74]:
weather_df = (
    pl.read_parquet("../data/nyc_weather.parquet")
)
weather_df.head()

time,temperature,precipitation
datetime[μs],f64,f64
2022-01-01 00:00:00,8.7,0.0
2022-01-01 01:00:00,8.7,0.0
2022-01-01 02:00:00,8.6,0.0
2022-01-01 03:00:00,8.5,0.0
2022-01-01 04:00:00,8.2,0.0


Join the weather data to the taxi data using an appropriate strategy

In [43]:
df = (
    taxi_df
    <blank>
)
df.head()

VendorID,pickup,dropoff,passenger_count,trip_distance,time,temperature,precipitation
i32,datetime[μs],datetime[μs],f32,f32,datetime[μs],f64,f64
2,2022-01-01 00:00:08,2022-01-01 00:14:14,1.0,7.94,2022-01-01 00:00:00,8.7,0.0
2,2022-01-01 00:00:11,2022-01-01 00:14:29,1.0,5.55,2022-01-01 00:00:00,8.7,0.0
2,2022-01-01 00:00:17,2022-01-01 00:31:04,1.0,3.86,2022-01-01 00:00:00,8.7,0.0
2,2022-01-01 00:00:18,2022-01-01 00:13:27,1.0,3.37,2022-01-01 00:00:00,8.7,0.0
2,2022-01-01 00:00:20,2022-01-01 00:20:51,1.0,4.89,2022-01-01 00:00:00,8.7,0.0


Make a bar chart of the relationship between `precipitation` and `trip_distance`:
- first group by precipitation (rounded off to one decimal place)
- take the mean of trip distance
- make a bar chart with precipitation on the x-axis and trip distance on the y-axis
- add an appropriate title to the chart

In [90]:
(
    df
    <blank>
)

## Solutions

### Solution to exercise 1

In [91]:
taxi_df = (
    pl.read_parquet("../data/nyc_trip_data_600k.parquet")
    .sort("pickup")
)
taxi_df.head()

VendorID,pickup,dropoff,passenger_count,trip_distance
i32,datetime[μs],datetime[μs],f32,f32
2,2022-01-01 00:00:08,2022-01-01 00:14:14,1.0,7.94
2,2022-01-01 00:00:11,2022-01-01 00:14:29,1.0,5.55
2,2022-01-01 00:00:17,2022-01-01 00:31:04,1.0,3.86
2,2022-01-01 00:00:18,2022-01-01 00:13:27,1.0,3.37
2,2022-01-01 00:00:20,2022-01-01 00:20:51,1.0,4.89


In [None]:
weather_df = (
    pl.read_parquet("../data/nyc_weather.parquet")
)
weather_df.head()

Join the weather data to the taxi data using an appropriate strategy - we use nearest as this gives the most physically relevant value of weather for each ride

In [None]:
df = (
    taxi_df
    .join_asof(
        weather_df,
        left_on="pickup",
        right_on="time",
        strategy="nearest"
    )
)
df.head()

Make a bar chart of the relationship between `precipitation` and `trip_distance`:
- first group by precipitation (rounded off to one decimal place)
- take the mean of trip distance
- make a bar chart with precipitation on the x-axis and trip distance on the y-axis
- add an appropriate title to the chart

In [None]:
(
    df
    .group_by(
        pl.col("precipitation").round(1)
    )
    .agg(
        pl.col("trip_distance").mean()
    )
    .plot
    .bar(
        x="precipitation",
        y="trip_distance",
    )
    .properties(
        title="Mean trip distance by precipitation level in mm"
    )
)