# 6. Data Manipulation IV - Combining Data

The goal of this module is to learn to perform Polars queries that involve combining data. More specifically, we'll cover two new Query Statements:
1. Joining dataframes with `.join()`.
2. Concatenating dataframes with `.concat()`.

But first we import `polars`...

In [1]:
import polars as pl

... and load the data, this time changing the name from `df` to `rides_df` for what's in store in this module.

In [11]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
rides_df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
)

## 5.1. Joining Dataframes

In the previous module, we were interested in seeing some summary statistics about each of the different pickup location IDs:

In [31]:
location_id_summary_df = (
    rides_df
    .group_by("pu_location_id")
    .agg(
        pl.len().alias("count_trips"),
        pl.col("trip_distance").min().name.suffix("_min"),
        pl.col("trip_distance").max().name.suffix("_max"),
        pl.col("trip_distance").mean().name.suffix("_mean"),
    )
    .sort(pl.col("count_trips"), descending=True)
)
location_id_summary_df.head()

pu_location_id,count_trips,trip_distance_min,trip_distance_max,trip_distance_mean
i32,u32,f64,f64,f64
161,163269,0.0,51066.77,2.692728
132,157706,0.0,9211.95,15.76677
237,155631,0.0,44866.77,2.096025
236,146044,0.0,109619.96,3.481146
162,123805,0.0,57408.32,3.030781


This is nice, but if we try to present these results to any sort of business stakeholder, the first thing they'll say is "What are these location IDs? What is their actual name?" Thankfully, that information is stored in another file, `"taxi_zone_lookup.parquet"`. Let's load it in...

In [26]:
zones_df = pl.read_parquet("../data/taxi_zone_lookup.parquet")
zones_df.head()

LocationID,Borough,Zone,service_zone
i32,str,str,str
1,"""EWR""","""Newark Airport""","""EWR"""
2,"""Queens""","""Jamaica Bay""","""Boro Zone"""
3,"""Bronx""","""Allerton/Pelham Gardens""","""Boro Zone"""
4,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
5,"""Staten Island""","""Arden Heights""","""Boro Zone"""


This dataframe contains information about each location ID, and the `str` names for the Borough, Zone, and service zone that each location ID represents. Let's quickly rename the columns to make them snake_case like those of `rides_df`:

In [27]:
zones_df = (
    zones_df
    .rename({
        "LocationID": "location_id",
        "Borough": "borough",
        "Zone": "zone",
    })
)
zones_df.head()

location_id,borough,zone,service_zone
i32,str,str,str
1,"""EWR""","""Newark Airport""","""EWR"""
2,"""Queens""","""Jamaica Bay""","""Boro Zone"""
3,"""Bronx""","""Allerton/Pelham Gardens""","""Boro Zone"""
4,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
5,"""Staten Island""","""Arden Heights""","""Boro Zone"""


Perfect.

If we want to now combine it with the summary statistics we just computed, we can do so by **joining** it into the summary statistics table, using `pl.DataFrame.join()`:

In [32]:
(
    location_id_summary_df
    .join(
        zones_df,
        left_on="pu_location_id",
        right_on="location_id"
    )
    .head()
)

pu_location_id,count_trips,trip_distance_min,trip_distance_max,trip_distance_mean,borough,zone,service_zone
i32,u32,f64,f64,f64,str,str,str
1,372,0.0,35.75,1.34578,"""EWR""","""Newark Airport""","""EWR"""
2,8,0.0,20.23,11.98,"""Queens""","""Jamaica Bay""","""Boro Zone"""
3,151,0.0,35.1,7.159073,"""Bronx""","""Allerton/Pelham Gardens""","""Boro Zone"""
4,7586,0.0,37.6,2.742518,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
5,1,0.0,0.0,0.0,"""Staten Island""","""Arden Heights""","""Boro Zone"""


Nice! The data from `zones_df` has been joined into our zone-summary table, and we can see the `"borough"`, `"zone"`, and `"service_zone"` alongside the `"pu_location_id"`! `polars`'s `.join()` functionality has all the same functionality you may be familiar with from Pandas and/or SQL:
1. `location_id_summary_df` specifies the **left** table for the join.
2. `.join(...)` starts the join.
3. `zones_df` specifies the **right** table for the join.
4. `left_on` specifies the column to join on from `location_id_summary_df`.
5. `right_on` specifies the column to join on from `zones_df`.

Just to be clear, though, we didn't have to do the aggregation to `location_id_summary_df`, first; we can join directly into the original `rides_df`:

In [37]:
(
    rides_df
    .select(["tpep_pickup_datetime", "pu_location_id", "do_location_id", "trip_distance", "total_amount"])
    .join(
        zones_df,
        left_on="pu_location_id",
        right_on="location_id",    
    )
#     .select(["tpep_pickup_datetime", "pu_location_id", "do_location_id", "trip_distance", "total_amount", "borough", "zone"])
    .head()
)

tpep_pickup_datetime,pu_location_id,do_location_id,trip_distance,total_amount,borough,zone,service_zone
datetime[ns],i32,i32,f64,f64,str,str,str
2024-03-01 00:18:51,142,239,1.3,16.3,"""Manhattan""","""Lincoln Square East""","""Yellow Zone"""
2024-03-01 00:26:00,238,24,1.1,15.2,"""Manhattan""","""Upper West Side North""","""Yellow Zone"""
2024-03-01 00:09:22,263,75,0.86,10.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""
2024-03-01 00:33:45,164,162,0.82,14.19,"""Manhattan""","""Midtown South""","""Yellow Zone"""
2024-03-01 00:05:43,263,7,4.9,30.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""


But now, it's a bit strange that we have the `"pu_location_id"` and `"do_location_id"` in the same dataframe, because it's not clear whether the `"borough"` and `"zone"` refer to the `"pu_location_id"` or the `"do_location_id"`. We can quickly fix that:

In [40]:
zone_columns = ["borough", "zone", "service_zone"]
(
    rides_df
    .select(["tpep_pickup_datetime", "pu_location_id", "do_location_id", "trip_distance", "total_amount"])
    .join(
        zones_df,
        left_on="pu_location_id",
        right_on="location_id",
    )
    .with_columns([pl.col(zone_columns).name.prefix("pu_")])
    .drop(zone_columns)
    .head()
)

tpep_pickup_datetime,pu_location_id,do_location_id,trip_distance,total_amount,pu_borough,pu_zone,pu_service_zone
datetime[ns],i32,i32,f64,f64,str,str,str
2024-03-01 00:18:51,142,239,1.3,16.3,"""Manhattan""","""Lincoln Square East""","""Yellow Zone"""
2024-03-01 00:26:00,238,24,1.1,15.2,"""Manhattan""","""Upper West Side North""","""Yellow Zone"""
2024-03-01 00:09:22,263,75,0.86,10.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""
2024-03-01 00:33:45,164,162,0.82,14.19,"""Manhattan""","""Midtown South""","""Yellow Zone"""
2024-03-01 00:05:43,263,7,4.9,30.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""


We can also use the `on` argument in case the name of the two columns is the same in both tables. For example, imagine we wanted to join `zones_df` with the table we created in the prior module, of rides that both started in the same location ID; then we don't need to specify pickup vs dropoff:

In [42]:
(
    rides_df
    .filter(pl.col("do_location_id").eq(pl.col("pu_location_id")))
    .select([
        "tpep_pickup_datetime",
        pl.col("pu_location_id").alias("location_id"),
        "trip_distance",
        "total_amount",
    ])
    .join(
        zones_df,
        on="location_id",
    )
    .head()
)

tpep_pickup_datetime,location_id,trip_distance,total_amount,borough,zone,service_zone
datetime[ns],i32,f64,f64,str,str,str
2024-03-01 00:12:31,158,0.65,15.48,"""Manhattan""","""Meatpacking/West Village West""","""Yellow Zone"""
2024-03-01 00:17:01,158,0.96,15.48,"""Manhattan""","""Meatpacking/West Village West""","""Yellow Zone"""
2024-03-01 01:00:25,143,0.06,8.7,"""Manhattan""","""Lincoln Square West""","""Yellow Zone"""
2024-03-01 00:18:13,236,0.4,11.28,"""Manhattan""","""Upper East Side North""","""Yellow Zone"""
2024-03-01 00:07:35,145,2.53,28.08,"""Queens""","""Long Island City/Hunters Point""","""Boro Zone"""


And, of course, as with the other Query Statements, we can join not only on a column name `str`, but also on a `pl.Expr` object. Consider the following toy example, where we want to join even numbers and odd numbers:

In [48]:
left_df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["Danny", "Donna", "Lana", "Sauna"]})
right_df = pl.DataFrame({"c": [1, 2], "d": ["cool", "not cool"]})
display(left_df)
display(right_df)

joined_df = (
    left_df
    .join(
        right_df,
        left_on=pl.col("a") % 2,
        right_on=pl.col("c") % 2
    )
)
display(joined_df)

a,b
i64,str
1,"""Danny"""
2,"""Donna"""
3,"""Lana"""
4,"""Sauna"""


c,d
i64,str
1,"""cool"""
2,"""not cool"""


a,b,d
i64,str,str
1,"""Danny""","""cool"""
0,"""Donna""","""not cool"""
1,"""Lana""","""cool"""
0,"""Sauna""","""not cool"""
