# 6. Data Manipulation IV - Combining Data

The goal of this module is to learn to perform Polars queries that involve combining data. More specifically, we'll cover:
1. Joining dataframes with `.join()`.
2. Concatenating dataframes with `.concat()`.

But first we import `polars`...

In [2]:
import polars as pl

... and load the data, this time changing the name from `df` to `rides_df` for what's in store in this module.

In [12]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
rides_df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
    .head()
)

## 5.1. Joining Dataframes

Sometimes the whole story can't be told with one dataframe or table. In fact, in the world of big data, this is usually the case, and it can be quite expensive to store all data in one dataframe. Sometimes, we store supporting data in another table. In the case of our dataset, we have a similar situation--there is a whole other dataset of information about the zones that are referenced to in the columns `"pu_location_id"` and `"do_location_id"`.

Let's have a look:

In [13]:
zones_df = pl.read_csv("../data/taxi_zone_lookup.csv")
zones_df.head()

LocationID,Borough,Zone,service_zone
i64,str,str,str
1,"""EWR""","""Newark Airport""","""EWR"""
2,"""Queens""","""Jamaica Bay""","""Boro Zone"""
3,"""Bronx""","""Allerton/Pelham Gardens""","""Boro Zone"""
4,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
5,"""Staten Island""","""Arden Heights""","""Boro Zone"""


This data contains information about each location ID, and the `str` names for the Borough, Zone, and service zone that each location ID represents. This data is also a lot smaller than the rides data:

In [14]:
print(zones_df.shape)

(265, 4)


Now, in the previous module, we 