# 6. Data Manipulation IV - Combining Data

The goal of this module is to learn to perform Polars queries that involve combining data. More specifically, we'll cover two new Query Statements:
1. Joining dataframes with `.join()`.
2. Concatenating dataframes with `.concat()`.

But first we import `polars`...

In [1]:
import polars as pl

%run setup.py

/data/datasets/data/yellow_tripdata_2024-03.parquet already exists
/data/datasets/data/taxi_zone_lookup.csv already exists


... and load the data, this time changing the name from `df` to `march_yellow_rides_df` for what's in store in this module.

In [2]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
march_yellow_rides_df = pl.read_parquet(local_parquet).rename(
    yellow_rides_column_rename_mapping
)

## 6.1. Joining Dataframes

In the previous module, we were interested in seeing some summary statistics about each of the different pickup location IDs:

In [3]:
location_id_summary_df = (
    march_yellow_rides_df.group_by("pu_location_id")
    .agg(
        pl.len().alias("count_trips"),
        pl.col("trip_distance").min().name.suffix("_min"),
        pl.col("trip_distance").max().name.suffix("_max"),
        pl.col("trip_distance").mean().name.suffix("_mean"),
    )
    .sort(pl.col("count_trips"), descending=True)
)
location_id_summary_df.head()

pu_location_id,count_trips,trip_distance_min,trip_distance_max,trip_distance_mean
i32,u32,f64,f64,f64
161,163269,0.0,51066.77,2.692728
132,157706,0.0,9211.95,15.76677
237,155631,0.0,44866.77,2.096025
236,146044,0.0,109619.96,3.481146
162,123805,0.0,57408.32,3.030781


This is nice, but if we try to present these results to any sort of business stakeholder, the first thing they'll say is "What are these location IDs? What is their actual name?" Thankfully, that information is stored in another file, `"taxi_zone_lookup.parquet"`. Let's load it in...

In [2]:
zones_df = pl.read_csv(taxi_zone_lookup_local)
zones_df.head()

LocationID,Borough,Zone,service_zone
i64,str,str,str
1,"""EWR""","""Newark Airport""","""EWR"""
2,"""Queens""","""Jamaica Bay""","""Boro Zone"""
3,"""Bronx""","""Allerton/Pelham Gardens""","""Boro Zone"""
4,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
5,"""Staten Island""","""Arden Heights""","""Boro Zone"""


In [3]:
zones_df = zones_df.write_parquet(
    "/data/datasets/data/taxi_zone_lookup.parquet"
)

This dataframe contains information about each location ID, and the `str` names for the Borough, Zone, and service zone that each location ID represents. Let's quickly rename the columns to make them snake_case like those of `march_yellow_rides_df`:

In [19]:
zones_df = zones_df.rename(
    {
        "LocationID": "location_id",
        "Borough": "borough",
        "Zone": "zone",
    }
)
zones_df.head()

location_id,borough,zone,service_zone
i64,str,str,str
1,"""EWR""","""Newark Airport""","""EWR"""
2,"""Queens""","""Jamaica Bay""","""Boro Zone"""
3,"""Bronx""","""Allerton/Pelham Gardens""","""Boro Zone"""
4,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
5,"""Staten Island""","""Arden Heights""","""Boro Zone"""


Perfect.

If we want to now combine it with the summary statistics we just computed, we can do so by **joining** it into the summary statistics table, using `pl.DataFrame.join()`:

In [20]:
(
    location_id_summary_df.join(
        zones_df, left_on="pu_location_id", right_on="location_id"
    ).head()
)

pu_location_id,count_trips,trip_distance_min,trip_distance_max,trip_distance_mean,borough,zone,service_zone
i32,u32,f64,f64,f64,str,str,str
1,372,0.0,35.75,1.34578,"""EWR""","""Newark Airport""","""EWR"""
2,8,0.0,20.23,11.98,"""Queens""","""Jamaica Bay""","""Boro Zone"""
3,151,0.0,35.1,7.159073,"""Bronx""","""Allerton/Pelham Gardens""","""Boro Zone"""
4,7586,0.0,37.6,2.742518,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
5,1,0.0,0.0,0.0,"""Staten Island""","""Arden Heights""","""Boro Zone"""


Nice! The data from `zones_df` has been joined into our zone-summary table, and we can see the `"borough"`, `"zone"`, and `"service_zone"` alongside the `"pu_location_id"`! `polars`'s `.join()` functionality has all the same functionality you may be familiar with from Pandas and/or SQL:
1. `location_id_summary_df` specifies the **left** table for the join.
2. `.join(...)` starts the join.
3. `zones_df` specifies the **right** table for the join.
4. `left_on` specifies the column to join on from `location_id_summary_df`.
5. `right_on` specifies the column to join on from `zones_df`.

Just to be clear, though, we didn't have to do the aggregation to `location_id_summary_df`, first; we can join directly into the original `march_yellow_rides_df`:

In [21]:
(
    march_yellow_rides_df.select(
        [
            "tpep_pickup_datetime",
            "pu_location_id",
            "do_location_id",
            "trip_distance",
            "total_amount",
        ]
    )
    .join(
        zones_df,
        left_on="pu_location_id",
        right_on="location_id",
    )
    #     .select(["tpep_pickup_datetime", "pu_location_id", "do_location_id", "trip_distance", "total_amount", "borough", "zone"])
    .head()
)

tpep_pickup_datetime,pu_location_id,do_location_id,trip_distance,total_amount,borough,zone,service_zone
datetime[ns],i32,i32,f64,f64,str,str,str
2024-03-01 00:18:51,142,239,1.3,16.3,"""Manhattan""","""Lincoln Square East""","""Yellow Zone"""
2024-03-01 00:26:00,238,24,1.1,15.2,"""Manhattan""","""Upper West Side North""","""Yellow Zone"""
2024-03-01 00:09:22,263,75,0.86,10.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""
2024-03-01 00:33:45,164,162,0.82,14.19,"""Manhattan""","""Midtown South""","""Yellow Zone"""
2024-03-01 00:05:43,263,7,4.9,30.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""


But now, it's a bit strange that we have the `"pu_location_id"` and `"do_location_id"` in the same dataframe, because it's not clear whether the `"borough"` and `"zone"` refer to the `"pu_location_id"` or the `"do_location_id"`. We can quickly fix that:

In [22]:
zone_columns = ["borough", "zone", "service_zone"]
(
    march_yellow_rides_df.select(
        [
            "tpep_pickup_datetime",
            "pu_location_id",
            "do_location_id",
            "trip_distance",
            "total_amount",
        ]
    )
    .join(
        zones_df,
        left_on="pu_location_id",
        right_on="location_id",
    )
    .with_columns([pl.col(zone_columns).name.prefix("pu_")])
    .drop(zone_columns)
    .head()
)

tpep_pickup_datetime,pu_location_id,do_location_id,trip_distance,total_amount,pu_borough,pu_zone,pu_service_zone
datetime[ns],i32,i32,f64,f64,str,str,str
2024-03-01 00:18:51,142,239,1.3,16.3,"""Manhattan""","""Lincoln Square East""","""Yellow Zone"""
2024-03-01 00:26:00,238,24,1.1,15.2,"""Manhattan""","""Upper West Side North""","""Yellow Zone"""
2024-03-01 00:09:22,263,75,0.86,10.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""
2024-03-01 00:33:45,164,162,0.82,14.19,"""Manhattan""","""Midtown South""","""Yellow Zone"""
2024-03-01 00:05:43,263,7,4.9,30.4,"""Manhattan""","""Yorkville West""","""Yellow Zone"""


We can also use the `on` argument in case the name of the two columns is the same in both tables. For example, imagine we wanted to join `zones_df` with the table we created in the prior module, of rides that both started in the same location ID; then we don't need to specify pickup vs dropoff:

In [23]:
(
    march_yellow_rides_df.filter(
        pl.col("do_location_id").eq(pl.col("pu_location_id"))
    )
    .select(
        [
            "tpep_pickup_datetime",
            pl.col("pu_location_id").alias("location_id"),
            "trip_distance",
            "total_amount",
        ]
    )
    .join(
        zones_df,
        on="location_id",
    )
    .head()
)

tpep_pickup_datetime,location_id,trip_distance,total_amount,borough,zone,service_zone
datetime[ns],i32,f64,f64,str,str,str
2024-03-01 00:12:31,158,0.65,15.48,"""Manhattan""","""Meatpacking/West Village West""","""Yellow Zone"""
2024-03-01 00:17:01,158,0.96,15.48,"""Manhattan""","""Meatpacking/West Village West""","""Yellow Zone"""
2024-03-01 01:00:25,143,0.06,8.7,"""Manhattan""","""Lincoln Square West""","""Yellow Zone"""
2024-03-01 00:18:13,236,0.4,11.28,"""Manhattan""","""Upper East Side North""","""Yellow Zone"""
2024-03-01 00:07:35,145,2.53,28.08,"""Queens""","""Long Island City/Hunters Point""","""Boro Zone"""


`polars` also supports all possible joins: anti joins, left joins, right joins, inner joins, outer joins... you've got it all, all you have to use is the `how` argument. To this end, there's a curious difference between `zones_df` and `march_yellow_rides_df`...

In [24]:
print(
    f"Number of distinct pu_location_id in `march_yellow_rides_df`: {march_yellow_rides_df.select(pl.col('pu_location_id').n_unique())}"
)
print(
    f"Number of distinct pu_location_id in `zones_df`: {zones_df.select(pl.col('location_id').n_unique())}"
)

Number of distinct pu_location_id in `march_yellow_rides_df`: shape: (1, 1)
┌────────────────┐
│ pu_location_id │
│ ---            │
│ u32            │
╞════════════════╡
│ 259            │
└────────────────┘
Number of distinct pu_location_id in `zones_df`: shape: (1, 1)
┌─────────────┐
│ location_id │
│ ---         │
│ u32         │
╞═════════════╡
│ 265         │
└─────────────┘


There are exactly **6** more `location_id`s in `zones_df` than there are `pu_location_id`s in `march_yellow_rides_df`! This makes sense--`zones_df` serves as an index of **all** zones, and there's no guarantee that all zones had at least one ride. What if we wanted to see which zones didn't have any rides that started in them (i.e. `location_id`s with no matching `pu_location_id`)? For that, we can use an anti-join:

In [25]:
(
    zones_df.join(
        march_yellow_rides_df,
        right_on="pu_location_id",
        left_on="location_id",
        how="anti",
    )
)

location_id,borough,zone,service_zone
i64,str,str,str
99,"""Staten Island""","""Freshkills Park""","""Boro Zone"""
103,"""Manhattan""","""Governor's Island/Ellis Island…","""Yellow Zone"""
104,"""Manhattan""","""Governor's Island/Ellis Island…","""Yellow Zone"""
105,"""Manhattan""","""Governor's Island/Ellis Island…","""Yellow Zone"""
110,"""Staten Island""","""Great Kills Park""","""Boro Zone"""
204,"""Staten Island""","""Rossville/Woodrow""","""Boro Zone"""


We can confirm, the result has a length of **6** data points; exactly what we saw in our `n_unique()` check.

And if we want to join on something beyond whether or not two columns are equal (i.e. non-equi joins), then we need to use a different join function, `.join_where()`. Consider, for example, that we are organizing a round-robin tournament, and we want to prepare a dataframe of all the match-ups that need to take place, given a dataframe of all the competitors:

In [26]:
competitors_df = pl.DataFrame(
    {
        "player_id": [1, 2, 3, 4],
        "player_name": ["Danny", "Donna", "Lana", "Sauna"],
    }
)

defense_competitors_df = competitors_df.select(
    pl.all().name.prefix("defense_")
)
offense_competitors_df = competitors_df.select(
    pl.all().name.prefix("offense_")
)

display(defense_competitors_df)
display(offense_competitors_df)

defense_player_id,defense_player_name
i64,str
1,"""Danny"""
2,"""Donna"""
3,"""Lana"""
4,"""Sauna"""


offense_player_id,offense_player_name
i64,str
1,"""Danny"""
2,"""Donna"""
3,"""Lana"""
4,"""Sauna"""


We can create a dataframe of all the matches that need to take place by joining `defense_competitors_df` to `offense_competitors_df` wherever the `defense_player_id` is less than the `offense_player_id` (because every matchup only needs to take place once, and players don't need to play themselves):

In [27]:
matchups_df = defense_competitors_df.join_where(
    offense_competitors_df,
    pl.col("defense_player_id").lt(pl.col("offense_player_id")),
)
display(matchups_df)

defense_player_id,defense_player_name,offense_player_id,offense_player_name
i64,str,i64,str
1,"""Danny""",2,"""Donna"""
1,"""Danny""",3,"""Lana"""
1,"""Danny""",4,"""Sauna"""
2,"""Donna""",3,"""Lana"""
2,"""Donna""",4,"""Sauna"""
3,"""Lana""",4,"""Sauna"""


Perfect! There are 6 matches that need to take place in total, as expected.

## 6.2. Concatenating Dataframes

The data we've been working with so far has been the file `"../data/yellow_tripdata_2024-03.parquet"`, which contains all rides given by yellow taxis in NYC in the month of March, 2024. Amazingly, New York City records and publishes this data every month, going all the way back to 2009! What if we wanted to analyze two months' datasets together, in the same dataframe? Well, for that, we have `pl.concat()`.

Let's start by loading in February's yellow taxi trip data. It should have the same schema as the data from March, so we can use the same column name mapping dictionary.

In [5]:
yellow_tripdata_2024_02_parquet_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-02.parquet"

yellow_tripdata_2024_02_df = pl.read_parquet(
    yellow_tripdata_2024_02_parquet_url
)

yellow_tripdata_2024_02_df.write_parquet(
    "/data/datasets/data/yellow_tripdata_2024-02.parquet"
)

In [6]:
february_yellow_rides_df = pl.read_parquet(
    "/data/datasets/data/yellow_tripdata_2024-02.parquet"
).rename(yellow_rides_column_rename_mapping)
february_yellow_rides_df.head()

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64
2,2024-02-01 00:04:45,2024-02-01 00:19:58,1,4.39,1,"""N""",68,236,1,20.5,1.0,0.5,1.28,0.0,1.0,26.78,2.5,0.0
2,2024-02-01 00:56:31,2024-02-01 01:10:53,1,7.71,1,"""N""",48,243,1,31.0,1.0,0.5,9.0,0.0,1.0,45.0,2.5,0.0
2,2024-02-01 00:07:50,2024-02-01 00:43:12,2,28.69,2,"""N""",132,261,2,70.0,0.0,0.5,0.0,6.94,1.0,82.69,2.5,1.75
1,2024-02-01 00:01:49,2024-02-01 00:10:47,1,1.1,1,"""N""",161,163,1,9.3,3.5,0.5,2.85,0.0,1.0,17.15,2.5,0.0
1,2024-02-01 00:37:35,2024-02-01 00:51:15,1,2.6,1,"""N""",246,79,2,15.6,3.5,0.5,0.0,0.0,1.0,20.6,2.5,0.0


To combine these two dataframes, we just have to pass them in as a list to `pl.concat()`, and we can see the result:

In [7]:
all_yellow_rides_df = pl.concat(
    [
        february_yellow_rides_df,
        march_yellow_rides_df,
    ]
)
display(all_yellow_rides_df.head())
print(
    f"{february_yellow_rides_df.shape[0]} rides recorded in `february_yellow_rides_df`."
)
print(
    f"{march_yellow_rides_df.shape[0]} rides recorded in `march_yellow_rides_df`."
)
print(
    f"{all_yellow_rides_df.shape[0]} rides recorded in `all_yellow_rides_df`."
)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64
2,2024-02-01 00:04:45,2024-02-01 00:19:58,1,4.39,1,"""N""",68,236,1,20.5,1.0,0.5,1.28,0.0,1.0,26.78,2.5,0.0
2,2024-02-01 00:56:31,2024-02-01 01:10:53,1,7.71,1,"""N""",48,243,1,31.0,1.0,0.5,9.0,0.0,1.0,45.0,2.5,0.0
2,2024-02-01 00:07:50,2024-02-01 00:43:12,2,28.69,2,"""N""",132,261,2,70.0,0.0,0.5,0.0,6.94,1.0,82.69,2.5,1.75
1,2024-02-01 00:01:49,2024-02-01 00:10:47,1,1.1,1,"""N""",161,163,1,9.3,3.5,0.5,2.85,0.0,1.0,17.15,2.5,0.0
1,2024-02-01 00:37:35,2024-02-01 00:51:15,1,2.6,1,"""N""",246,79,2,15.6,3.5,0.5,0.0,0.0,1.0,20.6,2.5,0.0


3007526 rides recorded in `february_yellow_rides_df`.
3582628 rides recorded in `march_yellow_rides_df`.
6590154 rides recorded in `all_yellow_rides_df`.


And just like that, we have a combined dataframe of all the yellow taxi rides from February and March! Yellow taxis aren't the only type of taxis in New York City, though; for example, there are also green taxis, which the city of NYC also records. However, a slightly different taxi type means slightly different data... how will it be to combine such data together? Let's see...

Let's start by loading in `march_green_rides_df` from file:

In [8]:
green_tripdata_2024_03_parquet_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2024-03.parquet"

green_tripdata_2024_03_df = pl.read_parquet(green_tripdata_2024_03_parquet_url)

green_tripdata_2024_03_df.write_parquet(
    "/data/datasets/data/green_tripdata_2024-03.parquet"
)

In [9]:
march_green_rides_df = pl.read_parquet(
    "/data/datasets/data/green_tripdata_2024-03.parquet"
)
march_green_rides_df.head()

VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
i32,datetime[ns],datetime[ns],str,i64,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,f64
2,2024-03-01 00:10:52,2024-03-01 00:26:12,"""N""",1,129,226,1,1.72,12.8,1.0,0.5,3.06,0.0,,1.0,18.36,1,1,0.0
2,2024-03-01 00:22:21,2024-03-01 00:35:15,"""N""",1,130,218,1,3.25,17.7,1.0,0.5,0.0,0.0,,1.0,20.2,2,1,0.0
2,2024-03-01 00:45:27,2024-03-01 01:04:32,"""N""",1,255,107,2,4.58,23.3,1.0,0.5,3.5,0.0,,1.0,32.05,1,1,2.75
1,2024-03-01 00:02:00,2024-03-01 00:23:45,"""N""",1,181,71,1,0.0,22.5,0.0,1.5,0.0,0.0,,1.0,24.0,1,1,0.0
2,2024-03-01 00:16:45,2024-03-01 00:23:25,"""N""",1,95,135,1,1.15,8.6,1.0,0.5,1.0,0.0,,1.0,12.1,1,1,0.0


Already, we can see some slight differences between the `all_yellow_rides_df` and `march_green_rides_df`--for example, `march_green_rides_df` has 20 columns, while we're accustomed to the 19 columns in `march_yellow_rides_df`--but there are also a lot of similarities. For one, `march_green_rides_df` has the same `datetime` columns (though with slighly different names), and the same TitleCase ID columns. We'll need to conform the column names so that the correct columns match up with each other upon concatenation.

In [10]:
green_rides_df_column_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "lpep_pickup_datetime": "tpep_pickup_datetime",
    "lpep_dropoff_datetime": "tpep_dropoff_datetime",
    # "Airport_fee": "airport_fee",  # Doesn't exist in `green_rides_df`
}
march_green_rides_df = march_green_rides_df.rename(
    green_rides_df_column_mapping
)

Great! Now, if we want to finally combine the yellow taxi dataframes and green taxi dataframe with `pl.concat()`, we just pass them in as a list again:

In [None]:
all_rides_df = pl.concat(
    [
        february_yellow_rides_df,
        march_yellow_rides_df,
        march_green_rides_df,
    ]
)

Even with the renaming, we got an error! That's because, by default, the two dataframes have to have the same schema. This isn't the case in our dataframes, though; however we'd still like to concatenate them! Thankfully, `polars` gives us flexible control over this, with the `how` argument. By default, `how` is set to `how="vertical"`, restricting the concatenation to a strict vertical stacking of the dataframes. However, we can set `how="diagonal"`, and the two dataframes will be concatenated together, with nulls used to fill in the spaces of unshared columns:

Add source column to each df to distinguish rows comming from different df

In [14]:
february_yellow_rides_df = february_yellow_rides_df.with_columns(
    pl.lit("february_yellow").alias("source")
)
march_yellow_rides_df = march_yellow_rides_df.with_columns(
    pl.lit("march_yellow").alias("source")
)
march_green_rides_df = march_green_rides_df.with_columns(
    pl.lit("march_green").alias("source")
)

In [18]:
all_rides_df = pl.concat(
    [
        february_yellow_rides_df,
        march_yellow_rides_df,
        march_green_rides_df,
    ],
    how="diagonal",
)
display(all_rides_df.tail())
print(all_yellow_rides_df.shape[0])

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,source,ehail_fee,trip_type
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,i64
2,2024-03-31 21:19:00,2024-03-31 21:30:00,,1.45,,,25,61,,12.08,0.0,0.0,2.52,0.0,1.0,15.6,,,"""march_green""",,
2,2024-03-31 22:30:00,2024-03-31 22:35:00,,1.13,,,41,42,,12.24,0.0,0.0,0.0,0.0,1.0,13.24,,,"""march_green""",,
2,2024-03-31 22:43:00,2024-03-31 22:48:00,,13062.08,,,223,7,,12.08,0.0,0.0,3.77,0.0,1.0,16.85,,,"""march_green""",,
2,2024-03-31 22:48:00,2024-03-31 23:12:00,,7.96,,,42,249,,40.52,0.0,0.0,8.75,0.0,1.0,53.02,,,"""march_green""",,
2,2024-03-31 22:08:00,2024-03-31 22:47:00,,10.7,,,7,211,,39.35,0.0,0.0,9.91,6.94,1.0,59.95,,,"""march_green""",,


6590154


Scrolling to the right, we can see that `ehail_fee` and `trip_type` are null, since the yellow trips data doesn't have these two columns--they are brought over from `green_rides_df`.

`pl.concat()` does offer a few more options, such as horizontal concatenation instead of vertical concatenation, but it's not so useful for us here; instead, I leave it to your reading!

# Conclusion

In this module, we learned how to combine multiple dataframes together, using two new Query Statements, `.join()` and `.concat()`.