# 3. Data Manipulation I: Basics

The goal of this module is to become familiar with the Polars data manipulation API, and be able to perform sorting and filtering queries. We'll cover the three following topics:
1. An overview of the `polars` Query API, and its two main components--Column Expressions and Query Statements;
2. An introduction to Column Expressions with `polars.Expr` ("Expr" is short for "Expression");
3. An introduction to some Query Statements beyond just `select`; namely, `filter` and `sort`.

In [3]:
import polars as pl

%run setup.py

File /data/datasets/data/yellow_tripdata_2024-03.parquet already exists, skipping download.


In [1]:
%%writefile setup.py

import requests
import os

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-03.parquet"
local_parquet = "/data/datasets/data/yellow_tripdata_2024-03.parquet"
local_csv = "/data/datasets/data/yellow_tripdata_2024-03.csv"

if not os.path.exists(local_parquet):
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    with open(local_parquet, "wb") as file:
        file.write(response.content)
    print(f"Downloaded {local_parquet}")
else:
    print(f"File {local_parquet} already exists, skipping download.")


Writing setup.py


## 3.0. The Three Types of Tools in the Polars Query API

Despite having many tools for plotting, IO, and more, the core of Polars is a query language for interacting with dataframes. And, like a spoken language has a grammar, so does Polars; more specifically, there are **three main types of tools in the `polars` query language**:

1. **Column Expressions**: Anything that defines an operation (or chained operations) on a column (or combination of columns), such as adding, multiplying, subtracting, or renaming.
2. **Query Statements**: Anything that defines the retrieval or interaction of data, such as `select`, `filter`, `sort`, `group_by`, `join`, `concat`, etc.
3. **Miscellaneous Dataframe Operations**: Everything else, such as `value_counts` or `transpose`.

Don't be mistaken! There are of course other functions in the `polars` API, such as the `collect` function we just learned in the prior section, or `polars` even has a `plot` function for generating quick plots. However, when it comes to data manipulation queries, these are the three main ingredients. Every `polars` query you'll write will be some combination of tools from these three boxes.

Let's get started familiarizing ourselves with the `polars` API by learning a bit more about **Column Expressions**.

## 3.1. Introducing Query Tool \#1 - Column Expressions

In the prior module, we saw how we could load some data to a `csv` and do a basic `select`:

In [6]:
df = pl.read_csv(
    f"{local_csv}",
    schema_overrides={
        "tpep_pickup_datetime": pl.Datetime,
        "tpep_dropoff_datetime": pl.Datetime,
    },
)

In [7]:
(df.select(["tpep_pickup_datetime", "tpep_dropoff_datetime"]).head())

tpep_pickup_datetime,tpep_dropoff_datetime
datetime[μs],datetime[μs]
2024-03-01 00:18:51,2024-03-01 00:23:45
2024-03-01 00:26:00,2024-03-01 00:29:06
2024-03-01 00:09:22,2024-03-01 00:15:24
2024-03-01 00:33:45,2024-03-01 00:39:34
2024-03-01 00:05:43,2024-03-01 00:26:22


Well, what if we wanted to do something more interesting? Let's take the columns `"total_amount"` and `"tip_amount"`, for instance:

In [8]:
(df.select(["total_amount", "tip_amount"]).head())

total_amount,tip_amount
f64,f64
16.3,2.7
15.2,3.0
10.4,0.0
14.19,1.29
30.4,0.0


And, now we want to know, for each taxi trip, was there any tip paid at all? To do this, we have to crack open the **`polars` expression API**.

In [9]:
(
    df.select(
        [
            "total_amount",
            "tip_amount",
            pl.col("tip_amount").gt(0).alias("tip_paid"),
        ]
    ).head()
)

total_amount,tip_amount,tip_paid
f64,f64,bool
16.3,2.7,True
15.2,3.0,True
10.4,0.0,False
14.19,1.29,True
30.4,0.0,False


How did that work, `pl.col("tip_amount").gt(0).alias("tip_paid")`? Let's have a look.

While `polars` allows you to select columns by a simple string, such as `df.select(["total_amount"])`, anytime you need to do a computation on a column (i.e. more than just a simple select), you have to instantiate a `polars.Expr` object with `pl.col()`. Let's dissect a bit the expression we just saw, to see if any tip was paid:

```python
pl.col("tip_amount").gt(0).alias("tip_paid")
```

1. `pl.col("tip_amount")` creates the column expression object;
2. `.gt(0)` checks if the `"tip_amount"` was **g**reater **t**han 0;
3. `.alias("tip_paid")` renames the column.

Step 3 is important! If we didn't do that, `polars` would just use the original name, `"tip_amount"`, and that can cause errors of duplicate names:

In [None]:
(
    df.select(
        [
            "total_amount",
            "tip_amount",
            pl.col("tip_amount").gt(0),  # .alias("tip_paid")
        ]
    ).head()
)

See? Even the `polars` error message tell is to *"try renaming the columns with `.alias("new_name")` to avoid duplicate column names."*

The Expression API is really the special sauce of `polars`; once you have the `pl.Expr` object created with `pl.col()`, you can do any kind of computation you might desire:
- You can add or subtract two columns together.
- Cast columns to different datatypes.
- Multiply columns by a constant value
- and much more.

Let's go through some examples.

#### Example 1: Running a boolean check with `eq` - checking for trips without passengers

Taxi rides without passengers? How is that possible? Let's have a look:

In [10]:
(
    df.select(
        [
            "passenger_count",
            pl.col("passenger_count").eq(0).alias("had_zero_passengers"),
        ]
    ).head()
)

passenger_count,had_zero_passengers
i64,bool
0,True
0,True
1,False
1,False
0,True


#### Example 2: Subtracting two columns - measuring trip duration

In [11]:
(
    df.select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            (
                pl.col("tpep_dropoff_datetime")
                - pl.col("tpep_pickup_datetime")
            ).alias("trip_duration"),
        ]
    ).head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration
datetime[μs],datetime[μs],duration[μs]
2024-03-01 00:18:51,2024-03-01 00:23:45,4m 54s
2024-03-01 00:26:00,2024-03-01 00:29:06,3m 6s
2024-03-01 00:09:22,2024-03-01 00:15:24,6m 2s
2024-03-01 00:33:45,2024-03-01 00:39:34,5m 49s
2024-03-01 00:05:43,2024-03-01 00:26:22,20m 39s


That's right--polars even has an easy to work with `pl.Duration` datatype. But more on that later 😉

#### Example 3: Multiplying a column by a constant - `"trip_distance"` in kilometers

If you check the [data schema](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf), you'll see that the trip distance is measured in miles. Let's convert to the more universal kilometers.

In [12]:
kilometers_per_mile = 1.61
(
    df.select(
        [
            pl.col("trip_distance").name.suffix("_miles"),
            (pl.col("trip_distance") * kilometers_per_mile).name.suffix(
                "_kilometers"
            ),
        ]
    ).head()
)

trip_distance_miles,trip_distance_kilometers
f64,f64
1.3,2.093
1.1,1.771
0.86,1.3846
0.82,1.3202
4.9,7.889


As you can see, `alias` isn't the only way to rename a column--there are also a few name manipulation tools made accessible under the `.name` property of `pl.Expr`.

#### Example 4: Maximum and minimum - the shortest and longest trip distance

With just column expressions and `select`, we can already do some basic aggregations, like checking the longest and shortest trip distances in the data:

In [13]:
(
    df.select(
        [
            pl.col("trip_distance").min().name.suffix("_min"),
            pl.col("trip_distance").max().name.suffix("_max"),
        ]
    )
)

trip_distance_min,trip_distance_max
f64,f64
0.0,176836.3


The shortest trip distance was `0` miles, which is certainly possible... but the longest trip distance was `176836` miles? That seems hard to believe! It might be nice to know the other top `trip_distance`s, but for that, we're going to need to jump into **Query Tool \#2: Query Statments**.

## 3.2. Introducing Query Tool \#2: Query Statments

So far, the only query tool we've used is `select`, but there's much more; in this module, we'll expand the Query Statement Toolbox with two more tools: `filter` and `sort`.

### 3.2.1. The `filter` Query Statement

In the previous examples, we saw the surprising occurence that some taxi trips had zero passengers. But how often did this happen? We can answer this using a simple aggregation that we just learned

In [14]:
(
    df.select(
        [
            pl.col("passenger_count")
            .eq(0)
            .sum()
            .alias("num_trips_zero_passengers"),
            pl.col("passenger_count").count().alias("num_trips"),
            (
                (
                    pl.col("passenger_count").eq(0).sum()
                    / pl.col("passenger_count").count()
                )
                * 100
            ).alias("percentage_rides_zero_passengers"),
            pl.col("passenger_count").mean().alias("avg_passengers"),
        ]
    )
)

num_trips_zero_passengers,num_trips,percentage_rides_zero_passengers,avg_passengers
u32,u32,f64,f64
40372,3156438,1.279037,1.337625


40,000 trips had zero passengers?! That seems like some bad data. Thankfully, it's only about 1% of the data, so we can get rid of them. We do that with `filter`, keeping only the rows where `"passenger_count"` is greater than `0`:

In [15]:
(
    df.filter(pl.col("passenger_count").gt(0))
    .select(["trip_distance", "passenger_count"])
    .head()
)

trip_distance,passenger_count
f64,i64
0.86,1
0.82,1
5.04,1
2.15,1
1.1,1


We can also combine filters together with boolean operations. For example, if we want to check the trips that had at least 3 people, and had a `"trip_distance"` of more than 100 miles...

In [16]:
(
    df.filter(
        pl.col("passenger_count").gt(3) & pl.col("trip_distance").gt(100)
    )
    .select(["trip_distance", "passenger_count"])
    .head()
)

trip_distance,passenger_count
f64,i64
102.85,4
119.02,5
141.43,4


It wasn't that many! We can also express the same thing without the `&` clause, but rather with the `polars` built-in `pl.Expr.and_()`:

In [19]:
(
    df.filter(
        pl.col("passenger_count").gt(3).and_(pl.col("trip_distance").gt(100))
    )
    .select(["trip_distance", "passenger_count"])
    .head()
)

trip_distance,passenger_count
f64,i64
102.85,4
119.02,5
141.43,4


And the same behavior exists for `|` and `or_()`. For example, if we want to see the largest tip that was paid on a ride that was either more than 3 people or went to the airport:

In [20]:
(
    df.filter(pl.col("Airport_fee").gt(0).or_(pl.col("passenger_count").gt(3)))
    .select(pl.col("tip_amount").max())
    .head()
)

tip_amount
f64
300.0


That's a high tip!

### 3.2.2. The `sort` Query Statement

Known as `ORDER BY` in SQL, the `sort` statement sorts values in the dataframe, just as its name implies.

In the previous section, we saw how the maximum value of `"trip_distance"` was surprisingly high. Rather than seeing just the highest value of `"trip_distance"`, let's see a few more of the highest values by using `sort`. It'll also be curious to see the amount of money paid for those trips:

In [17]:
(
    df.select([pl.col("trip_distance"), pl.col("total_amount")])
    .sort("trip_distance", descending=True)
    .head(5)
)

trip_distance,total_amount
f64,f64
176836.3,32.74
176744.79,114.68
176329.23,34.51
138097.21,108.4
136660.1,19.89


We use the `descending` argument of the function to sort from highest to lowest. This is not the function's default however--by default, it sorts from lowest to highest (i.e. `descending=False`):

In [18]:
(
    df.select([pl.col("trip_distance"), pl.col("total_amount")])
    .sort(
        "trip_distance",
        #         descending=True
    )
    .head(5)
)

trip_distance,total_amount
f64,f64
0.0,24.62
0.0,9.6
0.0,21.5
0.0,99.23
0.0,28.37


We can do more than just sort by a column, though--we can also sort by a combination of columns. For example, what if we don't just want to sort by `"trip_distance"`, nor by `"total_amount"`, but by the `"total_amount"` per mile? Let's have a look:

In [19]:
(
    df.sort(pl.col("total_amount") / pl.col("trip_distance"), descending=True)
    .select(["total_amount", "trip_distance"])
    .head(5)
)

total_amount,trip_distance
f64,f64
0.0,0.0
0.0,0.0
0.0,0.0
0.0,0.0
0.0,0.0


Hmm, why the zeroes? There must be some strangeness in the data; let's start rather by filtering out everything that's less than or equal to 0, and then running the query:

In [22]:
(
    df.filter(pl.col("total_amount").gt(0).and_(pl.col("trip_distance").gt(0)))
    .sort(pl.col("total_amount") / pl.col("trip_distance"), descending=True)
    .select(["total_amount", "trip_distance"])
    .head(10)
)

total_amount,trip_distance
f64,f64
331.0,0.01
229.2,0.01
201.0,0.01
163.2,0.01
324.0,0.02
160.55,0.01
149.94,0.01
145.2,0.01
283.2,0.02
141.0,0.01


In [21]:
(
    df.filter(pl.col("total_amount").gt(0).and_(pl.col("trip_distance").gt(0)))
    .with_columns(
        (pl.col("total_amount") / pl.col("trip_distance")).alias(
            "total_per_distance"
        )
    )
    .sort(pl.col("total_amount") / pl.col("trip_distance"), descending=True)
    .select(["total_amount", "trip_distance", "total_per_distance"])
    .head()
)

total_amount,trip_distance,total_per_distance
f64,f64,f64
331.0,0.01,33100.0
229.2,0.01,22920.0
201.0,0.01,20100.0
163.2,0.01,16320.0
324.0,0.02,16200.0


Wow! Trips of less than `0.01` miles that cost as much as `$331.0`? Only in New York City 🤪

## Conclusion

In this module, we got started learning how to work with the `polars.Expr` API, and built on our querying skills with `sort()` and `filter()`.