# 7. Data Manipulation V - Working With Datatypes

The goal of this module is to understand all the different datatypes in `polars`, and learn to construct datatype-specific column expressions that leverage those different datatypes.

But first we import `polars`...

In [None]:
%pip install -U polars

In [1]:
import polars as pl

%run setup.py

/data/datasets/data/yellow_tripdata_2024-03.parquet already exists
/data/datasets/data/taxi_zone_lookup.csv already exists


... and load the rides data, already joining with the `zones_df` dataframe as in the previous module.

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = pl.read_parquet(taxi_zone_lookup_parquet).rename(
    zone_column_rename_mapping
)

This time, we'll join `zones_df` into the rides dataframe twice--once for `pu_location_id`, and once for `do_location_id`.

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

zone_df_columns = [
    "borough",
    "zone",
    "service_zone",
]

rides_df = (
    pl.read_parquet(local_parquet)
    .rename(yellow_rides_column_rename_mapping)
    .join(zones_df, left_on="pu_location_id", right_on="location_id")
    .rename(
        {
            zone_df_column: f"pu_{zone_df_column}"
            for zone_df_column in zone_df_columns
        }
    )
    .join(zones_df, left_on="do_location_id", right_on="location_id")
    .rename(
        {
            zone_df_column: f"do_{zone_df_column}"
            for zone_df_column in zone_df_columns
        }
    )
)

rides_df.head()

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pu_borough,pu_zone,pu_service_zone,do_borough,do_zone,do_service_zone
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str
1,2024-03-01 00:18:51,2024-03-01 00:23:45,0,1.3,1,"""N""",142,239,1,8.6,3.5,0.5,2.7,0.0,1.0,16.3,2.5,0.0,"""Manhattan""","""Lincoln Square East""","""Yellow Zone""","""Manhattan""","""Upper West Side South""","""Yellow Zone"""
1,2024-03-01 00:26:00,2024-03-01 00:29:06,0,1.1,1,"""N""",238,24,1,7.2,3.5,0.5,3.0,0.0,1.0,15.2,2.5,0.0,"""Manhattan""","""Upper West Side North""","""Yellow Zone""","""Manhattan""","""Bloomingdale""","""Yellow Zone"""
2,2024-03-01 00:09:22,2024-03-01 00:15:24,1,0.86,1,"""N""",263,75,2,7.9,1.0,0.5,0.0,0.0,1.0,10.4,0.0,0.0,"""Manhattan""","""Yorkville West""","""Yellow Zone""","""Manhattan""","""East Harlem South""","""Boro Zone"""
2,2024-03-01 00:33:45,2024-03-01 00:39:34,1,0.82,1,"""N""",164,162,1,7.9,1.0,0.5,1.29,0.0,1.0,14.19,2.5,0.0,"""Manhattan""","""Midtown South""","""Yellow Zone""","""Manhattan""","""Midtown East""","""Yellow Zone"""
1,2024-03-01 00:05:43,2024-03-01 00:26:22,0,4.9,1,"""N""",263,7,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0,"""Manhattan""","""Yorkville West""","""Yellow Zone""","""Queens""","""Astoria""","""Boro Zone"""


Let's go!

## 5.1. Datatypes - an Overview

In the previous modules, we've seen how `polars` prioritizes clarity and organization of datatypes, even displaying data types alongside each column name when displaying a dataframe:

In [4]:
rides_df.head(0)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pu_borough,pu_zone,pu_service_zone,do_borough,do_zone,do_service_zone
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str


We also saw in an earlier module that we can see a dataframe's datatypes by checking its schema:

In [5]:
rides_df.schema

Schema([('vendor_id', Int32),
        ('tpep_pickup_datetime', Datetime(time_unit='ns', time_zone=None)),
        ('tpep_dropoff_datetime', Datetime(time_unit='ns', time_zone=None)),
        ('passenger_count', Int64),
        ('trip_distance', Float64),
        ('ratecode_id', Int64),
        ('store_and_fwd_flag', String),
        ('pu_location_id', Int32),
        ('do_location_id', Int32),
        ('payment_type', Int64),
        ('fare_amount', Float64),
        ('extra', Float64),
        ('mta_tax', Float64),
        ('tip_amount', Float64),
        ('tolls_amount', Float64),
        ('improvement_surcharge', Float64),
        ('total_amount', Float64),
        ('congestion_surcharge', Float64),
        ('airport_fee', Float64),
        ('pu_borough', String),
        ('pu_zone', String),
        ('pu_service_zone', String),
        ('do_borough', String),
        ('do_zone', String),
        ('do_service_zone', String)])

In our dataframe alone, we have a few different datatypes:
- `pl.Int32`, `pl.Int64`
- `pl.Datetime`
- `pl.String`

But `polars` has so much more than this. Taking from [the docs](https://docs.pola.rs/py-polars/html/reference/datatypes.html), `polars` has numeric, temporal, nested, string, and other datatypes:

**Numeric**
- `pl.Decimal`: Decimal 128-bit type with an optional precision and non-negative scale.
- `pl.Float32`: 32-bit floating point type.
- `pl.Float64`: 64-bit floating point type.
- `pl.Int8`: 8-bit signed integer type.
- `pl.Int16`: 16-bit signed integer type.
- `pl.Int32`: 32-bit signed integer type.
- `pl.Int64`: 64-bit signed integer type.
- `pl.UInt8`: 8-bit unsigned integer type.
- `pl.UInt16`: 16-bit unsigned integer type.
- `pl.UInt32`: 32-bit unsigned integer type.
- `pl.UInt64`: 64-bit unsigned integer type.

**Temporal**
- `pl.Date`: Data type representing a calendar date.
- `pl.Datetime`: Data type representing a calendar date and time of day.
- `pl.Duration`: Data type representing a time duration.
- `pl.Time`: Data type representing the time of day.

**Nested**
- `pl.Array(inner[, shape, width])`: Fixed length list type.
- `pl.List(inner)`: Variable length list type.
- `pl.Struct(fields)`: Struct composite type.

**String**
- `pl.String`: UTF-8 encoded string type.
- `pl.Categorical`: A categorical encoding of a set of strings.
- `pl.Enum`: A fixed set categorical encoding of a set of strings.
- `pl.Utf8`: Alias of String.

**Other**
- `pl.Binary`: Binary type.
- `pl.Boolean`: Boolean type.
- `pl.Null`: Data type representing null values.
- `pl.Object`: Data type for wrapping arbitrary Python objects.
- `pl.Unknown`: Type representing DataType values that could not be determined statically.

But what does this really all mean for us, as the end users?

## 5.2. Working with `pl.String`

Up until now, we've covered mostly basic operations on columns, such as `eq()`, `count()`, `max()` etc, which work across all datatypes. However, there are many datatype-specific operations that can be done as well. To access those datatype-specific operations, we do so through that datatype's **namespace**.

For example, let's say that, as part of our data cleaning process, we want to confirm that all rides that had an `airport fee` actually involved either a pickup or dropoff at the airport; for that, we need a special `pl.String` operation, `.contains()`:

#### Example 1: substring containment check.

In [6]:
(
    rides_df.select(
        ["tpep_pickup_datetime", "total_amount", "airport_fee", "pu_zone"]
    )
    .with_columns(
        pl.col("pu_zone").str.contains("Airport").alias("is_airport_pickup")
    )
    .filter("is_airport_pickup")
    .head()
)

tpep_pickup_datetime,total_amount,airport_fee,pu_zone,is_airport_pickup
datetime[ns],f64,f64,str,bool
2024-03-01 00:02:47,34.74,1.75,"""JFK Airport""",True
2024-03-01 00:33:43,91.0,1.75,"""JFK Airport""",True
2024-03-01 00:24:10,95.85,1.75,"""JFK Airport""",True
2024-03-01 00:06:35,80.19,1.75,"""JFK Airport""",True
2024-03-01 00:06:56,41.55,1.75,"""JFK Airport""",True


Nice! To access the string dataypes special operations, we just prefix the operation with the namespace keyword `str`, like `pl.Expr.str.contains()`. Note that the `.str.contains()` function, like many other functions in the `str` namespace, supports searching by Regex.

Let's see what else can be done with `pl.String`s in `polars`:

#### Example 2: Splitting strings.

We can split strings with `pl.Expr.str.split()`:

In [7]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            pl.col("pu_service_zone")
            .str.split(by=" ")
            .name.suffix("_splitted"),
        ]
    ).head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_splitted
datetime[ns],str,list[str]
2024-03-01 00:18:51,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:26:00,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:09:22,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:33:45,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:05:43,"""Yellow Zone""","[""Yellow"", ""Zone""]"


The result is a `pl.List` datatype! We'll get more into that in a moment ;)

#### Example 3: Measuring string lengths.

In [None]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            pl.col("pu_service_zone")
            .str.len_chars()
            .name.suffix("_str_length"),
        ]
    ).head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_str_length
datetime[ns],str,u32
2024-03-01 00:18:51,"""Yellow Zone""",11
2024-03-01 00:26:00,"""Yellow Zone""",11
2024-03-01 00:09:22,"""Yellow Zone""",11
2024-03-01 00:33:45,"""Yellow Zone""",11
2024-03-01 00:05:43,"""Yellow Zone""",11


Notice that the resultant column `pu_service_zone_str_length` takes on the datatype `pl.UInt32`. It's an unsigned int because length can never be negative!

#### Example 4: converting to upper and lowercase, replacing values

We can also conveniently convert to upper and lower case strings in `polars`, or perform character replacement:

In [9]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            pl.col("pu_service_zone")
            .str.to_uppercase()
            .name.suffix("_uppercase"),
            pl.col("pu_service_zone")
            .str.to_lowercase()
            .name.suffix("_lowercase"),
            pl.col("pu_service_zone")
            .str.to_lowercase()
            .str.replace(" ", "_")
            .name.suffix("_lowercase_wo_space"),
        ]
    ).head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_uppercase,pu_service_zone_lowercase,pu_service_zone_lowercase_wo_space
datetime[ns],str,str,str,str
2024-03-01 00:18:51,"""Yellow Zone""","""YELLOW ZONE""","""yellow zone""","""yellow_zone"""
2024-03-01 00:26:00,"""Yellow Zone""","""YELLOW ZONE""","""yellow zone""","""yellow_zone"""
2024-03-01 00:09:22,"""Yellow Zone""","""YELLOW ZONE""","""yellow zone""","""yellow_zone"""
2024-03-01 00:33:45,"""Yellow Zone""","""YELLOW ZONE""","""yellow zone""","""yellow_zone"""
2024-03-01 00:05:43,"""Yellow Zone""","""YELLOW ZONE""","""yellow zone""","""yellow_zone"""


And there are yet even more functions; but that's all we'll cover for now!

## 5.3. Working with `pl.List`

In the previous section, we saw that splitting a `pl.String` column produced a `pl.List` column as a result:

In [10]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            pl.col("pu_service_zone")
            .str.split(by=" ")
            .name.suffix("_splitted"),
        ]
    ).head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_splitted
datetime[ns],str,list[str]
2024-03-01 00:18:51,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:26:00,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:09:22,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:33:45,"""Yellow Zone""","[""Yellow"", ""Zone""]"
2024-03-01 00:05:43,"""Yellow Zone""","[""Yellow"", ""Zone""]"


Let's see what we can do with this new column!

#### Example 1: List length

From the brief `.head()` just above, it seems like every ride happens in a `pu_service_zone` that is exactly two words long (i.e. "Yellow" "Zone"). Does that hold throughout all ~3 million rows though? Let's check with `pl.List.len()` inside a `group_by`.

In [11]:
(
    rides_df.group_by(
        pl.col("pu_service_zone")
        .str.split(by=" ")
        .list.len()
        .alias("pu_service_zone_num_words")
    )
    .agg(pl.len())
    .head()
)

pu_service_zone_num_words,len
u32,u32
1,279161
2,3303467


That's a surprise! Let's take a closer look by just grouping by the names directly:

In [None]:
(
    rides_df.group_by(
        pl.col("pu_service_zone")
        # .str.split(by=" ")
        # .list.len()
        # .alias("pu_service_zone_num_words")
    )
    .agg(pl.len())
    .head()
)

pu_service_zone,len
str,u32
"""EWR""",372
"""Boro Zone""",184029
"""Airports""",265141
"""Yellow Zone""",3119438
"""N/A""",13648


Seems reasonable anyway.

#### Example 2: Reversing a list

We can reverse a list easily with `.list.reverse()`:

In [14]:
(
    rides_df.with_columns(
        pl.col("pu_service_zone").str.split(by=" ").name.suffix("_splitted"),
    )
    .select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            pl.col("pu_service_zone_splitted")
            .list.reverse()
            .name.suffix("_reversed"),
        ]
    )
    .head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_splitted_reversed
datetime[ns],str,list[str]
2024-03-01 00:18:51,"""Yellow Zone""","[""Zone"", ""Yellow""]"
2024-03-01 00:26:00,"""Yellow Zone""","[""Zone"", ""Yellow""]"
2024-03-01 00:09:22,"""Yellow Zone""","[""Zone"", ""Yellow""]"
2024-03-01 00:33:45,"""Yellow Zone""","[""Zone"", ""Yellow""]"
2024-03-01 00:05:43,"""Yellow Zone""","[""Zone"", ""Yellow""]"


#### Example 3: Taking elements from each list.

If we want to take just the first element of each list, we can do so with `.list.first()`.

In [15]:
(
    rides_df.with_columns(
        pl.col("pu_service_zone").str.split(by=" ").name.suffix("_splitted"),
    )
    .select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            "pu_service_zone_splitted",
            pl.col("pu_service_zone_splitted")
            .list.first()
            .name.suffix("_first_element"),
        ]
    )
    .head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_splitted,pu_service_zone_splitted_first_element
datetime[ns],str,list[str],str
2024-03-01 00:18:51,"""Yellow Zone""","[""Yellow"", ""Zone""]","""Yellow"""
2024-03-01 00:26:00,"""Yellow Zone""","[""Yellow"", ""Zone""]","""Yellow"""
2024-03-01 00:09:22,"""Yellow Zone""","[""Yellow"", ""Zone""]","""Yellow"""
2024-03-01 00:33:45,"""Yellow Zone""","[""Yellow"", ""Zone""]","""Yellow"""
2024-03-01 00:05:43,"""Yellow Zone""","[""Yellow"", ""Zone""]","""Yellow"""


We can also take the last element of each list with `.list.last()`, an element of any index with `.list.get()`, or a sublist with `.list.gather()`:

In [30]:
(
    rides_df.with_columns(
        pl.col("pu_service_zone").str.split(by=" ").name.suffix("_splitted"),
    )
    .select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            "pu_service_zone_splitted",
            pl.col("pu_service_zone_splitted")
            .list.gather(0)
            .name.suffix("_second_element"),
        ]
    )
    .head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_splitted,pu_service_zone_splitted_second_element
datetime[ns],str,list[str],list[str]
2024-03-01 00:18:51,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""Yellow""]"
2024-03-01 00:26:00,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""Yellow""]"
2024-03-01 00:09:22,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""Yellow""]"
2024-03-01 00:33:45,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""Yellow""]"
2024-03-01 00:05:43,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""Yellow""]"


#### Example 4: Operating on each element in the list.

Finally, if we want to operate on an element of each list, we can use `.list.eval()`, along with the `pl.element()` helper object which functions quite like `pl.col()`, but does so in the context of within a list. For example, if we want to reverse each string within the list:

In [31]:
(
    rides_df.with_columns(
        pl.col("pu_service_zone").str.split(by=" ").name.suffix("_splitted"),
    )
    .select(
        [
            "tpep_pickup_datetime",
            "pu_service_zone",
            "pu_service_zone_splitted",
            pl.col("pu_service_zone_splitted")
            .list.eval(pl.element().str.reverse())
            .alias("pu_service_zone_reversed_strings"),
        ]
    )
    .head()
)

tpep_pickup_datetime,pu_service_zone,pu_service_zone_splitted,pu_service_zone_reversed_strings
datetime[ns],str,list[str],list[str]
2024-03-01 00:18:51,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""wolleY"", ""enoZ""]"
2024-03-01 00:26:00,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""wolleY"", ""enoZ""]"
2024-03-01 00:09:22,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""wolleY"", ""enoZ""]"
2024-03-01 00:33:45,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""wolleY"", ""enoZ""]"
2024-03-01 00:05:43,"""Yellow Zone""","[""Yellow"", ""Zone""]","[""wolleY"", ""enoZ""]"


And that's all for now! Let's move on to one final datatype, or types: the temporal datatypes.

## 5.4. Working with Temporal Datatypes

One of the nicest things about `polars` is the multitude of functionality that it offers for temporal datatypes. We'll cover a few instances of the basic functionality here.

At the beginning of this course, we showed an example of subtracting the pickup and dropoff columns to create a duration column:

In [32]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            (
                pl.col("tpep_dropoff_datetime")
                - pl.col("tpep_pickup_datetime")
            ).alias("trip_duration"),
        ]
    ).head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration
datetime[ns],datetime[ns],duration[ns]
2024-03-01 00:18:51,2024-03-01 00:23:45,4m 54s
2024-03-01 00:26:00,2024-03-01 00:29:06,3m 6s
2024-03-01 00:09:22,2024-03-01 00:15:24,6m 2s
2024-03-01 00:33:45,2024-03-01 00:39:34,5m 49s
2024-03-01 00:05:43,2024-03-01 00:26:22,20m 39s


So now we have `pl.Datetime` and `pl.Duration` columns in one dataframe! To get the other two temporal datatypes in there, we just have to do `pl.Datetime` extraction operations using the `.dt` namespace:

In [33]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            (
                pl.col("tpep_dropoff_datetime")
                - pl.col("tpep_pickup_datetime")
            ).alias("trip_duration"),
            pl.col("tpep_pickup_datetime").dt.date().alias("tpep_pickup_date"),
            pl.col("tpep_pickup_datetime").dt.time().alias("tpep_pickup_time"),
        ]
    ).head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration,tpep_pickup_date,tpep_pickup_time
datetime[ns],datetime[ns],duration[ns],date,time
2024-03-01 00:18:51,2024-03-01 00:23:45,4m 54s,2024-03-01,00:18:51
2024-03-01 00:26:00,2024-03-01 00:29:06,3m 6s,2024-03-01,00:26:00
2024-03-01 00:09:22,2024-03-01 00:15:24,6m 2s,2024-03-01,00:09:22
2024-03-01 00:33:45,2024-03-01 00:39:34,5m 49s,2024-03-01,00:33:45
2024-03-01 00:05:43,2024-03-01 00:26:22,20m 39s,2024-03-01,00:05:43


That was easy! Let's see what else can be done with these datatypes...

#### Example 1: Comparing two datetimes - filtering impossible rides

Earlier in this course, we took a look at some rides with impossible values, i.e. rides that had zero passengers. Here let's look at a similar case--rides that have the dropoff before pickup. `polars` enables us to do that with a simple column comparison query:

In [48]:
(
    rides_df.filter(
        pl.col("tpep_dropoff_datetime").lt(pl.col("tpep_pickup_datetime"))
    ).head()
)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pu_borough,pu_zone,pu_service_zone,do_borough,do_zone,do_service_zone
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str
1,2024-03-02 11:30:00,2024-03-02 11:28:04,1,1.1,99,"""N""",75,74,1,17.5,0.0,0.5,0.0,0.0,1.0,19.0,0.0,0.0,"""Manhattan""","""East Harlem South""","""Boro Zone""","""Manhattan""","""East Harlem North""","""Boro Zone"""
1,2024-03-12 18:00:00,2024-03-12 17:50:43,1,0.2,99,"""N""",39,39,1,14.5,0.0,0.5,0.0,0.0,1.0,16.0,0.0,0.0,"""Brooklyn""","""Canarsie""","""Boro Zone""","""Brooklyn""","""Canarsie""","""Boro Zone"""
1,2024-03-13 11:15:00,2024-03-13 11:09:33,1,0.3,99,"""N""",95,196,1,15.5,0.0,0.5,0.0,0.0,1.0,17.0,0.0,0.0,"""Queens""","""Forest Hills""","""Boro Zone""","""Queens""","""Rego Park""","""Boro Zone"""
1,2024-03-13 12:00:00,2024-03-13 11:46:19,1,3.2,99,"""N""",95,216,1,22.5,0.0,0.5,0.0,0.0,1.0,24.0,0.0,0.0,"""Queens""","""Forest Hills""","""Boro Zone""","""Queens""","""South Ozone Park""","""Boro Zone"""
1,2024-03-21 13:15:20,2024-03-21 13:15:04,1,17.1,2,"""N""",161,132,2,70.0,2.5,0.5,0.0,6.94,1.0,80.94,2.5,0.0,"""Manhattan""","""Midtown Center""","""Yellow Zone""","""Queens""","""JFK Airport""","""Airports"""


Well, there they are, the impossible rides! Let's quickly quantify exactly how big the problem is:

In [49]:
(
    rides_df.select(
        pl.col("tpep_dropoff_datetime")
        .lt(pl.col("tpep_pickup_datetime"))
        .mean()
        .alias("fraction_rides with_do_before_pu")
    )
)

fraction_rides with_do_before_pu
f64
3.3e-05


That's not so much. When it comes time for some machine learning model training, we'll remove these.

#### Example 2: Subtracting two `pl.Datetime`s to get a `pl.Duration`, and extracting information

As we saw above, we can subtract two datetimes from one another to get a duration.

In [50]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            (
                pl.col("tpep_dropoff_datetime")
                - pl.col("tpep_pickup_datetime")
            ).alias("trip_duration"),
        ]
    ).head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration
datetime[ns],datetime[ns],duration[ns]
2024-03-01 00:18:51,2024-03-01 00:23:45,4m 54s
2024-03-01 00:26:00,2024-03-01 00:29:06,3m 6s
2024-03-01 00:09:22,2024-03-01 00:15:24,6m 2s
2024-03-01 00:33:45,2024-03-01 00:39:34,5m 49s
2024-03-01 00:05:43,2024-03-01 00:26:22,20m 39s


But what can we do with that `pl.Duration` object once we have it? What if we want to know e.g. the duration not as a `pl.Duration`, but as a number of some unit of time? Well, that's quite straightforward using functions readily available in the `dt` namespace:

In [52]:
(
    rides_df.with_columns(
        [
            (
                pl.col("tpep_dropoff_datetime")
                - pl.col("tpep_pickup_datetime")
            ).alias("trip_duration")
        ]
    )
    .select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            "trip_duration",
            pl.col("trip_duration")
            .dt.total_minutes()
            .name.suffix("_in_minutes"),
            pl.col("trip_duration")
            .dt.total_seconds()
            .name.suffix("in__seconds"),
            pl.col("trip_duration")
            .dt.total_milliseconds()
            .name.suffix("_in_milliseconds"),
        ]
    )
    .head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration,trip_duration_in_minutes,trip_durationin__seconds,trip_duration_in_milliseconds
datetime[ns],datetime[ns],duration[ns],i64,i64,i64
2024-03-01 00:18:51,2024-03-01 00:23:45,4m 54s,4,294,294000
2024-03-01 00:26:00,2024-03-01 00:29:06,3m 6s,3,186,186000
2024-03-01 00:09:22,2024-03-01 00:15:24,6m 2s,6,362,362000
2024-03-01 00:33:45,2024-03-01 00:39:34,5m 49s,5,349,349000
2024-03-01 00:05:43,2024-03-01 00:26:22,20m 39s,20,1239,1239000


That enables us to easily work with interger values rather than with durations, if we so prefer.

#### Example 3: Extracting information from `pl.Datetime`s - checking for and understanding overnight taxi rides

Just as we could extract information from a `pl.Duration` column, we can from a `pl.Datetime` object to get useful information in the form of different datatypes. For example, we can get a `pl.Date`:

In [53]:
(
    rides_df.select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            pl.col("tpep_pickup_datetime").dt.date().alias("tpep_pickup_date"),
            pl.col("tpep_dropoff_datetime")
            .dt.date()
            .alias("tpep_dropoff_date"),
        ]
    ).head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,tpep_pickup_date,tpep_dropoff_date
datetime[ns],datetime[ns],date,date
2024-03-01 00:18:51,2024-03-01 00:23:45,2024-03-01,2024-03-01
2024-03-01 00:26:00,2024-03-01 00:29:06,2024-03-01,2024-03-01
2024-03-01 00:09:22,2024-03-01 00:15:24,2024-03-01,2024-03-01
2024-03-01 00:33:45,2024-03-01 00:39:34,2024-03-01,2024-03-01
2024-03-01 00:05:43,2024-03-01 00:26:22,2024-03-01,2024-03-01


Or with this, we can check for overnight rides (i.e. rides that had a `"tpep_dropoff_datetime"` with a `.dt.date()` that was after the `.dt.date()` of `"tpep_pickup_datetime"`):

In [54]:
(
    rides_df.filter(
        pl.col("tpep_pickup_datetime")
        .dt.date()
        .lt(pl.col("tpep_dropoff_datetime").dt.date())
        #         .alias("is_overnight_ride")
    ).head()
)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pu_borough,pu_zone,pu_service_zone,do_borough,do_zone,do_service_zone
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str
2,2024-02-29 23:59:33,2024-03-01 00:18:39,2,3.43,1,"""N""",68,148,1,19.8,1.0,0.5,3.0,0.0,1.0,27.8,2.5,0.0,"""Manhattan""","""East Chelsea""","""Yellow Zone""","""Manhattan""","""Lower East Side""","""Yellow Zone"""
2,2024-02-29 23:59:13,2024-03-01 00:13:55,1,8.92,1,"""N""",132,39,1,34.5,1.0,0.5,0.0,0.0,1.0,38.75,0.0,1.75,"""Queens""","""JFK Airport""","""Airports""","""Brooklyn""","""Canarsie""","""Boro Zone"""
2,2024-02-29 23:55:56,2024-03-01 00:05:45,1,1.16,1,"""N""",79,144,1,10.0,1.0,0.5,3.0,0.0,1.0,18.0,2.5,0.0,"""Manhattan""","""East Village""","""Yellow Zone""","""Manhattan""","""Little Italy/NoLiTa""","""Yellow Zone"""
2,2024-02-29 23:59:57,2024-03-01 00:05:54,1,0.91,1,"""N""",79,234,2,7.9,1.0,0.5,0.0,0.0,1.0,12.9,2.5,0.0,"""Manhattan""","""East Village""","""Yellow Zone""","""Manhattan""","""Union Sq""","""Yellow Zone"""
2,2024-02-29 23:52:21,2024-03-01 00:02:16,1,2.03,1,"""N""",161,137,1,12.1,1.0,0.5,3.42,0.0,1.0,20.52,2.5,0.0,"""Manhattan""","""Midtown Center""","""Yellow Zone""","""Manhattan""","""Kips Bay""","""Yellow Zone"""


And if we want to know which day of the week overnight rides usually happen on, especially in comparison with non-overnight rides, we can have a check by extracting the day of the week with `.dt.weekday()`:

In [55]:
is_overnight_by_dow_pivot = (
    rides_df.with_columns(
        [
            pl.col("tpep_pickup_datetime")
            .dt.date()
            .lt(pl.col("tpep_dropoff_datetime").dt.date())
            .alias("is_overnight_ride"),
            pl.col("tpep_pickup_datetime")
            .dt.weekday()
            .name.suffix("_day_of_week"),
        ]
    )
    .pivot(
        on="is_overnight_ride",
        index="tpep_pickup_datetime_day_of_week",
        values="tpep_pickup_datetime",
        aggregate_function="len",
        sort_columns=True,
    )
    .with_columns(
        [  # Normalize columns to sum to `1`.
            pl.col(col) / pl.col(col).sum() for col in ["false", "true"]
        ]
    )
    .sort("tpep_pickup_datetime_day_of_week")
)
display(is_overnight_by_dow_pivot)

tpep_pickup_datetime_day_of_week,false,true
i8,f64,f64
1,0.106651,0.074921
2,0.125901,0.074032
3,0.138011,0.092074
4,0.145888,0.147397
5,0.16737,0.252398
6,0.17467,0.264435
7,0.14151,0.094743


In this table above, where each column sums to `1`, and `day_of_week=1` corresponds to `Monday` and `day_of_week=7` corresponds to `Sunday`, we can see that both non-overnight and overnight rides have a spike during the weekend, but the weekend spike for overnight rides is much more pronounced (`26%` of all overnight rides starting on a saturday, compared to `17%` of non-overnight rides starting on a saturday).

# 5.5. Adding Constant Values to the Dataframe.

What if we wanted to add a column to our dataframe that has a constant value? For example, it might be nice to add a column to the dataframe that represents the date that the data was ingested; this can be a particularly responsible idea if we ever want to upload this data into some shared table as the NYC releases and publishes rides data for new months.

`polars` offers us a way to do that with the `pl.lit()` function ("lit" here meaning "literal"):

In [56]:
import datetime as dt

(
    rides_df.with_columns(
        [pl.lit(dt.datetime.now(tz=None)).alias("data_ingested_at_datetime")]
    )
    .select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            "data_ingested_at_datetime",
        ]
    )
    .head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,data_ingested_at_datetime
datetime[ns],datetime[ns],datetime[μs]
2024-03-01 00:18:51,2024-03-01 00:23:45,2025-01-05 16:20:54.905462
2024-03-01 00:26:00,2024-03-01 00:29:06,2025-01-05 16:20:54.905462
2024-03-01 00:09:22,2024-03-01 00:15:24,2025-01-05 16:20:54.905462
2024-03-01 00:33:45,2024-03-01 00:39:34,2025-01-05 16:20:54.905462
2024-03-01 00:05:43,2024-03-01 00:26:22,2025-01-05 16:20:54.905462


That's right--we can create `pl.Datetime` (or `pl.Date`) columns by simply using Python's built in `datetime` module. We can also do this by using `pl.date` (or `pl.datetime`):

In [59]:
(
    rides_df.with_columns([pl.date(2024, 6, 1).alias("data_ingested_at_date")])
    .select(
        [
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            "data_ingested_at_date",
        ]
    )
    .head()
)

tpep_pickup_datetime,tpep_dropoff_datetime,data_ingested_at_date
datetime[ns],datetime[ns],date
2024-03-01 00:18:51,2024-03-01 00:23:45,2024-06-01
2024-03-01 00:26:00,2024-03-01 00:29:06,2024-06-01
2024-03-01 00:09:22,2024-03-01 00:15:24,2024-06-01
2024-03-01 00:33:45,2024-03-01 00:39:34,2024-06-01
2024-03-01 00:05:43,2024-03-01 00:26:22,2024-06-01


And the same logic can be done for every datatype we've seen so far:

In [60]:
(
    pl.DataFrame({"a": [1, 2, 3]}).with_columns(
        [
            pl.lit("b").alias("string_lit"),
            pl.lit(["mama", "dada"]).alias("list_lit"),
            pl.lit(5).alias("int_lit"),
            pl.lit(dt.date(2024, 6, 2)).alias("date_lit"),
        ]
    )
)

a,string_lit,list_lit,int_lit,date_lit
i64,str,list[str],i32,date
1,"""b""","[""mama"", ""dada""]",5,2024-06-02
2,"""b""","[""mama"", ""dada""]",5,2024-06-02
3,"""b""","[""mama"", ""dada""]",5,2024-06-02


# Conclusion

In this module, we learned how to work with different datatypes in `polars`, and how to create new columns using `pl.lit()`.