# 02. Getting Started

The goal of this module is to create `polars` dataframes in memory and by loading data; become familiar with `polars`'s lazy-mode versus in-memory mode; understand how to leverage `polars`'s query optimization; and become familiar with how to generate some basic summary statistics about a dataframe.

If you had any issues installing the environment, please write me directly: benfeifke@gmail.com.

Note: to set up your environment, head over to the Github page for this course: https://github.com/bfeif/polars-for-data-science-oreilly-course.

## 2.0. Imports

Import `polars`.

In [1]:
import polars as pl

## 2.1. Creating a `pl.DataFrame`

The main entrypoint to any work we'll do in `polars` is the dataframe. If you're familiar with creating a dataframe in `pandas`, creating a dataframe in `polars` is very similar:

In [2]:
scratch_df = pl.DataFrame({
    "customer_id": [1, 2, 3,],
    "first_name": ["dan", "stan", "jan",],
    "last_name": ["hanson", "flanson", "ransom",],
})

In a Jupyter Notebook, we can easily display the dataframe:

In [3]:
display(scratch_df)

customer_id,first_name,last_name
i64,str,str
1,"""dan""","""hanson"""
2,"""stan""","""flanson"""
3,"""jan""","""ransom"""


Or `polars` offers a great output style if we're printing it as if we're in the console:

In [4]:
print(scratch_df)

shape: (3, 3)
┌─────────────┬────────────┬───────────┐
│ customer_id ┆ first_name ┆ last_name │
│ ---         ┆ ---        ┆ ---       │
│ i64         ┆ str        ┆ str       │
╞═════════════╪════════════╪═══════════╡
│ 1           ┆ dan        ┆ hanson    │
│ 2           ┆ stan       ┆ flanson   │
│ 3           ┆ jan        ┆ ransom    │
└─────────────┴────────────┴───────────┘


We already see some cool things, like the way that `polars` explicitly tells us the shape and dtypes of the dataframe when we print it. But you didn't come here for fake data, did you? Let's load some real data!

## 2.2. Reading Data From `csv`

The data used throughout this course will be data from taxi rides in New York City, offered publicly by the city: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page (see a detailed description of the schema of this dataset [here](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)).

There are many ways to load data in `polars`: from `csv` files, `parquet` files, `xlsx` files, or even directly from databases. We'll start with a `csv`, with the function `polars.read_csv()`.

In [6]:
!pwd

/Users/ben/code/polars-oreilly-course/notebooks


In [9]:
df = pl.read_csv("../data/yellow_tripdata_2024-03.parquet")

ComputeError: could not parse `�5�;j���{�s�:u9(Fc��s�p>��S���e����L��"�D����)" H^�@�t�O�*�>�` as dtype `str` at column 'PAR1"L   (�/� A         ���' (column number 1)

The current offset in the file is 594 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `�5�;j���{�s�:u9(Fc��s�p>��S���e����L��"�D����)" H^�@�t�O�*�>�` to the `null_values` list.

Original error: ```invalid utf-8 sequence```

We can take a quick look at the data with a call to `.head()`:

In [None]:
df.head()

One thing that's nice about the `polars.DataFrame.head()` operation is the way that it shows the datatype of every column, making the schema explicitly clear. In this case, we have 19 columns:
1. `'VendorID'`: `Int64`
2. `'tpep_pickup_datetime'`: `String`
3. `'tpep_dropoff_datetime'`: `String`
4. `'passenger_count'`: `Int64`
5. `'trip_distance'`: `Float64`
6. `'RatecodeID'`: `Int64`
7. `'store_and_fwd_flag'`: `String`
8. `'PULocationID'`: `Int64`
9. `'DOLocationID'`: `Int64`
10. `'payment_type'`: `Int64`
11. `'fare_amount'`: `Float64`
12. `'extra'`: `Float64`
13. `'mta_tax'`: `Float64`
14. `'tip_amount'`: `Float64`
15. `'tolls_amount'`: `Float64`
16. `'improvement_surcharge'`: `Float64`
17. `'total_amount'`: `Float64`
18. `'congestion_surcharge'`: `Float64`
19. `'Airport_fee'`: `Float64`

You might have noticed something strange--why are `tpep_pickup_datetime` and `tpep_dropoff_datetime` recognized as a `string` datatype? After all, they look a lot like datetimes...

Well, `polars.read_csv()` doesn't do much to detect datatypes beyond strings and numbers upon reading in data... unless of course you ask it to! This can be done with a simple inclusion of the `schema_overrides` argument to the `polars.read_csv()` call:

In [None]:
df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}#, "passenger_count": pl.Datetime}
)
df.head()

Now `"tpep_pickup_datetime"` and `"tpep_dropoff_datetime"` are parsed as datetimes, just as we wanted.

This is one of the greatest strenghts of `polars`: easily managing data types and making them organized and transparent for users. This might not seem like a big deal right now, but when it comes time to do complex operations on columns, you'll very much appreciate this! But more on that in a few modules. For now, what else can we learn about this data?

We can see the shape of the whole dataframe with `pl.DataFrame.shape`:

In [None]:
df.shape

That's a lot of data! We can also see a brief set of summary statistics about the dataframe with `pl.DataFrame.describe()`:

In [None]:
df.describe()

Here we can see how many null and non-null values are in each column, the average and standard deviation of each column, and the key quantiles of each column.

Now, it's time to get lazy... let's try loading data with `polars` in lazy mode!

## 2.3. Scanning Data From `csv`

As we mentioned in the previous module, `polars` supports two ways of loading and processing data: the in-memory mode with `DataFrame`, and the lazy mode with `LazyFrame`. With `DataFrame`, all operations are executed as they are written; with `LazyFrame`, however, operations are optimized before they are executed.

The most common way to enter lazy mode is by starting with `polars.scan_csv()` instead of `polars.read_csv()`:

In [None]:
lf = pl.scan_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
)

That's right--we can use the `schema_overrides` argument with `polars.scan_csv()` as well! `polars.scan_csv()` supports (almost) all the same input arguments as `polars.read_csv()`, making it easy to work with.

As in the last section, let's do `head()` to take a look at the data.

In [None]:
lf.head()

Hey, that's not a dataframe!

When working with a `LazyFrame`, it's better to think of it as a "query" rather than as a "dataframe". To this end, displaying a `LazyFrame` object doesn't show the data result; rather, **it shows you what the computer is going to do to get you the data result**.

In this case (reading the graph in the image from bottom to top), polars will:
- `pl.scan_csv(...)`: scan the csv, selecting all columns ("π \*/19") and filtering None ("σ None").
- `.head()`: Take the first five rows ("SLICE offset: 0; len: 5").

_Note: If this didn't work for you, you need to install `graphviz`. You can read more about how to do that on the [`graphviz` website](https://graphviz.org/download/)._

_Another note: If the syntax here with π and σ looks weird to you, that's relational algebra, the formal algebra and notation underlying relational data (i.e. columnar data). For a deeper look into that, check out [the Wikipedia page about it](https://en.wikipedia.org/wiki/Relational_algebra)._

You might also notice that, the graph instructs us to *"run **LazyFrame.show_graph()** to see the optimized version"*. Let's do that!

In [None]:
lf.head().show_graph()

Looks the same as the naive query plan--nothing to optimize here! Either way, we can see the result of the data with a call to `.collect()`:

In [None]:
lf.head().collect()

Now you can load `csv`'s the in-memory way, and the lazy way--great! Now, let's actually do something beyond just loading... something that will trigger the query optimization engine...

## 2.4. Selecting Data

Now that we have the data loaded--both as a `polars.DataFrame` and as a `polars.LazyFrame`--it's time to start querying it! Let's start with the most straightforward: selecting.

If you're coming from `pandas`, the notation for selecting columns is with brackets, like `pandas.Dataframe[]`; if you're coming from `SQL`, the notation is `SELECT`. `Polars` is more like `SQL`, the notation is `polars.DataFrame.select()`.

Suppose we only want to know the pickup and dropoff time of each taxi ride...

In [None]:
(
    df
    .select(["tpep_pickup_datetime", "tpep_dropoff_datetime"])
    .head()
)

Easy! Let's see what that looks like with the `LazyFrame`:

In [None]:
(
    lf
    .select(["tpep_pickup_datetime", "tpep_dropoff_datetime"])
    .head()
)

Again, you can see the naively sequential order of the query in the naive query plan:
- `pl.scan_csv(...)`: scan the csv, selecting all columns ("π \*/19") and filtering None ("σ None").
- `.select(...)`: select two columns ("π 2/2").
- `.head()`: Take the first five rows ("SLICE offset: 0; len: 5").

But let's see what `Polars` will *actually* do to execute the query:

In [None]:
(
    lf
    .select(["tpep_pickup_datetime", "tpep_dropoff_datetime"])
    .head()
    .show_graph()
)

It's different! Rather than the naive query plan, in which all 19 columns are selected during the data loading, only to be selected down to 2 columns immediately thereafter, the optimized query plan shows that the two columns `"tpep_pickup_datetime"` and `"tpep_dropoff_datetime"` are being selected **while** the CSV is being read! This means that **the other 17 columns are never loaded into memory**, giving your code a massive boost in storage performance, and run performance during reading.

Let's run the query:

In [None]:
(
    lf
    .select(["tpep_pickup_datetime", "tpep_dropoff_datetime"])
    .head()
    .collect()
)

And the result is identical to doing it with `DataFrame`! But before moving on, let's see--how much of a boost to run-performance did we get?

In [None]:
%%timeit -n 3
(
    pl.read_csv(
        "../data/yellow_tripdata_2024-03.csv",
        schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
    )
    .select(["tpep_pickup_datetime", "tpep_dropoff_datetime"])
)

In [None]:
%%timeit -n 3
(
    pl.scan_csv(
        "../data/yellow_tripdata_2024-03.csv",
        schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
    )
    .select(["tpep_pickup_datetime", "tpep_dropoff_datetime"])
    .collect()
)

Almost a 3x speed increase!

You'll also notice, again, that the syntax for selecting columns with the `polars.LazyFrame` is identical to the syntax for selecting columns with `polars.DataFrame`. In almost all cases, this is true in `Polars`--that the syntax for the two is the same, only that `polars.LazyFrame` requires a `.collect()` call to actually return data.

It's also worth mentioning here that, while we can easily convert a `pl.LazyFrame` to a `pl.DataFrame` with `.collect()`, the same can be done in the opposite direction with `pl.DataFrame.lazy()`. Let's check it out:

In [None]:
df_as_lazy = (
    pl.read_csv(
        "../data/yellow_tripdata_2024-03.csv",
        schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
    )
    .lazy()
)
type(df_as_lazy)

Before moving onto the next module, let's try loading in one more file type: `parquet`.

## 2.5. Reading and Scanning From `parquet`

In the course introduction, we saw that `polars` uses the Apache Arrow columnar memory format, a memory format that's optimized for modern column-oriented analytics use-cases, where columns are more important than rows. Apache Arrow optimizes for modern column-oriented analytics use-cases by placing values in a column next to each other in memory, rather than placing values in a row next to each other in memory.

Well, there's also a file type that uses the Apache Arrow columnar memory format--`parquet`. Due to this connection, `polars` and `parquet` work very well together, and are a great toolkit for data science.

Let's try and read and scan data from parquet in the same way we did for `csv`:

In [None]:
df = pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
df.head()

Something's changed--the `"tpep_pickup_datetime"` and `"tpep_dropoff_datetime"` get automatically read in as `datetime` data type! But why is that?

Similar to how `polars` places a high importance on data types, so does `parquet`; with this, it's possible to store columns' data types alongside the data in a `parquet` file. This is something that `csv` just can't do!

And, of course, there's also `scan_parquet()`:

In [None]:
pl.scan_parquet("../data/yellow_tripdata_2024-03.parquet").head().collect()

## 2.6. Conclusion

In this module, we've learned:
- How to create a dataframe from scratch;
- How to load data to a dataframe, both from `csv` and from `parquet`, and display some basic information about it;
- How to work with in-memory mode vs lazy mode, and the performance difference that can be expected beteween the two;
- How to begin writing queries, with the `select` function, and how this is executed differently in in-memory vs lazy mode.

In the next module, "Data Manipulation I: Basics", we'll go deeper into manipulating dataframes. As a note, since this course has a more practical focus, from here on out only `DataFrame`s will be used, for syntactic brevity. As noted in this module, however, note that `DataFrame` and `LazyFrame` can almost always be used interchangeably.