# 2. Getting Started

The goal of this module is to install `polars`; become familiar with `polars`'s lazy-mode versus in-memory mode; and understand how to leverage `polars`'s query optimization.

Functions we will cover:
- `polars.read_csv()`
- `polars.scan_csv()`
- `polars.read_parquet()`
- `polars.DataFrame.head()`
- `polars.DataFrame.select()`
- `polars.LazyFrame.select()`
- `polars.LazyFrame.show_graph()`

## 2.0. Installing and Importing Polars

If you haven't already, you can install `polars`. The preferred way is with `pip`.

```bash
pip install polars
```

You can do it right here in the notebook, if it's easiest:

In [1]:
!pip install polars



And of course, now it can be imported.

In [2]:
import polars as pl

## 2.1. Reading Data From `csv`

The data used throughout this course will be data from taxi rides in new york city, offered publicly by the city of NYC: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

There are many ways to load data in `polars`: from `csv` files, `parquet` files, `xlsx` files, or even directly from databases. We'll start with a `csv`.

In [11]:
df = pl.read_csv("../data/yellow_tripdata_2024-03.csv")

As with `Pandas`, we can take a quick look at it with a call to `.head()`:

In [12]:
df.head()

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
i64,str,str,i64,f64,i64,str,i64,i64,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1,"""2024-03-01T00:18:51.000000000""","""2024-03-01T00:23:45.000000000""",0,1.3,1,"""N""",142,239,1,8.6,3.5,0.5,2.7,0.0,1.0,16.3,2.5,0.0
1,"""2024-03-01T00:26:00.000000000""","""2024-03-01T00:29:06.000000000""",0,1.1,1,"""N""",238,24,1,7.2,3.5,0.5,3.0,0.0,1.0,15.2,2.5,0.0
2,"""2024-03-01T00:09:22.000000000""","""2024-03-01T00:15:24.000000000""",1,0.86,1,"""N""",263,75,2,7.9,1.0,0.5,0.0,0.0,1.0,10.4,0.0,0.0
2,"""2024-03-01T00:33:45.000000000""","""2024-03-01T00:39:34.000000000""",1,0.82,1,"""N""",164,162,1,7.9,1.0,0.5,1.29,0.0,1.0,14.19,2.5,0.0
1,"""2024-03-01T00:05:43.000000000""","""2024-03-01T00:26:22.000000000""",0,4.9,1,"""N""",263,7,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0


One thing that's nice about the `polars.DataFrame.head()` operation is the way that it shows the datatype of every column, making the schema explicitly clear. In this case, we have 19 columns:
1. `'VendorID'`: `Int64`
2. `'tpep_pickup_datetime'`: `String`
3. `'tpep_dropoff_datetime'`: `String`
4. `'passenger_count'`: `Int64`
5. `'trip_distance'`: `Float64`
6. `'RatecodeID'`: `Int64`
7. `'store_and_fwd_flag'`: `String`
8. `'PULocationID'`: `Int64`
9. `'DOLocationID'`: `Int64`
10. `'payment_type'`: `Int64`
11. `'fare_amount'`: `Float64`
12. `'extra'`: `Float64`
13. `'mta_tax'`: `Float64`
14. `'tip_amount'`: `Float64`
15. `'tolls_amount'`: `Float64`
16. `'improvement_surcharge'`: `Float64`
17. `'total_amount'`: `Float64`
18. `'congestion_surcharge'`: `Float64`
19. `'Airport_fee'`: `Float64`

You might have noticed something strange--why are `tpep_pickup_datetime` and `tpep_dropoff_datetime` recognized as a `string` datatype?

Well, `polars.read_csv()` doesn't do much to detect datatypes beyond strings and numbers upon reading in data... unless of course you ask it to! This can be done with a simple addition to the `polars.read_csv()` call:

In [22]:
df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    dtypes={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
)
df.head()

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
i64,datetime[μs],datetime[μs],i64,f64,i64,str,i64,i64,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1,2024-03-01 00:18:51,2024-03-01 00:23:45,0,1.3,1,"""N""",142,239,1,8.6,3.5,0.5,2.7,0.0,1.0,16.3,2.5,0.0
1,2024-03-01 00:26:00,2024-03-01 00:29:06,0,1.1,1,"""N""",238,24,1,7.2,3.5,0.5,3.0,0.0,1.0,15.2,2.5,0.0
2,2024-03-01 00:09:22,2024-03-01 00:15:24,1,0.86,1,"""N""",263,75,2,7.9,1.0,0.5,0.0,0.0,1.0,10.4,0.0,0.0
2,2024-03-01 00:33:45,2024-03-01 00:39:34,1,0.82,1,"""N""",164,162,1,7.9,1.0,0.5,1.29,0.0,1.0,14.19,2.5,0.0
1,2024-03-01 00:05:43,2024-03-01 00:26:22,0,4.9,1,"""N""",263,7,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0


This is one of the greatest strenghts of `polars`: easily managing data types and making them organized and transparent for users. This might not seem like a big deal right now, but when it comes time to do complex operations on columns, you'll very much appreciate this! But more on that in a few modules. For now, let's try loading data with `polars` in a different way--the lazy way.

## 2.2. Scanning Data From `csv`

As we mentioned in the previous module, `polars` supports two ways of loading and processing data: the in-memory mode with `DataFrame`, and the lazy mode with `LazyFrame`. With `DataFrame`, all operations are executed as they are written; with `LazyFrame`, however, operations are optimized before they are executed.

The most common way to enter lazy mode is by starting with `polars.scan_csv()` instead of `polars.read_csv()`:

In [24]:
lf = pl.scan_csv(
    "../data/yellow_tripdata_2024-03.csv",
    dtypes={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
)

That's right--we can use the `dtypes` argument with `polars.scan_csv()` as well! `polars.scan_csv()` supports (almost) all the same input arguments as `polars.read_csv()`, making it easy to work with.

As in the last section, let's do `head()` to take a look at the data.

In [31]:
lf.head()

Hey, that's not a dataframe!

When working with `LazyFrame`s, it's better to think of them as "queries" rather than as "dataframes". To this end, displaying a `LazyFrame` object doesn't show the data result; rather, **it shows you what the computer is going to do to get you the data result**.

In this case, 