There are two ways of running transformations:
* step by step in eager mode (imperative)
* as an integrated query in lazy mode


For analytics, running as an integrated query in lazy mode is superior:
* eager mode copies the same dataframe over and over, and is not memory efficient
* eager mode is not aware of what other steps are doing, and is potentially missing out on optimizations
* integrated queries can use a query optimizer to identify inefficiencies
* the query optimizer minimizes memory usage, and produces a single output

In [1]:
import polars as pl
import pathlib

path_to_data = pathlib.Path("data/titanic.csv")

In [2]:
eager_df = pl.read_csv(path_to_data)
lazy_df = pl.scan_csv(path_to_data)

In [3]:
print(type(eager_df))
print(type(lazy_df))

<class 'polars.internals.dataframe.frame.DataFrame'>
<class 'polars.internals.lazyframe.frame.LazyFrame'>


If we try to grab the `head` of our `eager_df`, we'll get the actually data, since it's been eagerly read into memory already. 

If we try to grab the `head` of the `lazy_df`, we'll only get the "naive" query execution plan. 

In [4]:
eager_df.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow...","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. ...","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


In [5]:
lazy_df.head(2)

In [6]:
lazy_df.describe_optimized_plan()

'  CSV SCAN data/titanic.csv\n  PROJECT */12 COLUMNS\n'

Polars passes your naive query plan to its query optimizer. 

The kinds of optimizations that are provided:
* projection pushdown - don't retrieve all columns
* predicate pushdown - apply filter conditions early on
* slice pushdown - only process a limit of rows, if only a limit is required
* combine predicates - combine multiple filter conditions
* common subplan elimination - eliminate duplicated transformations


Polars also implements other optimizations, such as fast-path algorithms, when it knows your data is sorted, but this is slightly different from the query optimizer. 

# Lazy Evaluation
Eventually, you do want to run those transformations, and turn your lazy frame into an eager frame, so you can see the data and do things with it. 

There are two primary ways to turn your lazy frame into an eager frame:
* `collect()` - to get full output
* `fetch()` - to get partial output; use `fetch` when you don't want to run your query on the full dataset

In [7]:
(
    lazy_df
    .collect()
    .head()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow...","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. ...","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis...","""female""",26.0,0,0,"""STON/O2. 31012...",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs....","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. Wil...","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


In [9]:
(
    lazy_df
    .fetch(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow...","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. ...","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis...","""female""",26.0,0,0,"""STON/O2. 31012...",7.925,,"""S"""


# Going from Lazy Mode to Eager Mode
You might have a lazy dataframe that you evaluate and turn into an eager dataframe, but then you'd like to continue your query, so you want to turn your eager dataframe back into a lazy dataframe.


Or, you might have to perform transformations that can only be done in eager mode (like `pivot`).
You can't do a `pivot` in lazy mode, because you need to know what your columns will be ahead of time before you can actually compute the new values of the pivotted frame.

We can turn an eager frame into a lazy frame by calling the `.lazy()` method.

In [10]:
new_eager_df = pl.read_csv(path_to_data)
new_lazy_df = new_eager_df.lazy()
new_lazy_df

# PyArrow Data Types
***
All data types in a Polars `Series` or `DataFrame` come from the Apache Arrow project.

Apache Arrow is an open source, cross-language project that attempts to specify the best way to represent tabular data in memory. 

Apache Arrow is:
* a specification for how data should be represented in memory
* a set of libraries in various languages that implement the specification

Polars uses the implementation of the Arrow specification from the Rust library `Arrow2`
## Benefits of using Apache Arrow
Apache Arrow provides:
* for zero-copy sharying of data between processes
* faster vectorized operations
* consistent representations for missing data types (and NaNs)
# In Practice
In Practice, you don't deal with Arrow or PyArrow directly, as Polars handles it for you. 
You can convert your Polars dataframe into a PyArrow `Table` by running `df.to_arrow()`. 
PyArrow is Python's implementation of the Apache Arrow specification.

In [11]:
eager_df.schema

{'PassengerId': Int64,
 'Survived': Int64,
 'Pclass': Int64,
 'Name': Utf8,
 'Sex': Utf8,
 'Age': Float64,
 'SibSp': Int64,
 'Parch': Int64,
 'Ticket': Utf8,
 'Fare': Float64,
 'Cabin': Utf8,
 'Embarked': Utf8}

In [13]:
eager_df['Name'].dtype

Utf8