# Advanced pandas - Going Beyond the Basics

## polars
___

### Table of Contents
1. [Import dependencies](#section1)
2. [Read data into polars](#section2)
3. [Expressions and operations](#section3)

https://towardsdatascience.com/pandas-dataframe-but-much-faster-f475d6be4cd4

https://medium.com/towards-data-science/measuring-the-speed-of-new-pandas-2-0-against-polars-and-datatable-still-not-good-enough-e44dc78f6585

https://towardsdatascience.com/getting-started-with-the-polars-dataframe-library-6f9e1c014c5c

https://towardsdatascience.com/understanding-lazy-evaluation-in-polars-b85ccb864d0c

___
<a id='section1'></a>
# (1) Import dependencies

In [1]:
# Install dependencies (if not already done so)
# !pip install pandas==2.0.3
# !pip install polars==0.18.7

In [1]:
import numpy as np
import pandas as pd
import polars as pl

___
<a id='section2'></a>
# (2) Read data into `polars`
- Data Source: https://archive.ics.uci.edu/dataset/352/online+retail ((CC BY 4.0) license)

`polars` supports reading and writing to all common file formats (e.g. CSV, JSON, Parquet), cloud storage (S3, Azure Blob, BigQuery) and databases (e.g. PostgreSQL, MySQL etc.).

In this example, we will use the fast CSV reading function of `polars`, as shown below:

In [16]:
# Read CSV in polars
df = pl.read_csv('https://raw.githubusercontent.com/kennethleungty/Educative-Advanced-Pandas/main/data/csv/online_retail_dataset.csv',
                 encoding='utf-8', # Ensure values are encoded appropriately
                 ignore_errors=True # Hide errors first since dtypes not specified
                )

# View output
df.head()

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
i64,str,str,i64,str,f64,i64,str
536365,"""85123A""","""WHITE HANGING …",6,"""1/12/2010 8:26…",2.55,17850,"""United Kingdom…"
536365,"""71053""","""WHITE METAL LA…",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"
536365,"""84406B""","""CREAM CUPID HE…",8,"""1/12/2010 8:26…",2.75,17850,"""United Kingdom…"
536365,"""84029G""","""KNITTED UNION …",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"
536365,"""84029E""","""RED WOOLLY HOT…",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"


`polars` has a strict schema, meaning that data types should be known before running the query. In the above case, because we did not specify the `dtypes`, the data type inference was done automatically and the output DataFrame indicates the inferred `dtype` for each column.

To specify the data types appropriately for `polars` to work optimally, we can use the `dtypes` and `columns` parameters, as illustrated below:

In [15]:
# Specify column names
columns=["InvoiceNo", "StockCode", "Description", "Quantity", "InvoiceDate", 
         "UnitPrice", "CustomerID", "Country"]

# Define list of data types in the same sequence as the columns
dtypes_list = [pl.Utf8, pl.Utf8, pl.Utf8, pl.Int64, pl.Datetime,
               pl.Float64, pl.Int64, pl.Categorical]

# Read CSV in polars with dtypes specified
df = pl.read_csv('https://raw.githubusercontent.com/kennethleungty/Educative-Advanced-Pandas/main/data/csv/online_retail_dataset.csv',
                 encoding='utf-8', # Ensure values are encoded appropriately
                 dtypes=dtypes_list,
                 columns=columns
                 )

# View output
df.head()

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
str,str,str,i64,datetime[μs],f64,i64,cat
"""536365""","""85123A""","""WHITE HANGING …",6,,2.55,17850,"""United Kingdom…"
"""536365""","""71053""","""WHITE METAL LA…",6,,3.39,17850,"""United Kingdom…"
"""536365""","""84406B""","""CREAM CUPID HE…",8,,2.75,17850,"""United Kingdom…"
"""536365""","""84029G""","""KNITTED UNION …",6,,3.39,17850,"""United Kingdom…"
"""536365""","""84029E""","""RED WOOLLY HOT…",6,,3.39,17850,"""United Kingdom…"


> **Note**: `pl.Utf8` represents the UTF-8 encoded string type in `polars`.The complete list of `polars` data types can be found in the section below on the Educative lesson page.

If we do not wish to read the entire large dataset directly, we can do scanning with `scan_csv()` instead. `scan_csv()` delays the parsing of the dataset, and instead lazily reads it and returns a holder known as a `LazyFrame` (rather than a DataFrame). A `LazyFrame` is a representation of a lazy computation graph/query against a DataFrame. 

The purpose of this is to let `polars` generate an optimal execution plan before actually executing the transformation, so that `polars` can skip over certain columns if they are not needed in the computation. The actual computation takes place when the `collect()` is called, as shown below:

In [19]:
lazy_df = pl.scan_csv('../data/csv/online_retail_dataset.csv', # Must be read from local file, not external URL
                      encoding='utf8',
                      ignore_errors = True)
type(lazy_df)

polars.lazyframe.frame.LazyFrame

In [25]:
# Retrieve full data when run collect()
df = lazy_df.collect(streaming = True)

# View output
df.head()

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
i64,str,str,i64,str,f64,i64,str
536365,"""85123A""","""WHITE HANGING …",6,"""1/12/2010 8:26…",2.55,17850,"""United Kingdom…"
536365,"""71053""","""WHITE METAL LA…",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"
536365,"""84406B""","""CREAM CUPID HE…",8,"""1/12/2010 8:26…",2.75,17850,"""United Kingdom…"
536365,"""84029G""","""KNITTED UNION …",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"
536365,"""84029E""","""RED WOOLLY HOT…",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"


As opposed to the lazy execution in `scan_csv()`, the `read_csv()` method uses an eager execution mode.

For debugging purposes, sometimes it is useful to just return a few rows to examine the output. We can use the fetch() method to return the first n rows, as shown below:

In [24]:
lazy_df.fetch(3)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
i64,str,str,i64,str,f64,i64,str
536365,"""85123A""","""WHITE HANGING …",6,"""1/12/2010 8:26…",2.55,17850,"""United Kingdom…"
536365,"""71053""","""WHITE METAL LA…",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"
536365,"""84406B""","""CREAM CUPID HE…",8,"""1/12/2010 8:26…",2.75,17850,"""United Kingdom…"


> **Note**: To showcase the lazy evaluation capabilities of `polars`, we will load the dataset locally for the subsequent examples, instead of retrieving it from the GitHub repo link.

___
<a id='section3'></a>
# (3) Expressions and operations

`polars` has a powerful concept called expressions, and it is central to its fast performance. Expressions are at the core of many data science operations as they are used to represent operations performed on one or more columns in a DataFrame. 

They can include basic arithmetic, aggregations, comparisons, and other more complex transformations. When using `polars`, expressions allow users to create concise, readable code for data manipulation.

In this lesson, let us focus on four of these expression methods, namely:
1. `select()`
2. `filter()`
3. `groupby()`
4. `with_columns()`

## `select()`

## `filter()`

## `groupby()`

## `with_columns()`

https://pola-rs.github.io/polars-book/getting-started/expressions/

## Chaining expressions

The power of expressions is that every expression produces a new expression, which means that they can be chained together in a form of pipeline. For example, ____

https://pola-rs.github.io/polars-book/getting-started/expressions/

We generally want to stay in lazy mode for as long as possible (ideally for our entire query) so that Polars can apply query optimisation. 

**SHOW EXAMPLE of lazy()**: https://towardsdatascience.com/understanding-lazy-evaluation-in-polars-b85ccb864d0c#:~:text=Explicit%20Lazy%20Evaluation

So this is not only much faster and more scalable, it’s also much easier to read and write! From: https://www.rhosignal.com/posts/polars-glob-csvs/

In [None]:
polars_df = (
    pl.scan_csv("data_files/*.csv")
    # Select a subset of columns
    .select(["date","temperature","humdity"])
    ...
    .collect(streaming=True)
)

I called collect with the streaming = True argument to tell Polars I want it to evaluate the dataset in chunks
There are some caveats here:

streaming does not work for all operations (but does for core operations like filter,groupby and join). If streaming is not available for some of your operations Polars will default to non-streaming and you may run out of memory with a large dataset

Want to add a table of polars data types?