### Intoduction to Polars

https://realpython.com/python-gil/

https://pola.rs/

![alt text](img/polars.png)

The versatility of Python and its simple syntax are certainly the strong points of this high-level, general-purpose programming language. However, one of its greatest weaknesses, if not the greatest (especially considering how the computing ecosystem has evolved), is the Global Interpreter Lock (GIL).

The Python Global Interpreter Lock, or GIL, is a mutex (lock) that permits only one thread to control the Python interpreter at a time. In simple terms, it means that only one thread can be actively executing code at any given moment. While this constraint may not be noticeable in single-threaded programs, it can become a performance bottleneck in scenarios involving multi-threaded code or CPU-bound tasks.

The GIL's reputation as an "infamous" feature stems from its restriction of executing only one thread at a time, even in multi-threaded architectures with multiple CPU cores. This article explores how the GIL affects the performance of Python programs and provides insights into mitigating its impact on code.

To understand the problem the GIL addresses, it's crucial to delve into Python's memory management using reference counting. Objects in Python have a reference count variable that tracks the number of references pointing to the object. When this count reaches zero, the object's memory is released. The GIL addresses the challenge of protecting this reference count variable from race conditions where two threads might simultaneously increase or decrease its value. Without proper protection, such scenarios could lead to memory leaks or, worse, incorrect release of memory while references to the object still exist, resulting in crashes or unpredictable bugs.

In [1]:
import sys
a = []
b = a
sys.getrefcount(a)

3

While one solution could be adding locks to shared data structures to safeguard the reference count variable, it introduces the risk of deadlocks and performance degradation due to repeated lock acquisition and release. The GIL takes a different approach by acting as a single lock on the interpreter itself. This ensures that executing any Python bytecode necessitates acquiring the interpreter lock. Although this approach avoids deadlocks and minimizes performance overhead, it effectively confines CPU-bound Python programs to single-threaded execution.

Python is a high-level programming language that prioritizes ease of use and readability. However, this focus on simplicity and readability can impact its performance. Recognizing the need for optimized execution in certain scenarios, an interface between Python and C has been developed. This integration allows developers to leverage the efficiency of C, a low-level language known for its speed, in performance-critical sections of their Python programs. By doing so, they can strike a balance between Python's simplicity and C's performance, optimizing their applications for specific tasks. Interestingly, the GIL was not only designed to ensure better performance in single-threaded programs but also to facilitate the integration of C libraries that were not thread-safe. It's amusing to note that in C, you can sidestep the issue of individual threads, and this playful workaround extends to other programming languages as well.

Those clever creators of Polars wrote the library in Rust. In addition to this, they focused on parallelization and efficiency. And given the excellent result, Polars will probably replace Pandas in the course of a few years. Therefore, in this course, we explore a bit of this fantastic library.

In [2]:
try:
    import polars as pl
except ImportError:
    print("Il pacchetto 'polars' non è installato. Installazione in corso...")
    %conda install -c conda-forge polars -y
    print("Installazione completata.")


### Data types

Polars is entirely based on Arrow data types and backed by Arrow memory arrays. From this point of view, there isn't much new, but it's worth listing the types we will use the most:

* ```pl.Int32``` and ```pl.Int64```
* ```pl.Float32``` and ```pl.Float32```
* ```pl.Date``` and ```pl.Datetime```
* ```pl.Boolean``` and ```pl.Categorical```

Categorical data represents string data where the values in the column have a finite set of values. Storing these values as plain strings is a waste of memory so Polars encode them in dictionary format.

Keep in mind that in Polars, 'NaN' doesn't exist; instead, it is replaced with ```pl.Null```.

### Data structures

Regarding the structures, they are similar to those in Pandas, so essentially, we will be working with:
* ```pl.Series``` and  ```pl.DataFrame```

Some of these functions have been implemented and operate for ```pl.DataFrame``` in the same way as those in Pandas:

* ```.head()``` shows the first 5 elements
* ```.tail()``` shows the last 5 elements
* ```.sample()``` shows 5 random elements
* ```.describe()``` returns summary statistics

### I/O

Polars supports reading and writing to all common files (e.g. csv, json, parquet), cloud storage (S3, Azure Blob, BigQuery) and databases (e.g. postgres, mysql). For this course, we are revisiting the old and dear Titanic database:

In [3]:
df = pl.read_csv("titanic.csv")

In general, the syntax is quite simple: use ```read_filetype``` for reading and ```write_filetype``` for writing. I would refer you to the documentation for the attributes of the aforementioned functions. A non-exhaustive list of file types includes:
* ```.read_json()``` and ```.write_json()```
* ```.read_parquet()``` and ```.write_parquet()```

### Contexts

Contexts and Expressions constitute the language through which Polars performs operations on data. A context pertains to the circumstances under which an expression is meant to be assessed. Let's do some examples:

##### Select

In the ```select``` context the selection applies expressions over columns. The expressions in this context must produce Series that are all the same length:

In [14]:
out = df.select(pl.col("Name"),
                pl.col("Age"),
                pl.col("Survived")
                ).limit(5)
print(out)

shape: (5, 3)
┌───────────────────────────────────┬──────┬──────────┐
│ Name                              ┆ Age  ┆ Survived │
│ ---                               ┆ ---  ┆ ---      │
│ str                               ┆ f64  ┆ i64      │
╞═══════════════════════════════════╪══════╪══════════╡
│ Braund, Mr. Owen Harris           ┆ 22.0 ┆ 0        │
│ Cumings, Mrs. John Bradley (Flor… ┆ 38.0 ┆ 1        │
│ Heikkinen, Miss. Laina            ┆ 26.0 ┆ 1        │
│ Futrelle, Mrs. Jacques Heath (Li… ┆ 35.0 ┆ 1        │
│ Allen, Mr. William Henry          ┆ 35.0 ┆ 0        │
└───────────────────────────────────┴──────┴──────────┘


```select``` is very similar to ```with_columns``` The main difference is that the latter retains the original columns and adds new ones while ```select``` drops the original columns:

In [15]:
out = out.with_columns(pl.col("Survived").cast(pl.Boolean).alias("As bool"))
print(out)

shape: (5, 4)
┌───────────────────────────────────┬──────┬──────────┬─────────┐
│ Name                              ┆ Age  ┆ Survived ┆ As bool │
│ ---                               ┆ ---  ┆ ---      ┆ ---     │
│ str                               ┆ f64  ┆ i64      ┆ bool    │
╞═══════════════════════════════════╪══════╪══════════╪═════════╡
│ Braund, Mr. Owen Harris           ┆ 22.0 ┆ 0        ┆ false   │
│ Cumings, Mrs. John Bradley (Flor… ┆ 38.0 ┆ 1        ┆ true    │
│ Heikkinen, Miss. Laina            ┆ 26.0 ┆ 1        ┆ true    │
│ Futrelle, Mrs. Jacques Heath (Li… ┆ 35.0 ┆ 1        ┆ true    │
│ Allen, Mr. William Henry          ┆ 35.0 ┆ 0        ┆ false   │
└───────────────────────────────────┴──────┴──────────┴─────────┘


##### Filter

In the filter context you filter the existing dataframe based on arbitrary expression:

In [21]:
out = df.filter((pl.col("Age") < 10) & (pl.col("Pclass") == 3)).limit(5)
print(out)

shape: (5, 12)
┌─────────────┬──────────┬────────┬──────────────────┬───┬─────────┬─────────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name             ┆ … ┆ Ticket  ┆ Fare    ┆ Cabin ┆ Embarked │
│ ---         ┆ ---      ┆ ---    ┆ ---              ┆   ┆ ---     ┆ ---     ┆ ---   ┆ ---      │
│ i64         ┆ i64      ┆ i64    ┆ str              ┆   ┆ str     ┆ f64     ┆ str   ┆ str      │
╞═════════════╪══════════╪════════╪══════════════════╪═══╪═════════╪═════════╪═══════╪══════════╡
│ 8           ┆ 0        ┆ 3      ┆ Palsson, Master. ┆ … ┆ 349909  ┆ 21.075  ┆ null  ┆ S        │
│             ┆          ┆        ┆ Gosta Leonard    ┆   ┆         ┆         ┆       ┆          │
│ 11          ┆ 1        ┆ 3      ┆ Sandstrom, Miss. ┆ … ┆ PP 9549 ┆ 16.7    ┆ G6    ┆ S        │
│             ┆          ┆        ┆ Marguerite Rut   ┆   ┆         ┆         ┆       ┆          │
│ 17          ┆ 0        ┆ 3      ┆ Rice, Master.    ┆ … ┆ 382652  ┆ 29.125  ┆ null  ┆ Q        │
│    

##### Group by with aggregation

In the ```group_by``` context, expressions work on groups and thus can yield results of any length:

In [46]:
out = df.group_by("Pclass").agg(
    pl.count("PassengerId").alias("#"),
    pl.col("PassengerId").filter(pl.col("Name").str.contains("John")).alias("ids of Johns"),
    pl.sum("Survived")
)
print(out)

shape: (3, 4)
┌────────┬─────┬─────────────────┬──────────┐
│ Pclass ┆ #   ┆ ids of Johns    ┆ Survived │
│ ---    ┆ --- ┆ ---             ┆ ---      │
│ i64    ┆ u32 ┆ list[i64]       ┆ i64      │
╞════════╪═════╪═════════════════╪══════════╡
│ 3      ┆ 491 ┆ [9, 46, … 889]  ┆ 119      │
│ 2      ┆ 184 ┆ [42, 99, … 865] ┆ 87       │
│ 1      ┆ 216 ┆ [2, 169, … 823] ┆ 136      │
└────────┴─────┴─────────────────┴──────────┘
