# 100 pandas x polars puzzles

Inspired by [100 Pandas Puzzles](https://github.com/ajcr/100-pandas-puzzles) by [Alex Riley](https://www.linkedin.com/in/ajcriley/), here are  **100\* short data manipulation puzzles**, solved **side-by-side in Pandas and Polars**.

This is for people who are familiar with pandas and want to adopt polars to reap the performance and simplicity benefits that it offers.

## Why should you even care about polars ?
The polars philosopy can be found in the [user guide](https://docs.pola.rs/) with pandas comparisons added for context: 
- Utilizes all available cores on your machine.
    - pandas mostly uses a single core from your CPU, Polars is **designed for parallel execution** and can utilize multiple CPU cores automatically for many operations.    
- Optimizes queries to reduce unneeded work/memory allocations.
    - pandas operations are eagerly evaluated, meaning each operation materializes immediately leading to repeated memory allocation and intermediate DataFrames. 
    - Polars supports a **LazyFrame** API, where transformations like:
    `.select()`, `.filter()`, `.with_columns()`, `.group_by()`, `.join()`
    build a **query plan** instead of executing immediately.
    - Execution is triggered only when an **action** such as `.collect()` or `.fetch()` is called.
    - Before execution, Polars optimizes the entire plan (projection pushdown, predicate pushdown, reordering, parallelization). 
- Handles datasets much larger than your available RAM.
    - Polars can process datasets **larger than available RAM** in many scenarios by streaming data and avoiding unnecessary materialization.
- A consistent and predictable API.
    - Polars emphasizes **explicit, expression-based transformations**.
    - Also, Polars syntax is similar to PySpark. So if you later go down the spark route, understanding polars will be helpful
- Adheres to a strict schema (data-types should be known before running the query).
    - Polars enforces known data types before query execution.  This reduces data type surprises and bugs

PS: Polars offers both Eager (regular dataframe, but faster processes) and  Lazy execution (uses lazyframe and even faster !!) approach. Since these puzzles are designed for interactive learning where we inspect results at every step, we will primarily use the Eager (DataFrame) API, though many solutions work identically in Lazy mode. [Read more about polars Lazyframe and dataframe](https://stuffbyyuki.com/lazyframe-vs-dataframe-in-polars-performance-comparison/)

Every polars solution includes a Quick Info about the logic used. Polars follows a **functional, expression-based model** built around three core conceptsn:
1. Constructors `(e.g., pl.col(), pl.lit())`
2. Expressions `(e.g., .sum(), .is_between(), .replace())`
3. Contexts `(e.g., .select(), .filter(), .with_columns())`

Most work in Polars happens by combining **Expressions inside a Context**.  
This allows Polars to analyze and optimize the full expression chain **before any data is processed** — one of the key reasons it performs so well.

The exercises are loosely divided in sections. Each section has a difficulty rating; these ratings are subjective, of course, but should be a seen as a rough guide as to how inventive the required solution is.

Enjoy the puzzles!

\* *the list of exercises is not yet complete! Pull requests or suggestions for additional exercises, corrections and improvements are welcomed.*

## Importing pandas / polars

Difficulty: *easy* 

##### Before you begin

These first puzzles focus on **environment awareness**, not data manipulation.

You’ll be:
- importing Pandas and Polars
- inspecting their versions
- and printing detailed environment information

At first glance, this may seem trivial. In real-world data work, it isn’t.

Understanding how to identify **what version of a library you’re running**, and **what that library depends on**, is often the first step in:
- debugging unexpected behavior
- reproducing results
- collaborating with others
- and comparing how different tools behave under the same setup

Both Pandas and Polars expose similar APIs here.  


Think of these puzzles as learning how to **check your tools before using them**.


**1.a** Import pandas under the alias `pd`.<br>
**1.b** Import pandas under the alias `pl`.

In [None]:
# import pandas here


In [None]:
# import polars here

**2.** Print the version of pandas and polars that has been imported.

In [None]:
# print the pandas version here 


2.3.3


In [None]:
# print the polars version here 


1.36.1


**3.** Print out all the *version* information of the libraries that are required by pandas and polars.

In [None]:
#pandas required libraries here



INSTALLED VERSIONS
------------------
commit                : 9c8bc3e55188c8aff37207a74f1dd144980b8874
python                : 3.11.9
python-bits           : 64
OS                    : Windows
OS-release            : 10
Version               : 10.0.22000
machine               : AMD64
processor             : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : English_United Kingdom.1252

pandas                : 2.3.3
numpy                 : 2.4.0
pytz                  : 2025.2
dateutil              : 2.9.0.post0
pip                   : 24.0
Cython                : None
sphinx                : None
IPython               : 9.8.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
html5lib            

In [None]:
#polars required libraries here


--------Version info---------
Polars:              1.36.1
Index type:          UInt32
Platform:            Windows-10-10.0.22000-SP0
Python:              3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Runtime:             rt32

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.4.0
openpyxl             <not installed>
pandas               2.3.3
polars_cloud         <not installed>
pyarrow              <not installed>
pydantic             <not

## DataFrame basics

### A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

Difficulty: *easy*

Before solving the puzzles below, remember to import NumPy (used here only for representing missing values in numeric column):
```python
import numpy as np
```
You will be working with the same underlying data for both Pandas and Polars puzzles.

Consider the following Python dictionary  and list of row labels labels:


``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some  made up data with the theme of animals and trips to a vet.)

> Important note: Pandas has a first-class Index that is separate from the data columns. Polars does not maintain a persistent row index; if row labels are needed, they must be represented explicitly as a column.. This difference will matter in several puzzles below.

> ***A useful mental model is that Pandas behaves more like a spreadsheet—where rows have persistent labels, while Polars behaves more like a database or query engine, where data is primarily treated as a collection of columns and rows have no inherent identity unless explicitly added..*** - Gemini , last night

> Not having to manage a global index removes the need for row-level alignment and bookkeeping. This makes it easier for Polars to reorder, chunk, and parallelize operations, contributing to its performance. 



**4.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

In [None]:
# pandas

  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no


In [None]:
# polars

shape: (10, 5)
┌───────┬────────┬─────┬────────┬──────────┐
│ index ┆ animal ┆ age ┆ visits ┆ priority │
│ ---   ┆ ---    ┆ --- ┆ ---    ┆ ---      │
│ str   ┆ str    ┆ f64 ┆ i64    ┆ str      │
╞═══════╪════════╪═════╪════════╪══════════╡
│ a     ┆ cat    ┆ 2.5 ┆ 1      ┆ yes      │
│ b     ┆ cat    ┆ 3.0 ┆ 3      ┆ yes      │
│ c     ┆ snake  ┆ 0.5 ┆ 2      ┆ no       │
│ d     ┆ dog    ┆ NaN ┆ 3      ┆ yes      │
│ e     ┆ dog    ┆ 5.0 ┆ 2      ┆ no       │
│ f     ┆ cat    ┆ 2.0 ┆ 3      ┆ no       │
│ g     ┆ snake  ┆ 4.5 ┆ 1      ┆ no       │
│ h     ┆ cat    ┆ NaN ┆ 1      ┆ yes      │
│ i     ┆ dog    ┆ 7.0 ┆ 2      ┆ no       │
│ j     ┆ dog    ┆ 3.0 ┆ 1      ┆ no       │
└───────┴────────┴─────┴────────┴──────────┘


**5.** Display a summary of the basic information about this DataFrame and its data (*hint: there is a single method that can be called on the DataFrame*).

In [None]:
#pandas

Unnamed: 0,age,visits
count,8.0,10.0
mean,3.4375,1.9
std,2.007797,0.875595
min,0.5,1.0
25%,2.375,1.0
50%,3.0,2.0
75%,4.625,2.75
max,7.0,3.0


In [None]:
#polars

shape: (9, 6)
┌────────────┬───────┬────────┬──────┬──────────┬──────────┐
│ statistic  ┆ index ┆ animal ┆ age  ┆ visits   ┆ priority │
│ ---        ┆ ---   ┆ ---    ┆ ---  ┆ ---      ┆ ---      │
│ str        ┆ str   ┆ str    ┆ f64  ┆ f64      ┆ str      │
╞════════════╪═══════╪════════╪══════╪══════════╪══════════╡
│ count      ┆ 10    ┆ 10     ┆ 10.0 ┆ 10.0     ┆ 10       │
│ null_count ┆ 0     ┆ 0      ┆ 0.0  ┆ 0.0      ┆ 0        │
│ mean       ┆ null  ┆ null   ┆ NaN  ┆ 1.9      ┆ null     │
│ std        ┆ null  ┆ null   ┆ NaN  ┆ 0.875595 ┆ null     │
│ min        ┆ a     ┆ cat    ┆ 0.5  ┆ 1.0      ┆ no       │
│ 25%        ┆ null  ┆ null   ┆ 2.5  ┆ 1.0      ┆ null     │
│ 50%        ┆ null  ┆ null   ┆ 4.5  ┆ 2.0      ┆ null     │
│ 75%        ┆ null  ┆ null   ┆ 7.0  ┆ 3.0      ┆ null     │
│ max        ┆ j     ┆ snake  ┆ 7.0  ┆ 3.0      ┆ yes      │
└────────────┴───────┴────────┴──────┴──────────┴──────────┘


**6.** Return the first 3 rows of the DataFrame `df`.

In [None]:
#pandas

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no


In [None]:
# polars

index,animal,age,visits,priority
str,str,f64,i64,str
"""a""","""cat""",2.5,1,"""yes"""
"""b""","""cat""",3.0,3,"""yes"""
"""c""","""snake""",0.5,2,"""no"""


**7.** Select just the 'animal' and 'age' columns from the DataFrame `df`.

In [None]:
#pandas

Unnamed: 0,animal,age
a,cat,2.5
b,cat,3.0
c,snake,0.5
d,dog,
e,dog,5.0
f,cat,2.0
g,snake,4.5
h,cat,
i,dog,7.0
j,dog,3.0


In [None]:
#polars

animal,age
str,f64
"""cat""",2.5
"""cat""",3.0
"""snake""",0.5
"""dog""",
"""dog""",5.0
"""cat""",2.0
"""snake""",4.5
"""cat""",
"""dog""",7.0
"""dog""",3.0


**8.** Select the data in rows `[3, 4, 8]` *and* in columns `['animal', 'age']`.

In [None]:
#pandas

Unnamed: 0,animal,age
d,dog,
e,dog,5.0
f,cat,2.0
g,snake,4.5
h,cat,
i,dog,7.0


In [None]:
#polars

animal,age
str,f64
"""dog""",
"""dog""",5.0
"""cat""",2.0
"""snake""",4.5
"""cat""",
"""dog""",7.0


**9.** Select only the rows where the number of visits is greater than 3.

In [None]:
#pandas

Unnamed: 0,animal,age,visits,priority


In [None]:
#polars

index,animal,age,visits,priority
str,str,f64,i64,str


**10.** Select the rows where the age is missing, i.e. it is `NaN`.

In [None]:
#pandas

Unnamed: 0,animal,age,visits,priority
d,dog,,3,yes
h,cat,,1,yes


In [None]:
#polars

index,animal,age,visits,priority
str,str,f64,i64,str
"""d""","""dog""",,3,"""yes"""
"""h""","""cat""",,1,"""yes"""


**11.** Select the rows where the animal is a cat *and* the age is less than 3.

In [None]:
#pandas

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
f,cat,2.0,3,no


In [None]:
#polars

index,animal,age,visits,priority
str,str,f64,i64,str
"""a""","""cat""",2.5,1,"""yes"""
"""f""","""cat""",2.0,3,"""no"""


**12.** Select the rows where the age is between 2 and 4 (inclusive).

In [None]:
#pandas

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
f,cat,2.0,3,no
j,dog,3.0,1,no


In [None]:
#polars

index,animal,age,visits,priority
str,str,f64,i64,str
"""a""","""cat""",2.5,1,"""yes"""
"""b""","""cat""",3.0,3,"""yes"""
"""f""","""cat""",2.0,3,"""no"""
"""j""","""dog""",3.0,1,"""no"""


**13.** Change the age in row 'f' to 1.5.

In [None]:
#pandas

In [None]:
#polars

index,animal,age,visits,priority
str,str,f64,i64,str
"""a""","""cat""",2.5,1,"""yes"""
"""b""","""cat""",3.0,3,"""yes"""
"""c""","""snake""",0.5,2,"""no"""
"""d""","""dog""",,3,"""yes"""
"""e""","""dog""",5.0,2,"""no"""
"""f""","""cat""",1.5,3,"""no"""
"""g""","""snake""",4.5,1,"""no"""
"""h""","""cat""",,1,"""yes"""
"""i""","""dog""",7.0,2,"""no"""
"""j""","""dog""",3.0,1,"""no"""


**14.** Calculate the sum of all visits in `df` (i.e. find the total number of visits).

In [None]:
#pandas

np.int64(19)

In [None]:
#polars

19

**15.** Calculate the mean age for each different animal in `df`.

In [None]:
#pandas

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

In [None]:
#polars

shape: (3, 2)
┌────────┬──────────┐
│ animal ┆ age      │
│ ---    ┆ ---      │
│ str    ┆ f64      │
╞════════╪══════════╡
│ cat    ┆ 2.333333 │
│ snake  ┆ 2.5      │
│ dog    ┆ 5.0      │
└────────┴──────────┘


**16.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.

In [None]:
# pandas 


Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,1.5,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [None]:
# Polars


index,animal,age,visits,priority
str,str,f64,i64,str
"""a""","""cat""",2.5,1,"""yes"""
"""b""","""cat""",3.0,3,"""yes"""
"""c""","""snake""",0.5,2,"""no"""
"""d""","""dog""",,3,"""yes"""
"""e""","""dog""",5.0,2,"""no"""
…,…,…,…,…
"""g""","""snake""",4.5,1,"""no"""
"""h""","""cat""",,1,"""yes"""
"""i""","""dog""",7.0,2,"""no"""
"""j""","""dog""",3.0,1,"""no"""


**17.** Count the number of each type of animal in `df`.

In [None]:
# pandas

animal
dog      5
cat      4
snake    2
Name: count, dtype: int64

In [None]:
# Polars


animal,len
str,u32
"""dog""",5
"""snake""",2
"""cat""",4


**18.** Sort `df` first by the values in the 'age' in *decending* order, then by the value in the 'visits' column in *ascending* order (so row `i` should be first, and row `d` should be last).

In [None]:
# pandas

Unnamed: 0,animal,age,visits,priority
i,dog,7.0,2,no
k,dog,5.5,2,no
e,dog,5.0,2,no
g,snake,4.5,1,no
j,dog,3.0,1,no
b,cat,3.0,3,yes
a,cat,2.5,1,yes
f,cat,1.5,3,no
c,snake,0.5,2,no
h,cat,,1,yes


In [None]:
# polars

index,animal,age,visits,priority
str,str,f64,i64,str
"""h""","""cat""",,1,"""yes"""
"""d""","""dog""",,3,"""yes"""
"""i""","""dog""",7.0,2,"""no"""
"""k""","""dog""",5.5,2,"""no"""
"""e""","""dog""",5.0,2,"""no"""
…,…,…,…,…
"""j""","""dog""",3.0,1,"""no"""
"""b""","""cat""",3.0,3,"""yes"""
"""a""","""cat""",2.5,1,"""yes"""
"""f""","""cat""",1.5,3,"""no"""


**19.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be `True` and 'no' should be `False`.

In [None]:
#pandas

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,True
b,cat,3.0,3,True
c,snake,0.5,2,False
d,dog,,3,True
e,dog,5.0,2,False
f,cat,1.5,3,False
g,snake,4.5,1,False
h,cat,,1,True
i,dog,7.0,2,False
j,dog,3.0,1,False


In [None]:
# polars 

index,animal,age,visits,priority
str,str,f64,i64,bool
"""a""","""cat""",2.5,1,true
"""b""","""cat""",3.0,3,true
"""c""","""snake""",0.5,2,false
"""d""","""dog""",,3,true
"""e""","""dog""",5.0,2,false
…,…,…,…,…
"""g""","""snake""",4.5,1,false
"""h""","""cat""",,1,true
"""i""","""dog""",7.0,2,false
"""j""","""dog""",3.0,1,false


**20.** In the 'animal' column, change the 'snake' entries to 'python'.

In [None]:
#pandas

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,True
b,cat,3.0,3,True
c,python,0.5,2,False
d,dog,,3,True
e,dog,5.0,2,False
f,cat,1.5,3,False
g,python,4.5,1,False
h,cat,,1,True
i,dog,7.0,2,False
j,dog,3.0,1,False


In [None]:
# polars

index,animal,age,visits,priority
str,str,f64,i64,str
"""a""","""cat""",2.5,1,"""yes"""
"""b""","""cat""",3.0,3,"""yes"""
"""c""","""python""",0.5,2,"""no"""
"""d""","""dog""",,3,"""yes"""
"""e""","""dog""",6.0,2,"""no"""
…,…,…,…,…
"""g""","""python""",4.5,1,"""no"""
"""h""","""cat""",,1,"""yes"""
"""i""","""dog""",7.0,2,"""no"""
"""j""","""dog""",3.0,1,"""no"""


**21.** For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (*hint: use a pivot table*).

In [None]:
# pandas

visits,1,2,3
animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cat,2.5,,2.25
dog,3.0,5.833333,
python,4.5,0.5,


In [None]:
# polars 

animal,1,3,2
str,f64,f64,f64
"""cat""",2.5,2.25,
"""python""",4.5,,0.5
"""dog""",3.0,,6.166667
