# EDA, the Polars Way

What is [polars](https://docs.pola.rs/)?

- Library for manipulation with tabular data*, based on arrow
- Contender to [pandas](https://pandas.pydata.org/) (and many more similar tools)
- Since 2020, started by Ritchie Vink

### Why polars?

- Performance (rust)
- Clean(er) API
- Lazy evaluation & query optimization
- Cool kid on the block

### Why not polars?

- Less stable
- Less functionality
- Less known
- Sometimes lengthy code
- Copilot tends to suggest pandas code ;-)

## Let's start

```shell
jupyter lab
```

or Visual Studio Code or PyCharm if you prefer those.

Open "exercises.ipynb"

In [None]:
# Most basic import
import polars as pl

# Other useful imports
from datetime import date, datetime
import polars.selectors as cs


## Fundamental data structures

See https://docs.pola.rs/user-guide/concepts/data-structures/

### DataFrame

- a "table"?
- a "spreadsheet" table?
- a "dict of columns"?

In [None]:
# Load some data
un = pl.read_csv("data/un_basic.csv", try_parse_dates=True)
un


In [None]:
# What is it?
type(un)


### Series

- a "list"?
- a "column"?
- an "array of X"?

A bit of everything...

In [None]:
# Select one column for a DataFrame
un["country"]


In [None]:
type(un["country"])

### Closer look at the table

In [None]:
un.shape

In [None]:
un.columns

A random selection of rows using [`.sample`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.sample.html#polars.DataFrame.sample)

In [None]:
un.sample(10)

Look at basic statistical properties using [`.describe`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.describe.html#polars.DataFrame.describe):

In [None]:
un.describe()

Other useful methods to obtain a selection of rows:
- [`.head`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html)
- [`.tail`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.tail.html)

### D(ata) types

- each column holds objects of the same type (unlike Python collections!)
- distinct from (but convertible to/from) Python classes
- the types are nullable => each value can be missing

In [None]:
# List of all data types in a DataFrame
un.dtypes


In [None]:
# More useful (dict)
{col: un[col].dtype for col in un.columns}


#### Common types

- Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64
- Float32, Float64
- Decimal
- Date, Datetime, Time
- String, Categorical, Enum
- Array, List, Struct
- Boolean
- Object, Null, Binary, Unknown

See https://docs.pola.rs/user-guide/concepts/data-types/overview/

In [None]:
# Construct a Series from an object
pl.Series("city", ["Firenze", "Berlin", "Pittsburgh", "Prague"], dtype=pl.String)


In [None]:
# Construct a DataFrame from dictionary of lists.

pl.DataFrame(
    {
        "event": ["PyCon Italia", "PyCon.DE & PyData Berlin", "PyCon US", "EuroPython"],
        "city": ["Firenze", "Berlin", "Pittsburgh", "Prague"],
        "country": ["Italy", "Germany", "United States of America", "Czechia"],
        "start_date": [
            date(2024, 5, 22),
            date(2024, 4, 22),
            date(2024, 5, 15),
            date(2024, 7, 8),
        ],
    }
)


In [None]:
## Pandas window

import pandas as pd

pandas_series = pd.DataFrame(
    {"a": [1, 2, 3, 4, 5], "b": pd.Series([1, 2, 3, 4, 5], dtype="float64")}
)
pl.DataFrame(pandas_series)


### Indices

Polars, unlike pandas, does not have indices.
End of story.

## Basic plotting

Choose any library you want:

- [plotly](https://plotly.com/python/)
- [matplotlib](https://matplotlib.org/)
- [seaborn](https://seaborn.pydata.org/)
- [hvplot](https://hvplot.holoviz.org/)
- ...

### "Built-in" hvplot support

Note: hvplot must be installed

```python
df.plot()
df.plot.bar()
df.plot.scatter()
...
```


In [None]:
un.plot.scatter(
    x="area",
    y="population",
    # logx=True,
    # logy=True,
    # color="region",
    # title="Countries of the World",
    # hover_cols=["country"],
)


Slightly more info here:
- https://docs.pola.rs/py-polars/html/reference/dataframe/plot.html
- https://docs.pola.rs/user-guide/misc/visualization/ 
- https://hvplot.holoviz.org/user_guide/Pandas_API.html - pandas API for hvplot. It works almost the same with polars

**Exercise**: 
1. Load the list of cities from an external file.
2. Draw the "poor man's map of the world", based on the "lat" and "lng" columns of the table.
Optionally: You can embellish it with any aesthetics you want.

In [None]:
# Exercise load-cities
# Load the data:
# - the file is called "data/worldcities.parquet"
# - find the appropriate loading method from https://docs.pola.rs/user-guide/io/
cities = ...
cities


In [None]:
# Exercise world-map
# Plot ihe cities
# - find the appropriate plotting method
# - use proper arguments for the call
cities.plot....(
    x=...,
    y=...,
)

## Sorting

- [.sort](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.sort.html)

In [None]:
un.sort("admission_date")

Note: We receive a new DataFrame, as with all other manipulations. There is no "inplace" in polars. (unlike pandas, where this is only discouraged)

In [None]:
un.sort("population", descending=True)

In [None]:
un.sort("region", "subregion")

**Exercise** Create a bar plot of 10 countries with the lowest population.

Hints:
- [`.sort`]()
- [`.head`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html) 
- [`.plot.bar`](https://hvplot.holoviz.org/reference/tabular/bar.html) - docs from the hvplot pages
- Use `hover_cols` and `x` args to describe the plot properly (optional)

In [None]:
# Exercise ten-smallest
sorted_population = ...
ten_smallest = ...

# Plot it
ten_smallest.plot....

## Expressions & selection

In [None]:
un["country"]

In [None]:
# Find the difference
un.select("country")


Expression, representing column(s) in a dataframe:

In [None]:
pl.col("country")

In [None]:
pl.lit("country")

In [None]:
# Select a column
un.select(pl.col("country"))  # Or pl.col.country


In [None]:
# Literal value
un.select(pl.lit("country"))


In [None]:
.drop

Note: You can pass an expression to `.sort` too.

## Filtering

When you want to select rows based on some criteria... Well, a short detour:

### Contexts

Every expression can only be executed within one of the following contexts:

1. Selection (`.select`, `.with_columns`) - we already saw this
2. Filtering (`.filter`)
3. Aggregation

See https://docs.pola.rs/user-guide/concepts/contexts/.


Pass any boolean expression to the [`.filter`]() method

In [None]:
un.filter(country="Italy")

In [None]:
un.filter(pl.col("population") > 1e9)

**Exercise:** Show how the share of electricity coming from different sources evolved over time in Italy (or from some other country).

In [None]:
# Exercise energy-it
el_source = pl.read_csv(
    "data/our_world_in_data/electricity-source.csv", infer_schema_length=5000
)  # Note the infer_schema_length
el_source_italy = ...
el_source_italy


**Exercise:** Create a linear (or stacked area) plot of the fractions

Hints:
- `.plot.area` is your friend
- you can supply multiple column names in the `y` argument
- An example is here: https://hvplot.holoviz.org/user_guide/Pandas_API.html#area-plot

In [None]:
# Exercise energy-it
el_source_italy.plot(..., ...)


## Operations

In [None]:
un["population"] / un["area"]

In [None]:
date.today() - un["admission_date"]

In [None]:
type(pl.col("population") / pl.col("area"))

In [None]:
un.select(
    "country",
    "population",
    "area",
    (pl.col("population") / pl.col("area")).alias("density"),
).sort("density", descending=True)


In [None]:
un.with_columns(
    density=pl.col("population") / pl.col("area"),
)


## Aggregations

When you want to find the summary statistics, you can use one of several Series/DataFrame functions, such as `.max`, `.mean`, `.median`, `.quantile`, ...

In [None]:
un.max()

In [None]:
un.median()

In [None]:
un.quantile(0.05)

**Exercise** Find all founding members of the U.N. 

Hints:
- find the admission date of the first member first
- you might want to find the proper minimizing function (min)

In [None]:
# Exercise founding-members
first_date = ...
founding_members = ...
founding_members


### Grouped aggregations

[.group_by](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html)

- it allows to select columns (or expression) to group by.
- in combination with a few selected functions or the `.agg`, it enters a new context for operations with the groups

In [None]:
un.group_by("region").len()

In [None]:
un.group_by("subregion").len()

In [None]:
un.group_by("region", "subregion").len().sort("region", "subregion")

In [None]:
un.group_by("subregion").sum()

In [None]:
un.group_by("region", "subregion").agg(
    pl.col("population").sum().alias("total_population"),
    pl.col("area").sum().alias("total_area"),
    pl.col("area").count().alias("num_countries"),
).sort("region", "subregion")


In [None]:
forest_area = pl.read_csv("data/our_world_in_data/forest-area-km.csv")
forest_area


**Exercise:** Find the relative change of forestation for each country on the year range.

- group by an appropriate column
- find the proper aggregation functions (https://docs.pola.rs/user-guide/expressions/aggregation)
- optionally exclude the infinite and NA values ([`.is_finite`](https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.is_finite.html)) 

In [None]:
# Exercise forest-change
first_and_last_forest_area = forest_area.group_by(...).agg(
    ...
)
relative_change = first_and_last_forest_area....
# Filter finite and sort
relative_change = relative_change....

### Time aggregations

In [None]:
weather = pl.read_parquet("data/florence-meteostat.parquet")
weather


What about timezones? Let's forget about them for now... but it would deserve its own workshop. See https://docs.pola.rs/user-guide/transformations/time-series/timezones/

In [None]:
weather.plot(y="temp")

In [None]:
weather.plot(x="time", y="temp")

In [None]:
# We can use the year column
yearly_mean = weather.group_by(
    pl.col("time").dt.year().alias("year"), maintain_order=True
).agg(avg_temp=pl.col("temp").drop_nans().mean())
yearly_mean


In [None]:
yearly_mean.plot(x="year", y="avg_temp")

But groupping by calendar month? That might be trickier.

And hence the [`.group_by_dynamic`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html) method.

- it takes one column/expression to serve as a time index
- the data must be sorted in that column (either by previous `.sort` or [`.set_sorted`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.set_sorted.html))
- you can specify the aggregation period by a number of ways

In [None]:
monthly = (
    weather.set_sorted("time")
    .group_by_dynamic("time", every="1mo")
    .agg(avg_temp=pl.col("temp").drop_nans().mean())
)
monthly.plot(x="time", y="avg_temp")


**Exercise:** What was the day with highest lowest temperature (probably meaning the hottest night) in Florence in the last 10 years? 

Hints:
- Filter the new values only
- Group by an appropriate time period
- Find the minimum
- You might need to `.drop_nans` and `.drop_nulls` to work with reasonable values (at appropriate moment)
- Work with the minima to find the top value (or a few of them)

In [None]:
# Exercise hottest-night
recent_weather = weather....
min_daily_temperatures = recent_weather.set_sorted("time").group_by_dynamic(...).agg(...)
top_nights = min_daily_temperatures....
top_nights

Similarly, you can group by a rolling window using [`.rolling`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.rolling.html), ...

## Joining

We have data coming from two (or more) tables. We want to combine them, so we call the [`.join`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.join.html) method on of them and:

- which column(s) should match (`on`, `left_on`, `right_on` arguments)
- how to deal with missing values (`how`)
- how to deal with non-matching columns that are in both dataframes (`suffix`)
- whether to restrict the relationship somehow (`validate`)

More on that in the [User guide](https://docs.pola.rs/user-guide/transformations/joins/).

(Pandas note: the different, almost equivalent ways of joining in pandas are a big source of confusion)

In [None]:
current_forest_area = forest_area.group_by("Code").last()

**Example:** Find the total forest area per region. (Bonus: find also the percentage)

In [None]:
current_forest_area.join(
    un, left_on="Code", right_on="iso3", how="inner"
).sum().select("Forest area", "area").with_columns(
    forest_area_ratio=pl.col("Forest area") / pl.col("area") / 100 / 0.71
)


**Exercise** Find the number of cities over 1 million inhabitants per region / subregion. (Bonus: include the largest of those cities)

Hints:
- Join with the "un" table on appropriate columns
- Aggregate over appropriate columns and use [`.len`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.dataframe.group_by.GroupBy.len.html)

In [None]:
# Exercise million-cities
million_cities = cities....
million_cities_with_country = million_cities....
million_cities_per_region = (
    million_cities_with_country.group_by(...)
    ...
)


## Wide / long table format

### Wide -> long

Sometimes we have a 2D table where the column names do not describe independent attributes but rather one condition under many circumstances, like in the following:

In [None]:
wb_pop_wide = pl.read_csv("data/world_bank-population.csv")
wb_pop_wide


We can transform the column labels to one column and convert the table to "long format":

[.melt](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.melt.html#polars.DataFrame.melt)

In [None]:
wb_pop = (
    wb_pop_wide.melt(
        id_vars=["Country Code", "Country Name"],
        variable_name="year",
        value_vars=cs.numeric(),
        value_name="population",
    )
    .cast({"year": pl.Int64})
    .rename({"Country Code": "iso3", "Country Name": "country"})
)
wb_pop


In [None]:
world_pop = wb_pop.filter(iso3="WLD").drop(
    "iso3", "country"
)  # .plot(x="year", y="population")
world_pop


In [None]:
# Alternative
wb_pop_wide.filter(pl.col("Country Code") == "WLD").select(cs.numeric()).transpose(
    header_name="year", include_header=True
).rename({"column_0": "population"})


### Long -> wide (pivotting)

In the opposite of melting, we want to aggregate something over a pair of columns and create a "2D", pivot table. 

[.pivot]()

**Example:** Plot the daily pattern of temperatures in Florence for each month of the year (since 2020).

In [None]:
month_day_data = (
    weather.drop_nulls()
    .filter(
        pl.col("temp").is_not_nan(),
        pl.col("time") >= date(2020, 1, 1),
    )
    .select("time", "temp")
    .with_columns(
        pl.col("time").dt.month().alias("month"),
        pl.col("time").dt.strftime("%B").alias("month_name"),
        pl.col("time").dt.hour().alias("hour"),
    )
)
month_day_data


In [None]:
month_day_table = month_day_data.pivot(
    values="temp",
    index="hour",
    columns="month_name",
    aggregate_function="mean",
)
month_day_table


In [None]:
month_day_table.plot(x="hour")

**Exercise:** Find how the total forested area developed for different regions year by year and draw a stacked area plot.

Hints:
- Join `un` and `forecast_area` dataframes, using appropriate columns
- Pivot using the region as `column` and year as row `index`. Find the appropriate `value` column.
- Find the appropriate value for the `aggregate_function` argument.

Question: You should see something weird in the output - can you explain?

In [None]:
# Exercise forest-region
forest_area_by_region = forest_area.join(...).pivot(..., aggregate_function=...)
forest_area_by_region.plot....

## Lazy operations

One of the best attributes of polars is its ability to optimise queries and perform them in chunks.

We can start lazy mode by
- lazy input functions ([`scan_csv`](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html), [`scan_parquet`](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_parquet.html), ...)
- calling [`.lazy`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.lazy.html) on a dataframe

All operations on the lazy dataframes are added to the query.

The final results are obtained by a [`.collect`](https://docs.pola.rs/py-polars/html/reference/lazyframe/api/polars.LazyFrame.collect.html) or some other method call.

In [None]:
lazy_cities = pl.scan_parquet("data/worldcities.parquet")
lazy_cities

In [None]:
lazy_mean_population = (
    lazy_cities.filter(pl.col("population") > 1e6)
    .group_by("country")
    .agg(pl.col("population").mean().alias("mean_population"))
    .sort("mean_population", descending=True)
)
lazy_mean_population


In [None]:
print(lazy_mean_population.explain(optimized=True))

In [None]:
lazy_mean_population.collect()

Find more on that in the [User Guide](https://docs.pola.rs/user-guide/lazy/). We do not have any larger-than-memory data file in our repo...