# EDA, the Polars Way

What is [polars](https://docs.pola.rs/)?

- Library for manipulation with tabular data*, based on arrow
- Contender to [pandas](https://pandas.pydata.org/) (and many more similar tools)
- Since 2020, started by Ritchie Vink

### Why polars?

- Performance (rust)
- Clean(er) API
- SQL-like concepts (?)
- Lazy evaluation & query optimization
- Cool kid on the block

### Why not polars?

- Less stable (?, version 1.0 in July 2024)
- Less functionality 
- Less known
- Sometimes lengthy code
- Copilot tends to suggest pandas code ;-)

## Let's start

```shell
jupyter lab
```

or Visual Studio Code or PyCharm if you prefer those.

Open "exercises.ipynb"

In [1]:
# Uncomment and run this if you're using deepnote
# !pip install -r requirements.txt

In [2]:
# Most basic import
import polars as pl

# Other useful imports
from datetime import date, datetime
import polars.selectors as cs
import altair as alt


## Fundamental data structures

See https://docs.pola.rs/user-guide/concepts/data-structures/

### DataFrame

- a "table"?
- a "spreadsheet" table?
- a "dict of columns"?

In [3]:
# Load some data
un = pl.read_csv("data/un_basic.csv", try_parse_dates=True)
un


iso3,country,population,area,admission_date,region,subregion
str,str,i64,f64,date,str,str
"""AFG""","""Afghanistan""",41128771,652860.0,1946-11-19,"""Asia""","""Southern Asia"""
"""ALB""","""Albania""",2777689,28750.0,1955-12-14,"""Europe""","""Southern Europe"""
"""DZA""","""Algeria""",44903225,2.381741e6,1962-10-08,"""Africa""","""Northern Africa"""
"""AND""","""Andorra""",79824,470.0,1993-07-28,"""Europe""","""Southern Europe"""
"""AGO""","""Angola""",35588987,1.2467e6,1976-12-01,"""Africa""","""Sub-Saharan Africa"""
…,…,…,…,…,…,…
"""VEN""","""Venezuela""",28301696,912050.0,1945-11-15,"""Americas""","""Latin America and the Caribbea…"
"""VNM""","""Vietnam""",98186856,331340.0,1977-09-20,"""Asia""","""South-eastern Asia"""
"""YEM""","""Yemen""",33696614,527970.0,1947-09-30,"""Asia""","""Western Asia"""
"""ZMB""","""Zambia""",20017675,752610.0,1964-12-01,"""Africa""","""Sub-Saharan Africa"""


In [4]:
# What is it?
type(un)


polars.dataframe.frame.DataFrame

### Series

- a "list"?
- a "column"?
- an "array of X"?

A bit of everything...

In [5]:
# Select one column for a DataFrame
un["country"]


country
str
"""Afghanistan"""
"""Albania"""
"""Algeria"""
"""Andorra"""
"""Angola"""
…
"""Venezuela"""
"""Vietnam"""
"""Yemen"""
"""Zambia"""


In [6]:
type(un["country"])

polars.series.series.Series

### Closer look at the table

In [7]:
un.shape

(193, 7)

In [8]:
un.columns

['iso3',
 'country',
 'population',
 'area',
 'admission_date',
 'region',
 'subregion']

A random selection of rows using [`.sample`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.sample.html#polars.DataFrame.sample)

In [9]:
un.sample(10)

iso3,country,population,area,admission_date,region,subregion
str,str,i64,f64,date,str,str
"""GTM""","""Guatemala""",17357886,108890.0,1945-11-21,"""Americas""","""Latin America and the Caribbea…"
"""GIN""","""Guinea""",13859341,245860.0,1958-12-12,"""Africa""","""Sub-Saharan Africa"""
"""ARE""","""United Arab Emirates""",9441129,98647.9,1971-12-09,"""Asia""","""Western Asia"""
"""VCT""","""Saint Vincent and the Grenadin…",103948,390.0,1980-09-16,"""Americas""","""Latin America and the Caribbea…"
"""TUR""","""Türkiye""",84979913,785350.0,1945-10-24,"""Asia""","""Western Asia"""
"""LBN""","""Lebanon""",5489739,10450.0,1945-10-24,"""Asia""","""Western Asia"""
"""ETH""","""Ethiopia""",123379924,1136200.0,1945-11-13,"""Africa""","""Sub-Saharan Africa"""
"""ISL""","""Iceland""",382003,103000.0,1946-11-19,"""Europe""","""Northern Europe"""
"""NOR""","""Norway""",5457127,624500.0,1945-11-27,"""Europe""","""Northern Europe"""
"""JAM""","""Jamaica""",2827377,10990.0,1962-09-18,"""Americas""","""Latin America and the Caribbea…"


Look at basic statistical properties using [`.describe`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.describe.html#polars.DataFrame.describe):

In [10]:
un.describe()

statistic,iso3,country,population,area,admission_date,region,subregion
str,str,str,f64,f64,str,str,str
"""count""","""193""","""193""",193.0,193.0,"""193""","""193""","""193"""
"""null_count""","""0""","""0""",0.0,0.0,"""0""","""0""","""0"""
"""mean""",,,40973000.0,694604.42958,"""1965-09-09 17:32:01.244000""",,
"""std""",,,149010000.0,1918400.0,,,
"""min""","""AFG""","""Afghanistan""",11312.0,20.0,"""1945-10-24""","""Africa""","""Australia and New Zealand"""
"""25%""",,,2105566.0,25710.0,"""1945-12-27""",,
"""50%""",,,9228071.0,130370.0,"""1960-09-28""",,
"""75%""",,,30547580.0,549086.87,"""1977-09-20""",,
"""max""","""ZWE""","""Zimbabwe""",1417200000.0,17098250.0,"""2011-07-14""","""Oceania""","""Western Europe"""


Other useful methods to obtain a selection of rows:
- [`.head`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html)
- [`.tail`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.tail.html)

### D(ata) types

- each column holds objects of the same type (unlike Python collections!)
- distinct from (but convertible to/from) Python classes
- the types are nullable => each value can be missing (difference to pandas)

In [11]:
# List of all data types in a DataFrame
un.dtypes


[String, String, Int64, Float64, Date, String, String]

In [12]:
# More useful (dict)
{col: un[col].dtype for col in un.columns}


{'iso3': String,
 'country': String,
 'population': Int64,
 'area': Float64,
 'admission_date': Date,
 'region': String,
 'subregion': String}

#### Common types

- Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64
- Float32, Float64
- Decimal
- Date, Datetime, Time
- String, Categorical, Enum
- Array, List, Struct
- Boolean
- Object, Null, Binary, Unknown

See https://docs.pola.rs/user-guide/concepts/data-types/overview/

In [13]:
# Construct a Series from an object
pl.Series("city", ["Firenze", "Berlin", "Pittsburgh", "Prague"], dtype=pl.String)


city
str
"""Firenze"""
"""Berlin"""
"""Pittsburgh"""
"""Prague"""


In [14]:
# Construct a DataFrame from dictionary of lists.

pl.DataFrame(
    {
        "event": ["PyCon Italia", "PyCon.DE & PyData Berlin", "PyCon US", "EuroPython"],
        "city": ["Firenze", "Berlin", "Pittsburgh", "Prague"],
        "country": ["Italy", "Germany", "United States of America", "Czechia"],
        "start_date": [
            date(2024, 5, 22),
            date(2024, 4, 22),
            date(2024, 5, 15),
            date(2024, 7, 8),
        ],
    }
)


event,city,country,start_date
str,str,str,date
"""PyCon Italia""","""Firenze""","""Italy""",2024-05-22
"""PyCon.DE & PyData Berlin""","""Berlin""","""Germany""",2024-04-22
"""PyCon US""","""Pittsburgh""","""United States of America""",2024-05-15
"""EuroPython""","""Prague""","""Czechia""",2024-07-08


In [15]:
## Pandas window

import pandas as pd

pandas_series = pd.DataFrame(
    {"a": [1, 2, 3, 4, 5], "b": pd.Series([1, 2, 3, 4, 5], dtype="float64")}
)
pl.DataFrame(pandas_series)


a,b
i64,f64
1,1.0
2,2.0
3,3.0
4,4.0
5,5.0


### Indices

Polars, unlike pandas (with its complex index types, multi-indices etc.), does not have indices.
End of story.

## Basic plotting

Choose any library you want:

- [altair]()
- [plotly](https://plotly.com/python/)
- [matplotlib](https://matplotlib.org/)
- [seaborn](https://seaborn.pydata.org/)
- [hvplot](https://hvplot.holoviz.org/)
- ...

### "Built-in" altair support

Note: altair must (should) be installed

```python
df.plot()
df.plot.bar()
df.plot.scatter()
...
```


In [16]:
un.plot.point(
    # x="area",
    # y="population",
    x=alt.X("area", scale=alt.Scale(type='log')),
    y=alt.Y("population", scale=alt.Scale(type='log')),    
    color="region",
    tooltip=["country", "population", "area"],
).properties(
    title="Countries of the world"
)


Slightly more info here:
- https://docs.pola.rs/py-polars/html/reference/dataframe/plot.html
- https://docs.pola.rs/user-guide/misc/visualization/
- https://altair-viz.github.io/ - altair documentation
- https://hvplot.holoviz.org/user_guide/Pandas_API.html - pandas API for hvplot. It works almost the same with polars

**Exercise**: 
1. Load the list of cities from an external file.
2. Draw the "poor man's map of the world", based on the "lat" and "lng" columns of the table.
Optionally: You can embellish it with any aesthetics you want.

In [17]:
# Exercise load-cities
# Load the data:
# - the file is called "data/worldcities.parquet"
# - find the appropriate loading method from https://docs.pola.rs/user-guide/io/
cities = pl.read_parquet("data/worldcities.parquet")
cities


city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
str,str,f64,f64,str,str,str,str,str,i64,i64
"""Tokyo""","""Tokyo""",35.6897,139.6922,"""Japan""","""JP""","""JPN""","""Tōkyō""","""primary""",37732000,1392685764
"""Jakarta""","""Jakarta""",-6.175,106.8275,"""Indonesia""","""ID""","""IDN""","""Jakarta""","""primary""",33756000,1360771077
"""Delhi""","""Delhi""",28.61,77.23,"""India""","""IN""","""IND""","""Delhi""","""admin""",32226000,1356872604
"""Guangzhou""","""Guangzhou""",23.13,113.26,"""China""","""CN""","""CHN""","""Guangdong""","""admin""",26940000,1156237133
"""Mumbai""","""Mumbai""",19.0761,72.8775,"""India""","""IN""","""IND""","""Mahārāshtra""","""admin""",24973000,1356226629
…,…,…,…,…,…,…,…,…,…,…
"""Munha-dong""","""Munha-dong""",39.3813,127.2517,"""Korea, North""","""KP""","""PRK""","""Kangwŏn""",,,1408979215
"""Sil-li""","""Sil-li""",39.488,125.464,"""Korea, North""","""KP""","""PRK""","""P’yŏngnam""",,,1408767958
"""Muan""","""Muan""",34.9897,126.4714,"""Korea, South""","""KR""","""KOR""","""Jeonnam""","""admin""",,1410001061
"""Hongseong""","""Hongseong""",36.6009,126.665,"""Korea, South""","""KR""","""KOR""","""Chungnam""","""admin""",,1410822139


In [18]:
# Exercise world-map
# Plot the cities
# - find the appropriate plotting method
# - use proper arguments for the call
cities.head(4999).plot.scatter(
    x="lng",
    y="lat",
    color="country",
)

## Sorting

- [.sort](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.sort.html)

In [None]:
un.sort("admission_date")

Note: We receive a new DataFrame, as with all other manipulations. There is no "inplace" in polars. (unlike pandas, where this is only discouraged)

In [None]:
un.sort("population", descending=True)

In [None]:
un.sort("region", "subregion")

**Exercise:** Create a bar plot of 10 countries with the lowest population.

Hints:
- [`.sort`]()
- [`.head`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html) 
- [`.plot.bar`](https://hvplot.holoviz.org/reference/tabular/bar.html) - docs from the hvplot pages
- Use `hover_cols` and `x` args to describe the plot properly (optional)

In [None]:
# Exercise ten-smallest
sorted_population = un.sort("population")
ten_smallest = sorted_population.head(10)

# Plot it
ten_smallest.plot.bar(x="country", y="population", hover_cols=["region", "subregion"])

## Expressions & selection

To select one column, we can use the bracket notation:


In [None]:
un["country"]

Something very similar can be achieved by using [`.select`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.select.html)

In [None]:
# Find the difference
un.select("country")


And now we explicitly construct an expression ("choose the value in column country"):

In [None]:
# Select a column
un.select(pl.col("country"))  # Or pl.col.country

Expression, representing column(s) in a dataframe:

In [None]:
pl.col("country")

Expression describe operations (or trees of operations) that work on dataframes and produce one or more Series.

- they are evaluated only when needed
- they can be executed in chunks
- polars can optimize evaluation of the complex trees

See [User guide](https://docs.pola.rs/user-guide/concepts/expressions/) for more.

In [None]:
# Literal value
un.select(pl.lit("country"))

Finally, we get to computing.

First, direct operations on series:

In [None]:
un["population"] / un["area"]

In [None]:
date.today() - un["admission_date"]

However, let's construct this as an expression:

In [None]:
pl.col("population") / pl.col("area")

We can use expressions in `.select`:

In [None]:
un.select(
    "country",
    "population",
    "area",
    (pl.col("population") / pl.col("area")).alias("density"),
).sort("density", descending=True)


In [None]:
un.top_k(10, by=(pl.col("population") / pl.col("area")))

Btw. You can use SQL in polars!

See https://docs.pola.rs/api/python/stable/reference/sql/python_api.html.

In [None]:
pl.sql("SELECT country, population, area, population / area AS density FROM un ORDER by density DESC LIMIT 10")

To extend a dataframe with new columns:

[`.with_columns`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.with_columns.html)

In [None]:
un.with_columns(
    density=pl.col("population") / pl.col("area"),
)

Some other methods to change the columns:
- [`.drop`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.drop.html) to remove
- [`.rename`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.rename.html) to change column names
- [`.cast`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.cast.html) to change column data types

Note: You can pass an expression to `.sort` too.

**Exercise**: Add a column "membership_in_years",giving how long each country has been a member of the U.N.

- Use the difference from above
- Use [`polars.Expression.dt.total_days`](https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.dt.total_days.html)
- Divide to get years
- (Optionally) If you want to export "complete" years, use interger division and `cast(pl.Int64)` to convert to integers

In [None]:
# Exercise membership-years
un...(
    
)

## Filtering

When you want to select rows based on some criteria... Well, a short detour:

### Contexts

Every expression can only be executed within one of the following contexts:

1. Selection (`.select`, `.with_columns`) - we already saw this
2. Filtering (`.filter`)
3. Aggregation - we will see this

See https://docs.pola.rs/user-guide/concepts/contexts/.


Pass any boolean expression to the [`.filter`]() method

In [None]:
un.filter(pl.col("population") > 1e9)

In [None]:
un.filter(pl.col("country")=="Italy")

In [None]:
un.filter(country="Czechia")

**Exercise:** Show how the share of electricity coming from different sources evolved over time in Czechia (or from some other country).

In [None]:
# Exercise energy-cz
el_source = pl.read_csv(
    "data/our_world_in_data/electricity-source.csv", infer_schema_length=5000
)  # Note the infer_schema_length
el_source 
# el_source_cz = ...
# el_source_cz


**Exercise:** Create a linear (or stacked area) plot of the fractions

Hints:
- `.plot.area` is your friend
- you can supply multiple column names in the `y` argument
- An example is here: https://hvplot.holoviz.org/user_guide/Pandas_API.html#area-plot

In [None]:
# Exercise energy-cz
el_source_cz.plot(..., ...)


## Aggregations

When you want to find the summary statistics, you can use one of several Series/DataFrame functions, such as `.max`, `.mean`, `.median`, `.quantile`, ...

In [None]:
un.max()

In [None]:
un.describe()

In [None]:
un.max()  # Careful, it is per-column

In [None]:
un.median()

It works on series too:

In [None]:
un["area"].quantile(0.05)

**Exercise** Find all founding members of the U.N. 

Hints:
- find the admission date of the first member first
- you can call statistics methods on the Series directly
- you might want to find the proper minimizing function (min)

In [None]:
# Exercise founding-members
first_date = ...
founding_members = ...
founding_members


### Grouped aggregations

[.group_by](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html)

- it allows to select columns (or expression) to group by.
- in combination with a few selected functions or the `.agg`, it enters a new context for operations with the groups

In [None]:
un.group_by("region")

In [None]:
un.group_by("region").min()

In [None]:
un.group_by("subregion").len()

In [None]:
un.group_by("region", "subregion").len().sort("region", "subregion")

In [None]:
un.group_by("subregion").sum()

The `.agg` method works similar to `.select`, only the expressions are evaluated only on the group and then combined together, together with the group keys.

In [None]:
un.group_by("region", "subregion").agg(
    pl.sum("*"),
    total_population=pl.col("population").sum(),
    total_area=pl.col("area").sum(),
    num_countries=pl.col("area").count(),
    
).sort("region", "subregion")


In [None]:
forest_area = pl.read_csv("data/our_world_in_data/forest-area-km.csv")
forest_area.filter(Code="CZE")

**Exercise:** Find the relative change of forestation for each country on the year range.

- group by an appropriate column
- find the proper aggregation functions (https://docs.pola.rs/user-guide/expressions/aggregation) - first, last might be your friends
- optionally exclude the infinite and NA values ([`.is_finite`](https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.is_finite.html)) 

In [None]:
# Exercise forest-change
first_and_last_forest_area = forest_area.group_by(...).agg(
    ...
)
relative_change = first_and_last_forest_area....
# Filter finite and sort
relative_change = relative_change....
relative_change

### Time aggregations

In [None]:
weather = pl.read_parquet("data/prague-meteostat.parquet")
weather


What about timezones? Let's forget about them for now... but it would deserve its own workshop. See https://docs.pola.rs/user-guide/transformations/time-series/timezones/

In [None]:
weather.plot(y="temp")

In [None]:
weather.plot(x="time", y="temp")

#### Intermezzo: Missing values

- Find them: [`.is_null`](https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.is_null.html)
- Remove them: `.filter` is your friend, .drop_nulls
- Replace them: [`.fill_null`](https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.fill_null.html)

More on that in [User Guide](https://docs.pola.rs/user-guide/expressions/missing-data/)

In [None]:
weather.filter(pl.col("temp").is_null())

In [None]:
weather.drop_nulls()

Don't confuse with NaN (pandas users!) and/or infinities:

In [None]:
pl.Series([1, 0, None]) / pl.Series([0, 0, 0])

Back to time series

In [None]:
# We can use the year column
yearly_mean = weather.group_by(
    pl.col("time").dt.year().alias("year"), maintain_order=True
).agg(avg_temp=pl.col("temp").drop_nans().mean())
yearly_mean


In [None]:
yearly_mean.plot(x="year", y="avg_temp")

But groupping by calendar month? That might be trickier.

And hence the [`.group_by_dynamic`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html) method.

- it takes one column/expression to serve as a time index
- the data must be sorted in that column (either by previous `.sort` or [`.set_sorted`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.set_sorted.html))
- you can specify the aggregation period by a number of ways

In [None]:
monthly = (
    weather
    .group_by_dynamic("time", every="1mo")
    .agg(avg_temp=pl.col("temp").mean() ) #.drop_nans().mean())
)
monthly.plot(x="time", y="avg_temp")


**Exercise:** What was the day with highest lowest temperature (probably meaning the hottest night) in Prague in the last 10 years?

Hints:
- Filter the new values only
- Group by an appropriate time period
- Find the minimum
- You might need to `.drop_nans` and `.drop_nulls` to work with reasonable values (at appropriate moment)
- Work with the minima to find the top value (or a few of them)

In [None]:
# Exercise hottest-night
recent_weather = weather.filter(...)
min_daily_temperatures = recent_weather.group_by_dynamic(...).agg(...)
top_nights = min_daily_temperatures.....head(10)
top_nights

Similarly, you can group by a rolling window using [`.rolling`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.rolling.html), ...

**Exercise:** Can you find the highest rainfall within a 48-hour (rolling) window? (Not every 2 days)

In [None]:
# Exercise highest-rainfall
weather....agg(...).top_k(10, by=...)

## Joining

We have data coming from two (or more) tables. We want to combine them, so we call the [`.join`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.join.html) method on of them and:

- which column(s) should match (`on`, `left_on`, `right_on` arguments)
- how to deal with missing values (`how`)
- how to deal with non-matching columns that are in both dataframes (`suffix`)
- whether to restrict the relationship somehow (`validate`)

More on that in the [User guide](https://docs.pola.rs/user-guide/transformations/joins/).

(Pandas note: the different, almost equivalent ways of joining in pandas are a big source of confusion)

In [None]:
current_forest_area = forest_area.group_by("Code").last()
current_forest_area

In [None]:
un

**Example:** Find the total forest area per region.

In [None]:
current_forest_area.join(
    un, left_on="Code", right_on="iso3", how="inner"
).group_by("region").agg(pl.col("Forest area").sum(), pl.col("area").sum()).with_columns(
    forest_area_ratio=pl.col("Forest area") / pl.col("area") / 100 / 0.71
)


In [None]:
cities

**Exercise** Find the number of cities over 1 million inhabitants per region / subregion. (Bonus: include the largest of those cities)

Hints:
- Join with the "un" table on appropriate columns
- Aggregate over appropriate columns and use [`.len`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.dataframe.group_by.GroupBy.len.html)

In [None]:
# Exercise million-cities
million_cities = cities.filter(pl.col("population") > 1e6)
million_cities_with_country = million_cities.join(...)
million_cities_per_region = (
    million_cities_with_country.group_by(...)...
)
million_cities_per_region


## Wide / long table format

### Wide -> long

Sometimes we have a 2D table where the column names do not describe independent attributes but rather one condition under many circumstances, like in the following:

In [None]:
wb_pop_wide = pl.read_csv("data/world_bank-population.csv")
wb_pop_wide


We can transform the column labels to one column and convert the table to "long format":

[.unpivot](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unpivot.html#polars.DataFrame.unpivot)

In [None]:
wb_pop = (
    wb_pop_wide.unpivot(
        on=cs.numeric(),
        index=["Country Code", "Country Name"],
        variable_name="year",
        value_name="population",
    )
    .cast({"year": pl.Int64})
    .rename({"Country Code": "iso3", "Country Name": "country"})
)
wb_pop.sample(10)


In [None]:
world_pop = wb_pop.filter(iso3="WLD").drop(
    "iso3", "country"
)  # .plot(x="year", y="population")
world_pop.plot(x="year", y="population")


In [None]:
# Alternative
wb_pop_wide.filter(pl.col("Country Code") == "WLD").select(cs.numeric()).transpose(
    header_name="year", include_header=True
).rename({"column_0": "population"})


### Long -> wide (pivotting)

In the opposite of melting, we want to aggregate something over a pair of columns and create a "2D", pivot table. 

[.pivot]()

**Example:** Plot the daily pattern of temperatures in Prague for each month of the year (since 2020).

In [None]:
month_day_data = (
    weather
    .drop_nulls("temp")
    .filter(
        pl.col("temp").is_not_nan(),
        pl.col("time") >= date(2000, 1, 1),
    )
    .select("time", "temp")
    .with_columns(
        month=pl.col("time").dt.month(),
        month_name=pl.col("time").dt.strftime("%B").alias("month_name"),
        hour=pl.col("time").dt.hour(),
    )
).drop("month", "time")
month_day_data


In [None]:
month_day_data.pivot(
    on="month_name",
    values="temp",
    index="hour",
    aggregate_function="mean",
)

In [None]:
month_day_table = month_day_data.pivot(
    on="month_name",
    values="temp",
    index="hour",
    aggregate_function="mean",
)
month_day_table


In [None]:
month_day_table.plot(x="hour")

**Exercise:** Find how the total forested area developed for different regions year by year and draw a stacked area plot.

Hints:
- Join `un` and `forecast_area` dataframes, using appropriate columns
- Pivot using the region as `column` and year as row `index`. Find the appropriate `value` column.
- Find the appropriate value for the `aggregate_function` argument.

Question: You should see something weird in the output - can you explain?

In [None]:
# Exercise forest-region
forest_area_by_region = forest_area.join(un, ...).pivot(...)
forest_area_by_region

## Lazy operations

One of the best attributes of polars is its ability to optimise queries and perform them in chunks.

We can start lazy mode by
- lazy input functions ([`scan_csv`](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html), [`scan_parquet`](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_parquet.html), ...)
- calling [`.lazy`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.lazy.html) on a dataframe

All operations on the lazy dataframes are added to the query.

The final results are obtained by a [`.collect`](https://docs.pola.rs/py-polars/html/reference/lazyframe/api/polars.LazyFrame.collect.html) or some other method call.

In [None]:
lazy_cities = pl.read_parquet("data/worldcities.parquet")
lazy_cities

In [None]:
lazy_mean_population = (
    lazy_cities.lazy().filter(pl.col("population") > 1e6)
    .group_by("country")
    .agg(pl.col("population").mean().alias("mean_population"))
    .sort("mean_population", descending=True)
)
lazy_mean_population


In [None]:
print(lazy_mean_population.explain(optimized=True))

In [None]:
pop = lazy_mean_population.collect()
pop

Find more on that in the [User Guide](https://docs.pola.rs/user-guide/lazy/). We do not have any larger-than-memory data file in our repo...

In [None]:
pop.write_excel("pop.xlsx")