# EDA, the Polars Way

What is [polars](https://docs.pola.rs/)?

- Library for manipulation with tabular data*, based on arrow
- Contender to [pandas](https://pandas.pydata.org/) (and many more similar tools)
- Since 2020, started by Ritchie Vink

### Why polars?

- Performance (rust)
- Clean(er) API
- Lazy evaluation & query optimization
- Cool kid on the block

### Why not polars?

- Less stable
- Less functionality
- Less known
- Sometimes lengthy code

## Let's start

```shell
jupyter lab
```

TODO: Open "exercises/10-exploration.ipynb"

In [None]:
import polars as pl
import polars.selectors as cs

## Basic types

See https://docs.pola.rs/user-guide/concepts/data-structures/

In [None]:
# Other useful imports
from datetime import date, datetime

### DataFrame

- a "table"?
- a "spreadsheet" table?
- a "dict of columns"?

In [None]:
# Load some data
un = pl.read_csv("data/un_basic.csv", try_parse_dates=True)
un

In [None]:
# What is it?
type(un)

### Series

- a "list"?
- a "column"?
- an "array of X"?

A bit of everything...

In [None]:
# Select one column for a DataFrame
un["country"]

In [None]:
type(un["country"])

In [None]:
un.shape

In [None]:
un.columns

In [None]:
un.sample(10)

In [None]:
un.describe()

### D(ata) types

- each column hold object of the same type (unlike Python collections!)
- distinct from (but convertible to/from) Python classes
- the types are nullable => each value can be missing

In [None]:
# List of all data types in a DataFrame
un.dtypes

#### Common types

- Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64
- Float32, Float64
- Decimal
- Date, Datetime, Time
- String, Categorical, Enum
- Array, List, Struct
- Boolean
- Object, Null, Binary, Unknown

See https://docs.pola.rs/user-guide/concepts/data-types/overview/

In [None]:
# More useful (dict)
{col: un[col].dtype for col in un.columns}

In [None]:
# Construct a Series from an object
pl.Series("city", ["Firenze", "Berlin", "Pittsburgh", "Prague"], dtype=pl.String)

In [None]:
# Construct a DataFrame
pl.DataFrame({
    "event": ["PyCon Italia", "PyCon.DE & PyData Berlin", "PyCon US", "EuroPython"],
    "city": ["Firenze", "Berlin", "Pittsburgh", "Prague"],
    "country": ["Italy", "Germany", "United States of America", "Czechia"],
    "start_date": [date(2024, 5, 22), date(2024, 4, 22), date(2024, 5, 15), date(2024, 7, 8)]
})

In [None]:
## Pandas window

import pandas as pd
pandas_series = pd.DataFrame({
    "a": [1, 2, 3, 4, 5],
    "b": pd.Series([1, 2, 3, 4, 5], dtype="float64")
})
pl.DataFrame(pandas_series)

## Basic plotting

Choose any library you want:

- [plotly](https://plotly.com/python/)
- [matplotlib](https://matplotlib.org/)
- [seaborn](https://seaborn.pydata.org/)
- [hvplot](https://hvplot.holoviz.org/)
- ...

### "Built-in" hvplot support

Note: hvplot must be installed

```python
df.plot()
df.plot.bar()
df.plot.scatter()
```


In [None]:
un.plot.scatter(x="area", y="population", logx=True, logy=True, color="region", title="Countries of the World", hover_cols=["country"])

**Excercise**: Load the list of cities and draw the "poor man's map of the world", based on the "lat" and "lng" columns of the table.

In [None]:
cities = pl.read_parquet("data/simple_maps/worldcities.parquet")
cities

In [None]:
cities.plot.scatter(
    x="lng",
    y="lat",
    hover_cols=["city"],
    color="country",
    title="Cities of the World",
    height=500,
    width=1000,
    legend=False,
    grid=True
)

## Sorting

[polars.DataFrame.sort](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.sort.html)

In [None]:
un.sort("admission_date")

In [None]:
un.sort("population", descending=True)

In [None]:
un.sort("region", "subregion")

**Exercise** Create a bar plot of 10 countries with the lowest population.

Hints:
- `.sort`
- [`.head`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html) 
- [`.plot.bar`](https://hvplot.holoviz.org/reference/tabular/bar.html) - docs from the hvplot pages
- Use `hover_cols` and `x` args to describe the plot properly

In [None]:
# Exercise

## Expressions & selection

Expression, representing column(s) in a dataframe:

In [None]:
pl.col("country")

In [None]:
un.select("country")

In [None]:
# Select a column
un.select(pl.col("country"))

In [None]:
# Literal value
un.select(pl.lit("country"))

## Filtering

When you want to select rows based on some criteria... Well, a short detour:

### Contexts

Every expression can only be executed within one of the following contexts:

1. Selection (`.select`, `.with_columns`) - we already saw this
2. Filtering (`.filter`)
3. Aggregation

See https://docs.pola.rs/user-guide/concepts/contexts/.


Pass any boolean expression to the [`.filter`]() method

In [None]:
un.filter(country="Italy")

In [None]:
un.filter(pl.col("population") > 1e9)

**Exercise:** Show how the share of electricity coming from different sources evolved over time in Italy, or from some other country.

In [None]:
el_source = pl.read_csv("data/our_world_in_data/electricity-source.csv", infer_schema_length=5000)
# Exercise

**Exercise:** Create a line plot of the fractions

Hints:
- `plot()` called directly is a line plot
- you can supply multiple column names in the `y` argument

In [None]:
el_source_italy.plot(x="year", y=["nuclear", "hydro", "fossil", "renewables"])

**Exercise** Find all founding members of the U.N. 

In [None]:
# Exercise

In [None]:
el_source.describe()

## Operations

In [None]:
un["population"] / un["area"]

In [None]:
date.today() - un["admission_date"]

In [None]:
type(pl.col("population") / pl.col("area"))

In [None]:
un.select(
    "country",
    "population",
    "area",
    (pl.col("population") / pl.col("area")).alias("density")
).sort("density", descending=True)

In [None]:
un.with_columns(
    density=pl.col("population") / pl.col("area"),
)

## Aggregations (groupby)

In [None]:
un.group_by("region").len()

In [None]:
un.group_by("subregion").len()

In [None]:
un.group_by("region", "subregion").len().sort("region", "subregion")

In [None]:
un.group_by("subregion").sum()

In [None]:
un.group_by("region", "subregion").agg(
    pl.col("population").sum().alias("total_population"),
    pl.col("area").sum().alias("total_area"),
    pl.col("area").count().alias("num_countries"),
).sort("region", "subregion")

In [None]:
forest_area = pl.read_csv("data/our_world_in_data/forest-area-km.csv")
forest_area

**Exercise** Find the relative change of forestation for each country on the year range.

In [None]:
# Exercise

### Time aggregations

In [None]:
weather = pl.read_parquet("data/florence-meteostat.parquet")
weather

What about timezones? Let's forget about them for now... but it would deserve its own workshop. See https://docs.pola.rs/user-guide/transformations/time-series/timezones/

In [None]:
weather.plot(y="temp")

In [None]:
weather.plot(x="time", y="temp")

In [None]:
# We can use the year column
yearly_mean = weather.group_by(
    pl.col("time").dt.year().alias("year"), maintain_order=True
).agg(avg_temp=pl.col("temp").drop_nans().mean())
yearly_mean


In [None]:
yearly_mean.plot(x="year", y="avg_temp")

In [None]:
# But grouping by month?
monthly = weather.set_sorted("time").group_by_dynamic("time", every="1mo").agg(avg_temp=pl.col("temp").drop_nans().mean())
monthly.plot(x="time", y="avg_temp")

**Exercise:** What was the day with highest lowest temperature (probably meaning the hottest night) in Florence since the beginning of the measurement? 

In [None]:
# Exercise

## Joining

In [None]:
forest_area.group_by("Code").last()

**Exercise:** Find the total forest area per region. (Bonus: find also the percentage)

In [None]:
# Exercise

**Exercise** Find the number of cities over 1 million inhabitants per region / subregion.

In [None]:
# Exercise

## Wide / long table format

In [None]:
wb_pop_wide = pl.read_csv("data/world_bank-population.csv")
wb_pop_wide

In [None]:
wb_pop = wb_pop_wide.melt(
    id_vars=["Country Code", "Country Name"],
    variable_name="year",
    value_vars=cs.numeric(),
    value_name="population"
).cast(
    {"year": pl.Int64}
).rename(
    {"Country Code": "iso3", "Country Name": "country"}
)
wb_pop

In [None]:
world_pop = wb_pop.filter(iso3="WLD").drop("iso3", "country")   #.plot(x="year", y="population")
world_pop

**Exercise:** Plot the daily pattern of temperatures in Florence for each month of the year (since 2020).

In [None]:
# Exercise

In [None]:
month_day_table = month_day_data.pivot(
    values="temp",
    index="hour",
    columns="month_name",
    aggregate_function="mean",
)
month_day_table

In [None]:
month_day_table.plot(x="hour")

## Lazy operations

In [None]:
cities.lazy()

In [None]:
print(cities.lazy().filter(pl.col("population") > 1e6).group_by("country").agg(pl.col("population").mean().alias("mean_population")).sort("mean_population", descending=True).explain(optimized=True))

In [None]:
pl.scan_csv("data/simple_maps/worldcities.csv", infer_schema_length=100000, null_values="").cast({"population": pl.Int64}).collect()

## TODO

- The joins!!!
- Handling missing values
- Interoperability with pandas