# EDA, the Polars Way

What is [polars](https://docs.pola.rs/)?

- Library for manipulation with tabular data*, based on arrow
- Contender to [pandas](https://pandas.pydata.org/) (and many more similar tools)
- Since 2020, started by Ritchie Vink

### Why polars?

- Performance (rust)
- Clean(er) API
- Lazy evaluation & query optimization
- Cool kid on the block

### Why not polars?

- Less stable
- Less functionality
- Less known
- Sometimes lengthy code
- Copilot tends to suggest pandas code ;-)

## Let's start

```shell
jupyter lab
```

or Visual Studio Code or PyCharm if you prefer those.

Open "exercises.ipynb"

In [1]:
# Most basic import
import polars as pl

# Other useful imports
from datetime import date, datetime
import polars.selectors as cs


## Fundamental data structures

See https://docs.pola.rs/user-guide/concepts/data-structures/

### DataFrame

- a "table"?
- a "spreadsheet" table?
- a "dict of columns"?

In [3]:
# Load some data
un = pl.read_csv("data/un_basic.csv", try_parse_dates=True)
un


iso3,country,population,area,admission_date,region,subregion
str,str,i64,f64,date,str,str
"""AFG""","""Afghanistan""",41128771,652860.0,1946-11-19,"""Asia""","""Southern Asia"""
"""ALB""","""Albania""",2777689,28750.0,1955-12-14,"""Europe""","""Southern Europ…"
"""DZA""","""Algeria""",44903225,2.381741e6,1962-10-08,"""Africa""","""Northern Afric…"
"""AND""","""Andorra""",79824,470.0,1993-07-28,"""Europe""","""Southern Europ…"
"""AGO""","""Angola""",35588987,1.2467e6,1976-12-01,"""Africa""","""Sub-Saharan Af…"
"""ATG""","""Antigua and Ba…",93763,440.0,1981-11-11,"""Americas""","""Latin America …"
"""ARG""","""Argentina""",46234830,2.7804e6,1945-10-24,"""Americas""","""Latin America …"
"""ARM""","""Armenia""",2780469,29740.0,1992-03-02,"""Asia""","""Western Asia"""
"""AUS""","""Australia""",26005540,7.74122e6,1945-11-01,"""Oceania""","""Australia and …"
"""AUT""","""Austria""",9041851,83879.0,1955-12-14,"""Europe""","""Western Europe…"


In [None]:
# What is it?
type(un)


### Series

- a "list"?
- a "column"?
- an "array of X"?

A bit of everything...

In [None]:
# Select one column for a DataFrame
un["country"]


In [None]:
type(un["country"])

### Closer look at the table

In [None]:
un.shape

In [None]:
un.columns

A random selection of rows using [`.sample`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.sample.html#polars.DataFrame.sample)

In [None]:
un.sample(10)

Look at basic statistical properties using [`.describe`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.describe.html#polars.DataFrame.describe):

In [None]:
un.describe()

Other useful methods to obtain a selection of rows:
- [`.head`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html)
- [`.tail`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.tail.html)

### D(ata) types

- each column holds objects of the same type (unlike Python collections!)
- distinct from (but convertible to/from) Python classes
- the types are nullable => each value can be missing

In [None]:
# List of all data types in a DataFrame
un.dtypes


In [None]:
# More useful (dict)
{col: un[col].dtype for col in un.columns}


#### Common types

- Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64
- Float32, Float64
- Decimal
- Date, Datetime, Time
- String, Categorical, Enum
- Array, List, Struct
- Boolean
- Object, Null, Binary, Unknown

See https://docs.pola.rs/user-guide/concepts/data-types/overview/

In [None]:
# Construct a Series from an object
pl.Series("city", ["Firenze", "Berlin", "Pittsburgh", "Prague"], dtype=pl.String)


In [None]:
# Construct a DataFrame from dictionary of lists.

pl.DataFrame(
    {
        "event": ["PyCon Italia", "PyCon.DE & PyData Berlin", "PyCon US", "EuroPython"],
        "city": ["Firenze", "Berlin", "Pittsburgh", "Prague"],
        "country": ["Italy", "Germany", "United States of America", "Czechia"],
        "start_date": [
            date(2024, 5, 22),
            date(2024, 4, 22),
            date(2024, 5, 15),
            date(2024, 7, 8),
        ],
    }
)


In [None]:
## Pandas window

import pandas as pd

pandas_series = pd.DataFrame(
    {"a": [1, 2, 3, 4, 5], "b": pd.Series([1, 2, 3, 4, 5], dtype="float64")}
)
pl.DataFrame(pandas_series)


### Indices

Polars, unlike pandas, does not have indices.
End of story.

## Basic plotting

Choose any library you want:

- [plotly](https://plotly.com/python/)
- [matplotlib](https://matplotlib.org/)
- [seaborn](https://seaborn.pydata.org/)
- [hvplot](https://hvplot.holoviz.org/)
- ...

### "Built-in" hvplot support

Note: hvplot must be installed

```python
df.plot()
df.plot.bar()
df.plot.scatter()
...
```


In [None]:
un.plot.scatter(
    x="area",
    y="population",
    # logx=True,
    # logy=True,
    # color="region",
    # title="Countries of the World",
    # hover_cols=["country"],
)


Slightly more info here:
- https://docs.pola.rs/py-polars/html/reference/dataframe/plot.html
- https://docs.pola.rs/user-guide/misc/visualization/ 
- https://hvplot.holoviz.org/user_guide/Pandas_API.html - pandas API for hvplot. It works almost the same with polars

**Exercise**: 
1. Load the list of cities from an external file.
2. Draw the "poor man's map of the world", based on the "lat" and "lng" columns of the table.
Optionally: You can embellish it with any aesthetics you want.

In [None]:
# Exercise load-cities
# Load the data:
# - the file is called "data/worldcities.parquet"
# - find the appropriate loading method from https://docs.pola.rs/user-guide/io/
cities = ...
cities


In [None]:
# Exercise world-map
# Plot ihe cities
# - find the appropriate plotting method
# - use proper arguments for the call
cities.plot....(
    x=...,
    y=...,
)

## Sorting

- [.sort](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.sort.html)

In [None]:
un.sort("admission_date")

Note: We receive a new DataFrame, as with all other manipulations. There is no "inplace" in polars. (unlike pandas, where this is only discouraged)

In [None]:
un.sort("population", descending=True)

In [None]:
un.sort("region", "subregion")

**Exercise** Create a bar plot of 10 countries with the lowest population.

Hints:
- [`.sort`]()
- [`.head`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html) 
- [`.plot.bar`](https://hvplot.holoviz.org/reference/tabular/bar.html) - docs from the hvplot pages
- Use `hover_cols` and `x` args to describe the plot properly (optional)

In [None]:
# Exercise ten-smallest
sorted_population = ...
ten_smallest = ...

# Plot it
ten_smallest.plot....

## Expressions & selection

In [None]:
un["country"]

In [None]:
# Find the difference
un.select("country")


Expression, representing column(s) in a dataframe:

In [None]:
pl.col("country")

In [None]:
pl.lit("country")

In [None]:
# Select a column
un.select(pl.col("country"))  # Or pl.col.country


In [None]:
# Literal value
un.select(pl.lit("country"))


In [None]:
.drop

Note: You can pass an expression to `.sort` too.

## Filtering

When you want to select rows based on some criteria... Well, a short detour:

### Contexts

Every expression can only be executed within one of the following contexts:

1. Selection (`.select`, `.with_columns`) - we already saw this
2. Filtering (`.filter`)
3. Aggregation

See https://docs.pola.rs/user-guide/concepts/contexts/.


Pass any boolean expression to the [`.filter`]() method

In [None]:
un.filter(country="Italy")

In [None]:
un.filter(pl.col("population") > 1e9)

**Exercise:** Show how the share of electricity coming from different sources evolved over time in Italy (or from some other country).

In [None]:
# Exercise energy-it
el_source = pl.read_csv(
    "data/our_world_in_data/electricity-source.csv", infer_schema_length=5000
)  # Note the infer_schema_length
el_source_italy = ...
el_source_italy


**Exercise:** Create a linear (or stacked area) plot of the fractions

Hints:
- `.plot.area` is your friend
- you can supply multiple column names in the `y` argument
- An example is here: https://hvplot.holoviz.org/user_guide/Pandas_API.html#area-plot

In [None]:
# Exercise energy-it
el_source_italy.plot(..., ...)


**Exercise** Find all founding members of the U.N. 

Hints:
- find the admission date of the first member first
- you might want to find the proper minimizing function (min)

In [None]:
# Exercise founding-members
first_date = ...
founding_members = ...
founding_members


## Operations

In [None]:
un["population"] / un["area"]

In [None]:
date.today() - un["admission_date"]

In [None]:
type(pl.col("population") / pl.col("area"))

In [None]:
un.select(
    "country",
    "population",
    "area",
    (pl.col("population") / pl.col("area")).alias("density"),
).sort("density", descending=True)


In [None]:
un.with_columns(
    density=pl.col("population") / pl.col("area"),
)


## Aggregations (group_by)

In [None]:
un.group_by("region").len()

In [None]:
un.group_by("subregion").len()

In [None]:
un.group_by("region", "subregion").len().sort("region", "subregion")

In [None]:
un.group_by("subregion").sum()

In [None]:
un.group_by("region", "subregion").agg(
    pl.col("population").sum().alias("total_population"),
    pl.col("area").sum().alias("total_area"),
    pl.col("area").count().alias("num_countries"),
).sort("region", "subregion")


In [None]:
forest_area = pl.read_csv("data/our_world_in_data/forest-area-km.csv")
forest_area


**Exercise:** Find the relative change of forestation for each country on the year range.

- group by an appropriate column
- find the proper aggregation functions (https://docs.pola.rs/user-guide/expressions/aggregation)
- optionally exclude the infinite and NA values ([`.is_finite`](https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.is_finite.html)) 

In [None]:
# Exercise forest-change
first_and_last_forest_area = forest_area.group_by(...).agg(
    ...
)
relative_change = first_and_last_forest_area....
# Filter finite and sort
relative_change = relative_change....

### Time aggregations

In [None]:
weather = pl.read_parquet("data/florence-meteostat.parquet")
weather


What about timezones? Let's forget about them for now... but it would deserve its own workshop. See https://docs.pola.rs/user-guide/transformations/time-series/timezones/

In [None]:
weather.plot(y="temp")

In [None]:
weather.plot(x="time", y="temp")

In [None]:
# We can use the year column
yearly_mean = weather.group_by(
    pl.col("time").dt.year().alias("year"), maintain_order=True
).agg(avg_temp=pl.col("temp").drop_nans().mean())
yearly_mean


In [None]:
yearly_mean.plot(x="year", y="avg_temp")

In [None]:
# But grouping by month?
monthly = (
    weather.set_sorted("time")
    .group_by_dynamic("time", every="1mo")
    .agg(avg_temp=pl.col("temp").drop_nans().mean())
)
monthly.plot(x="time", y="avg_temp")


**Exercise:** What was the day with highest lowest temperature (probably meaning the hottest night) in Florence in the last 10 years? 

Hints:
- Filter the new values only
- Group by an appropriate time period
- Find the minimum
- You might need to `.drop_nans` and `.drop_nulls` to work with reasonable values (at appropriate moment)
- Work with the minima to find the top value (or a few of them)

In [None]:
# Exercise hottest-night
recent_weather = weather....
min_daily_temperatures = recent_weather.set_sorted("time").group_by_dynamic(...).agg(...)
top_nights = min_daily_temperatures....
top_nights

## Joining

Data coming from two (or more) tables.

In [48]:
forest_area.group_by("Code").last()

Code,Entity,Year,Forest area
str,str,i64,f64
"""JOR""","""Jordan""",2020,97500.0
"""KEN""","""Kenya""",2020,3.61109e6
"""BTN""","""Bhutan""",2020,2.72508e6
"""LKA""","""Sri Lanka""",2020,2.11302e6
"""ZWE""","""Zimbabwe""",2020,1.744458e7
…,…,…,…
"""COM""","""Comoros""",2020,32920.0
"""NPL""","""Nepal""",2020,5.96203e6
"""CAN""","""Canada""",2020,3.469281e8
"""PCN""","""Pitcairn""",2020,3500.0


**Example:** Find the total forest area per region. (Bonus: find also the percentage)

In [None]:
forest_area.group_by("Code").last().join(
    un, left_on="Code", right_on="iso3", how="inner"
).sum().select("Forest area", "area").with_columns(
    forest_area_ratio=pl.col("Forest area") / pl.col("area") / 100 / 0.71
)


**Exercise** Find the number of cities over 1 million inhabitants per region / subregion. (Bonus: include the largest of those cities)

Hints:
- Join with the "un" table on appropriate columns
- Aggregate over appropriate columns and use [`.len`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.dataframe.group_by.GroupBy.len.html)

In [63]:
# Exercise million-cities
million_cities = cities....
million_cities_with_country = million_cities....
million_cities_per_region = (
    million_cities_with_country.group_by(...)
    ...
)


SyntaxError: invalid syntax (2313246093.py, line 2)

## Wide / long table format

### Wide -> long

In [51]:
wb_pop_wide = pl.read_csv("data/world_bank-population.csv")
wb_pop_wide


Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68_level_0
str,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,str
"""Aruba""","""ABW""","""Population, total""","""SP.POP.TOTL""",54608,55811,56682,57475,58178,58782,59291,59522,59471,59330,59106,58816,58855,59365,60028,60715,61193,61465,61738,62006,62267,62614,63116,63683,64174,64478,64553,64450,64332,64596,65712,67864,70192,72360,74710,77050,79417,81858,84355,86867,89101,90691,91781,92701,93540,94483,95606,96787,97996,99212,100341,101288,102112,102880,103594,104257,104874,105439,105962,106442,106585,106537,106445,"""""",
"""Africa Eastern and Southern""","""AFE""","""Population, total""","""SP.POP.TOTL""",130692579,134169237,137835590,141630546,145605995,149742351,153955516,158313235,162875171,167596160,172475766,177503186,182599092,187901657,193512956,199284304,205202669,211120911,217481420,224315978,230967858,237937461,245386717,252779730,260209149,267938123,276035920,284490394,292795186,301124880,309890664,318544083,326933522,335625136,344418362,353466601,362985802,372352230,381715600,391486231,401600588,412001885,422741118,433807484,445281555,457153837,469508516,482406426,495748900,509410477,523459657,537792950,552530654,567892149,583651101,600008424,616377605,632746570,649757148,667242986,685112979,702977106,720859132,"""""",
"""Afghanistan""","""AFG""","""Population, total""","""SP.POP.TOTL""",8622466,8790140,8969047,9157465,9355514,9565147,9783147,10010030,10247780,10494489,10752971,11015857,11286753,11575305,11869879,12157386,12425267,12687301,12938862,12986369,12486631,11155195,10088289,9951449,10243686,10512221,10448442,10322758,10383460,10673168,10694796,10745167,12057433,14003760,15455555,16418912,17106595,17788819,18493132,19262847,19542982,19688632,21000256,22645130,23553551,24411191,25442944,25903301,26427199,27385307,28189672,29249157,30466479,31541209,32716210,33753499,34636207,35643418,36686784,37769499,38972230,40099462,41128771,"""""",
"""Africa Western and Central""","""AFW""","""Population, total""","""SP.POP.TOTL""",97256290,99314028,101445032,103667517,105959979,108336203,110798486,113319950,115921723,118615741,121424797,124336039,127364044,130563107,133953892,137548613,141258400,145122851,149206663,153459665,157825609,162323313,167023385,171566640,176054495,180817312,185720244,190759952,195969722,201392200,206739024,212172888,217966101,223788766,229675775,235861484,242200260,248713095,255482918,262397030,269611898,277160097,284952322,292977949,301265247,309824829,318601484,327612838,336893835,346475221,356337762,366489204,376797999,387204553,397855507,408690375,419778384,431138704,442646825,454306063,466189102,478185907,490330870,"""""",
"""Angola""","""AGO""","""Population, total""","""SP.POP.TOTL""",5357195,5441333,5521400,5599827,5673199,5736582,5787044,5827503,5868203,5928386,6029700,6177049,6364731,6578230,6802494,7032713,7266780,7511895,7771590,8043218,8330047,8631457,8947152,9276707,9617702,9970621,10332574,10694057,11060261,11439498,11828638,12228691,12632507,13038270,13462031,13912253,14383350,14871146,15366864,15870753,16394062,16941587,17516139,18124342,18771125,19450959,20162340,20909684,21691522,22507674,23364185,24259111,25188292,26147002,27128337,28127721,29154746,30208628,31273533,32353588,33428486,34503774,35588987,"""""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Kosovo""","""XKX""","""Population, total""","""SP.POP.TOTL""",947000,966000,994000,1022000,1050000,1078000,1106000,1135000,1163000,1191000,1219000,1247000,1278000,1308000,1339000,1369000,1400000,1430000,1460000,1491000,1521000,1552000,1582000,1614000,1647000,1682000,1717000,1753000,1791000,1827000,1862000,1898000,1932000,1965000,1997000,2029000,2059000,2086000,1966000,1762000,1700000,1701154,1702310,1703466,1704622,1705780,1719536,1733404,1747383,1761474,1775680,1791000,1807106,1818117,1812771,1788196,1777557,1791003,1797085,1788878,1790133,1786038,1761985,"""""",
"""Yemen, Rep.""","""YEM""","""Population, total""","""SP.POP.TOTL""",5542459,5646668,5753386,5860197,5973803,6097298,6228430,6368014,6515904,6673981,6843607,7024196,7215835,7417736,7630190,7855657,8094985,8348182,8615301,8899922,9204938,9529105,9872292,10237391,10625687,11036918,11465444,11915563,12387238,12872362,13375121,13895851,14433771,14988047,15553171,16103339,16614326,17108681,17608133,18114552,18628700,19143457,19660653,20188799,20733406,21320671,21966298,22641538,23329004,24029589,24743946,25475610,26223391,26984002,27753304,28516545,29274002,30034389,30790513,31546691,32284046,32981641,33696614,"""""",
"""South Africa""","""ZAF""","""Population, total""","""SP.POP.TOTL""",16520441,16989464,17503133,18042215,18603097,19187194,19789771,20410677,21050540,21704214,22368306,23031441,23698507,24382513,25077016,25777964,26480300,27199838,27943445,28697014,29463549,30232561,31022417,31865176,32768207,33752964,34877834,36119333,37393853,38668684,39877570,40910959,41760755,42525440,43267982,43986084,44661603,45285048,45852166,46364681,46813266,47229714,47661514,48104048,48556071,49017147,49491756,49996094,50565812,51170779,51784921,52443325,53145033,53873616,54729551,55876504,56422274,56641209,57339635,58087055,58801927,59392255,59893885,"""""",
"""Zambia""","""ZMB""","""Population, total""","""SP.POP.TOTL""",3119430,3219451,3323427,3431381,3542764,3658024,3777680,3901288,4029173,4159007,4281671,4399919,4523581,4653289,4789038,4931249,5079672,5233292,5391355,5553462,5720438,5897481,6090818,6291070,6488072,6686449,6890967,7095185,7294325,7491275,7686401,7880466,8074337,8270917,8474216,8684135,8902019,9133156,9372430,9621238,9891136,10191964,10508294,10837973,11188040,11564870,11971567,12402073,12852966,13318087,13792086,14265814,14744658,15234976,15737793,16248230,16767761,17298054,17835893,18380477,18927715,19473125,20017675,"""""",


[.melt](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.melt.html#polars.DataFrame.melt)

In [52]:
wb_pop = (
    wb_pop_wide.melt(
        id_vars=["Country Code", "Country Name"],
        variable_name="year",
        value_vars=cs.numeric(),
        value_name="population",
    )
    .cast({"year": pl.Int64})
    .rename({"Country Code": "iso3", "Country Name": "country"})
)
wb_pop


iso3,country,year,population
str,str,i64,i64
"""ABW""","""Aruba""",1960,54608
"""AFE""","""Africa Eastern and Southern""",1960,130692579
"""AFG""","""Afghanistan""",1960,8622466
"""AFW""","""Africa Western and Central""",1960,97256290
"""AGO""","""Angola""",1960,5357195
…,…,…,…
"""XKX""","""Kosovo""",2022,1761985
"""YEM""","""Yemen, Rep.""",2022,33696614
"""ZAF""","""South Africa""",2022,59893885
"""ZMB""","""Zambia""",2022,20017675


In [53]:
world_pop = wb_pop.filter(iso3="WLD").drop(
    "iso3", "country"
)  # .plot(x="year", y="population")
world_pop


year,population
i64,i64
1960,3031474234
1961,3072421801
1962,3126849612
1963,3193428894
1964,3260441925
…,…
2018,7660371127
2019,7741774583
2020,7820205606
2021,7888305693


In [56]:
# Alternative
wb_pop_wide.filter(pl.col("Country Code") == "WLD").select(cs.numeric()).transpose(
    header_name="year", include_header=True
).rename({"column_0": "population"})


year,population
str,i64
"""1960""",3031474234
"""1961""",3072421801
"""1962""",3126849612
"""1963""",3193428894
"""1964""",3260441925
…,…
"""2018""",7660371127
"""2019""",7741774583
"""2020""",7820205606
"""2021""",7888305693


### Long -> wide (pivotting)

**Example:** Plot the daily pattern of temperatures in Florence for each month of the year (since 2020).

In [49]:
month_day_data = (
    weather.drop_nulls()
    .filter(
        pl.col("temp").is_not_nan(),
        pl.col("time") >= date(2020, 1, 1),
    )
    .select("time", "temp")
    .with_columns(
        pl.col("time").dt.month().alias("month"),
        pl.col("time").dt.strftime("%B").alias("month_name"),
        pl.col("time").dt.hour().alias("hour"),
    )
)
month_day_data


time,temp,month,month_name,hour
datetime[ns],f64,i8,str,i8
2020-01-01 00:00:00,-1.2,1,"""January""",0
2020-01-01 01:00:00,-2.2,1,"""January""",1
2020-01-01 02:00:00,-2.7,1,"""January""",2
2020-01-01 03:00:00,-2.6,1,"""January""",3
2020-01-01 04:00:00,-3.6,1,"""January""",4
…,…,…,…,…
2024-05-15 20:00:00,13.6,5,"""May""",20
2024-05-15 21:00:00,11.1,5,"""May""",21
2024-05-15 22:00:00,10.6,5,"""May""",22
2024-05-15 23:00:00,10.2,5,"""May""",23


[.pivot]()

In [57]:
month_day_table = month_day_data.pivot(
    values="temp",
    index="hour",
    columns="month_name",
    aggregate_function="mean",
)
month_day_table


hour,January,February,March,April,May,June,July,August,September,October,November,December
i8,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0,2.687742,2.96338,4.074839,6.991333,10.410714,14.1825,18.110484,17.83629,14.51,10.33629,8.044167,5.321774
1,2.560645,2.747183,3.871613,6.768,10.23741,13.9125,17.462097,17.11129,14.17,9.945968,7.869167,5.318548
2,2.363871,2.41338,3.564516,6.378667,9.752518,13.4175,16.893548,16.521774,13.691667,9.543548,7.5525,5.083871
3,1.92129,1.939437,3.143226,5.654,9.069065,12.553333,16.317742,16.06371,13.290833,9.345161,7.171667,4.648387
4,1.837419,2.016197,3.668387,7.128667,11.260432,14.973333,18.727419,17.912903,14.474167,9.933065,7.246667,4.637903
…,…,…,…,…,…,…,…,…,…,…,…,…
19,4.903871,6.11831,7.299355,11.078,15.38777,19.553333,23.760484,22.56129,18.800833,13.935484,10.149167,7.332258
20,4.357419,5.33662,6.549032,10.111333,14.203597,18.238333,22.366129,21.362097,17.758333,13.032258,9.566667,6.867742
21,3.707097,4.416197,5.81871,9.091333,12.892086,16.984167,20.966129,20.26129,16.575,12.124194,9.185833,6.214516
22,3.5,4.045775,5.272903,8.467333,12.257554,16.2025,20.006452,19.356452,15.890833,11.525,8.806667,6.000806


In [58]:
month_day_table.plot(x="hour")

## Lazy operations

In [None]:
cities.lazy()

In [None]:
print(
    cities.lazy()
    .filter(pl.col("population") > 1e6)
    .group_by("country")
    .agg(pl.col("population").mean().alias("mean_population"))
    .sort("mean_population", descending=True)
    .explain(optimized=True)
)

In [None]:
pl.scan_csv(
    "data/simple_maps/worldcities.csv", infer_schema_length=100000, null_values=""
).cast({"population": pl.Int64}).collect()