# Agenda

1. What is Polars? What is its relationship with Pandas and the rest of the data ecosystem in Python?
2. Series
3. Data frames
4. Reading in CSV files
5. dtypes
6. Expressions
7. Selecting rows with `df.select`
8. Selecting columns
9. `df.with_columns`
10. `df.filter`
11. Sorting
12. Grouping
13. Optimizing queries and "lazy frames"

# What is Polars?

Data frames are everywhere, but they aren't native to very many languages. We're seeing a growing number of libraries that provide data frames outside of the core of languages. The bad news is that you have to install / choose which data-frame package you want to use in Python. The good news is that we have a few great options to choose from.

Pandas is the 900-pound gorilla in this space.  Polars is a relatively new entry, and it's written for (a) speed of execution, (b) builtin multithreading, (c) lazy loading of data, and (d) a very minimalist, elegant API.

The URL is pola.rs, because it's written in the Rust language.  The Python API that Polars exposes is very similar to the Pandas API, making it fairly easy for someone to move from Pandas to Polars.

I personally see Polars as great for when you have tons of data and/or execution speed is critical. If you're a snob who wants a very elegant API, then it's great, too. 



# Installing Polars

You can install it from PyPI using `pip`:

    pip install -U polars

You'll probably want to install the complete Polars package and dependencies and extras, using

    pip install -U 'polars[all]'

If you're using a Mac with Apple Silicon, Polars might die on you because your copy of Python was compiled for Intel, and is in compatibility mode.  If you're like me, and can't/won't/don't know how to recompile Python for Apple Silicon, you can just install a different Polars package from PyPI that takes this into account:

    pip install -U 'polars-lts-cpu[all]'
   

In [1]:
import polars as pl    # just as we use Pandas as pd, we use Polars as pl

# Series

A series is a 1D data structure, just like in Pandas. (There, behind the scenes, we have a NumPy array. That is *not* true in Polars!) 

We can create a Polars series just by invoking `pl.Series` on a list of values. Polars will (like Pandas) figure out what kinds of values we have, and set the dtype.

In [8]:
s = pl.Series(
    values=[10, 20, 30, 40, 50],
    name='numbers'
)

In [9]:
s

numbers
i64
10
20
30
40
50


# Some differences

1. No index! There is no index in Polars, period.
2. No name, although we can give it a name by passing the `name` argument, which comes *before* the data, unless you want both `name` and `values` to be keyword arguments, in which case they can be in any order.
3. The shape is always displayed at the top of the series (or data frame)
4. The dtype is displayed at the top of the series. And here, we see that the dtype is `i64`, aka `int64`

In [10]:
s = pl.Series(
    values=[10.5, 20.5, 30.5, 40.5, 50.5],
    name='numbers'
)

s

numbers
f64
10.5
20.5
30.5
40.5
50.5


In [11]:
# let's try another dtype

s = pl.Series(
    values=[10.5, 20.5, 30.5, 40.5, 50.5],
    name='numbers',
    dtype=pl.Float32   # use Polars dtypes, not NumPy dtypes and not strings
)

s

numbers
f32
10.5
20.5
30.5
40.5
50.5


In [18]:
# What if I have a mixture of types?

s = pl.Series(
    values=[10, 20.5, 30, 40.5, 50],
    dtype=pl.Int64
)

s

TypeError: unexpected value while building Series of type Int64; found value of type Float64: 20.5

Hint: Try setting `strict=False` to allow passing data with mixed types.

In [19]:
s = pl.Series(
    values=[10, 20.5, 30, 40.5, 50],
    dtype=pl.Int64,
    strict=False
)

s

10
20
30
40
50


In [21]:
s = pl.Series(
    'hello out there from polars world!'.split()
)

s

"""hello"""
"""out"""
"""there"""
"""from"""
"""polars"""
"""world!"""


In [22]:
s.dtype

String

In [26]:
s1 = pl.Series(
    values=[10, 20, 30, 40, 50]
)

s
    

10
20
30
40
50


In [27]:
s2 = pl.Series(
    values=[100, 200, 300, 400, 500]
)

s2
    

100
200
300
400
500


In [29]:
# if we perform an operation on two series, assuming that they are both the same 
# length (i.e., shape), then the operations will be performed across the same
# indexes (yes, even though we don't have an official "index")

s1 + s2

110
220
330
440
550


In [30]:
s1 + s2.head(2)

InvalidOperationError: cannot do arithmetic operation on series of different lengths: got 5 and 2

In [31]:
# can we do broadcast operations, arithmetic operations with a series and a scalar value?

s1 + 5

15
25
35
45
55


In [32]:
s1 + 5.5

15.5
25.5
35.5
45.5
55.5


In [33]:
s % 2 == 0

true
True
True
True
True


In [34]:
s[2:4]

30
40


# Methods we can run on our series

Just as Pandas provides a lot of methods for analyzing data on a series, Polars provides similar (or identical) methods.

In [36]:
s.mean()

30.0

In [37]:
s.std()

15.811388300841896

In [38]:
s.min()

10

In [39]:
s.max()

50

In [40]:
s.median()

30.0

In [41]:
s.quantile(0.25)

20.0

In [42]:
s.quantile(0.75)

40.0

In [43]:
s.count()

5

In [44]:
s.describe()

statistic,value
str,f64
"""count""",5.0
"""null_count""",0.0
"""mean""",30.0
"""std""",15.811388
"""min""",10.0
"""25%""",20.0
"""50%""",30.0
"""75%""",40.0
"""max""",50.0


In [45]:
s = pl.Series('this is a bunch of words for my Polars class'.split())

In [46]:
s.describe()

statistic,value
str,str
"""count""","""10"""
"""null_count""","""0"""
"""min""","""Polars"""
"""max""","""words"""


# Exercise: Polar series

1. Create a series containing the forecast high temperatures for where you live over the next 10 days.
2. What dtype does the series have? Force it to be ints. Force it to be floats.
3. Get the descriptive statistics for these values.
4. Calculate by how much each day's forecast high temp will differ from the mean and the median. 

In [47]:
high_temps = pl.Series(
    [36, 35, 34, 32, 32, 32, 32, 30, 30, 30]
)



In [48]:
high_temps

36
35
34
32
32
32
32
30
30
30


In [51]:
high_temps = pl.Series(
    values=[36, 35, 34, 32, 32, 32, 32, 30, 30, 30],
    dtype=pl.Int8
)

high_temps

36
35
34
32
32
32
32
30
30
30


In [52]:
high_temps = pl.Series(
    values=[36, 35, 34, 32, 32, 32, 32, 30, 30, 30],
    dtype=pl.Float32
)

high_temps

36.0
35.0
34.0
32.0
32.0
32.0
32.0
30.0
30.0
30.0


In [53]:
high_temps.describe()

statistic,value
str,f64
"""count""",10.0
"""null_count""",0.0
"""mean""",32.299999
"""std""",2.110819
"""min""",30.0
"""25%""",30.0
"""50%""",32.0
"""75%""",34.0
"""max""",36.0


In [54]:
# broadcast operation
high_temps - high_temps.mean()

3.700001
2.700001
1.700001
-0.299999
-0.299999
-0.299999
-0.299999
-2.299999
-2.299999
-2.299999


In [55]:
# broadcast operation
high_temps - high_temps.median()

4.0
3.0
2.0
0.0
0.0
0.0
0.0
-2.0
-2.0
-2.0


In [56]:
high_temps.describe()

statistic,value
str,f64
"""count""",10.0
"""null_count""",0.0
"""mean""",32.299999
"""std""",2.110819
"""min""",30.0
"""25%""",30.0
"""50%""",32.0
"""75%""",34.0
"""max""",36.0


# Data frames

A data frame is a 2D collection of data

- Once again, no index
- As in Pandas, every column is a series
- Every column needs to have a unique name

We can create a data frame in a few ways, including passing a dict whose keys are strings (the column names) and whose values are lists (or Polars series). Every list/series must contain the same number of values.

In [57]:
df = pl.DataFrame(
    {'high_temps': [36, 35, 34, 32, 32, 32, 32, 30, 30, 30],
     'low_temps': [24, 24, 23, 24, 24, 25, 23, 21, 21, 22]}
)

In [58]:
df

high_temps,low_temps
i64,i64
36,24
35,24
34,23
32,24
32,24
32,25
32,23
30,21
30,21
30,22


In [59]:
df.dtypes   # what are the dtypes of our columns?

[Int64, Int64]

In [60]:
df.columns

['high_temps', 'low_temps']

In [61]:
# I can retrieve a column using [] and the column name
df['high_temps']

high_temps
i64
36
35
34
32
32
32
32
30
30
30


In [62]:
# I can pass a list of columns, and get all of them back

df[['high_temps', 'low_temps']]

high_temps,low_temps
i64,i64
36,24
35,24
34,23
32,24
32,24
32,25
32,23
30,21
30,21
30,22


In [67]:
import numpy as np
np.random.seed(0)

df = pl.DataFrame(
    {'high_temps': [36, 35, 34, 32, 32, 32, 32, 30, 30, 30],
     'low_temps': [24, 24, 23, 24, 24, 25, 23, 21, 21, 22], 
     'random': np.random.randint(0, 1000, 10)}
)

In [68]:
df

high_temps,low_temps,random
i64,i64,i64
36,24,684
35,24,559
34,23,629
32,24,192
32,24,835
32,25,763
32,23,707
30,21,359
30,21,9
30,22,723


In [69]:
df[3]

high_temps,low_temps,random
i64,i64,i64
32,24,192


In [70]:
df[2:7]

high_temps,low_temps,random
i64,i64,i64
34,23,629
32,24,192
32,24,835
32,25,763
32,23,707


# How can we retrieve data?

We'll need *expressions* to do this, which are the Polars equivalent to boolean indexing in Pandas.