# Getting Started with Polars in Python


_Data Umbrella Talk • 08 October 2024 • Kimberly Fessel, [Dr Kim Data](https://www.drkimdata.com)_

## Import Libraries

In [1]:
import polars as pl

## Creating and Reading Data

### Create a DataFrame

Polars core object is the **DataFrame**. 

You can easily create your own by passing a dictionary of column names with lists of values to the Polars `DataFrame()` function.

In [2]:
df = pl.DataFrame(
    {
        'student': ['Angel', 'Brendan', 'Chelsea'],
        'grade': [10, 11, 9],
        'score': [93.5, 87.0, 79.5],
        'subject': ['Math', 'Math', 'English'],
    }
)

When you view the dataframe, you'll see the column names and data types. You will NOT see row index labels since Polars doesn't use them.

In [4]:
df

student,grade,score,subject
str,i64,f64,str
"""Angel""",10,93.5,"""Math"""
"""Brendan""",11,87.0,"""Math"""
"""Chelsea""",9,79.5,"""English"""


### Load a CSV

You can also easily read data into Polars. The `read_csv()` function loads CSV data:
- From your computer if you provide a file path, or
- From the internet if you provide a URL.

In [5]:
cereal = pl.read_csv('https://raw.githubusercontent.com/kimfetti/Projects/master/Etc/cereal.csv')

In [6]:
cereal

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193


## Selecting Data

Many of the same functions and methods you may already know from pandas also work in Polars. For example,
- `.head()` shows the first few rows of a dataframe
- `.sample()` gives you a random sample of rows

In [7]:
cereal.head()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [8]:
cereal.sample(10, seed=44)

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Basic 4""","""G""","""C""",130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
"""Double Chex""","""R""","""C""",100,2,0,190,1.0,18.0,5,80,25,3,1.0,0.75,44.330856
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""Frosted Flakes""","""K""","""C""",110,1,0,200,1.0,14.0,11,25,25,1,1.0,0.75,31.435973
"""Puffed Rice""","""Q""","""C""",50,1,0,0,0.0,13.0,0,15,0,3,0.5,1.0,60.756112
"""Post Nat. Raisin Bran""","""P""","""C""",120,3,1,200,6.0,11.0,14,260,25,3,1.33,0.67,37.840594
"""Cinnamon Toast Crunch""","""G""","""C""",120,1,3,210,0.0,13.0,9,45,25,2,1.0,0.75,19.823573
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193
"""Raisin Nut Bran""","""G""","""C""",100,3,2,140,2.5,10.5,8,140,25,3,1.0,0.5,39.7034


Selecting a specific column with Polars, however, looks differently from pandas. Here, use the `.select()` method with the `col()` function.

In [9]:
cereal.select(pl.col('fiber'))

fiber
f64
10.0
2.0
9.0
14.0
1.0
…
0.0
0.0
3.0
3.0


In [10]:
cereal.select(pl.col('name', 'fiber'))

name,fiber
str,f64
"""100% Bran""",10.0
"""100% Natural Bran""",2.0
"""All-Bran""",9.0
"""All-Bran with Extra Fiber""",14.0
"""Almond Delight""",1.0
…,…
"""Triples""",0.0
"""Trix""",0.0
"""Wheat Chex""",3.0
"""Wheaties""",3.0


## Filtering Data

Polars dataframes have a dedicated method for filtering; it's called `.filter()`. 

_HINT: Just remember to keep using `pl.col()` to reference the dataframe columns._

In [11]:
cereal.sample(5, seed=23)

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
"""Cocoa Puffs""","""G""","""C""",110,1,1,180,0.0,12.0,13,55,25,2,1.0,1.0,22.736446
"""Apple Jacks""","""K""","""C""",110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
"""Basic 4""","""G""","""C""",130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
"""Honey-comb""","""P""","""C""",110,1,0,180,0.0,14.0,11,35,25,1,1.0,1.33,28.742414


In [12]:
cereal.filter(pl.col('mfr') == 'K')

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Apple Jacks""","""K""","""C""",110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
"""Corn Flakes""","""K""","""C""",100,2,0,290,1.0,21.0,2,35,25,1,1.0,1.0,45.863324
"""Corn Pops""","""K""","""C""",110,1,0,90,1.0,13.0,12,20,25,2,1.0,1.0,35.782791
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Raisin Bran""","""K""","""C""",120,3,1,210,5.0,14.0,12,240,25,2,1.33,0.75,39.259197
"""Raisin Squares""","""K""","""C""",90,2,0,0,2.0,15.0,6,110,25,3,1.0,0.5,55.333142
"""Rice Krispies""","""K""","""C""",110,2,0,290,0.0,22.0,3,35,25,1,1.0,1.0,40.560159
"""Smacks""","""K""","""C""",110,2,1,70,1.0,9.0,15,40,25,2,1.0,0.75,31.230054


Like pandas, use `&` (and) to enforce conditions, or use `|` (or) to combine conditions where at least one must be true.

In [13]:
cereal.filter((pl.col('mfr') == 'K') & (pl.col('sugars') >= 10))

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Apple Jacks""","""K""","""C""",110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
"""Corn Pops""","""K""","""C""",110,1,0,90,1.0,13.0,12,20,25,2,1.0,1.0,35.782791
"""Froot Loops""","""K""","""C""",110,2,1,125,1.0,11.0,13,30,25,2,1.0,1.0,32.207582
"""Frosted Flakes""","""K""","""C""",110,1,0,200,1.0,14.0,11,25,25,1,1.0,0.75,31.435973
"""Fruitful Bran""","""K""","""C""",120,3,0,240,5.0,14.0,12,190,25,3,1.33,0.67,41.015492
"""Mueslix Crispy Blend""","""K""","""C""",160,3,2,150,3.0,17.0,13,160,25,3,1.5,0.67,30.313351
"""Raisin Bran""","""K""","""C""",120,3,1,210,5.0,14.0,12,240,25,2,1.33,0.75,39.259197
"""Smacks""","""K""","""C""",110,2,1,70,1.0,9.0,15,40,25,2,1.0,0.75,31.230054


## Adding Columns

The `.with_columns()` method allows you to add one or more columns to your dataframes.

In [14]:
cereal.tail()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193
"""Wheaties Honey Gold""","""G""","""C""",110,2,1,200,1.0,16.0,8,60,25,1,1.0,0.75,36.187559


In [15]:
cereal.with_columns( (pl.col('calories')/pl.col('cups')) )

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,f64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",212.121212,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120.0,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",212.121212,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",100.0,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",146.666667,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",146.666667,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
"""Trix""","""G""","""C""",110.0,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
"""Wheat Chex""","""R""","""C""",149.253731,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
"""Wheaties""","""G""","""C""",100.0,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193


**Don't forget to alias the columns you create!** If you don't provide an alias, Polars will overwrite one of the existing columns in your dataframe.

In [16]:
cereal.with_columns( (pl.col('calories')/pl.col('cups')).alias('cal_per_cup') )

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,cal_per_cup
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,212.121212
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,120.0
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,212.121212
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,100.0
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,146.666667
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174,146.666667
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301,110.0
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445,149.253731
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193,100.0


You can add multiple columns, just separate them with a commas in the `.with_columns()` method.

_TIP: Instead of creating these new columns in series, Polars can create them in parallel for much speedier calculations!_

In [17]:
cereal.with_columns( 
    (pl.col('calories')/pl.col('cups')).alias('cal_per_cup'),
    (pl.col('sugars') >= 10).alias('high_sugar')
)

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,cal_per_cup,high_sugar
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64,f64,bool
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,212.121212,false
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,120.0,false
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,212.121212,false
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,100.0,false
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,146.666667,false
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174,146.666667,false
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301,110.0,true
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445,149.253731,false
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193,100.0,false


None of the columns we've created appear in our `cereal` dataframe. Changes don't stick around unless you save the output of `.with_columns()`.

In [18]:
cereal

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193


## Missing Values

You can look at descriptive statistics, along with the number of missing values, using the `.describe()` method.

In [19]:
cereal.describe()

statistic,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""77""","""77""","""77""",77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
"""null_count""","""0""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,,106.883117,2.545455,1.012987,159.675325,2.151948,14.597403,6.922078,96.077922,28.246753,2.207792,1.02961,0.821039,42.665705
"""std""",,,,19.484119,1.09479,1.006473,83.832295,2.383364,4.278956,4.444885,71.286813,22.342523,0.832524,0.150477,0.232716,14.047289
"""min""","""100% Bran""","""A""","""C""",50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
"""25%""",,,,100.0,2.0,0.0,130.0,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.174094
"""50%""",,,,110.0,3.0,1.0,180.0,2.0,14.0,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
"""75%""",,,,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
"""max""","""Wheaties Honey Gold""","""R""","""H""",160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


It appears there are no null values in this dataset; however, -1 signifies missings.

In [20]:
cereal.filter(pl.col('potass') == -1)

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
"""Cream of Wheat (Quick)""","""N""","""H""",100,3,0,80,1.0,21.0,0,-1,0,2,1.0,1.0,64.533816


The `read_csv()` function offers many options to handle your data upon import, including a way to detect missing value characters.

Let's reload the data but specify that -1 indicates a missing value.

In [21]:
cereal = pl.read_csv(
    'https://raw.githubusercontent.com/kimfetti/Projects/master/Etc/cereal.csv',
    null_values = '-1'
)

Now we correctly observe a "null" for the potassium value for "Almond Delight."

In [22]:
cereal.head()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280.0,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135.0,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320.0,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330.0,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,,25,3,1.0,0.75,34.384843


To find missing values, use `.is_null()`.

In [23]:
cereal.filter(pl.col('potass').is_null())

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,,25,3,1.0,0.75,34.384843
"""Cream of Wheat (Quick)""","""N""","""H""",100,3,0,80,1.0,21.0,0,,0,2,1.0,1.0,64.533816


To impute missing values, use `.fill_null()`.

In [26]:
cereal.fill_null(0).head()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,0,25,3,1.0,0.75,34.384843


In [28]:
cereal.fill_null(strategy='mean').head()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,98,25,3,1.0,0.75,34.384843


## Joining DataFrames

To demonstrate merging dataframes, here's another small dataframe with the current manufacturers of each cereal. _("D" means the cereal has been discontinued.)_

In [24]:
cereal_current = pl.DataFrame(
    {
        'name': cereal.select(pl.col('name')),
        'current_mfr': ["K", "D", "K", "D", "D", "G", "K", "G", "G", "K", 
                        "Q", "G", "G", "D", "G", "G", "K", "K", "G", "K", 
                        "B", "K", "D", "D", "K", "K", "K", "D", "P", "P", 
                        "P", "G", "P", "P", "P", "Q", "G", "P", "D", "D",
                        "G", "Q", "G", "H", "D", "D", "K", "G", "D", "D", 
                        "D", "D", "P", "D", "Q", "Q", "Q", "Q", "K", "G", 
                        "D", "G", "K", "P", "D", "P", "K", "K", "D", "G", 
                        "G", "G", "D", "G", "G", "G", "D"]
    }
)

cereal_current

name,current_mfr
str,str
"""100% Bran""","""K"""
"""100% Natural Bran""","""D"""
"""All-Bran""","""K"""
"""All-Bran with Extra Fiber""","""D"""
"""Almond Delight""","""D"""
…,…
"""Triples""","""D"""
"""Trix""","""G"""
"""Wheat Chex""","""G"""
"""Wheaties""","""G"""


To perform table joins in Polars, use the `.join()` method. 

This method allows for inner, full outer, left, right, and cross joins.

In [None]:
cereal_update = cereal.join(cereal_current, on='name', how='inner')

cereal_update

## Data Analysis

Polars operations allow for powerful analysis. Here are some simple questions and answers, and there's much more you can do as well.

> Which cereals have now been discontinued?

In [28]:
cereal_update.filter(pl.col('current_mfr') == 'D')

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,current_mfr
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64,str
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,"""D"""
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,"""D"""
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,,25,3,1.0,0.75,34.384843,"""D"""
"""Clusters""","""G""","""C""",110,3,2,140,2.0,13.0,7,105,25,3,1.0,0.5,40.400208,"""D"""
"""Crispy Wheat & Raisins""","""G""","""C""",100,2,1,140,2.0,11.0,10,120,25,3,1.0,0.75,36.176196,"""D"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Raisin Squares""","""K""","""C""",90,2,0,0,2.0,15.0,6,110,25,3,1.0,0.5,55.333142,"""D"""
"""Shredded Wheat 'n'Bran""","""N""","""C""",90,3,0,0,4.0,19.0,0,140,0,1,1.0,0.67,74.472949,"""D"""
"""Strawberry Fruit Wheats""","""N""","""C""",90,2,0,15,3.0,15.0,5,90,25,2,1.0,1.0,59.363993,"""D"""
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174,"""D"""


> Which current cereals have switched manufacturers?

In [30]:
cereal_update.filter((pl.col('mfr') != pl.col('current_mfr')) & (pl.col('current_mfr') != 'D'))

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,current_mfr
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64,str
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,"""K"""
"""Bran Chex""","""R""","""C""",90,2,1,200,4.0,15.0,6,125,25,1,1.0,0.67,49.120253,"""G"""
"""Bran Flakes""","""P""","""C""",90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813,"""K"""
"""Corn Chex""","""R""","""C""",110,2,0,280,0.0,22.0,3,25,25,1,1.0,1.0,41.445019,"""G"""
"""Cream of Wheat (Quick)""","""N""","""H""",100,3,0,80,1.0,21.0,0,,0,2,1.0,1.0,64.533816,"""B"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Maypo""","""A""","""H""",100,4,1,0,0.0,16.0,3,95,25,2,1.0,1.0,54.850917,"""H"""
"""Rice Chex""","""R""","""C""",110,1,0,240,0.0,23.0,2,30,25,1,1.0,1.13,41.998933,"""G"""
"""Shredded Wheat""","""N""","""C""",80,2,0,0,3.0,16.0,0,95,0,1,0.83,1.0,68.235885,"""P"""
"""Shredded Wheat spoon size""","""N""","""C""",90,3,0,0,3.0,20.0,0,120,0,1,1.0,0.67,72.801787,"""P"""


## Writing to a CSV

Once you have new data you'd like to save, Polars allows you to write it to a CSV with `.write_csv()`.

In [32]:
cereal_update.write_csv('cereal_updated.csv')

In [33]:
!ls

01_Getting_Started.ipynb cereal_updated.csv       [34mimages[m[m
