# Polars: Data Manipulation

## Import Library and Load Data

In [1]:
import polars as pl

In [2]:
cereal = pl.read_csv('https://raw.githubusercontent.com/kimfetti/Projects/master/Etc/cereal.csv')

In [3]:
cereal.tail()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193
"""Wheaties Honey Gold""","""G""","""C""",110,2,1,200,1.0,16.0,8,60,25,1,1.0,0.75,36.187559


## Adding Columns

The `.with_columns()` method allows you to add one or more columns to your dataframes.

In [4]:
cereal.with_columns( (pl.col('calories')/pl.col('cups')) )

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,f64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",212.121212,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120.0,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",212.121212,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",100.0,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",146.666667,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",146.666667,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
"""Trix""","""G""","""C""",110.0,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
"""Wheat Chex""","""R""","""C""",149.253731,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
"""Wheaties""","""G""","""C""",100.0,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193


**IMPORTANT: Don't forget to alias the columns you create!** 

If you don't provide an alias, Polars will overwrite one of the existing columns in your dataframe.

In [5]:
cereal.with_columns( (pl.col('calories')/pl.col('cups')).alias('cal_per_cup') )

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,cal_per_cup
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,212.121212
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,120.0
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,212.121212
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,100.0
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,146.666667
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174,146.666667
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301,110.0
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445,149.253731
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193,100.0


You can add multiple columns, just separate them with a commas in the `.with_columns()` method.

_TIP: Instead of creating these new columns in sucession, Polars can create them in parallel for much speedier calculations!_

In [6]:
cereal.with_columns( 
    (pl.col('calories')/pl.col('cups')).alias('cal_per_cup'),
    (pl.col('sugars') >= 10).alias('high_sugar')
)

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,cal_per_cup,high_sugar
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64,f64,bool
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,212.121212,false
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,120.0,false
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,212.121212,false
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,100.0,false
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,146.666667,false
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174,146.666667,false
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301,110.0,true
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445,149.253731,false
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193,100.0,false


None of the columns we've created appear in our `cereal` dataframe. Changes don't stick around unless you save the output of `.with_columns()`.

In [7]:
cereal

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193


## Missing Values

### Finding Missings

You can look at descriptive statistics, along with the number of missing values, using the `.describe()` method.

In [8]:
cereal.describe()

statistic,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""77""","""77""","""77""",77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
"""null_count""","""0""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,,106.883117,2.545455,1.012987,159.675325,2.151948,14.597403,6.922078,96.077922,28.246753,2.207792,1.02961,0.821039,42.665705
"""std""",,,,19.484119,1.09479,1.006473,83.832295,2.383364,4.278956,4.444885,71.286813,22.342523,0.832524,0.150477,0.232716,14.047289
"""min""","""100% Bran""","""A""","""C""",50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
"""25%""",,,,100.0,2.0,0.0,130.0,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.174094
"""50%""",,,,110.0,3.0,1.0,180.0,2.0,14.0,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
"""75%""",,,,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
"""max""","""Wheaties Honey Gold""","""R""","""H""",160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


It appears there are no null values in this dataset; however, -1 signifies missings.

In [9]:
cereal.filter(pl.col('potass') == -1)

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
"""Cream of Wheat (Quick)""","""N""","""H""",100,3,0,80,1.0,21.0,0,-1,0,2,1.0,1.0,64.533816


The `read_csv()` function offers many options to handle your data upon import, including a way to detect missing value characters.

Let's reload the data but specify that -1 indicates a missing value.

In [10]:
cereal = pl.read_csv(
    'https://raw.githubusercontent.com/kimfetti/Projects/master/Etc/cereal.csv',
    null_values = '-1'
)

Now we correctly observe a "null" for the potassium value for "Almond Delight."

In [11]:
cereal = pl.read_csv(
    'https://raw.githubusercontent.com/kimfetti/Projects/master/Etc/cereal.csv',
    null_values = '-1'
)

To find missing values, use `.is_null()`.

In [12]:
cereal.filter(pl.col('potass').is_null())

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,,25,3,1.0,0.75,34.384843
"""Cream of Wheat (Quick)""","""N""","""H""",100,3,0,80,1.0,21.0,0,,0,2,1.0,1.0,64.533816


### Imputing (Filling) Missings

To impute or fill in missing values, use `.fill_null()`.

In [13]:
cereal.fill_null(0).head()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,0,25,3,1.0,0.75,34.384843


In [14]:
cereal.fill_null(strategy='mean').head()

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,98,25,3,1.0,0.75,34.384843


## Joining DataFrames

To demonstrate merging dataframes, here's another small dataframe with the current manufacturers of each cereal. _("D" means the cereal has been discontinued.)_

In [15]:
cereal_current = pl.DataFrame(
    {
        'name': cereal.select(pl.col('name')),
        'current_mfr': ["K", "D", "K", "D", "D", "G", "K", "G", "G", "K", 
                        "Q", "G", "G", "D", "G", "G", "K", "K", "G", "K", 
                        "B", "K", "D", "D", "K", "K", "K", "D", "P", "P", 
                        "P", "G", "P", "P", "P", "Q", "G", "P", "D", "D",
                        "G", "Q", "G", "H", "D", "D", "K", "G", "D", "D", 
                        "D", "D", "P", "D", "Q", "Q", "Q", "Q", "K", "G", 
                        "D", "G", "K", "P", "D", "P", "K", "K", "D", "G", 
                        "G", "G", "D", "G", "G", "G", "D"]
    }
)

cereal_current

name,current_mfr
str,str
"""100% Bran""","""K"""
"""100% Natural Bran""","""D"""
"""All-Bran""","""K"""
"""All-Bran with Extra Fiber""","""D"""
"""Almond Delight""","""D"""
…,…
"""Triples""","""D"""
"""Trix""","""G"""
"""Wheat Chex""","""G"""
"""Wheaties""","""G"""


To perform table joins in Polars, use the `.join()` method. This method allows for inner, full outer, left, right, and cross joins.

In [16]:
cereal_update = cereal.join(cereal_current, on='name', how='inner')

cereal_update

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,current_mfr
str,str,str,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,f64,f64,str
"""100% Bran""","""N""","""C""",70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,"""K"""
"""100% Natural Bran""","""Q""","""C""",120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,"""D"""
"""All-Bran""","""K""","""C""",70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,"""K"""
"""All-Bran with Extra Fiber""","""K""","""C""",50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,"""D"""
"""Almond Delight""","""R""","""C""",110,2,2,200,1.0,14.0,8,,25,3,1.0,0.75,34.384843,"""D"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Triples""","""G""","""C""",110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174,"""D"""
"""Trix""","""G""","""C""",110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301,"""G"""
"""Wheat Chex""","""R""","""C""",100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445,"""G"""
"""Wheaties""","""G""","""C""",100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193,"""G"""


## Data Analysis

Polars operations allow for powerful analysis. Here are some simple questions and answers, and there's much more you can do as well.

> **QUESTION:** Which cereals have now been discontinued?

In [17]:
cereal_update.filter(pl.col('current_mfr') == 'D').select(pl.col('name', 'current_mfr'))

name,current_mfr
str,str
"""100% Natural Bran""","""D"""
"""All-Bran with Extra Fiber""","""D"""
"""Almond Delight""","""D"""
"""Clusters""","""D"""
"""Crispy Wheat & Raisins""","""D"""
…,…
"""Raisin Squares""","""D"""
"""Shredded Wheat 'n'Bran""","""D"""
"""Strawberry Fruit Wheats""","""D"""
"""Triples""","""D"""


> **QUESTION:** Which current cereals have switched manufacturers?

In [18]:
(cereal_update
 .filter((pl.col('mfr') != pl.col('current_mfr')) & (pl.col('current_mfr') != 'D'))
 .select(pl.col('name', 'mfr', 'current_mfr'))
)

name,mfr,current_mfr
str,str,str
"""100% Bran""","""N""","""K"""
"""Bran Chex""","""R""","""G"""
"""Bran Flakes""","""P""","""K"""
"""Corn Chex""","""R""","""G"""
"""Cream of Wheat (Quick)""","""N""","""B"""
…,…,…
"""Maypo""","""A""","""H"""
"""Rice Chex""","""R""","""G"""
"""Shredded Wheat""","""N""","""P"""
"""Shredded Wheat spoon size""","""N""","""P"""


## Writing to a CSV

Once you have new data you'd like to save, Polars allows you to write it to a CSV with `.write_csv()`.

In [19]:
!mkdir -p data

In [20]:
cereal_update.write_csv('data/cereal_updated.csv')

In [21]:
!ls data

cereal_updated.csv
