## Selecting columns 4: Transforming and adding a column
By the end of this lecture you will be able to:
- transform an existing column in place using `with_columns`
- add a new column with an expression
- add a new column with column arithmetic
- add a column with constant values using `pl.lit`

In [1]:
import polars as pl

In [2]:
csv_file = "data_titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


## Transforming an existing column

We can transform an existing column by passing the column to `with_columns`.

In this example we round `Fare` to 0 significant figures.

In [4]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Fare").round(0)
        )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.0,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.0,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",8.0,,"""S"""


## Adding a new column from an existing column
We can create a new column from an existing column by renaming it with `alias`

In [5]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').round(0).alias('roundFare')
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,roundFare
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",7.0
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",71.0
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",8.0


Instead of using `alias` we can also create the new column by assigning the column name equal to the expression (this approach in Polars is referred to as kwargs assignment) 

In [6]:
(
    pl.read_csv(csv_file)
    .with_columns(
        roundFare = pl.col('Fare').round(0)
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,roundFare
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",7.0
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",71.0
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",8.0


## Difference between `with_columns` and `select`
- The `select` method returns a subset of the columns but `with_columns` method returns all of the columns
- `with_columns` accepts expressions only - no strings

## Adding or transforming a column with column arithmetic

We can transform columns with arithmetic in an expression.

In this example we double the values in the `Fare` column in a new column called `doubleFare`

In [7]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Fare") * 2).alias("doubleFare")
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,doubleFare
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",14.5
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",142.5666
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",15.85


We can also do arithmetic multiple columns in an expression.

In this examle we add the values in the `Fare` and `Age` column

In [10]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Fare") + pl.col("Age")).alias("farePlusAge")
    )
    .head(5)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,farePlusAge
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",29.25
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",109.2833
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",33.925
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",88.1
5,0,3,"""Allen, Mr. Wil…","""male""",35.0,0,0,"""373450""",8.05,,"""S""",43.05


Some people feel text arithmetic expressions are more readable. 

We do the same example as above but with the `.add` operator rather than `+` 

In [11]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').add(pl.col('Age')).alias('farePlusAge')
    )
    .head(2)
)


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,farePlusAge
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",29.25
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",109.2833


The mapping from python operators to expressions are:
- `==` to `eq`
- `//` to `floordiv`
- `> ` to `gt`
- `>=` to `ge`
- `< ` to `lt`
- `<=` to `le`
- `% ` to `mod`
- `!=` to `ne`
- `- ` to `sub`
- `/ ` to `truediv`
- `^ ` to `xor`
- `* ` to `mul`

## Adding a new column with a constant value

Use the literal function `pl.lit` to specify a constant value in Polars.

Here we add a new column called `Aboard` with a value `yes` for all passengers 

In [13]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit('yes').alias('Aboard')
    )
    .select(['Name','Aboard'])
    .head(10)
)

Name,Aboard
str,str
"""Braund, Mr. Ow…","""yes"""
"""Cumings, Mrs. …","""yes"""
"""Heikkinen, Mis…","""yes"""
"""Futrelle, Mrs.…","""yes"""
"""Allen, Mr. Wil…","""yes"""
"""Moran, Mr. Jam…","""yes"""
"""McCarthy, Mr. …","""yes"""
"""Palsson, Maste…","""yes"""
"""Johnson, Mrs. …","""yes"""
"""Nasser, Mrs. N…","""yes"""


## Exercises

In the exercises you will develop your understanding of:
- transforming an existing column
- adding a new column from existing columns
- adding a new column with a constant value

### Exercise 1

Add a new column called `familySize` which is the sum of the number of siblings (`SibSp` columns), the number of parents or children (`Parch` columns) plus one for the passenger themself.

Print out the first 3 rows.

Hint: Add the two columns inside `()` and then apply `.alias`

In [14]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col('SibSp')+pl.col('Parch')+1).alias('familySize')
    )
).head(10)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,familySize
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,i64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",2
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",2
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",1
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",2
5,0,3,"""Allen, Mr. Wil…","""male""",35.0,0,0,"""373450""",8.05,,"""S""",1
6,0,3,"""Moran, Mr. Jam…","""male""",,0,0,"""330877""",8.4583,,"""Q""",1
7,0,1,"""McCarthy, Mr. …","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S""",1
8,0,3,"""Palsson, Maste…","""male""",2.0,3,1,"""349909""",21.075,,"""S""",5
9,1,3,"""Johnson, Mrs. …","""female""",27.0,0,2,"""347742""",11.1333,,"""S""",3
10,1,2,"""Nasser, Mrs. N…","""female""",14.0,1,0,"""237736""",30.0708,,"""C""",2


### Exercise 2 

Add a new column called `decade` that converts the `Age` column to the passengers age in decades e.g. 15.2 goes to 10, where 10 is an integer. Add the new column using the kwargs approach.

Print out the first 3 rows.

Hint: use `cast` to convert the dtype

In [17]:
(
    pl.read_csv(csv_file)
    .with_columns(
        ((pl.col('Age')/10).cast(pl.Int32)*10).alias('decade')
    )
).head(10)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,decade
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,i32
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",20.0
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",30.0
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",20.0
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",30.0
5,0,3,"""Allen, Mr. Wil…","""male""",35.0,0,0,"""373450""",8.05,,"""S""",30.0
6,0,3,"""Moran, Mr. Jam…","""male""",,0,0,"""330877""",8.4583,,"""Q""",
7,0,1,"""McCarthy, Mr. …","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S""",50.0
8,0,3,"""Palsson, Maste…","""male""",2.0,3,1,"""349909""",21.075,,"""S""",0.0
9,1,3,"""Johnson, Mrs. …","""female""",27.0,0,2,"""347742""",11.1333,,"""S""",20.0
10,1,2,"""Nasser, Mrs. N…","""female""",14.0,1,0,"""237736""",30.0708,,"""C""",10.0


### Exercise 3
Create a new literal column

Add a new binary column called `Aboard` that has the value `1` for all passengers.

Print out the first 3 rows

In [18]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit(1).alias('Aboard')
    )
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Aboard
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,i32
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",1
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",1
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",1
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",1
5,0,3,"""Allen, Mr. Wil…","""male""",35.0,0,0,"""373450""",8.05,,"""S""",1
…,…,…,…,…,…,…,…,…,…,…,…,…
887,0,2,"""Montvila, Rev.…","""male""",27.0,0,0,"""211536""",13.0,,"""S""",1
888,1,1,"""Graham, Miss. …","""female""",19.0,0,0,"""112053""",30.0,"""B42""","""S""",1
889,0,3,"""Johnston, Miss…","""female""",,1,2,"""W./C. 6607""",23.45,,"""S""",1
890,1,1,"""Behr, Mr. Karl…","""male""",26.0,0,0,"""111369""",30.0,"""C148""","""C""",1


### Exercise 4

Add a new Boolean column `overThirty` that captures whether a passenger's age is 30 years or older

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

## Solutions

### Solution to exercise 1

Add a new column for family size

In [19]:
(
    pl.read_csv(csv_file)
    .with_columns( 
        (
        pl.col('SibSp') + pl.col('Parch') + 1
        ).alias('familySize')
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,familySize
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,i64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",2
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",2
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",1


### Solution to exercise 2

Create a decades column

In [20]:
(
    pl.read_csv(csv_file)
    .with_columns( 
        decade = ((pl.col('Age')/10).floor()).cast(pl.Int64)
    )
    .select(['Age','decade'])
    .head(3)
)


Age,decade
f64,i64
22.0,2
38.0,3
26.0,2


### Solution to exercise 3

Create a new literal column

In [21]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit(1).alias('Aboard')
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Aboard
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,i32
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",1
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",1
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",1


### Solution to Exercise 4

Add a new Boolean column based on an expression

In [22]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Age") >= 30).alias("overThirty")
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,overThirty
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,bool
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",False
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",True
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",False
