## Selecting columns 6: Adding a new column based on a mapping or condition
In this lecture we learn how to:
- add a new column with a dict mapping from an existing column
- add a new column with an `if-else` condition using `pl.when`
- add a new column with a condition on multiple columns
- add a new column with multiple `if-elif` conditions

In [None]:
import polars as pl

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

## Add a new column based on a mapping from another column
We can add a new column based on a Python `dict` that maps values in an existing column to a new value with the `replace` expression

In [None]:
(
    df
    .with_columns(
        embarked_full = pl.col("Embarked").replace(
            {
                "S":"Southampton",
                "C":"Cherbourg"
            },
        )
    )
    .select('Embarked','embarked_full')
    .head(2)
)

If we want to return a column with a different dtype to the input column we use `replace_strict` and specify the `return_dtype`. In this example we map the (integer) class number to a (string) class name

In [None]:
(
    df
    .with_columns(
            class_name = pl.col("Pclass").replace_strict(
            {
                1:'first',
                2:'second',
                3:'third'
            },
            return_dtype=pl.String
        )
    )
    .select('Pclass','class_name')
    .head(2)
)

In [None]:
(
    df
    .with_columns(
        class_name = pl.col("Pclass").replace_strict(
            {
                1:'first',
                2:'second',
                3:'third'
            },
            return_dtype=pl.String
        )
    )
    .select('Pclass','class_name')
    .head(2)
)

We can think of `replace` as a left join (see the section on combining `DataFrames` if you are not familiar with joins) and in fact Polars implements this as a join by converting the mapping dictionary to a `DataFrame` under-the-hood.

## Add a new column based on a condition on another column

We use the `pl.when.then.otherwise` to define a new column based on a condition on one or more other columns.

We want to add a new binary column called `first_class` based on the `Pclass` column where the value is `1` for first class passengers and `0` for second and third class passengers:

`first_class = 1 if Pclass == 1`,`otherwise = 0`


In [None]:
(
    df
    .select(
        "Pclass",
        pl.when(
             pl.col("Pclass") == 1
        )
        .then(1)
        .otherwise(0)
        .alias("first_class")
    )
    .select("Pclass","first_class")
    .head(2)
)

## Syntax of `pl.when`

The syntax is:
```python

    pl.when(**predicate**)
    .then(**Value if True**)
    .otherwise(**Value if False)
    .alias(**New Column Name**)
```
so we pass:
- a predicate expression (e.g. `pl.col("Pclass") == 1`) to `pl.when`
- an expression to show the value if True to `.then`
- an expression to show the value otherwise to `.otherwise`
- a name for the output to `alias`. If we don't pass `alias` the output is named after the first expression in the predicate

Note: if the value passed to `then` or `otherwise` is a string Polars interprets it as a column name. To pass a string as a value it must be passed as a Polars literal as seen in the following example

In [None]:
(
    df
    .select(
        "Pclass",
        pl.when(
             pl.col("Pclass") == 1
        )
        .then(pl.lit("first"))
        .otherwise(pl.lit("not_first"))
        .alias("first_class")
    )
    .select("Pclass","first_class")
    .head(2)
)

In this example the string in `then` is interpreted as a column name - though more generally we can pass expressions inside `then` and `otherwise`

In [None]:
(
    df
    .select(
        "Pclass",
        pl.when(
             pl.col("Pclass") == 1
        )
        .then("Fare")
        .otherwise(
            pl.col("Fare").min()
        )
        .alias("fare_approximation")
    )
    .select("Pclass","fare_approximation")
    .head(2)
)

## Add a new column based on a condition on multiple other columns

We can base conditions on multiple other columns. Here we want to add a new binary column called `young_first_class` where first class passengers with age under 30 are `1` and all other passengers are 0. 

`young_first_class = 1 if (Pclass == 1 and Age < 30) otherwise = 0`


We need an `AND` condition to combine the two filter conditions. We can do this by passing the predicates as a comma-separated list to `pl.when`

In [None]:
(
    df
    .select(
        pl.col("Pclass"),
        pl.col("Age"),
        pl.when(
            pl.col("Pclass") == 1,
            pl.col("Age") < 30
        )
        .then(1)
        .otherwise(0)
        .alias("young_first_class")
    )
    .tail(5)
)

## Add a new column based on an `if-elif-else` condition
We are not limited to a single `if-else` condition. 

For example we want to create 3 categories with:
- `1` for young first class passengers 
- `2` for older first class passengers and 
- `0` for all other passengers

`age_class == 1 if (Pclass == 1 and Age < 30) or
age_class == 2 if (Pclass == 1 and Age > 30) or
otherwise 0`

We do this by repeating the `.when.then` cycle for each sub-condition.

The syntax here for a single extra condition is:
```python

    pl.when(**Boolean Expression**)
    .then(**Value if True**)
    .when(**Boolean Expression**)
    .then(**Value if True**)
    .otherwise(**Value if False)
    .alias(**New Column Name**)
    
```


In [None]:
(
    df
    .select(
        pl.col("Pclass"),
        pl.col("Age"),
        pl.when(
            pl.col("Pclass") == 1,
            pl.col("Age")<30
        )
        .then(1)
        .when(
            pl.col("Pclass") == 1,
            pl.col("Age")>=30
        )
        .then(2)
        .otherwise(0)
        .alias("age_class")
    )
    .head(5)
)

The cycles of `.when.then` can be repeated indefinitely.

## Exercises

In the exercises you will develop your understanding of:
- adding a column based on a mapping of another column
- adding a binary column based on a condition on multiple columns
- adding a column based on a nested `if-elif` condition on another column

### Exercise 1 
Create a binary column for whether a passenger is female or male.

Add a column called `is_female` that maps rows with a female passenger to 1 and rows with a male passenger to 0. Ensure the column has an integer dtype

In [None]:
(
    pl.read_csv(csv_file)
    .select(['Sex',<blank>])
    .head()
)

### Exercise 2 
Create a binary column called `young_female_first_class` for whether a passenger is:
- female
- in first class and
- under 30

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head()
)

### Exercise 3 
Create a column called `embarked_categories` where
- female and embarked in Southampton then has value "FS"
- female and did not embark in Southampton then has value "NFS"
- male and embarked in Southampton then has value "MS"
- male and did not embark in Southampton then has value "NMS"

In [None]:
(
    pl.read_csv(csv_file)
    .select(
            "Sex",
            "Embarked",
            <blank>
    )
)


## Solutions

### Solution to Exercise 1 

Add a column called `is_female` that maps rows with a female passenger to 1 and rows with a male passenger to 0. Ensure the column has an integer dtype

In [None]:
(
    pl.read_csv(csv_file)
    .select(
        'Sex',
        pl.col("Sex").replace_strict({"female":1,"male":0},return_dtype=pl.Int64).alias("is_female")
    )
    .head()
)

### Solution to Exercise 2 
Create a binary column called `young_female_first_class` for whether a passenger is:
- female
- in first class and
- under 30

In [None]:
(
    pl.read_csv(csv_file)
    .select(
        "Pclass",
        "Sex",
        "Age",
        pl.when(
            pl.col("Sex")=="female",
            pl.col("Pclass")==1,
            pl.col("Age") <30
        )
        .then(1)
        .otherwise(0)
        .alias("young_female_first_class")
    )
    .filter(
        pl.col("young_female_first_class")==1
    )
    .head()
)

### Solution to Exercise 3 

Create a column called `embarked_categories` where
- female and embarked in Southampton then has value "FS"
- female and did not embark in Southampton then has value "NFS"
- male and embarked in Southampton then has value "MS"
- male and did not embark in Southampton then has value "NMS"

In [None]:
(
    pl.read_csv(csv_file)
    .select(
            "Sex",
            "Embarked",
            pl.when(
                pl.col("Sex")=="female",
                pl.col("Embarked") == "S"
            )
            .then(pl.lit("FS"))
            .when(
                pl.col("Sex")=="female",
                pl.col("Embarked") != "S"
            )
            .then(pl.lit("NFS"))
            .when(
                pl.col("Sex")=="male",
                pl.col("Embarked") == "S"
            )
            .then(pl.lit("MS"))
            .when(
                pl.col("Sex")=="male",
                pl.col("Embarked") != "S"
            )
            .then(pl.lit("NMS"))
            .otherwise(pl.lit('O'))
            .alias("embarked_categories")
    )
    .head()
)