## Transforming a `DataFrame`
In this lecture you will learn how to:
- rename, drop and re-order columns from a `DataFrame`
- transform a `DataFrame` in a function using `pipe`

In [1]:
import polars as pl
import polars.selectors as cs
# Set the number of rows to be printed to 6
pl.Config.set_tbl_rows(6)

polars.config.Config

In [2]:
csv_file = "data_titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


## Renaming columns
We can rename columns by passing a `dict` that maps old names to new names.

In [4]:
(
    df
    .rename({"PassengerId":"ID"})
    .head(2)
)

ID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


## Dropping columns

We can drop columns by passing a `list` of column names

In [5]:
(
    df
    .drop(["PassengerId","Pclass"])
    .head(2)
)

Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,str,str,f64,i64,i64,str,f64,str,str
0,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


Or we can pass a comma-seperated list of column names

In [6]:
(
    df
    .drop("PassengerId","Pclass")
    .head(2)
)

Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,str,str,f64,i64,i64,str,f64,str,str
0,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


## Re-ordering columns
We can re-order columns with a `list` in `select`.

In this example we re-order the columns in alphabetical order

In [7]:
(
    df
    .select(sorted(df.columns))
    .head(2)
)

Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
f64,str,str,f64,str,i64,i64,i64,str,i64,i64,str
22.0,,"""S""",7.25,"""Braund, Mr. Ow…",0,1,3,"""male""",1,0,"""A/5 21171"""
38.0,"""C85""","""C""",71.2833,"""Cumings, Mrs. …",0,2,1,"""female""",1,1,"""PC 17599"""


## Changing dtypes
We can change dtypes within an expression using `pl.col(...).cast()` but we can also call `cast` with a `dict` argument on a DataFrame.

In this example we cast the `Survived` column from integer to string

In [9]:
(
    df
    .cast(
        {
            "Survived":pl.Utf8
        }
    )
    .head(2)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,str,i64,str,str,f64,i64,i64,str,f64,str,str
1,"""0""",3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,"""1""",1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


We can also cast an entire `DataFrame`

In [11]:
(
    df
    .cast(pl.Utf8)
    .head(2)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,str,str,str,str,str,str,str,str,str,str,str
"""1""","""0""","""3""","""Braund, Mr. Ow…","""male""","""22.0""","""1""","""0""","""A/5 21171""","""7.25""",,"""S"""
"""2""","""1""","""1""","""Cumings, Mrs. …","""female""","""38.0""","""1""","""0""","""PC 17599""","""71.2833""","""C85""","""C"""


Or use selectors

In [12]:
(
    df
    .cast(
        {
            cs.numeric():pl.Utf8
        }
    )
    .head(2)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,str,str,str,str,str,str,str,str,str,str,str
"""1""","""0""","""3""","""Braund, Mr. Ow…","""male""","""22.0""","""1""","""0""","""A/5 21171""","""7.25""",,"""S"""
"""2""","""1""","""1""","""Cumings, Mrs. …","""female""","""38.0""","""1""","""0""","""PC 17599""","""71.2833""","""C85""","""C"""


## Transforming `DataFrames` in a function

We may want to capture some `DataFrame` transformations in a function. This can be to:
- re-use the same transformations multiple times
- make code easier to read or
- make the transformations testable

If our function:
- takes a `DataFrame` (and some other optional arguments) as an input and
- outputs a `DataFrame`
then we can use the `pipe` method.

In this example we define a function that makes all string columns uppercase

In [13]:
def uppercase_all_strings(df):
    return (
        df
        .with_columns(
            pl.col(pl.Utf8).str.to_uppercase()
        )
    )

We can pipe the `DataFrame` to this function as follows

In [14]:
(
    df
    .pipe(uppercase_all_strings)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""BRAUND, MR. OW…","""MALE""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""CUMINGS, MRS. …","""FEMALE""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""HEIKKINEN, MIS…","""FEMALE""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
889,0,3,"""JOHNSTON, MISS…","""FEMALE""",,1,2,"""W./C. 6607""",23.45,,"""S"""
890,1,1,"""BEHR, MR. KARL…","""MALE""",26.0,0,0,"""111369""",30.0,"""C148""","""C"""
891,0,3,"""DOOLEY, MR. PA…","""MALE""",32.0,0,0,"""370376""",7.75,,"""Q"""


One advantage of the `pipe` method is that it can allow us to access `DataFrame` method data even when we are using method chaining and do not have a variable with the `DataFrame` assigned.

In the following example we have a query that starts with scanning a CSV file in lazy mode. We want to re-order the columns to alphabetical order but within the method chained code.

We can do this with `pipe`.

The `pipe` method allows us to access the `DataFrame` using a temporary variable inside a function.

In this example we sort the columns alphabetically inside a `lambda` function using `pipe`

In [15]:
(
    pl.scan_csv(csv_file)
    .pipe(
        lambda temp_df: temp_df.select( sorted(temp_df.columns))
    )
    .columns
)

['Age',
 'Cabin',
 'Embarked',
 'Fare',
 'Name',
 'Parch',
 'PassengerId',
 'Pclass',
 'Sex',
 'SibSp',
 'Survived',
 'Ticket']

The transformations in `pipe` are passed to the query optimiser in lazy mode.

In this example we only use the first three columns in the `select`

In [16]:
print(
    pl.scan_csv(csv_file)
    .pipe(
        lambda temp_df: temp_df.select( sorted(temp_df.columns[:3]))
    )
    .explain()
)

FAST_PROJECT: [PassengerId, Pclass, Survived]

    Csv SCAN data_titanic.csv
    PROJECT 3/12 COLUMNS


The query optimiser sees that only 3 columns are required

### Function arguments using `pipe`
The key point about `pipe` are that:
- a `DataFrame` is the first argument and
- only a `DataFrame` is output

We can pass optional arguments to functions using `pipe`

In [19]:
def _multiply_floats(df: pl.DataFrame, multiplication_factor: int) -> pl.DataFrame:
    return df.select(pl.col(pl.Float64)) * multiplication_factor

(
    df
    .pipe(
        _multiply_floats, 
        multiplication_factor=3)
    .head(3)
)


Age,Fare
f64,f64
66.0,21.75
114.0,213.8499
78.0,23.775


## Exercises
In the exercises you will develop your understanding of:
- renaming columns
- dropping columns
- transformations using `pipe`

### Exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

Cast all of the integer columns to 16-bit integers

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

### Exercise 2
Rename the `Age` column to `age`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

Rename all column names to lower case. Expand the cell below if you would like a hint

In [None]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

## Solutions

### Solution to exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [20]:
(
    pl.read_csv(csv_file)
    .drop(["Age","Fare"])
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Cabin,Embarked
i64,i64,i64,str,str,i64,i64,str,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",1,0,"""A/5 21171""",,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",1,0,"""PC 17599""","""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",0,0,"""STON/O2. 31012…",,"""S"""


Cast all of the integer columns to 16-bit integers

In [21]:
(
    pl.read_csv(csv_file)
    .cast(
        {
            cs.integer():pl.Int16
        }
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i16,i16,i16,str,str,f64,i16,i16,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


### Solution to exercise 2

Rename the `Age` column to `age`

In [22]:
(
    pl.read_csv(csv_file)
    .rename({"Age":"age"})
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


Rename all column names to lower case. Expand the cell below if you would like a hint

In [23]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [24]:
(
    pl.read_csv(csv_file)
    .pipe(
        lambda df:df.rename({oldCol:oldCol.lower() for oldCol in df.columns})
    )
    .head(3)
)

passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
