# Selecting columns 2: using `select` and expressions
By the end of this lecture you will be able to:
- select a column or columns with `select`
- transform a column while selecting it
- select a column in lazy mode

Selecting columns with expressions is key to performant analysis as:
- this approach works in lazy mode
- when we select and transform multiple columns Polars will run these selections in paralell

We introduce the range of methods we can use to select columns with an expression in this lecture.

In [1]:
import polars as pl

In [2]:
csv_file = "data_titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


## Selecting a single column with a string

We can choose a column with a string in the `select` method

In [4]:
(
    df
    .select('Age')
    .head(3)
)

Age
f64
22.0
38.0
26.0


Note that the output of `select` is always a `DataFrame` rather than a `Series` even if just one column is selected.

We can use `to_series` if we want a `Series`

In [6]:
(
    df
    .select('Age')
    .to_series()
    .head(3)
)

Age
f64
22.0
38.0
26.0


### Selecting multiple columns

We can select multiple columns with comma-separated strings

In [7]:
(
    df
    .select(
        'Survived','Age'
    )
    .head(3)
)

Survived,Age
i64,f64
0,22.0
1,38.0
1,26.0


Or we can pass a list of column names to `select`

In [8]:
(
    df
    .select(
        ['Survived','Age']
    )
    .head(3)
)

Survived,Age
i64,f64
0,22.0
1,38.0
1,26.0


There is no performance difference between using comma-separated column names or a list of column names.

## Differences between using `select` and `[]`

- `[]` indexing can only be used in eager mode, but **`select` can also be used in lazy mode**
- expressions in `select` can be **optimised** in lazy mode by the query optimiser
- multiple expressions in `select` can be run in *parallel*

Get in the habit of using `select` as your default

## Selecting columns with an expression

We can select a column with an expression in the `select` method

In [9]:
(
    df
    .select(
        pl.col('Age')
    )
    .head(3)
)

Age
f64
22.0
38.0
26.0


In practice when we pass a string instead of an expression (as above) polars uses that string to create an expression internally.

## Selecting and transforming a column with an expression
We can apply a transformation to a column before we output it.

In this example we use the `round` expression to round the values of the `Fare` column

In [10]:
(
    df
    .select(
        pl.col('Fare').round(0)
    )
    .head(3)
)

Fare
f64
7.0
71.0
8.0


We will see many more examples where we use expressions to transform data as we go through the course.

### Selecting multiple columns with a list of expressions

We can also pass multiple expressions seperated by commas or in a `list`. 

In this case we use the `alias` expression to change the name of one column in the output

In [11]:
(
    df
    .select(
        pl.col('Fare'),
        pl.col('Fare').round(0).alias('roundedFare')
    )
    .head(3)
)

Fare,roundedFare
f64,f64
7.25,7.0
71.2833,71.0
7.925,8.0


Recall that when you have multiple expressions Polars runs them in parallel.

## Returning a single value
If you have a `DataFrame` with a single value - that is a `DataFrame` with shape `[1,1]` you can
return that value with square brackets or the `item` method

In [12]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col('Name').first()
    )
    .item()
)

'Braund, Mr. Owen Harris'

## Selecting columns in lazy mode

If we apply `select` in lazy mode it changes the `PROJECT` part of the optimised query plan

In [13]:
df = (
    pl.scan_csv(csv_file)
    .select(['Survived','Age'])
)
print(df.explain())

FAST_PROJECT: [Survived, Age]

    Csv SCAN data_titanic.csv
    PROJECT 2/12 COLUMNS


The optimized query plan now has:

`PROJECT 2/12 COLUMNS`

This means that Polars only loads the `Survived` and `Age` columns into memory when reading the CSV.

Reducing the number of columns reduces time and memory usage 

The `FAST_PROJECT` part of the query plan doesn't have any implications for users but is described here if you are curious... 

> The `FAST_PROJECT` happens when `select` is applied to `scan_csv` but **no transformations are applied** to any columns.

> In this simpler case with column selections and no transformations Polars modifies its standard parallel approach and does the column selection in serial. This is faster than the standard method in parallel and so it is called `FAST_PROJECT`.

## Exercises

In the exercises you will develop your understanding of:
- selecting columns using the `select` method
- transforming columns within the `select` method
- using `select` in lazy mode

### Exercise 1

Select the `Age` and `Survived` columns using the Expression API

Do this twice:
- once using strings
- once using expressions

In [15]:
df = pl.read_csv(csv_file)
df.select(pl.col('Age'),pl.col('Survived')).head(3)


Age,Survived
f64,i64
22.0,0
38.0,1
26.0,1


### Exercise 2
Select all rows where `Age` is greater than 30 and output the `Age` and `Survived` columns

In [20]:
df = pl.read_csv(csv_file)
df.filter(pl.col('Age')>30).select(['Age','Survived']).head(3)

Age,Survived
f64,i64
38.0,1
35.0,1
35.0,0


### Exercise 3
Output a one-column DataFrame where the column is the `min` of the `Age` column

In [24]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Age').min().alias('min_age')
    )
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,min_age
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",0.42
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",0.42
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S""",0.42
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",0.42
5,0,3,"""Allen, Mr. Wil…","""male""",35.0,0,0,"""373450""",8.05,,"""S""",0.42
…,…,…,…,…,…,…,…,…,…,…,…,…
887,0,2,"""Montvila, Rev.…","""male""",27.0,0,0,"""211536""",13.0,,"""S""",0.42
888,1,1,"""Graham, Miss. …","""female""",19.0,0,0,"""112053""",30.0,"""B42""","""S""",0.42
889,0,3,"""Johnston, Miss…","""female""",,1,2,"""W./C. 6607""",23.45,,"""S""",0.42
890,1,1,"""Behr, Mr. Karl…","""male""",26.0,0,0,"""111369""",30.0,"""C148""","""C""",0.42


Add another line onto the query to output this single value as a float

In [31]:
df.select('Age').min().item()

0.42

~Output a one-row DataFrame where the first column is the `min` of the `Age` column and the second column is the `max` of the `Age` column

Expand the following cell if you want a hint

In [None]:
#Hint: you cannot have two columns with the same name so you will have to use the `alias` expression 

In [34]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col('Age').min().alias('min_age'),
        pl.col('Age').max().alias('max_age')
    ).transpose()
)

column_0
f64
0.42
80.0


### Exercise 4 
Using lazy mode, create a query that has the following query plan

```
FAST_PROJECT: [Age, Pclass, Survived]

    CSV SCAN ../data/titanic.csv
    PROJECT 3/12 COLUMNS
```

In [37]:
print(
     pl.scan_csv(csv_file)
    .select(
         pl.col(['Age','Pclass','Survived']
                )
     ).explain()
)

FAST_PROJECT: [Age, Pclass, Survived]

    Csv SCAN data_titanic.csv
    PROJECT 3/12 COLUMNS


## Solutions

### Solution to Exercise 1
Select the `Age` and `Survived` columns using the Expression API

Do this twice:
- once using strings
- once using expressions

In [None]:
df = pl.read_csv(csv_file)
df.select(['Age','Survived']).head(3)
df.select([pl.col('Age'),pl.col('Survived')]).head(3)

### Solution to Exercise 2
Select all rows where `Age` is greater than 30 and output the `Age` and `Survived` columns

In [None]:
df = pl.read_csv(csv_file)
df.filter(pl.col('Age')>30).select(['Age','Survived']).head(3)

### Solution to Exercise 3
Output a one-column DataFrame where the column is the `min` of the `Age` column

In [25]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col('Age').min()
    )
)

Age
f64
0.42


Add another line onto the query to output this single value as a float

In [None]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col('Age').min()
    )
    .item()
)

Output a one-row DataFrame where the first column is the `min` of the `Age` column and the second column is the `max` of the `Age` column


In [35]:
(
    pl.read_csv(csv_file)
    .select(
        [
            pl.col('Age').max().alias('age_max'),
            pl.col('Age').min().alias('age_min')
        ]
    )
)

age_max,age_min
f64,f64
80.0,0.42


### Solution to Exercise 4
```
  FAST_PROJECT: [Age, Pclass, Survived]
    CSV SCAN ../data/titanic.csv
    PROJECT 3/12 COLUMNS
```

In [38]:
print(pl.scan_csv(csv_file).select(['Age','Pclass','Survived']).explain())

FAST_PROJECT: [Age, Pclass, Survived]

    Csv SCAN data_titanic.csv
    PROJECT 3/12 COLUMNS
