# Selecting columns 3: selecting multiple columns
By the end of this lecture you will be able to:
- select columns based on a regex
- select columns based on dtype
- use selectors

Polars has two ways for selecting multiple columns:
- the expression API with `pl.col` or `pl.all`
- the selectors API with polars selectors such as `cs.contains`

Here we import the `polars.selectors` separately as `cs`

In [6]:
import polars as pl
import polars.selectors as cs

In [7]:
csv_file = "../Files/Sample_Superstore.csv"

In [8]:
df = pl.read_csv(csv_file)
df.head(3)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


### Selecting all columns from a `DataFrame`

We can select all columns by replacing `pl.col` with `pl.all`

In [10]:
df.select(pl.all()).head(3)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


We can select all but a subset of columns with the `exclude` expression

In [12]:
df.select(pl.exclude('Postal_Code','Sub_Category','Quantity')).head(3)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Region,Product_ID,Category,Product_Name,Sales,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-BO-10001798""","""Furniture""","""Bush Somerset Collection Bookc…",261.96,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-CH-10000454""","""Furniture""","""Hon Deluxe Fabric Upholstered …",731.94,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""","""West""","""OFF-LA-10000240""","""Office Supplies""","""Self-Adhesive Address Labels f…",14.62,0.0,6.8714


This is a shorthand for `pl.all().exclude(...)`

### Selecting columns with a regex
We can select columns with a regex - if the regex starts with `^` and ends with `$`. Note that we meet an easier approach to doing this with selectors below.

The following regex looks for columns starting with `P` and uses the regex *wildcard* `.*` to show `P` can be followed by any characters.

In [13]:
(
    df
    .select(
        "^P.*$"
    )
    .head(3)
)

Postal_Code,Product_ID,Product_Name,Profit
i64,str,str,f64
42420,"""FUR-BO-10001798""","""Bush Somerset Collection Bookc…",41.9136
42420,"""FUR-CH-10000454""","""Hon Deluxe Fabric Upholstered …",219.582
90036,"""OFF-LA-10000240""","""Self-Adhesive Address Labels f…",6.8714


We can pass this regex to `pl.col` to apply transformations to these columns. In this example we take the `max` of each column

In [14]:
(
    df
    .select(
        pl.col("^P.*$").max()
    )
    .head(3)
)

Postal_Code,Product_ID,Product_Name,Profit
i64,str,str,f64
99301,"""TEC-PH-10004977""","""netTALK DUO VoIP Telephone Ser…",8399.976


### Selecting columns based on dtype
We can select all of the columns that have a particular dtype by passing the dtype to `pl.col`. I use this approach **a lot** in my Polars pipelines.

Here we select all the string columns with `pl.Utf8` - the string dtype object

In [15]:
(
    df
    .select(
        pl.col(pl.Utf8)
    )
    .head(3)
)

Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Region,Product_ID,Category,Sub_Category,Product_Name
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…"
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …"
"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""","""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…"


We can also pass a list of dtypes to `pl.col`. In this case we select both 64-bit integer and float columns

In [16]:
(
    df
    .select(
        pl.col([pl.Int64,pl.Float64])
    )
    .head(3)
)

Row_ID,Postal_Code,Sales,Quantity,Discount,Profit
i64,i64,f64,i64,f64,f64
1,42420,261.96,2,0.0,41.9136
2,42420,731.94,3,0.0,219.582
3,90036,14.62,2,0.0,6.8714


## Using the selectors API
The selectors API aims to make selecting multiple columns less verbose. 

For simple cases it replicates using the expression API. For example to select all columns we use `cs.all`

In [17]:
(
    df
    .select(
        cs.all()
    )
    .head(3)
)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


In most Polars examples you see online the selectors sub-module is imported separately as `cs` (and I follow this practice below). However, in my own pipelines I find it easier to skip that extra import and use selectors with the main `pl` import

In [18]:
(
    df
    .select(
        pl.selectors.all()
    )
    .head(3)
)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


We can also do selection by position with `first` or `last`

In [19]:
(
    df
    .select(
        cs.first()
    )
    .head(3)
)

Row_ID
i64
1
2
3


The output of a selector is a standard Polars expression so we can follow it up with standard expression chaining

In [20]:
(
    df
    .select(
        cs.all().max()
    )
)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
9994,"""US-2017-169551""","""9/9/2017""","""9/9/2017""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Home Office""","""United States""","""Yuma""","""Wyoming""",99301,"""West""","""TEC-PH-10004977""","""Technology""","""Tables""","""netTALK DUO VoIP Telephone Ser…",22638.48,14,0.8,8399.976


The selectors API works well in lazy mode and for streaming queries just as expressions do.

We can select columns by groups of dtype - including a group of all integer and floating point dtypes with `cs.numeric`

In [21]:
(
    df
    .select(
        cs.numeric()
    )
    .head(3)
)

Row_ID,Postal_Code,Sales,Quantity,Discount,Profit
i64,i64,f64,i64,f64,f64
1,42420,261.96,2,0.0,41.9136
2,42420,731.94,3,0.0,219.582
3,90036,14.62,2,0.0,6.8714


We can select by name - in this example with a `~` operator to exclude the names listed

In [22]:
(
    df
    .select(
        ~cs.by_name("Pclass","Age")
    )
    .head(3)
)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


As a simpler alternative to the regex example we saw earlier we can use string methods such as:
- `contains`
- `starts_with`
- `end_with`
- `matches`

In this example we select all columns beginning with P

In [23]:
(
    df
    .select(
        cs.starts_with("P")
    )
    .head(3)
)

Postal_Code,Product_ID,Product_Name,Profit
i64,str,str,f64
42420,"""FUR-BO-10001798""","""Bush Somerset Collection Bookc…",41.9136
42420,"""FUR-CH-10000454""","""Hon Deluxe Fabric Upholstered …",219.582
90036,"""OFF-LA-10000240""","""Self-Adhesive Address Labels f…",6.8714


We can apply an OR condition by passing multiple strings

In [24]:
(
    df
    .select(
        cs.starts_with("P","A")
    )
    .head(3)
).columns

['Postal_Code', 'Product_ID', 'Product_Name', 'Profit']

With the `matches` method we can pass a regex without the `^` and `$` we need for the expression API

In [26]:
(
    df
    .select(
        cs.matches("Customer_Name|Profit")
    )
    .head(3)
)

Customer_Name,Profit
str,f64
"""Claire Gute""",41.9136
"""Claire Gute""",219.582
"""Darrin Van Huff""",6.8714


### Union of selectors
To do a union operation we use the `|` operator to say at least one of the conditions must be satisfied

In [28]:
(
    df
    .select(
        cs.string() | cs.contains("P") 
    )
    .head(3)
)

Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Profit
str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",41.9136
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",219.582
"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",6.8714


### Difference of selectors
To do a difference operation we use a minus operator `-`.

In this example we select all string columns other than any column beginning with T

In [29]:
(
    df
    .select(
        cs.string() - cs.starts_with("T") 
    )
    .head(3)
)

Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Region,Product_ID,Category,Sub_Category,Product_Name
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…"
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …"
"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""","""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…"
