# Sorting and fast-track alogorithms
By the end of this lecture you will be able to:
- sort a `DataFrame`
- sort a column with an expression 
- take advantage of fast-track algorithms with `set_sorted`
- find the largest and smallest values

In [2]:
import polars as pl

In [3]:
df = pl.read_csv("../../Files/Sample_Superstore.csv")

In [4]:
df.head(3)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


## Sorting a `DataFrame`

### Using the `sort` method on `DataFrame`

We can sort a `DataFrame` on a column with the `sort` method

In [6]:
df.sort("Customer_ID").head()

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1160,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-AP-10000576""","""Office Supplies""","""Appliances""","""Belkin 325VA UPS Surge Protect…",362.94,3,0.0,90.735
1161,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-BI-10004654""","""Office Supplies""","""Binders""","""Avery Binding System Hidden Ta…",11.54,2,0.0,5.77
1300,"""CA-2015-121391""","""10/4/2015""","""10/7/2015""","""First Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""OFF-ST-10001590""","""Office Supplies""","""Storage""","""Tenex Personal Project File wi…",26.96,2,0.0,7.0096
2230,"""CA-2014-128055""","""3/31/2014""","""4/5/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94122,"""West""","""OFF-BI-10004390""","""Office Supplies""","""Binders""","""GBC DocuBind 200 Manual Bindin…",673.568,2,0.2,252.588
2231,"""CA-2014-128055""","""3/31/2014""","""4/5/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94122,"""West""","""OFF-AP-10002765""","""Office Supplies""","""Appliances""","""Fellowes Advanced Computer Ser…",52.98,2,0.0,14.8344


By default `null` values are at the start of the sort. We can move the `nulls` to the end of the sort by setting the `nulls_last` argument to `True`

In [7]:
df.sort("Customer_ID",nulls_last=True).head()

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1160,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-AP-10000576""","""Office Supplies""","""Appliances""","""Belkin 325VA UPS Surge Protect…",362.94,3,0.0,90.735
1161,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-BI-10004654""","""Office Supplies""","""Binders""","""Avery Binding System Hidden Ta…",11.54,2,0.0,5.77
1300,"""CA-2015-121391""","""10/4/2015""","""10/7/2015""","""First Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""OFF-ST-10001590""","""Office Supplies""","""Storage""","""Tenex Personal Project File wi…",26.96,2,0.0,7.0096
2230,"""CA-2014-128055""","""3/31/2014""","""4/5/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94122,"""West""","""OFF-BI-10004390""","""Office Supplies""","""Binders""","""GBC DocuBind 200 Manual Bindin…",673.568,2,0.2,252.588
2231,"""CA-2014-128055""","""3/31/2014""","""4/5/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94122,"""West""","""OFF-AP-10002765""","""Office Supplies""","""Appliances""","""Fellowes Advanced Computer Ser…",52.98,2,0.0,14.8344


We can sort in reverse order with the `descending` argument - note that the `nulls_last` argument is set to the default of `False` so the `null` rows are first

In [8]:
df.sort("Customer_ID",descending=True).head()

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
19,"""CA-2014-143336""","""8/27/2014""","""9/1/2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""OFF-AR-10003056""","""Office Supplies""","""Art""","""Newell 341""",8.56,2,0.0,2.4824
20,"""CA-2014-143336""","""8/27/2014""","""9/1/2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""TEC-PH-10001949""","""Technology""","""Phones""","""Cisco SPA 501G IP Phone""",213.48,3,0.2,16.011
21,"""CA-2014-143336""","""8/27/2014""","""9/1/2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""OFF-BI-10002215""","""Office Supplies""","""Binders""","""Wilson Jones Hanging View Bind…",22.72,4,0.2,7.384
3041,"""US-2016-147991""","""5/5/2016""","""5/9/2016""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Chattanooga""","""Tennessee""",37421,"""South""","""FUR-FU-10004270""","""Furniture""","""Furnishings""","""Eldon Image Series Desk Access…",16.72,5,0.2,3.344
3815,"""CA-2016-152471""","""7/8/2016""","""7/8/2016""","""Same Day""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Jacksonville""","""Florida""",32216,"""South""","""TEC-PH-10002824""","""Technology""","""Phones""","""Jabra SPEAK 410 Multidevice Sp…",823.96,5,0.2,51.4975


We get the largest values first by setting `nulls_last=True`

In [9]:
df.sort("Customer_ID",descending=True,nulls_last=True).head()

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
19,"""CA-2014-143336""","""8/27/2014""","""9/1/2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""OFF-AR-10003056""","""Office Supplies""","""Art""","""Newell 341""",8.56,2,0.0,2.4824
20,"""CA-2014-143336""","""8/27/2014""","""9/1/2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""TEC-PH-10001949""","""Technology""","""Phones""","""Cisco SPA 501G IP Phone""",213.48,3,0.2,16.011
21,"""CA-2014-143336""","""8/27/2014""","""9/1/2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""OFF-BI-10002215""","""Office Supplies""","""Binders""","""Wilson Jones Hanging View Bind…",22.72,4,0.2,7.384
3041,"""US-2016-147991""","""5/5/2016""","""5/9/2016""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Chattanooga""","""Tennessee""",37421,"""South""","""FUR-FU-10004270""","""Furniture""","""Furnishings""","""Eldon Image Series Desk Access…",16.72,5,0.2,3.344
3815,"""CA-2016-152471""","""7/8/2016""","""7/8/2016""","""Same Day""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Jacksonville""","""Florida""",32216,"""South""","""TEC-PH-10002824""","""Technology""","""Phones""","""Jabra SPEAK 410 Multidevice Sp…",823.96,5,0.2,51.4975


## Sort on multiple columns
We can sort based on multiple columns with either a list...

In [11]:
df.sort(["Customer_ID","Profit"]).head()

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
5199,"""CA-2016-103982""","""3/3/2016""","""3/8/2016""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Round Rock""","""Texas""",78664,"""Central""","""OFF-SU-10000151""","""Office Supplies""","""Supplies""","""High Speed Automatic Electric …",3930.072,3,0.2,-786.0144
5200,"""CA-2016-103982""","""3/3/2016""","""3/8/2016""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Round Rock""","""Texas""",78664,"""Central""","""OFF-FA-10001332""","""Office Supplies""","""Fasteners""","""Acco Banker's Clasps, 5 3/4""-L…",2.304,1,0.2,0.7776
5202,"""CA-2016-103982""","""3/3/2016""","""3/8/2016""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Round Rock""","""Texas""",78664,"""Central""","""TEC-AC-10002857""","""Technology""","""Accessories""","""Verbatim 25 GB 6x Blu-ray Sing…",41.72,7,0.2,5.7365
1161,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-BI-10004654""","""Office Supplies""","""Binders""","""Avery Binding System Hidden Ta…",11.54,2,0.0,5.77
7470,"""CA-2014-138100""","""9/15/2014""","""9/20/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""New York City""","""New York""",10011,"""East""","""FUR-FU-10002456""","""Furniture""","""Furnishings""","""Master Caster Door Stop, Large…",14.56,2,0.0,6.2608


...or with comma-separated strings

In [12]:
df.sort("Customer_ID","Profit").head()

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
5199,"""CA-2016-103982""","""3/3/2016""","""3/8/2016""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Round Rock""","""Texas""",78664,"""Central""","""OFF-SU-10000151""","""Office Supplies""","""Supplies""","""High Speed Automatic Electric …",3930.072,3,0.2,-786.0144
5200,"""CA-2016-103982""","""3/3/2016""","""3/8/2016""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Round Rock""","""Texas""",78664,"""Central""","""OFF-FA-10001332""","""Office Supplies""","""Fasteners""","""Acco Banker's Clasps, 5 3/4""-L…",2.304,1,0.2,0.7776
5202,"""CA-2016-103982""","""3/3/2016""","""3/8/2016""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Round Rock""","""Texas""",78664,"""Central""","""TEC-AC-10002857""","""Technology""","""Accessories""","""Verbatim 25 GB 6x Blu-ray Sing…",41.72,7,0.2,5.7365
1161,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-BI-10004654""","""Office Supplies""","""Binders""","""Avery Binding System Hidden Ta…",11.54,2,0.0,5.77
7470,"""CA-2014-138100""","""9/15/2014""","""9/20/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""New York City""","""New York""",10011,"""East""","""FUR-FU-10002456""","""Furniture""","""Furnishings""","""Master Caster Door Stop, Large…",14.56,2,0.0,6.2608


## Sorting a column with an expression

We can transform a column into sorted order within an expression.

In this example we sort the values in every column independent of other columns

Within an expression we can also sort all columns with respect to another column using `sort_by`

In [14]:
(
    df
    .select(
        pl.all().sort_by("Customer_ID")
    ).head()
)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1160,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-AP-10000576""","""Office Supplies""","""Appliances""","""Belkin 325VA UPS Surge Protect…",362.94,3,0.0,90.735
1161,"""CA-2017-147039""","""6/29/2017""","""7/4/2017""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""Minneapolis""","""Minnesota""",55407,"""Central""","""OFF-BI-10004654""","""Office Supplies""","""Binders""","""Avery Binding System Hidden Ta…",11.54,2,0.0,5.77
1300,"""CA-2015-121391""","""10/4/2015""","""10/7/2015""","""First Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""OFF-ST-10001590""","""Office Supplies""","""Storage""","""Tenex Personal Project File wi…",26.96,2,0.0,7.0096
2230,"""CA-2014-128055""","""3/31/2014""","""4/5/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94122,"""West""","""OFF-BI-10004390""","""Office Supplies""","""Binders""","""GBC DocuBind 200 Manual Bindin…",673.568,2,0.2,252.588
2231,"""CA-2014-128055""","""3/31/2014""","""4/5/2014""","""Standard Class""","""AA-10315""","""Alex Avila""","""Consumer""","""United States""","""San Francisco""","""California""",94122,"""West""","""OFF-AP-10002765""","""Office Supplies""","""Appliances""","""Fellowes Advanced Computer Ser…",52.98,2,0.0,14.8344


It seems like `sort_by` in this case has just replicated the functionality of 
```python
df.sort("Customer_ID")
```
However, as we can use `sort_by` in an expression it can be used in other contexts such as in a `groupby` aggregation.  For example, if we wanted to get the name and Profit of the oldest cusstomer in each region we can do the following

In [15]:

(
    df
    .group_by("Region")
    .agg(
        pl.col("Customer_ID").sort_by("Profit").last(),
        pl.col("Profit").sort_by("Profit").last()
        
    )
)

Region,Customer_ID,Profit
str,str,f64
"""East""","""HL-15040""",5039.9856
"""West""","""RB-19360""",6719.9808
"""South""","""CM-12385""",3177.475
"""Central""","""TC-20980""",8399.976


### Filtering for the largest/smallest values
If we just want to find the largest or smallest values we could do `sort` followed by `head` or `tail`. For example here we find the oldest `customer`

In [16]:
(
    df
    .sort("Customer_ID")
    .tail(3)
)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
5898,"""CA-2016-167682""","""4/3/2016""","""4/9/2016""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Richmond""","""Indiana""",47374,"""Central""","""FUR-FU-10003799""","""Furniture""","""Furnishings""","""Seth Thomas 13 1/2"" Wall Clock""",71.12,4,0.0,22.0472
5899,"""CA-2016-167682""","""4/3/2016""","""4/9/2016""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Richmond""","""Indiana""",47374,"""Central""","""TEC-PH-10000673""","""Technology""","""Phones""","""Plantronics Voyager Pro HD - Bluetooth Headset""",259.96,4,0.0,124.7808
8342,"""CA-2017-141481""","""6/11/2017""","""6/14/2017""","""First Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-AP-10004532""","""Office Supplies""","""Appliances""","""Kensington 6 Outlet Guardian Standard Surge Protector""",61.44,3,0.0,16.5888


A faster approach is to use `top_k` which does not sort the full `DataFrame` but instead just searches through the rows to filter for the largest/smallest values and then sorts this small subset of rows - this method always places `null` values last

In [18]:
(
    df
    .top_k(
        # Number of records to return
        k=5,
        # Column/expression to sort by
        by="Customer_ID",
        # Return the largest records
        reverse=False,
    ).head()
)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
20,"""CA-2014-143336""","""8/27/2014""","""9/1/2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""TEC-PH-10001949""","""Technology""","""Phones""","""Cisco SPA 501G IP Phone""",213.48,3,0.2,16.011
8342,"""CA-2017-141481""","""6/11/2017""","""6/14/2017""","""First Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-AP-10004532""","""Office Supplies""","""Appliances""","""Kensington 6 Outlet Guardian Standard Surge Protector""",61.44,3,0.0,16.5888
3041,"""US-2016-147991""","""5/5/2016""","""5/9/2016""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Chattanooga""","""Tennessee""",37421,"""South""","""FUR-FU-10004270""","""Furniture""","""Furnishings""","""Eldon Image Series Desk Accessories, Burgundy""",16.72,5,0.2,3.344
3815,"""CA-2016-152471""","""7/8/2016""","""7/8/2016""","""Same Day""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Jacksonville""","""Florida""",32216,"""South""","""TEC-PH-10002824""","""Technology""","""Phones""","""Jabra SPEAK 410 Multidevice Speakerphone""",823.96,5,0.2,51.4975
3816,"""CA-2016-152471""","""7/8/2016""","""7/8/2016""","""Same Day""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Jacksonville""","""Florida""",32216,"""South""","""OFF-PA-10004965""","""Office Supplies""","""Paper""","""Xerox 1921""",15.984,2,0.2,4.995


Some good news: if you do .`sort.head/tail` in lazy mode Polars applies a `top_k` optimization under the hood

In [19]:
(
    df
    .lazy()
    .sort("Customer_ID")
    .tail(3)
    .collect()
)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
5898,"""CA-2016-167682""","""4/3/2016""","""4/9/2016""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Richmond""","""Indiana""",47374,"""Central""","""FUR-FU-10003799""","""Furniture""","""Furnishings""","""Seth Thomas 13 1/2"" Wall Clock""",71.12,4,0.0,22.0472
5899,"""CA-2016-167682""","""4/3/2016""","""4/9/2016""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Richmond""","""Indiana""",47374,"""Central""","""TEC-PH-10000673""","""Technology""","""Phones""","""Plantronics Voyager Pro HD - Bluetooth Headset""",259.96,4,0.0,124.7808
8342,"""CA-2017-141481""","""6/11/2017""","""6/14/2017""","""First Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-AP-10004532""","""Office Supplies""","""Appliances""","""Kensington 6 Outlet Guardian Standard Surge Protector""",61.44,3,0.0,16.5888


## Taking advantage of sorted data

For some operations Polars can use a fast track algorithm if it knows the data in a column is sorted.

For example, if we want the `max` value on a sorted column a fast-track algorithm would just take the last (non-`null`) value.

### Checking the sorted status
You can check if Polars **thinks** a column is sorted with the `flags` attribute on a column or a `Series`

In [20]:
df["Customer_ID"].flags

{'SORTED_ASC': False, 'SORTED_DESC': False}

In this case as both the ASC and DESC values are `False` Polars doesn't think the `Customer_ID` column is sorted (although we know that is sorted).

You can check the status of all columns at once with the `flags` attribute on a `DataFrame`

In [21]:
df.flags

{'Row_ID': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Order_ID': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Order_Date': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Ship_Date': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Ship_Mode': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Customer_ID': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Customer_Name': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Segment': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Country': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'City': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'State': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Postal_Code': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Region': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Product_ID': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Category': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Sub_Category': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Product_Name': {'SORTED_ASC': False, 'SORTED_DESC': False},
 '

We can check if a column is actually sorted with the `is_sorted` method:

In [22]:
df["Customer_ID"].is_sorted()

False

### Setting the sorted status
If we know that a column is sorted then we can let Polars know using `set_sorted`

In [23]:
df = (
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Customer_ID").set_sorted()
    )
)
df["Customer_ID"].flags

{'SORTED_ASC': False, 'SORTED_DESC': False}

Looking at the output of `flags` we now see `'SORTED_ASC': True`

In the exercises we see the major effect `set_sorted` can have on performance.

If we transform a column with a sorting operation Polars will automatically update the `flags` attribute for that column

In [24]:
df = (
    pl.read_csv(csv_file)
    .sort("Customer_ID")
)
df["Customer_ID"].flags

{'SORTED_ASC': True, 'SORTED_DESC': False}

If the data is sorted descending we tell Polars this by passing the `descending` argument:
```python
pl.col("Customer_ID").set_sorted(descending=True)
```

### `set_sorted` in an expression
We can use `set_sorted` within an expression. 

For example, if we have a sorted column we can use `set_sorted` to find the `max`

In [25]:
(
    df
    .select(
        pl.col("Customer_ID").set_sorted().max()
    )
)

Customer_ID
str
"""ZD-21925"""


We can check if a column is sorted using `is_sorted`. This method:
- checks the flags to see if they are `True`
- checks the data if the flags are not `True` - this can be fast as it returns as soon as it finds a non-sorted value

In [26]:
df["Customer_ID"].is_sorted()

True

### Operations with fast-track algorithms
The set of operations that have sorted fast-track algorithms is evolving but includes:
- min
- max
- quantile
- median (a special case of quantile)
- filter
- group_by (see the groupby lectures)
- join (see the join lectures)