## List dtype 2: using expressions on List columns

By the end of this lecture you will be able to:
- use expressions on list columns
- do set operations on list columns
- call expressions on each row in a `pl.List` column


In [1]:
import polars as pl

We create a `DataFrame` with a `pl.List` column

In [3]:
df = (
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3,4],
                [4,5,6,7,8]
            ],
        }
    )
)
df

values
list[i64]
"[0, 1]"
"[2, 3, 4]"
"[4, 5, … 8]"


## The list expression namespace
Polars has a `.list` namespace for expressions that work on `pl.List` columns
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/list.html

## Selecting data in lists
We can use list expressions from the `list` namespace to select data from:
- the start and end of the list on each row with `first`,`last`,`head` and `tail`
- slices with `slice`

In [4]:
(
    df
    .with_columns(
        [
            pl.col("values").list.first().alias("first"),
            pl.col("values").list.last().alias("last"),
            pl.col("values").list.head(2).alias("head"),
            pl.col("values").list.tail(2).alias("tail"),
            pl.col("values").list.slice(1,2).alias("slice"),

        ]
    )
)

values,first,last,head,tail,slice
list[i64],i64,i64,list[i64],list[i64],list[i64]
"[0, 1]",0,1,"[0, 1]","[0, 1]",[1]
"[2, 3, 4]",2,4,"[2, 3]","[3, 4]","[3, 4]"
"[4, 5, … 8]",4,8,"[4, 5]","[7, 8]","[5, 6]"


To get a specific row we use `list.get` to select a value by a position index in each list

In [5]:
(
    df
    .with_columns(
        [
            pl.col("values").list.get(0).alias("first"),
            pl.col("values").list.get(1).alias("second"),
            pl.col("values").list.get(-1).alias("last"),

        ]
    )
)

values,first,second,last
list[i64],i64,i64,i64
"[0, 1]",0,1,1
"[2, 3, 4]",2,3,4
"[4, 5, … 8]",4,5,8


We can check whether each list contains a value with `list.contains`

In [10]:
(
    df
    .with_columns(
        # Check whether the list contains 0,1,2 in new columns
        [
            pl.col("values").list.contains(i).alias(str(i)) for i in range(3)
        ]
    )
)

values,0,1,2
list[i64],bool,bool,bool
"[0, 1]",True,True,False
"[2, 3, 4]",False,False,True
"[4, 5, … 8]",False,False,False


In the example above `list.contains` takes a scalar value but it can also be expression based on another column.

In this example we add a column called `four` that has the constant value of 4. We then test each row of the list to see if it contains the expression from column `four`

In [11]:
  (
    df
    .with_columns(
        four = pl.lit(4)
    )
    .with_columns(
        pl.col("values").list.contains(pl.col("four")).alias("has_four")
    )
)

values,four,has_four
list[i64],i32,bool
"[0, 1]",4,False
"[2, 3, 4]",4,True
"[4, 5, … 8]",4,True


### Set operations on lists
We define a new `DataFrame` with two list columns that have some similarities and some differences

In [12]:
df2 = (
    pl.DataFrame(
        {
            'values':[ 
                [0,1,0], 
                [2,3],
                [4,5,6,7,8]
            ],
            'values_2':[ 
                [0], 
                [2,3,4],
                [4,5,9]
            ],

        }
    )
)
df2

values,values_2
list[i64],list[i64]
"[0, 1, 0]",[0]
"[2, 3]","[2, 3, 4]"
"[4, 5, … 8]","[4, 5, 9]"


We can find all unique values in an list with `list.unique`

In [15]:
(
    df2
    .with_columns(
#        "values",
        pl.col("values").list.unique().alias("unique")
    )
)

values,values_2,unique
list[i64],list[i64],list[i64]
"[0, 1, 0]",[0],"[0, 1]"
"[2, 3]","[2, 3, 4]","[2, 3]"
"[4, 5, … 8]","[4, 5, 9]","[4, 5, … 8]"


We can do set operations on two list columns with the `list.set_` expressions to produce a new list where:
- `set_intersection` gets the common values between lists
- `set_difference` gets values that are in the left list but not in the right list
- `set_symmetric_difference` gets values that are not in both lists
- `union` gets the unique values from both lists

In [16]:
pl.Config.set_fmt_table_cell_list_len(6)
(
    df2
    .with_columns(
        pl.col("values").list.set_intersection(pl.col("values_2")).alias("intersection"),
        pl.col("values").list.set_difference(pl.col("values_2")).alias("difference"),
        pl.col("values").list.set_symmetric_difference(pl.col("values_2")).alias("symmetric_difference"),
        pl.col("values").list.set_union(pl.col("values_2")).alias("union")

    )
)

values,values_2,intersection,difference,symmetric_difference,union
list[i64],list[i64],list[i64],list[i64],list[i64],list[i64]
"[0, 1, 0]",[0],[0],[1],[1],"[0, 1]"
"[2, 3]","[2, 3, 4]","[2, 3]",[],[4],"[2, 3, 4]"
"[4, 5, 6, 7, 8]","[4, 5, 9]","[4, 5]","[8, 7, 6]","[6, 7, 8, 9]","[4, 5, 6, 7, 8, 9]"


Note that in the second row the `difference` column is empty but the `symmetric_difference` has one element

### Re-ordering values in each list

We return to the initial `DataFrame` with one list column for the following examples
We can re-order values in each list:
- `reverse` reverses the order of the list
- `sort` sorts each list
- `shift` moves values in each list (in a non-periodic way) so the first values are `null`

In [17]:
(
    df
    .with_columns(
        pl.col("values").list.reverse().alias("reverse"),
        pl.col("values").list.sort().alias("sort"),
        pl.col("values").list.shift(1).alias("shift"),
    )
)

values,reverse,sort,shift
list[i64],list[i64],list[i64],list[i64]
"[0, 1]","[1, 0]","[0, 1]","[null, 0]"
"[2, 3, 4]","[4, 3, 2]","[2, 3, 4]","[null, 2, 3]"
"[4, 5, 6, 7, 8]","[8, 7, 6, 5, 4]","[4, 5, 6, 7, 8]","[null, 4, 5, 6, 7]"


### List aggregations
We can use list expressions like `len` or `mean` to aggregate lists

In [22]:
(
    df
    .with_columns(
        pl.col("values").list.len().alias("lengths"),
        pl.col("values").list.min().alias("min"),
        pl.col("values").list.mean().alias("mean"),
        pl.col("values").list.max().alias("max"),
    )
)

values,lengths,min,mean,max
list[i64],u32,i64,f64,i64
"[0, 1]",2,0,0.5,1
"[2, 3, 4]",3,2,3.0,4
"[4, 5, 6, 7, 8]",5,4,6.0,8


## Calling expressions on each list
Each row in a `pl.List` column is a `Series`. We can call the same expressions on each `Series` that we would call on a standalone `Series` or column in a `DataFrame`.

To do this we:
- call `list.eval` on the `pl.List` column and inside this 
- call `pl.element` to select the list on each row and then call expressions

The call to `pl.element` inside `list.eval` is like calling `pl.col` on a column in a `DataFrame`

In this example we `rank` the elements of each list

In [23]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [4,3,2]
            ],
        }
    )
    .with_columns(
        pl.col("values").list.eval(
            pl.element().rank(method="ordinal")
        ).alias("eval")
    )
)

values,eval
list[i64],list[u32]
"[0, 1]","[1, 2]"
"[4, 3, 2]","[3, 2, 1]"


If we call `pl.element` with no further expressions it returns the full list on each row

In [24]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [4,3,2]
            ],
        }
    )
    .with_columns(
        pl.col("values").list.eval(
            pl.element()
        ).alias("eval")
    )
)

values,eval
list[i64],list[i64]
"[0, 1]","[0, 1]"
"[4, 3, 2]","[4, 3, 2]"


We can do more complicated operations with repeated calls to `pl.element`

In this example we want to remove `null` values in the lists using a `filter` inside `list.eval`

In [26]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,None,1], 
                [2,3,None]
            ],
        }
    )
    .with_columns(
        pl.col("values").list.eval(
            pl.element().filter(
                pl.element().is_not_null()
            )
        ).alias("eval")
    )
)

values,eval
list[i64],list[i64]
"[0, null, 1]","[0, 1]"
"[2, 3, null]","[2, 3]"


As noted in the previous lecture using `explode` may be easier to write and faster to run than a `list.eval` approach!

## Exercises
In the exercises you will develop your understanding of:
- splitting a string into an list
- extracting elements of an list
- slicing an list
- indexing into an list using expressions

### Exercise 1
We need to parse the following address strings to get columns with the:
- number
- street
- city
- state
- zipcode

In [None]:
pl.Config.set_fmt_str_lengths(150)
addresses = [
    '93 NORTH 9TH STREET, BROOKLYN NY 11211',
    '380 WESTMINSTER ST, PROVIDENCE RI 02903',
    '177 MAIN STREET, LITTLETON NH 03561'
]
df = (
    pl.DataFrame(
        {"address":addresses}
    )
)
df

Add a column called `split` with the string split by whitespace (using `str.split`) into a list column

In [None]:
(
    df
    .with_columns(
        <blank>
    )
)

In an additional `with_column` statement add a 32-bit integer column called `number` using the `first` element of each list

The street component of the address runs from the second element of the list to the element of the list that contains a comma.

Add a list column called `contains_comma` where we check if each element in the lists in `split` contain a comma. Use `eval` to run the `str.contains` expression on each element in the list

With a new call to `with_column` slice each list in `split` from the second element to the index of the element that contains a comma.

Hint 1: there is an `list.arg_max` expression that finds the index of the largest value in an list. Use this to find the index of the `True` value in `contains_comma`

In [None]:
(
    pl.DataFrame(
        {
            "values":
            [
                [0,1],
                [3,2]
            ]
        }
    )
    .with_columns(
        pl.col("values").list.arg_max().alias("arg_max")
    )
)

Hint 2: you can pass an expression to `list.slice` if you want the `slice` to depend on values in another column

Python has a `join` method to combine a list of strings into a single string

In [None]:
';'.join(['a','b'])

Polars has a similar method called `list.join`.

Join the string lists in `street` using `list.join` (with a " " separating the strings)

Extract the `city` from `split` by slicing. The slice should start from the `arg_max` value in `contains_command` and have a length of 1 (here we are taking advantage of 3 one word city names!)

Get the `zipcode` as the last element in `split`

### Exercise 2

Create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

- Keep one row for each unique track (with uniqueness defined by the title and artist columns)
- Create a list column called `artists` by splitting the `artist` column

In [None]:
(
    spotify_df
    .unique(<blank>)
    .with_columns(<blank>)
    .select("title","rank","date","artist","artists","streams")
)

Continue by finding the 10 tracks with the most number of artists 

This can be done with a `sort` but recall a faster method to find the largest values introduced in the Sorting lecture in Section 3

Apply a `pl.Config` setting to ensure we can read all of the list elements and then display the results again

Create a new column called `lead_artist` with the first listed artist from each list. Return only the `title`,`artist` and `lead_artist` columns

In [None]:
(
    spotify_df
    <blank>
)

Get the top 10 artists ranked by their maximum number of streams for a track
- Explode the `artists` list column so each artist is on their own row
- Group by the exploded artists
- Aggregate to get the maximum of the `streams`

If you are not familar with `group_by` then come back to this exercise after Section 5 of the course

## Solutions
### Solution to exercise 1
We need to parse the following address strings to get columns with the:
- number
- street
- city
- state
- zipcode

In [27]:
pl.Config.set_fmt_str_lengths(150)
addresses = [
    '93 NORTH 9TH STREET, BROOKLYN NY 11211',
    '380 WESTMINSTER ST, PROVIDENCE RI 02903',
    '177 MAIN STREET, LITTLETON NH 03561'
]
df = (
    pl.DataFrame(
        {"address":addresses}
    )
)
df

address
str
"""93 NORTH 9TH STREET, BROOKLYN NY 11211"""
"""380 WESTMINSTER ST, PROVIDENCE RI 02903"""
"""177 MAIN STREET, LITTLETON NH 03561"""


Add a column called `split` with the string split by whitespace (using `str.split`) into a list column

In [28]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
)

address,split
str,list[str]
"""93 NORTH 9TH STREET, BROOKLYN NY 11211""","[""93"", ""NORTH"", ""9TH"", ""STREET,"", ""BROOKLYN"", … ""11211""]"
"""380 WESTMINSTER ST, PROVIDENCE RI 02903""","[""380"", ""WESTMINSTER"", ""ST,"", ""PROVIDENCE"", ""RI"", ""02903""]"
"""177 MAIN STREET, LITTLETON NH 03561""","[""177"", ""MAIN"", ""STREET,"", ""LITTLETON"", ""NH"", ""03561""]"


Add a 32-bit integer column called `number` using the `first` element of each list

In [29]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        pl.col("split").list.first().cast(pl.Int32).alias("number")
    )
)

address,split,number
str,list[str],i32
"""93 NORTH 9TH STREET, BROOKLYN NY 11211""","[""93"", ""NORTH"", ""9TH"", ""STREET,"", ""BROOKLYN"", … ""11211""]",93
"""380 WESTMINSTER ST, PROVIDENCE RI 02903""","[""380"", ""WESTMINSTER"", ""ST,"", ""PROVIDENCE"", ""RI"", ""02903""]",380
"""177 MAIN STREET, LITTLETON NH 03561""","[""177"", ""MAIN"", ""STREET,"", ""LITTLETON"", ""NH"", ""03561""]",177


The street component of the address runs from the second element of the list to the element of the list that contains a comma.

Add a list column called `contains_comma` where we check if each element in the lists in `split` contain a comma. Use `eval` to run the `str.contains` expression on each element in the list

In [30]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int64).alias("number"),
            pl.col("split").list.eval(
                pl.element().str.contains(",")
            ).alias("contains_comma")
        ]
    )
)

address,split,number,contains_comma
str,list[str],i64,list[bool]
"""93 NORTH 9TH STREET, BROOKLYN NY 11211""","[""93"", ""NORTH"", ""9TH"", ""STREET,"", ""BROOKLYN"", … ""11211""]",93,"[false, false, false, true, false, … false]"
"""380 WESTMINSTER ST, PROVIDENCE RI 02903""","[""380"", ""WESTMINSTER"", ""ST,"", ""PROVIDENCE"", ""RI"", ""02903""]",380,"[false, false, true, false, false, false]"
"""177 MAIN STREET, LITTLETON NH 03561""","[""177"", ""MAIN"", ""STREET,"", ""LITTLETON"", ""NH"", ""03561""]",177,"[false, false, true, false, false, false]"


With a new call to `with_column` slice each list in `split` from the second element to the index of the element that contains a comma.

Hint 1: there is an `list.arg_max` expression that finds the index of the largest value in an list. Use this to find the index of the `True` value in `contains_comma`

In [31]:
(
    pl.DataFrame(
        {
            "values":
            [
                [0,1],
                [3,2]
            ]
        }
    )
    .with_columns(
        pl.col("values").list.arg_max().alias("arg_max")
    )
)

values,arg_max
list[i64],u32
"[0, 1]",1
"[3, 2]",0


Hint 2: you can pass an expression to `list.slice` if you want the `slice` to depend on values in another column

In [32]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).alias("street")
    )
    
)

address,split,number,contains_comma,street
str,list[str],i32,list[bool],list[str]
"""93 NORTH 9TH STREET, BROOKLYN NY 11211""","[""93"", ""NORTH"", ""9TH"", ""STREET,"", ""BROOKLYN"", … ""11211""]",93,"[false, false, false, true, false, … false]","[""NORTH"", ""9TH"", ""STREET,""]"
"""380 WESTMINSTER ST, PROVIDENCE RI 02903""","[""380"", ""WESTMINSTER"", ""ST,"", ""PROVIDENCE"", ""RI"", ""02903""]",380,"[false, false, true, false, false, false]","[""WESTMINSTER"", ""ST,""]"
"""177 MAIN STREET, LITTLETON NH 03561""","[""177"", ""MAIN"", ""STREET,"", ""LITTLETON"", ""NH"", ""03561""]",177,"[false, false, true, false, false, false]","[""MAIN"", ""STREET,""]"


Join the string lists in `street` using `list.join` (with a " " separating the strings)

In [33]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).list.join(" ").alias("street")
    )
    
)

address,split,number,contains_comma,street
str,list[str],i32,list[bool],str
"""93 NORTH 9TH STREET, BROOKLYN NY 11211""","[""93"", ""NORTH"", ""9TH"", ""STREET,"", ""BROOKLYN"", … ""11211""]",93,"[false, false, false, true, false, … false]","""NORTH 9TH STREET,"""
"""380 WESTMINSTER ST, PROVIDENCE RI 02903""","[""380"", ""WESTMINSTER"", ""ST,"", ""PROVIDENCE"", ""RI"", ""02903""]",380,"[false, false, true, false, false, false]","""WESTMINSTER ST,"""
"""177 MAIN STREET, LITTLETON NH 03561""","[""177"", ""MAIN"", ""STREET,"", ""LITTLETON"", ""NH"", ""03561""]",177,"[false, false, true, false, false, false]","""MAIN STREET,"""


Extract the `city` from `split` by slicing. The slice should start from the `arg_max` value in `contains_command` and have a length of 1 (here we are taking advantage of 3 one word city names!)

In [34]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).list.join(" ").alias("street")
    )
    .with_columns(
        pl.col("split").list.slice(
            pl.col("contains_comma").list.arg_max()+1,1
        ).alias("city")
    )
)

address,split,number,contains_comma,street,city
str,list[str],i32,list[bool],str,list[str]
"""93 NORTH 9TH STREET, BROOKLYN NY 11211""","[""93"", ""NORTH"", ""9TH"", ""STREET,"", ""BROOKLYN"", … ""11211""]",93,"[false, false, false, true, false, … false]","""NORTH 9TH STREET,""","[""BROOKLYN""]"
"""380 WESTMINSTER ST, PROVIDENCE RI 02903""","[""380"", ""WESTMINSTER"", ""ST,"", ""PROVIDENCE"", ""RI"", ""02903""]",380,"[false, false, true, false, false, false]","""WESTMINSTER ST,""","[""PROVIDENCE""]"
"""177 MAIN STREET, LITTLETON NH 03561""","[""177"", ""MAIN"", ""STREET,"", ""LITTLETON"", ""NH"", ""03561""]",177,"[false, false, true, false, false, false]","""MAIN STREET,""","[""LITTLETON""]"


Get the `zipcode` as the last element in `split`

In [36]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).list.join(" ").alias("street")
    )
    .with_columns(
        [
            pl.col("split").list.slice(
                pl.col("contains_comma").list.arg_max()+1,1
            ).alias("city"),
            pl.col("split").list.last().cast(pl.Int32).alias("zipcode")
        ]
    )
)

address,split,number,contains_comma,street,city,zipcode
str,list[str],i32,list[bool],str,list[str],i32
"""93 NORTH 9TH STREET, BROOKLYN NY 11211""","[""93"", ""NORTH"", ""9TH"", ""STREET,"", ""BROOKLYN"", … ""11211""]",93,"[false, false, false, true, false, … false]","""NORTH 9TH STREET,""","[""BROOKLYN""]",11211
"""380 WESTMINSTER ST, PROVIDENCE RI 02903""","[""380"", ""WESTMINSTER"", ""ST,"", ""PROVIDENCE"", ""RI"", ""02903""]",380,"[false, false, true, false, false, false]","""WESTMINSTER ST,""","[""PROVIDENCE""]",2903
"""177 MAIN STREET, LITTLETON NH 03561""","[""177"", ""MAIN"", ""STREET,"", ""LITTLETON"", ""NH"", ""03561""]",177,"[false, false, true, false, false, false]","""MAIN STREET,""","[""LITTLETON""]",3561


### Solution to exercise 2

Create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

- Keep one row for each unique track (defined by the title and artist column)
- Create a list column called `artists` by splitting the `artist` column

In [None]:
(
    spotify_df
    .unique(subset=["title", "artist"])
    .with_columns(
        artists = pl.col("artist").str.split(",")
    )
    .select("title","rank","date","artist","artists","streams")
)

Continue by finding the 10 tracks with the most number of artists (this can be done with a `sort` but recall a faster method introduced in the Sorting lecture in Section 3)

In [None]:
(
    spotify_df
    .unique("title")
    .with_columns(
        artists = pl.col("artist").str.split(",")
    )
    .select("title","rank","date","artist","artists","streams")
    .top_k(
        k=10,
        by=pl.col("artists").list.len()
    )
)

Apply a `pl.Config` setting to ensure we can read all of the list elements and then display the results again

In [None]:
pl.Config.set_fmt_table_cell_list_len(20)

In [None]:
(
    spotify_df
    .unique("title")
    .with_columns(
        artists = pl.col("artist").str.split(",")
    )
    .select("title","rank","date","artist","artists","streams")
    .top_k(
        k=10,
        by=pl.col("artists").list.len()
    )
)

Create a new column called `lead_artist` with the first listed artist from each list

In [None]:
(
    spotify_df
    .select("title","artist",pl.col("artist").str.split(",").list.get(0).alias("lead_artist"))
)

Get the top 10 artists ranked by their maximum number of streams for a track
- Explode the `artists` list column so each artist is on their own row
- Group by the exploded artists
- Aggregate to get the maximum of the `streams`

If you are not familar with `group_by` then come back to this exercise after Section 5 of the course

In [None]:
(
    spotify_df
    .with_columns(
        artists = pl.col("artist").str.split(",")
    )
    .select("artists","streams")
    .explode("artists")
    .group_by("artists")
    .agg(
        pl.col("streams").max()
    )
    .top_k(k=10,by="streams")
)