## List dtype 2: using expressions on List columns

By the end of this lecture you will be able to:
- select data in lists
- re-order data in lists
- aggregate data in lists
- call expressions on each row in a `pl.List` column

Recall that each row in a Polars `pl.List` column is a Polars `Series`. We refer to the object on each row as a list for consistency with the dtype name.

In [None]:
import polars as pl

We create a `DataFrame` with a `pl.List` column

In [None]:
df = (
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3,4],
                [4,5,6,7,8]
            ],
        }
    )
)
df

Note that the length of the list does not have to be the same on each row

## The list expression namespace
Polars has a `.list` namespace for expressions that work on `pl.List` columns
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/list.html

## Selecting data in each list
We can use list expressions from the `list` namespace to select data from:
- the start and end of the list on each row and `first`,`last`,`head` and `tail`
- slices with `slice`

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").list.first().alias("first"),
            pl.col("values").list.last().alias("last"),
            pl.col("values").list.head(2).alias("head"),
            pl.col("values").list.tail(2).alias("tail"),
            pl.col("values").list.slice(1,2).alias("slice"),

        ]
    )
)

More generally, we use `list.get` to select a value by a position index in each list

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").list.get(0).alias("first"),
            pl.col("values").list.get(1).alias("second"),
            pl.col("values").list.get(-1).alias("last"),

        ]
    )
)

### Finding values in lists
- We can check whether a value is in an list with `list.contains`
- We can find all unique values in an list with `list.unique`

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").list.contains(i).alias(str(i)) for i in range(3)
        ]
    )
    .with_columns(
        pl.col("values").list.unique().alias("unique")
    )
)

### Re-ordering values in each list
We can re-order values in each list:
- `reverse` reverses the order of the list
- `sort` sorts each list
- `shift` moves values in each list (in a non-periodic way) so the first values are `null`

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").list.reverse().alias("reverse"),
            pl.col("values").list.sort().alias("sort"),
            pl.col("values").list.shift(1).alias("shift"),
        ]
    )
)

### List aggregations
We can use list expressions to aggregate the lists

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").list.len().alias("lengths"),
            pl.col("values").list.min().alias("min"),
            pl.col("values").list.mean().alias("mean"),
            pl.col("values").list.max().alias("max"),
        ]
    )
)

## Calling expressions on each list
Each row in a `pl.List` column is a `Series`. We can call the same expressions on each `Series` that we would call on a standalone `Series` or column in a `DataFrame`.

To do this we:
- call `list.eval` on the `pl.List` column and inside this 
- call `pl.element` to select the list on each row and then call expressions

The call to `pl.element` inside `list.eval` is like calling `pl.col` on a column in a `DataFrame`

In this example we `rank` the elements of each list

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [4,3,2]
            ],
        }
    )
    .with_columns(
        pl.col("values").list.eval(
            pl.element().rank(method="ordinal")
        ).alias("eval")
    )
)

If we call `pl.element` with no further expressions it returns the full list on each row

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [4,3,2]
            ],
        }
    )
    .with_columns(
        pl.col("values").list.eval(
            pl.element()
        ).alias("eval")
    )
)

We can do more complicated operations with repeated calls to `pl.element`

In this example we want to remove `null` values in the lists using a `filter` inside `list.eval`

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,None,1], 
                [2,3,None]
            ],
        }
    )
    .with_columns(
        pl.col("values").list.eval(
            pl.element().filter(
                pl.element().is_not_null()
            )
        ).alias("eval")
    )
)

## Exercises
In the exercises you will develop your understanding of:
- splitting a string into an list
- extracting elements of an list
- slicing an list
- indexing into an list using expressions

### Exercise 1
We need to parse the following address strings to get columns with the:
- number
- street
- city
- state
- zipcode

In [None]:
pl.Config.set_fmt_str_lengths(150)
addresses = [
    '93 NORTH 9TH STREET, BROOKLYN NY 11211',
    '380 WESTMINSTER ST, PROVIDENCE RI 02903',
    '177 MAIN STREET, LITTLETON NH 03561'
]
df = (
    pl.DataFrame(
        {"address":addresses}
    )
)
df

Add a column called `split` with the string split by whitespace (using `str.split`) into a list column

In [None]:
(
    df
    .with_columns(
        <blank>
    )
)

In an additional `with_column` statement add a 32-bit integer column called `number` using the `first` element of each list

The street component of the address runs from the second element of the list to the element of the list that contains a comma.

Add a list column called `contains_comma` where we check if each element in the lists in `split` contain a comma. Use `eval` to run the `str.contains` expression on each element in the list

With a new call to `with_column` slice each list in `split` from the second element to the index of the element that contains a comma.

Hint 1: there is an `list.arg_max` expression that finds the index of the largest value in an list. Use this to find the index of the `True` value in `contains_comma`

In [None]:
(
    pl.DataFrame(
        {
            "values":
            [
                [0,1],
                [3,2]
            ]
        }
    )
    .with_columns(
        pl.col("values").list.arg_max().alias("arg_max")
    )
)

Hint 2: you can pass an expression to `list.slice` if you want the `slice` to depend on values in another column

Python has a `join` method to combine a list of strings into a single string

In [None]:
';'.join(['a','b'])

Polars has a similar method called `list.join`.

Join the string lists in `street` using `list.join` (with a " " separating the strings)

Extract the `city` from `split` by slicing. The slice should start from the `arg_max` value in `contains_command` and have a length of 1 (here we are taking advantage of 3 one word city names!)

Get the `zipcode` as the last element in `split`

### Solution to exercise 1
We need to parse the following address strings to get columns with the:
- number
- street
- city
- state
- zipcode

In [None]:
pl.Config.set_fmt_str_lengths(150)
addresses = [
    '93 NORTH 9TH STREET, BROOKLYN NY 11211',
    '380 WESTMINSTER ST, PROVIDENCE RI 02903',
    '177 MAIN STREET, LITTLETON NH 03561'
]
df = (
    pl.DataFrame(
        {"address":addresses}
    )
)
df

Add a column called `split` with the string split by whitespace (using `str.split`) into a list column

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
)

Add a 32-bit integer column called `number` using the `first` element of each list

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        pl.col("split").list.first().cast(pl.Int32).alias("number")
    )
)

The street component of the address runs from the second element of the list to the element of the list that contains a comma.

Add a list column called `contains_comma` where we check if each element in the lists in `split` contain a comma. Use `eval` to run the `str.contains` expression on each element in the list

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int64).alias("number"),
            pl.col("split").list.eval(
                pl.element().str.contains(",")
            ).alias("contains_comma")
        ]
    )
)

With a new call to `with_column` slice each list in `split` from the second element to the index of the element that contains a comma.

Hint 1: there is an `list.arg_max` expression that finds the index of the largest value in an list. Use this to find the index of the `True` value in `contains_comma`

In [None]:
(
    pl.DataFrame(
        {
            "values":
            [
                [0,1],
                [3,2]
            ]
        }
    )
    .with_columns(
        pl.col("values").list.arg_max().alias("arg_max")
    )
)

Hint 2: you can pass an expression to `list.slice` if you want the `slice` to depend on values in another column

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).alias("street")
    )
    
)

Join the string lists in `street` using `list.join` (with a " " separating the strings)

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).list.join(" ").alias("street")
    )
    
)

Extract the `city` from `split` by slicing. The slice should start from the `arg_max` value in `contains_command` and have a length of 1 (here we are taking advantage of 3 one word city names!)

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).list.join(" ").alias("street")
    )
    .with_columns(
        pl.col("split").list.slice(
            pl.col("contains_comma").list.arg_max()+1,1
        ).alias("city")
    )
)

Get the `zipcode` as the last element in `split`

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").list.first().cast(pl.Int32).alias("number"),
            pl.col("split").list.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").list.slice(1,pl.col("contains_comma").list.arg_max()).list.join(" ").alias("street")
    )
    .with_columns(
        [
            pl.col("split").list.slice(
                pl.col("contains_comma").list.arg_max()+1,1
            ).alias("city"),
            pl.col("split").list.last().cast(pl.Int32).alias("zipcode")
        ]
    )
)