## List dtype 1: Creating and transforming List columns
By the end of this lecture you will be able to:
- select `pl.List` columns
- explode a `pl.List` column into rows
- convert a `pl.List` column to a `pl.Struct` or a Numpy array

In [None]:
import polars as pl

We create a `DataFrame` with integer, floating point and string `pl.List` columns. We set `strict=False` so that we can mix integers and floats in the input data for `floats`

In [None]:
df_lists = pl.DataFrame(
    {
        'ints':[ 
            [0,1], 
            [2,3]
        ],
        'floats':[ 
            [0.0,1], 
            [2,3]
        ],
        'strings':[ 
            ["0","1"],
            ["2","3"]
        ]
    },strict=False
)
df_lists

In the printed representation we see a list on each row.

In reality the data on each row of a `pl.List` column is a Polars `Series`.

We can see the underlying `Series` by selecting a row in a `pl.List` column

In [None]:
df_lists[0,"ints"]

We can configure how many list elements are printed with a `pl.Config` setting

In [None]:
pl.Config.set_fmt_table_cell_list_len(20)

## Selecting `pl.List` columns by dtype
We cannot select all `pl.List` columns without the column dtype

In [None]:
(
    df_lists
    .select(
        pl.col(pl.List)
    )
)

Instead we must pass the dtype for the lists in that column.

In this example we select the 64-bit integer `pl.List` column 

In [None]:
(
    df_lists
    .select(
        pl.col(pl.List(pl.Int64))
    )
)

### Length of List columns
The length of the list does not have to be the same on each row

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3,4],
                [4,5,6,7,8]
            ],
        }
    )
)

Polars also has a `pl.Array` dtype where the length of the arrays do have to be the same on each row. We create this dtype by specifying the length of the arrays and the inner dtype

In [None]:
(
    df_lists
    .with_columns(
        ints_array = pl.col('ints').cast(pl.Array(shape=2,inner=pl.Int64))
    )
    .select('ints','ints_array')
)

At present the functionality for the `pl.Array` dtype is limited so we focus on the `pl.List` dtype here

## Turning `pl.List` columns into rows
We use `explode` to expand each list element into its own row. To illustrate this we first create a simpler `DataFrame` with an `id` column and a `values` `pl.List` column

In [None]:
df_list = pl.DataFrame(
    {
        "id":["a","b"],
        'values':[ 
            [0,1], 
            [2,3,4]
        ],
    }
)
df_list

We can convert each list element into its own row by calling `explode` on the `DataFrame`

In [None]:
(
    df_list
    .explode("values")
)

### Using `explode` to do operations on `pl.List` columns
In the next lecture we see how to apply expressions to `pl.List` columns. However, this approach of instead using `explode` is often both a more performant solution that is easier to write than using list expressions (particularly where the list expressions require the `eval` approach we see in the nex lecture).

For example, here we want to create a new column that has the rank of the values in each list (with the lowest value having rank 1 etc).

We do this first by:
- doing `explode` on the list column
- doing the `rank` expression (using `over` to do it by `list`: we see more on `over` in the next section)

In [None]:
(
    df_list
    .explode("values")
    .with_columns(
        pl.col("values").rank().over("id").cast(pl.Int32).alias("rank")
    )
)

We can then gather the data back-up into `pl.List` columns by using `group_by` on the `id` column and aggregating the `values` and `rank` by `id`.

We continue the example above with `group_by`

In [None]:
(
    df_list
    .explode("values")
    .with_columns(
        pl.col("values").rank().over("id").cast(pl.Int32).alias("rank")
    )
    .group_by("id")
    .agg(
        pl.col("values"),
        pl.col("rank")
    )
)

> This next idea is more advanced, you may want to come back to it later if you find there are too many new concepts in it

This approach can be memory intensive if you have many columns and/or a list column with many elements. One way to reduce memory use is to do the `explode` in a `pipe` function where you only select the columns you need to do the `expode`/`group_by` operations.

To illustrate this we create a new `DataFrame` with an extra column

In [None]:
df_list_extra_column = pl.DataFrame(
    {
        "id":["a","b"],
        "id2":["c","d"],
        'values':[ 
            [0,1], 
            [2,3,4]
        ],
    }
)
df_list_extra_column

We now (from inside-out in the code below):
- `pipe` the `df_list_extra_column` into a function where we call it `df`
- in the function we calculate the rank column
- join the results back to `df_list_extra_column` (see the section on joins and concatenations if you are not familar with joins)

In [None]:
(
    df_list_extra_column
    .join(
        # Join the outputs of the ranking back onto the DataFrame
        df_list_extra_column.pipe(
            # Pass df_list_extra_column into the function as "df"
            lambda df: (
                df
                # Restrict the number of columns before exploding
                .select("id","values")
                # Do the same ranking operations as above
                .explode("values")
                .with_columns(
                    pl.col("values").rank().over("id").cast(pl.Int32).alias("rank")
                )
                .group_by("id")
                .agg(
                    # Only return the rank column as the values column is already in df_list_extra_column
                    pl.col("rank")
                )
            )
        ),
        # Join the results back on the id column
        on="id",
        how="left",
        coalesce=True
    )
)

## Convert a `pl.List` column to a `pl.Struct` column
Polars has an `list` namespace with expressions that work on `pl.List` columns (we see more of this in the next lecture).

We convert a `pl.List` column to a `pl.Struct` column with `list.to_struct`.

> Recall that the data is stored in memory differently in a List column compared to a Struct column. In a List column each row horizontally is a Series. In a Struct column each nested column vertically is a Series

In this example we use `list.to_struct` to turn the `pl.List` column into `DataFrame` columns

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3],
                [4,5]
            ],
        }
    )
    # Convert the lists to a struct
    .with_columns(
        pl.col("values").list.to_struct().alias("value_struct")
    )
)

We can then create full `DataFrame` columns from the `pl.Struct` column by calling `unnest` on the `DataFrame`

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3],
                [4,5]
            ],
        }
    )
    # Convert the lists to a struct
    .with_columns(
        pl.col("values").list.to_struct().alias("value_struct")
    )
    # Un-nest the struct to DataFrame columns
    .unnest("value_struct")
)

While a `pl.List` list can have a variable number of elements a `pl.Struct` has a fixed number of elements on each row. The number of struct nested columns is set by the length of the first row the `pl.List` column:
- if subsequent `pl.List` rows are shorter then these become `null` values in the `pl.Struct` nested columns
- if subsequent `pl.List` rows are longer then these are dropped from the the `pl.Struct` nested columns

Try this yourself by adding or removing values from the second row of the lists above

## Convert a `pl.List` column to a Numpy array
A `pl.List` column is a natural way to hold array data that we may need in Numpy. 

Here we create a `pl.List(pl.Float64)` column that holds `embeddings` from a machine learning model

In [None]:
df_embeddings = (
    pl.DataFrame(
        {
            'embeddings':[ 
                [0.0,1.0], 
                [2.0,3.0],
                [4.0,5.0]
            ],
        }
    )
)
df_embeddings

If we call `to_numpy` directly on the `embeddings` column we get a **one-dimensional `object`** Numpy array where each element is an array

In [None]:
df_embeddings["embeddings"].to_numpy()

To get the `embeddings` column as a two-dimensional Numpy array we can `explode` the column and then reshape it in Numpy

In [None]:
(
    df_embeddings
    ["embeddings"]
    .explode()
    .to_numpy()
    .reshape(len(df_embeddings),-1)
)

Alternatively, the fixed-width `pl.Array` dtype converts directly to a Numpy two-dimensional array without a reshape

In [None]:
(
    df_embeddings
    .with_columns(
        pl.col("embeddings").cast(pl.Array(shape=2,inner=pl.Float64))
    )
    ["embeddings"]
    .to_numpy()
)

## Exercises
In the exercises you will develop your understanding of:
- selecting list columns
- exploding list columns
- counting occurences in a list column

### Exercise 1
We create a `DataFrame` with `pl.List` columns

In [None]:
df_lists = pl.DataFrame(
    {
        'ints':[ 
            [0,1], 
            [2,3]
        ],
        'floats':[ 
            [0.0,1], 
            [2,3]
        ],
        'strings':[ 
            ["0","1"],
            ["2","3"]
        ]

    },
    strict=False
)
df_lists

Select the floating point list column from `df_lists`

In [None]:
(
    df_lists
    <blank>
)

Select the floating point **and** integer list column from `df_lists`

In [None]:
(
    df_lists
    <blank>
)

### Exercise 2
We create a `pl.List` column from the Titanic dataset by splitting the `Name` column on every whitespace

In [None]:
csv_file = "../data/titanic.csv"
df = (
    pl.read_csv(csv_file)
    .select(
        [
            "PassengerId",
            "Pclass",
            "Name",
            pl.col("Name").str.split(" ").alias("Name_list")
        ]
    )
)
df.head(2)

Expand the `Name_list` column into separate rows

In [None]:
(
    df
    <blank>
    .head()
)

Filter to remove rows with the titles: "Mr.","Mrs.","Miss.","Master." from the output

Find the most common names:

After filtering the titles count the occurence of each name in the `Name_list` column using `.value_counts(sort=True)`

## Solutions

### Solution to exercise 1
We create a `DataFrame` with `pl.List` columns. We use `strict=False` so that we can pass a mix of floats and integers for the `floats` column

In [None]:
df_lists = pl.DataFrame(
    {
        'ints':[ 
            [0,1], 
            [2,3]
        ],
        'floats':[ 
            [0.0,1], 
            [2,3]
        ],
        'strings':[ 
            ["0","1"],
            ["2","3"]
        ]

    },
    strict=False
)
df_lists

Select the floating point list column from `df_lists`

In [None]:
(
    df_lists
    .select(
        pl.col(pl.List(pl.Float64))
    )
)

Select the floating point **and** integer list column from `df_lists`

In [None]:
(
    df_lists
    .select(
        pl.col([pl.List(pl.Float64),pl.List(pl.Int64)])
    )
)

### Solution to exercise 2
We create a `pl.List` column from the Titanic dataset by splitting the `Name` column on every whitespace

In [None]:
csv_file = "../data/titanic.csv"
df = (
    pl.read_csv(csv_file)
    .select(
        "PassengerId",
        "Pclass",
        "Name",
        pl.col("Name").str.split(" ").alias("Name_list")
    )
)
df.head(2)

Expand the `Name_list` column into separate rows

In [None]:
(
    df
    .explode("Name_list")
    .head()
)

Filter to remove rows with the titles: "Mr.","Mrs.","Miss.","Master." from the output

In [None]:
(
    df
    .explode("Name_list")
    .filter(~pl.col("Name_list").is_in(["Mr.","Mrs.","Miss.","Master."]))
    .head()
)

Find the most common names:

After filtering the titles count the occurence of each name in the `Name_list` column using `.value_counts(sort=True)`

In [None]:
(
    df
    .explode("Name_list")
    .filter(~pl.col("Name_list").is_in(["Mr.","Mrs.","Miss.","Master."]))
    ["Name_list"]
    .value_counts(sort=True)
    .head()
)