# Introduction to nested dtypes: List, Array, Object and Struct
By the end of this lecture you will be able to:
- create columns with List, Array, Struct and Object dtypes
- explain the difference between the List, Array, Struct and Object dtypes
- unnest the fields in a Struct dtype

In [1]:
import polars as pl

### `pl.List` dtype
With a `pl.List` dtype each row is a `Series` and each `Series` has the same dtype.

We can create a `pl.List` column manually with a Python `list` *where all elements of the `list` have the same type or can be cast to the same type e.g. `int` to `float`*

In [2]:
df_lists = pl.DataFrame({
    'ints':[ 
        [0,1], 
        [2,3]
    ],
    'floats':[ 
        [0.0,1], 
        [2,3]
    ],
    'strings':[ 
        ["0","1"],
        ["2","3"]
    ]
})
df_lists

ints,floats,strings
list[i64],list[f64],list[str]
"[0, 1]","[0.0, 1.0]","[""0"", ""1""]"
"[2, 3]","[2.0, 3.0]","[""2"", ""3""]"


We cover the `pl.List` dtype in more detail in the lectures that follow.

The `pl.List` dtype can have a variable number of elements per row. There is also a `pl.Array` dtype optimised for cases where all rows have the same number of elements

In [3]:
(
    df_lists
    .with_columns(
        ints_array = pl.col("ints").cast(pl.Array(width=2,inner=pl.Int64))
    )
)

ints,floats,strings,ints_array
list[i64],list[f64],list[str],"array[i64, 2]"
"[0, 1]","[0.0, 1.0]","[""0"", ""1""]","[0, 1]"
"[2, 3]","[2.0, 3.0]","[""2"", ""3""]","[2, 3]"


Functionality for the `pl.Array` dtype is still limited so our focus will be on the `pl.List` dtype.

## Object dtype
We create a column with an object dtype when the lists cannot be cast to a homogenous type

In [5]:
df_object = pl.DataFrame({
    'mixed':[ 
        ['a',0],
        ['b',1]
    ]
})
df_object

mixed
object
"['a', 0]"
"['b', 1]"


The "list" on each row in a **`pl.Object`** column is a standard python `list` under the hood.

In [6]:
df_object[0,0]

['a', 0]

In [7]:
type(df_object[0,0])

list

Operations on a `pl.Object` column are slow as the operations are working with slow Python `lists` rather than fast Polars `Series`.

We generally want to avoid working with a `pl.Object` dtype if possible. For example, it may be better to cast integers to strings to have a string `pl.List` column rather than a `pl.Object` column.

## `pl.Struct` dtype
The `pl.Struct` dtype also has a collection of data on each row. The fields of a `pl.Struct` dtype are similar to regular columns of a `DataFrame` but are accessed with nested column titles.

We create a `pl.Struct` column by passing a list of `dicts` where:
- the `dict` on each row has the same keys
- the values for each key on each row have the same dtype

In [8]:
df_struct = (
    pl.DataFrame(
        {
            "year":[2020,2021],
            "trades":[
                {"exporter":"India","importer":"USA","quantity":0.0},
                {"exporter":"India","importer":"USA","quantity":1.5},
            ]
          }
    )
)
df_struct

year,trades
i64,struct[3]
2020,"{""India"",""USA"",0.0}"
2021,"{""India"",""USA"",1.5}"


The keys in a struct column are called `fields`.

We can list the keys with `struct.fields` on a `Series`

In [9]:
df_struct["trades"].struct.fields

['exporter', 'importer', 'quantity']

## Accessing  `pl.Struct` fields

We access fields within a struct column in an expression

In [10]:
(
    df_struct
    .select(
        pl.col("trades").struct.field("exporter")
    )
)

exporter
str
"""India"""
"""India"""


## Extracting data from a `pl.Struct`

We can convert a struct `Series` to be its own multi-column `DataFrame`

In [11]:
df_struct["trades"].struct.unnest()

exporter,importer,quantity
str,str,f64
"""India""","""USA""",0.0
"""India""","""USA""",1.5


We can also un-nest a `pl.Struct` column to become columns in the `DataFrame`

In [12]:
df_struct.unnest("trades")

year,exporter,importer,quantity
i64,str,str,f64
2020,"""India""","""USA""",0.0
2021,"""India""","""USA""",1.5


We can have more than one level of nesting in a struct columns.

In this example we keep the `quantity` field at the top level of the `pl.Struct` but move the `importer`/`exporter` fields into a second nested level within the `pl.Struct`

In [13]:
df_struct_deep = pl.DataFrame({'trades':[
        {
            "countries":{"exporter":"India","importer":"USA"},
            "quantity":0.0
        },
        {
            "countries":{"exporter":"India","importer":"USA"},
            "quantity":1.5
        },
    ]
  })
df_struct_deep

trades
struct[2]
"{{""India"",""USA""},0.0}"
"{{""India"",""USA""},1.5}"


In [15]:
df_struct_deep.unnest("trades").unnest("countries")

exporter,importer,quantity
str,str,f64
"""India""","""USA""",0.0
"""India""","""USA""",1.5


We can do fast operations on a `pl.Struct` dtype because we are working with Polars objects rather than python `lists`.

## Exercises
In the quiz in this Section you will develop your understanding of:
- creating `pl.List` columns
- creating `pl.Object` columns
- creating `pl.Struct` columns