# Introduction to Data Types
By the end of this lecture you will be able to:
- get the data type schema of a `DataFrame`
- get the data type of a `Series`
- explain the relationship between Polars and Apache Arrow


We look at the different data types in more detail in the Section on Data types and missing values.

In [1]:
import polars as pl

In [2]:
csvFile = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csvFile)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


## Data type schema

Every column in a `DataFrame` has a data type called a `dtype`.

You can get a `dict` that maps column names to dtypes with the `.schema` attribute

In [4]:
df.schema

OrderedDict([('PassengerId', Int64),
             ('Survived', Int64),
             ('Pclass', Int64),
             ('Name', Utf8),
             ('Sex', Utf8),
             ('Age', Float64),
             ('SibSp', Int64),
             ('Parch', Int64),
             ('Ticket', Utf8),
             ('Fare', Float64),
             ('Cabin', Utf8),
             ('Embarked', Utf8)])

There is also a `dtypes` attribute (as in Pandas). However, this gives a `list` of dtypes with no column names

In [5]:
df.dtypes

[Int64,
 Int64,
 Int64,
 Utf8,
 Utf8,
 Float64,
 Int64,
 Int64,
 Utf8,
 Float64,
 Utf8,
 Utf8]

A `Series` also has a data type attribute

In [6]:
df['Name'].dtype

Utf8

## Apache Arrow

A Pandas `DataFrame` has underlying Numpy arrays where the data is stored. In Polars the data is stored in an Arrow Table. 

We can see this Arrow Table by calling `to_arrow` - this is a cheap operation as it is just accessing the underlying data

In [7]:
df.to_arrow()

pyarrow.Table
PassengerId: int64
Survived: int64
Pclass: int64
Name: large_string
Sex: large_string
Age: double
SibSp: int64
Parch: int64
Ticket: large_string
Fare: double
Cabin: large_string
Embarked: large_string
----
PassengerId: [[1,2,3,4,5,...,887,888,889,890,891]]
Survived: [[0,1,1,1,0,...,0,1,0,1,0]]
Pclass: [[3,1,3,1,3,...,2,1,3,1,3]]
Name: [["Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Thayer)","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry",...,"Montvila, Rev. Juozas","Graham, Miss. Margaret Edith","Johnston, Miss. Catherine Helen "Carrie"","Behr, Mr. Karl Howell","Dooley, Mr. Patrick"]]
Sex: [["male","female","female","female","male",...,"male","female","female","male","male"]]
Age: [[22,38,26,35,35,...,27,19,null,26,32]]
SibSp: [[1,1,0,1,0,...,0,0,1,0,0]]
Parch: [[0,0,0,0,0,...,0,0,2,0,0]]
Ticket: [["A/5 21171","PC 17599","STON/O2. 3101282","113803","373450",...,"211536","112053","W./C. 6607","1113

### What is Apache Arrow?
Apache Arrow is an open source cross-language project to store tabular data in-memory. Apache Arrow is both:
- a specificiation for how data should be represented in memory
- a set of libraries in different languages that implement that specification

Polars uses the implementation of the Arrow specification from the Rust library [Arrow2](https://docs.rs/arrow2/latest/arrow2/)

### Why does `Polars` use `Apache Arrow`?
Arrow allows for:
- sharing data without copying ("zero-copy")
- faster vectorised calculations
- working with larger-than-memory data in chunks
- consistent representation of missing data

Overall, Polars can process data more quickly and with less memory usage because of Arrow.

### How do we use Arrow in practice?
In practice **we rarely need to deal with Arrow directly** - Polars handles that for us.

The main time I call `to_arrow` are when passing data to another library that supports Arrow. This can allow us to pass data between libraries without copying. 

For example, [in this blog post I show how you can pass data from Polars to XGBoost by calling `to_arrow`](https://www.rhosignal.com/posts/polars-arrow-xgboost/)

### So what is a Polars `DataFrame`?
One important consequence of using Arrow is that a Polars `DataFrame` doesn't hold data directly. Instead a Polars `DataFrame` holds references to an Arrow table.

One consequence is that when we add a new column using `with_columns` (see the Selecting and Transforming dataframes section for more) we create a new `DataFrame`

In [None]:
(
    df
    .with_columns(
        pl.lit(0).alias("zeroes")
    )
)

However, creating a new `DataFrame` is a **cheap** operation as we are not copying the existing data to the new `DataFrame` - we are just copying **references** to the existing data along with the reference to the new column 

## Exercises
In the exercises you will develop your understanding of:
- getting the dtypes of a `DataFrame`
- getting the dtypes of a `Series`

### Exercise 1 

What are the dtypes of this `DataFrame`?

In [11]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
df.schema

OrderedDict([('a', Int64), ('b', Float64)])

### Exercise 2
Create a `Series` by selecting the `a` column of `df`

In [12]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})

In [21]:
df['a']

a
i64
0
1
2


In [22]:
df.select('a').to_series().head(3)

a
i64
0
1
2


What is the dtype of `a`?
What is the dtype of `b`?

## Solutions

### Solution to Exercise 1
What are the dtypes of this `DataFrame`?

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
df.schema

### Solution to Exercise 2
Create a `Series` by selecting the `a` column of `df`

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
s = df["a"]

In [None]:
s

`s` has 64-bit integer dtype 

In [None]:
s2 = df["b"]
s2

`s2` has 64-bit floating point dtype 