# Conversion to & from Numpy and Pandas
By the end of this lecture you will be able to:
- convert between Polars and Numpy
- convert between Polars and Pandas

Key functionality in this notebook requires that your Pandas version is 1.5+, Polars is 0.16.4+ and PyArrow is 11+.

Use `pl.show_versions()` to check your installation

In [1]:
import polars as pl
import numpy as np
import pandas as pd

In [2]:
csv_file = "data_titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


## Convert a `DataFrame` to Numpy

To convert a `DataFrame` to Numpy use the `to_numpy` method. This clones (copies) the data.

In [4]:
arr = df.to_numpy()
arr

array([[1, 0, 3, ..., 7.25, None, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, None, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, None, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, None, 'Q']], dtype=object)

This conversion turns each row into a Numpy `ndarray` and vertically stacks these row-arrays.

As the `DataFrame` has a mix of types the Numpy array has an `object` dtype.

If the columns have uniform numeric dtype then the Numpy array has the corresponding dtype.

In this example we use `select` to choose the 64-bit floating point columns only for conversion to Numpy. 

> We cover `select` in more detail in the Section on Selecting columns and transforming dataframes.

In [5]:
floats_array = (
    df
    .select(
        pl.col(pl.Float64)
    )
    .to_numpy()
)
floats_array

array([[22.    ,  7.25  ],
       [38.    , 71.2833],
       [26.    ,  7.925 ],
       ...,
       [    nan, 23.45  ],
       [26.    , 30.    ],
       [32.    ,  7.75  ]])

In [6]:
floats_array.dtype

dtype('float64')

Typically it is better to do the conversion to `Numpy` as late as possible in your data processing pipeline in `Polars` is often faster and more memory efficient.

## Convert Numpy to a `DataFrame`

We can create a Polars `DataFrame` from a Numpy array

In [7]:
rand_array = np.random.standard_normal((5,3))
(
    pl.DataFrame(
        rand_array,
    )
)

column_0,column_1,column_2
f64,f64,f64
-0.669349,2.225443,0.382977
-1.83319,0.027975,1.118817
-1.494478,-0.081568,1.692512
1.230934,0.953518,1.069817
0.889615,-0.299069,-0.908479


We can optionally pass a list of column names to `pl.DataFrame` if we want to specify these.

If we have a **1D** Numpy array we can create a Polars `Series` or `DataFrame` with zero-copy. We start by creating a 1D array

In [8]:
arr = np.ones(int(1e3))
arr.shape

(1000,)

We can then create a zero-copy `Series` or `DataFrame`

In [9]:
# zero copy series conversion
pl.Series("a", arr)

# zero copy DataFrame conversion
pl.DataFrame(
    {
       "a": arr,
    }
)

a
f64
1.0
1.0
1.0
1.0
1.0
…
1.0
1.0
1.0
1.0


## Convert a `Series` to Numpy
Converting a `Series` to Numpy has more options than converting an entire `DataFrame`.

To do a simple conversion where the data is cloned use `to_numpy` on the `Series`

In [10]:
df['Age'].head().to_numpy()

array([22., 38., 26., 35., 35., nan, 54.,  2., 27., 14.])

### Convert a `Series` to Numpy with zero-copy
In some cases we can convert a `Series` to Numpy without copying ("zero-copy"). 

Zero-copy is only possible if there are no `null` or `NaN` values.

In [11]:
arr = (
    df['Survived']
    .head()
    .to_numpy(zero_copy_only=True)
)
arr

  df['Survived']


array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1], dtype=int64)

With zero-copy conversion the Numpy array is read-only so you cannot change the values in the Numpy array.

In the following example we get an `Exception` when we try to change the values after a zero-copy operation

In [14]:
arr = (
    df['Survived']
    .head()
    .to_numpy(allow_copy=False)
)
arr
arr[0] = 100

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1], dtype=int64)

## Convert a `DataFrame` to Pandas

### Convert to a Numpy-backed Pandas DataFrame
Pandas has historically used Numpy arrays to represent its data in memory.

To convert a `DataFrame` to Pandas with Numpy array use the `to_pandas` method. This clones the data similar to calling `to_numpy` on a `DataFrame` above.

> This conversion to Pandas requires that you have `PyArrow` installed with `pip` or `conda`.

In [15]:
(
    df
    .to_pandas()
    .head(2)
)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


### Convert to a PyArrow-backed Pandas `DataFrame`
Since Pandas release 1.5.0 and Polars release 1.6.4 you can have a Pandas `DataFrame` backed by an Arrow Table. You can create a Pandas `DataFrame` that references the same Arrow Table as your Polars `DataFrame`. This means that you can use (some) Pandas code on your data without copying the data

In [16]:
(
    df
    .to_pandas(use_pyarrow_extension_array=True)
    .head(2)
)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


The advantage of using the pyarrow extension array is that creating the Pandas `DataFrame` is very cheap as it does not require copying data. 

If there is a function you want from Pandas you can do a quick transformation to Pandas, apply the function and revert back to Polars. This works in eager mode only of course.

This PyArrow conversion is a new feature in both libraries to there may be bugs with trickier features such as categorical or nested columns.

Note that when you do **not** use the PyArrow extension approach the dtypes of the columns in Pandas are the standard Pandas dtypes. When you do use the PyArrow extension approach the the dtypes of the columns in Pandas are PyArrow dtypes

In [17]:
# Without PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=False).dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [18]:
# With PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=True).dtypes

PassengerId           int64[pyarrow]
Survived              int64[pyarrow]
Pclass                int64[pyarrow]
Name           large_string[pyarrow]
Sex            large_string[pyarrow]
Age                  double[pyarrow]
SibSp                 int64[pyarrow]
Parch                 int64[pyarrow]
Ticket         large_string[pyarrow]
Fare                 double[pyarrow]
Cabin          large_string[pyarrow]
Embarked       large_string[pyarrow]
dtype: object

### Calling `pd.DataFrame` on a Polars `DataFrame`
With an up-to-date version of Pandas you can call `pd.DataFrame` on a Polars `DataFrame`. But there may still be bugs such as the column names not being converted!

In [19]:
dfp = (
    pd.DataFrame(df)
    .head()
)

In [20]:
dfp

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Hopefully this conversion will be easier when both libraries have adopted the [dataframe interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html).

### Conversion from Pandas to Polars
You can convert from Pandas to Polars by calling `pl.DataFrame` on the Pandas `DataFrame`

In [21]:
(
    pl.DataFrame(
        df.to_pandas()
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


Or by calling `pl.from_pandas` on the Pandas `DataFrame`

In [22]:
(
    pl.from_pandas(
        df.to_pandas()
    ).head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


## Convert a `Series` to Pandas
You can convert a `Series` to Pandas with a call that clones the data

In [23]:
(
    df['Age']
    .to_pandas()
    .head()
)

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

Or you can again use the PyArrow extension type in Pandas for a zero-copy operation

In [24]:
(
    df['Age']
    .to_pandas(use_pyarrow_extension_array=True)
    .head()
)

0   22.0
1   38.0
2   26.0
3   35.0
4   35.0
Name: Age, dtype: double[pyarrow]

## Exercises

No exercises for this lecture!