# Visualisation
By the end of this lecture on you will be able to:
- create charts from Polars with a variety of plotting libraries
- understand how these libraries support Polars

We import Vegafusion along with Altair below. Vegafusion is not necessary but reduces the burden on your browser for visualising larger datasets. See my blog post here for more on this: https://www.rhosignal.com/posts/polars-and-altair/

Up-to-date versions of the visualisation libraries are typically required for maximum compatibility

In [None]:
import polars as pl

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import altair as alt
import vegafusion as vf

# Enable vegafusion for Altair
vf.enable()

In [None]:
csv_file = '../data/titanic.csv'

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

We first look at whether we can pass a Polars `DataFrame` directly to each plotting library by creating a simple bar chart. Below we consider some other points to be aware of for working with each library from Polars.

## Bar chart

We begin by getting a count of the number of passengers in each passenger class. See the section of the course on Statistics and Aggregation for more on the methods used here.

In [None]:
passenger_class_counts_df = (
    df['Pclass']
    .value_counts()
    .sort("Pclass")
)
passenger_class_counts_df

### Matplotlib
We can pass the columns of the `passenger_class_counts_df` directly to Matplotlib

In [None]:
plt.bar(
    x=passenger_class_counts_df["Pclass"],
    height=passenger_class_counts_df["counts"]
)

Note that the `Pclass` column is an integer column in Polars but the x-axis in the chart is a float axis. One way to make this appear as an integer axis is to cast the integers to strings

In [None]:
passenger_class_counts_string_column_df = (
    passenger_class_counts_df
    .with_columns(
        pl.col("Pclass").cast(pl.Utf8)
    )
)
plt.bar(
    x=passenger_class_counts_string_column_df["Pclass"],
    height=passenger_class_counts_string_column_df["counts"]
)

Matplotlib does not have explicit support for Polars. However, Matplotlib can accept a Polars `Series` as it just needs sequence-type objects that it can iterate through using standard python methods (that a `Series` supports).

### Seaborn
We can pass a Polars `DataFrame` to Seaborn for many charts. Note that Seaborn then typically copies the data to Pandas internally as it makes extensive use of Pandas-specific features such as the index. With a large `DataFrame` you may want to only pass the columns needed for the plot to avoid the whole `DataFrame` being copied!

In [None]:
sns.barplot(
    passenger_class_counts_df,#.to_pandas(use_pyarrow_extension_array=True),
    x="Pclass",
    y="counts"
)

Some more complicated Seaborn charts also accept a Polars `DataFrame` directly such as `sns.scatterplot` or `sns.jointplot`. In this example we do a joint plot of (log) Age and (log) Fare coloured by passenger class

In [None]:
sns.jointplot(
    data=(
        df
        .with_columns(
            [
                pl.col(pl.Float64).log(),
                pl.col("Pclass").cast(pl.Utf8)
            ]
        )
    ),
    x="Age", 
    y="Fare", 
    hue="Pclass",
)

### Plotly
We can pass a `DataFrame` directly to Plotly - note that we again use the `DataFrame` with a string column for correct output

In [None]:
px.bar(
    passenger_class_counts_string_column_df,
    x="Pclass",
    y="counts",
    color="Pclass",
    width=400
)

### The Dataframe Interchange Protocol
Seabornm Plotly and Altair below support Polars via the **Dataframe Interchange Protocol** ([read more here](https://data-apis.org/dataframe-protocol/latest/index.html)). This protocol is a way for 3rd-party packages (e.g. visualisation libraries) to work with different dataframe libraries without explicitly supporting the libraries.

What the use of the interchange protocol means in practice is that we can use Polars `DataFrames` directly with Plotly for many charts. However, as Polars does not have native support from Plotly there are no guarantees all plots will work with a Polars `DataFrame`. You may need to convert to Pandas in some cases.

If you are curious about how the interchange protocol works this is a simplified version:
- Plotly checks the type of the data object passed to it and finds that it is not a Pandas `DataFrame`
- Plotly then checks to see if the object passed to it has a `__dataframe__` namespace
- if Plotly finds the object has a `__dataframe__` namespace it uses the generic commands in that namespace to do what it needs (e.g. extract a named column from the `DataFrame`, check the dtype of the column and iterate through the column)

You can see the methods in the `__dataframe__` namespace on a Polars `DataFrame` here: 

In [None]:
[el for el in dir(df.__dataframe__()) if not el.startswith("__")]

These methods are wrappers for the standard Polars methods we learn on this course. The dataframe interchange is a rapidly developing project in its own right and so expect functionality to grow.

### Altair
We can pass a `DataFrame` directly to Altair

In [None]:
alt.Chart(
    passenger_class_counts_df,
    width=600
).mark_bar().encode(
    x="Pclass:N",
    y="counts:Q",
    color="Pclass:N",
)

As with Plotly, Altair supports Polars via the Dataframe Interchange Protocol. The same caveats apply as for Plotly.

## Exercises
In the exercises you will develop your understanding of:
- creating charts via Pandas or directly from Polars

### Exercise 1
We first create a `DataFrame` of bike sales and replace spaces in the string column names with `_` (see the lecture on Transforming DataFrames in the section of selecting and transforming data for more on `pipe`)

In [None]:
df_bike = (
    pl.read_parquet("../data/bike_sales.parquet")
    .pipe(lambda df: df.rename({col:col.replace(" ","_") for col in df.columns}))
)
df_bike.head(2)

We need to do a `group_by` first to get the data

In [None]:
customer_count_df = (
    df_bike
    .group_by("customer_age")
    .count()
    .sort("customer_age")
)

Using your preferred visualisation library make a bar chart of the `customer_age` column showing the number of bikes sold by customer_age

<blank>

## Solutions

### Matplotlib

In [None]:
plt.bar(
    x=customer_count_df["customer_age"],
    height=customer_count_df["count"],
)

### Seaborn

In [None]:
sns.barplot(
    customer_count_df.to_pandas(use_pyarrow_extension_array=True),
    x="customer_age",
    y="count"
)

### Plotly

In [None]:
px.bar(
    customer_count_df.with_columns(pl.col("customer_age").cast(pl.Utf8)),
    x="customer_age",
    y="count",
)

### Altair

In [None]:
(
    alt.Chart(
    customer_count_df,
        width=600
    )
    .mark_bar()
    .encode(
        x="customer_age:N",
        y="count:Q"
    )
)