# Advanced pandas - Going Beyond the Basics

## polars
___

### Table of Contents
1. [Import dependencies](#section1)
2. [Import dataset](#section2)

https://towardsdatascience.com/pandas-dataframe-but-much-faster-f475d6be4cd4

https://medium.com/towards-data-science/measuring-the-speed-of-new-pandas-2-0-against-polars-and-datatable-still-not-good-enough-e44dc78f6585

https://www.linkedin.com/in/marcogorelli

https://towardsdatascience.com/getting-started-with-the-polars-dataframe-library-6f9e1c014c5c

___
<a id='section1'></a>
# (1) Import dependencies

In [1]:
# Install dependencies (if not already done so)
# !pip install pandas==2.0.3
# !pip install polars==0.18.7

In [4]:
import numpy as np
import pandas as pd
import polars as pl

___
<a id='section2'></a>
# (2) Import dataset
- Data Source: https://archive.ics.uci.edu/dataset/352/online+retail ((CC BY 4.0) license)

`polars` supports reading and writing to all common file formats (e.g. CSV, JSON, Parquet), cloud storage (S3, Azure Blob, BigQuery) and databases (e.g. PostgreSQL, MySQL etc.).

In this example, we will use the fast CSV reading function of `polars`, as shown below:

In [9]:
# Read CSV in polars
df = pl.read_csv('https://raw.githubusercontent.com/kennethleungty/Educative-Advanced-Pandas/main/data/csv/online_retail_dataset.csv',
                  ignore_errors=True # Hide errors first since dtypes not specified
                )

# View output
df.head()

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
i64,str,str,i64,str,f64,i64,str
536365,"""85123A""","""WHITE HANGING …",6,"""1/12/2010 8:26…",2.55,17850,"""United Kingdom…"
536365,"""71053""","""WHITE METAL LA…",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"
536365,"""84406B""","""CREAM CUPID HE…",8,"""1/12/2010 8:26…",2.75,17850,"""United Kingdom…"
536365,"""84029G""","""KNITTED UNION …",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"
536365,"""84029E""","""RED WOOLLY HOT…",6,"""1/12/2010 8:26…",3.39,17850,"""United Kingdom…"


`polars` has a strict schema, meaning that data types should be known before running the query. In the above case, because we did not specify the `dtypes`, the data type inference was done automatically and the output DataFrame indicates the inferred `dtype` for each column.

To specify the data types appropriately for `polars` to work optimally, we can use the `dtypes` method, as illustrated below:

In [10]:
# Define list of data types in the same sequence as the columns
dtypes_list = [pl.Int64, pl.Utf8, pl.Utf8, pl.Int64, pl.Utf8, pl.Datetime,
               pl.Float64, pl.Int64, pl.Categorical]

# Read CSV in polars with dtypes specified
df = pl.read_csv('https://raw.githubusercontent.com/kennethleungty/Educative-Advanced-Pandas/main/data/csv/online_retail_dataset.csv',
                  dtypes=dtypes_list)

# View output
df.head()

PanicException: called `Option::unwrap()` on a `None` value

> **Note**: The complete list of `polars` data types can be found [here](https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html)