# Polars & DuckDB: DataFrames and SQL For Python Without Pandas
--------------------------

__[1. Introduction](#first-bullet)__

__[2. Getting Set Up On AWS with Docker](#second-bullet)__

__[3. Intro To Polars DataFrames](#third-bullet)__

__[4. DuckDB To The Rescue For SQL](#fourth-bullet)__

__[5. Conclusions](#fifth)__


## Introduction <a class="anchor" id="first-bullet"></a>
------

There are a plethora of dataframe alternatives to [Pandas](https://pandas.pydata.org/) due to its [limitations](https://insightsndata.com/what-are-the-limitations-of-pandas-35d462990c43), even the original author, Wes McKinney wrote a blog post about [10 Things I Hate About Pandas](https://wesmckinney.com/blog/apache-arrow-pandas-internals/). 

My biggest complaints to Pandas are:

1. Memory usage
2. Limited multi-core algorithms
3. No ability to execute SQL statements (like [SparkSQL & DataFrame](https://spark.apache.org/sql/))
4. No query planning/lazy-execution
5. [NULL values only exist for floats not ints](https://pandas.pydata.org/docs/user_guide/integer_na.html) (this changed in Pandas 1.0+)
6. Using [strings is inefficient](https://pandas.pydata.org/docs/user_guide/text.html) (this too changed in Pandas 1.0+
    
Many of these have been addressed by the [Pandas 2.0 release](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html). Over the years there has been many replacements for Pandas that have failed to gain traction in my opinion. And while there has been a steady march towards replacing the [NumPy](https://numpy.org/) backend with [Apache Arrow](https://arrow.apache.org/), I still feel the lack of SQL and overall API design is a major weakness.

For context I have been using a [Apache Spark](https://spark.apache.org/) since 2017 and love it not just from a performance point of view, but just how well the API is designed. The syntax makes sense coming from a SQL users perspective. If I want to group by a column and count in SQL or on Spark DataFrame I get what I expect either way. For instance using this datas set from [NYC Open Data on Motor Vechicle Collisions](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) in Pandas using a groupby-count expression I get:

In [1]:
import pandas as pd
pd_df = pd.read_csv("https://data.cityofnewyork.us/resource/h9gi-nx95.csv")
pd_df.groupby("borough").count()

Unnamed: 0_level_0,crash_date,crash_time,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BRONX,107,107,107,107,107,107,59,59,48,107,...,81,5,0,0,107,106,65,4,0,0
BROOKLYN,247,247,247,245,245,245,155,155,92,247,...,192,24,7,2,247,242,157,22,7,2
MANHATTAN,98,98,98,96,96,96,52,52,46,98,...,65,6,1,1,98,96,57,5,1,0
QUEENS,154,154,153,150,150,150,98,98,56,154,...,120,9,2,0,154,154,97,7,2,0
STATEN ISLAND,27,27,27,26,26,26,18,18,9,27,...,21,2,2,1,27,27,19,2,2,1


Notice this is the number of non nulls in every column. Not exactly what I wanted.

To get what I want I have to use the syntax:

In [2]:
pd_df.groupby("borough").size() # or pd_df.value_counts()

borough
BRONX            107
BROOKLYN         247
MANHATTAN         98
QUEENS           154
STATEN ISLAND     27
dtype: int64

But this returns a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). It seems like a trivial difference, but counting duplicates in a column is easy in Spark because we can use method chaining, to the do the equivalent in Pandas I have to convert back to a dataframe:

In [3]:
pd_df.groupby("borough").size().to_frame("counts").reset_index().query("counts > 0")

Unnamed: 0,borough,counts
0,BRONX,107
1,BROOKLYN,247
2,MANHATTAN,98
3,QUEENS,154
4,STATEN ISLAND,27


For years I have beening using Spark for large datasets, but for smaller ones sticking with Pandas. Recently though, I heard lots of hype about [Polars](https://www.pola.rs/) and [DuckDB](https://duckdb.org/) and decide to try them myself and was immediately impressed. 

In this blog post I go over my first interactions with both library's and call out things I like and dont like, but first let's get set up to run this notebook on an AWS EC2 instance using [Docker](https://www.docker.com/).

## Getting Set Up On AWS with Docker <a class="anchor" id="second-bullet"></a>

I have mostly used [Google Cloud](https://cloud.google.com/) for my prior personal projects, but for this project I wanted to use [Amazon Web Services](https://aws.com/). The first thing I can do is create an [Elastic Comppute Cloud 
(EC2) Instance](https://aws.amazon.com/ec2/). I created this from the console on using a `t2.medium` by signign on to [aws.com](aws.com) clicking on ec2, scrolling down and clicking the orange `Launch instance`,

![images/launch.png](images/launch.png)

I had to make sure I created a `keypair` file called "mikeskey.pem" that I downloaded.

![images/keypair.png](images/keypair.png)

Notice that in the security group I allowed SSH traffice from "Anywhere". Once I launched it I could see the instance running and clicked on `Instance ID` as shown below:

![images/instance.png](images/instance.png)

and click on the pop up choice of `Connect`. This took me to another page where I got the command at the bottom to SSH onto my machine using the keypair:

![images/connect.png](images/connect.png)

I opened a terminal from my Macbook and ran:

    ssh -i <path-to-key>/mikeskey.pem ec2-user@<dns-address>.compute-1.amazonaws.com

Note that I didnt create a user name so it defaulted to `ec2-user`. 

Next I set up git ssh-keys so I could develop on the instance as described [here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent) and cloned the repo. I then set up Docker as discussed [here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-container-image.html). I then build the image and called it `polars_nb` with the commands:

    sudo docker build -t polars_nb . 

I could then start up the container from this image using port forwarding and loading the current directory as the volume:

    sudo docker run -ip 8888:8888 -v `pwd`:/home/jovyan/ -t polars_nb

This shows a link that I can copy and paste into my webbrowser, but at this stage it wont work. Since Jupyter is running on a remote EC2 server I need to set up [ssh-tunneling](https://linuxize.com/post/how-to-setup-ssh-tunneling/) as described [here](https://towardsdatascience.com/setting-up-and-using-jupyter-notebooks-on-aws-61a9648db6c5). I can do this using by opening a new terminal on my Mac and running the command:

    ssh -i <path-to-key>/mikeskey.pem -L 8888:localhost:8888 ec2-user@<dns-address>.compute-1.amazonaws.com

Now I can reload the notebook addres before and viola it works!

## Intro To Polars DataFrames <a class="anchor" id="third-bullet"></a>

Now that we're set up with a notebook we can start to discuss [Polars](https://www.pola.rs/) dataframes. The Polars library is written in Rust with Python bindings. Polars uses multi-core processing making it fast and the authors smartly used [Apache Arrow](https://arrow.apache.org/) making it efficent for cross-language in-memory dataframes as there is no serialization between the Rust and Python.

We can import polars and read in a dataset from [NY Open Data on Motor Vechicle Collisions](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) using the [read_csv](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html) function:

In [4]:
import polars as pl
df = pl.read_csv("https://data.cityofnewyork.us/resource/h9gi-nx95.csv")
df.head(2)

crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
str,str,str,i64,f64,f64,str,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,str,i64,str,str,str,str,str
"""2021-09-11T00:…","""2:39""",,,,,,"""WHITESTONE EXP…","""20 AVENUE""",,2,0,0,0,0,0,2,0,"""Aggressive Dri…","""Unspecified""",,,,4455765,"""Sedan""","""Sedan""",,,
"""2022-03-26T00:…","""11:45""",,,,,,"""QUEENSBORO BRI…",,,1,0,0,0,0,0,1,0,"""Pavement Slipp…",,,,,4513547,"""Sedan""",,,,


The intial reading of CSVs is the same as Python and the [head](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.head.html) dataframe method returns the top `n` rows as Pandas does. However, in addition I also get shape of the dataframe are shown as well as the datatypes othe columns. I can get the number of columns and datatypes of each column using the [schema](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.schema.html) similar to Spark:

In [5]:
df.schema

{'crash_date': Utf8,
 'crash_time': Utf8,
 'borough': Utf8,
 'zip_code': Int64,
 'latitude': Float64,
 'longitude': Float64,
 'location': Utf8,
 'on_street_name': Utf8,
 'off_street_name': Utf8,
 'cross_street_name': Utf8,
 'number_of_persons_injured': Int64,
 'number_of_persons_killed': Int64,
 'number_of_pedestrians_injured': Int64,
 'number_of_pedestrians_killed': Int64,
 'number_of_cyclist_injured': Int64,
 'number_of_cyclist_killed': Int64,
 'number_of_motorist_injured': Int64,
 'number_of_motorist_killed': Int64,
 'contributing_factor_vehicle_1': Utf8,
 'contributing_factor_vehicle_2': Utf8,
 'contributing_factor_vehicle_3': Utf8,
 'contributing_factor_vehicle_4': Utf8,
 'contributing_factor_vehicle_5': Utf8,
 'collision_id': Int64,
 'vehicle_type_code1': Utf8,
 'vehicle_type_code2': Utf8,
 'vehicle_type_code_3': Utf8,
 'vehicle_type_code_4': Utf8,
 'vehicle_type_code_5': Utf8}

We can see that the datatypes of Polars are built on top of [Arrow's datatypes](https://arrow.apache.org/docs/python/api/datatypes.html) which is great.

The first command I tried with Polars was looking for duplicates in the dataframe. I found I could do this with the syntax:

In [6]:
test = (df.groupby("collision_id")
           .count()
           .filter(pl.col("count") > 1))

test

collision_id,count
i64,u32


Right away from the syntax I was onboard. Then I saw statements return a dataframe!

In [7]:
type(test)

polars.dataframe.frame.DataFrame

This is exactly what I want! I dont want a series! You can even print the dataframes:

In [8]:
print(test)

shape: (0, 2)
┌──────────────┬───────┐
│ collision_id ┆ count │
│ ---          ┆ ---   │
│ i64          ┆ u32   │
╞══════════════╪═══════╡
└──────────────┴───────┘


This turns out to be helpful when you have lazy execution (which I'll go over later). The next thing I tried was to access the column of the dataframe by using the got operator:

In [9]:
df.crash_date

AttributeError: 'DataFrame' object has no attribute 'crash_date'

I was actually happy to see this as to me, a column in a dataframe should not be accessed this way. Instead we can access it like a dictionary's key:

In [10]:
df["crash_date"].is_null().any()

False

The crash dates are strings that I wante to convert to datetime type. I can see the format of the string:

In [11]:
df['crash_date'][0] # the .loc method doesnt exist!

'2021-09-11T00:00:00.000'

Now I can extract the year-month-day from the string and assign that value a new column name called `crash_date_str`. Note the synatx to create a new column is to use a [with_columns](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.with_columns.html) method (similar to [withColumn](https://sparkbyexamples.com/pyspark/pyspark-withcolumn/)) and I have to use the [col](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.col.html) function similar to Spark! I can get the first 10 lengths using str methods similar to Pandas. Finally, I rename the new column `crash_data_str` using the [alias](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.alias.html) function like Spark! The default is to call the new column name the same as the old column. The next query you can see below is to strip the timestamp from the `crash_date_str` column and then convert it to a Polars datetime object and rename it `crash_date`. The results are below:

In [12]:
df = df.with_columns(
            pl.col("crash_date").str.slice(0, length=10).alias("crash_date_str")
      ).with_columns(
            pl.col("crash_date_str").str.strptime(
                pl.Datetime, "%Y-%m-%d", strict=False).alias("crash_date")
)

df.head()

crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5,crash_date_str
datetime[μs],str,str,i64,f64,f64,str,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,str,i64,str,str,str,str,str,str
2021-09-11 00:00:00,"""2:39""",,,,,,"""WHITESTONE EXP…","""20 AVENUE""",,2,0,0,0,0,0,2,0,"""Aggressive Dri…","""Unspecified""",,,,4455765,"""Sedan""","""Sedan""",,,,"""2021-09-11"""
2022-03-26 00:00:00,"""11:45""",,,,,,"""QUEENSBORO BRI…",,,1,0,0,0,0,0,1,0,"""Pavement Slipp…",,,,,4513547,"""Sedan""",,,,,"""2022-03-26"""
2022-06-29 00:00:00,"""6:55""",,,,,,"""THROGS NECK BR…",,,0,0,0,0,0,0,0,0,"""Following Too …","""Unspecified""",,,,4541903,"""Sedan""","""Pick-up Truck""",,,,"""2022-06-29"""
2021-09-11 00:00:00,"""9:35""","""BROOKLYN""",11208.0,40.667202,-73.8665,""" , (40.66720…",,,"""1211 LORI…",0,0,0,0,0,0,0,0,"""Unspecified""",,,,,4456314,"""Sedan""",,,,,"""2021-09-11"""
2021-12-14 00:00:00,"""8:13""","""BROOKLYN""",11233.0,40.683304,-73.917274,""" , (40.68330…","""SARATOGA AVENU…","""DECATUR STREET…",,0,0,0,0,0,0,0,0,,,,,,4486609,,,,,,"""2021-12-14"""


Notice the [col](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.col.html) function in Polars lets me access derived columns that are not in the original dataframe. In Pandas to the same operations I would have to use a lambda function within an assign function:

    df.assign(crash_date=lambda: df["crash_date_str"].str.strptime(...))

I can see the number of crashes in each borough of NYC with the query

In [13]:
df.groupby("borough").count()

borough,count
str,u32
"""BROOKLYN""",247
"""MANHATTAN""",98
,367
"""QUEENS""",154
"""STATEN ISLAND""",27
"""BRONX""",107


There is a borough value of NULL. I can filture this out with the commands:

In [14]:
nn_df = df.filter(pl.col("borough").is_not_null())

Now I can get just the unique values of non-null boroughs with the query: 

In [15]:
df.filter(pl.col("borough").is_not_null()).select("borough").unique()

borough
str
"""BRONX"""
"""BROOKLYN"""
"""QUEENS"""
"""MANHATTAN"""
"""STATEN ISLAND"""


Notice that I can use the [select](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.select.html) method in Polars. This is actually pretty powerful, as I can select columns and run queries on them similar to [selectEpr](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.selectExpr.html) in Spark:

In [16]:
(df.filter(pl.col("borough").is_not_null())
   .select([
       "borough", 
       (pl.col("number_of_persons_injured")  + 1).alias("number_of_persons_injured_plus1")
    ]).head()
)

borough,number_of_persons_injured_plus1
str,i64
"""BROOKLYN""",1
"""BROOKLYN""",1
"""BRONX""",3
"""BROOKLYN""",1
"""MANHATTAN""",1


Doing the above in Pandas is a little convoluted using in one command:

In [17]:
(pd_df[~pd_df["borough"].isnull()]
      .assign(number_of_persons_injured_plus1=pd_df["number_of_persons_injured"] + 1)
      [["borough", "number_of_persons_injured_plus1"]]
      .head()
)

Unnamed: 0,borough,number_of_persons_injured_plus1
3,BROOKLYN,1
4,BROOKLYN,1
7,BRONX,3
8,BROOKLYN,1
9,MANHATTAN,1


To me, the Polars query is so much easier to read. And its actually more efficient. The Pandas dataframe transforms the whole dataset, then subsets the columns to return just two. On the other hand Polars subsets the columns first and then add transform just those two columns.

Now I can create a dataframe the exact same way as in Pandas:

In [18]:
borough_df = pl.DataFrame({
                "borough": ["BROOKLYN", "BRONX", "MANHATTAN", "STATEN ISLAND", "QUEENS"],
                "population": [2590516, 1379946, 1596273, 2278029, 378977],
                "area":[179.7, 109.2, 58.68, 281.6, 149.0]
})

print(borough_df)

shape: (5, 3)
┌───────────────┬────────────┬───────┐
│ borough       ┆ population ┆ area  │
│ ---           ┆ ---        ┆ ---   │
│ str           ┆ i64        ┆ f64   │
╞═══════════════╪════════════╪═══════╡
│ BROOKLYN      ┆ 2590516    ┆ 179.7 │
│ BRONX         ┆ 1379946    ┆ 109.2 │
│ MANHATTAN     ┆ 1596273    ┆ 58.68 │
│ STATEN ISLAND ┆ 2278029    ┆ 281.6 │
│ QUEENS        ┆ 378977     ┆ 149.0 │
└───────────────┴────────────┴───────┘


Now lets go over a more complicated query, I can join the borough dataframe above to the dataframe we have to get the total number of injuries per borough then join that to the borough dataframe to get the injuries by population and sort them by borough name:

In [19]:
(df.filter(pl.col("borough").is_not_null())
   .select(["borough", "number_of_persons_injured"])
   .groupby("borough")
   .sum()
   .join(borough_df, on=["borough"])
   .select([
       "borough", 
       (pl.col("number_of_persons_injured") / pl.col("population")).alias("injuries_per_population")
   ])
   .sort(pl.col("borough"))
)

borough,injuries_per_population
str,f64
"""BRONX""",3.3e-05
"""BROOKLYN""",4.5e-05
"""MANHATTAN""",2.5e-05
"""QUEENS""",0.000193
"""STATEN ISLAND""",7e-06


This is really cool as its very easy to use method chaining and reads pretty close to SQL! Doing the same thign in the Pandas API would be an awkward mess.

Which brings me to something that was super exciting to see in Polars: [sqlcontext](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.SQLContext.execute.html). SQLContext in Polars can be used to create a table to run SQL on from a Polars dataframe as shown below:

In [20]:
ctx = pl.SQLContext(crashes=df)

Now I can get the sum of every crash in each borough per day and execute it!

In [29]:
daily_df = ctx.execute("""
    SELECT
        borough,
        crash_date AS day,
        SUM(number_of_persons_injured)
    FROM 
        crashes
    WHERE 
        borough IS NOT NULL
    GROUP BY 
        borough, crash_date
    ORDER BY 
        borough, day
""")

daily_df.collect().head()

borough,day,number_of_persons_injured
str,datetime[μs],i64
"""BRONX""",2021-02-26 00:00:00,0
"""BRONX""",2021-04-06 00:00:00,0
"""BRONX""",2021-04-08 00:00:00,0
"""BRONX""",2021-04-10 00:00:00,4
"""BRONX""",2021-04-11 00:00:00,0


Notice I had to use `collect()` function to get the results thats because by default the SQL uses lazy execution.
You can see this since printing the dataframe actually prints the query plan:

In [30]:
print(daily_df)

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

SORT BY [col("borough"), col("day")]
   SELECT [col("borough"), col("crash_date").alias("day"), col("number_of_persons_injured")] FROM
    AGGREGATE
    	[col("number_of_persons_injured").sum()] BY [col("borough"), col("crash_date")] FROM
      FILTER col("borough").is_not_null() FROM
      DF ["crash_date", "crash_time", "borough", "zip_code"]; PROJECT */30 COLUMNS; SELECTION: "None"


To get back a Polars dataframe I would have to use the `eager=True` paramater in the execute method.

Now I can register this new dataframe as a table called `daily_crashes` in the SQLContext:

In [31]:
ctx = ctx.register("daily_crashes", daily_df)

I can see the tables in the context using the command:

In [32]:
ctx.tables()

['crashes', 'daily_crashes']

Now say I want to get the current day's number of injuried people and the prior days; I could use the lag function in SQL to do so:

In [48]:
ctx.execute("""
    SELECT
        borough,
        day,
        number_of_persons_injured,
        LAG(1,number_of_persons_injured) 
            OVER (
            PARTITION BY borough 
            ORDER BY day ASC
            ) AS prior_day_injured
FROM
    daily_crashes
ORDER BY 
    borough,
    day DESC
""", eager=True)

InvalidOperationError: unsupported SQL function: lag

I finally hit snag in Polars: their doesnt seem to be a lot of support for Window functions. This was dissapointing since the library was so promising!  

Luckily there is another library that support blazingly fast SQL queries and integrates with Polars (and Pandas) directly: DuckDB.

## DuckDB To The Rescue For SQL <a class="anchor" id="fourth-bullet"></a>

I heard about [DuckDB](https://duckdb.org/) when I saw someone star it on github and thought it was "Yet Another SQL Engine". While DuckDB is a SQL engine it is much more! First, it's parallel query processing library written in C++. From the website it's,

        DuckDB is designed to support analytical query workloads, also known as Online analytical processing (OLAP). These workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables.
        ...
        DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation.

In other words, DuckDB is can be used for fast query execution across large datasets. Duc

In [49]:
import duckdb

query = duckdb.sql("""
    SELECT
        borough,
        day,
        number_of_persons_injured,
        LAG(1, number_of_persons_injured) 
            OVER (
                PARTITION BY borough 
                ORDER BY day ASC
                ) as prior_day_injured
FROM
    daily_df
ORDER BY 
    borough,
    day DESC
LIMIT 5
""")

Now we can see the output of the query:

In [50]:
query

┌─────────┬─────────────────────┬───────────────────────────┬───────────────────┐
│ borough │         day         │ number_of_persons_injured │ prior_day_injured │
│ varchar │      timestamp      │           int64           │       int32       │
├─────────┼─────────────────────┼───────────────────────────┼───────────────────┤
│ BRONX   │ 2022-04-24 00:00:00 │                         0 │                 1 │
│ BRONX   │ 2022-03-26 00:00:00 │                         7 │                 1 │
│ BRONX   │ 2022-03-25 00:00:00 │                         1 │                 1 │
│ BRONX   │ 2022-03-24 00:00:00 │                         1 │                 1 │
│ BRONX   │ 2022-03-22 00:00:00 │                         1 │                 1 │
└─────────┴─────────────────────┴───────────────────────────┴───────────────────┘

Then we can return the result as polars dataframe:

In [51]:
day_prior_df = query.pl()
day_prior_df.head(5)

borough,day,number_of_persons_injured,prior_day_injured
str,datetime[μs],i64,i32
"""BRONX""",2022-04-24 00:00:00,0,1
"""BRONX""",2022-03-26 00:00:00,7,1
"""BRONX""",2022-03-25 00:00:00,1,1
"""BRONX""",2022-03-24 00:00:00,1,1
"""BRONX""",2022-03-22 00:00:00,1,1


Now we can see another cool part of DuckDB, you can execute SQL directly on local files!

First we save the daily crash dataframe as parquet filea and remember its a lazy dataframe:

In [52]:
daily_df

It turns out you cant write lazy dataframes as parquet. So first we'll collect it and then write it to parquet:

In [54]:
daily_df.collect().write_parquet("daily_crashes.parquet")

[Apache Parquet](https://parquet.apache.org/) is a compressed columnar-stored format file type that is create for analytical queries. Column-based formats are particuarly good for OLAP queries since entire columns can be read in continuous and have aggregrations performed on them. The datatypes for each column are known allowing for compression. Since the columns and datatypes are known we can read them in with the following query:

In [65]:
duckdb.sql("SELECT * FROM parquet_schema(daily_crashes.parquet)").pl()

file_name,name,type,type_length,repetition_type,num_children,converted_type,scale,precision,field_id,logical_type
str,str,str,str,str,i64,str,i64,i64,i64,str
"""daily_crashes.…","""root""",,,,3.0,,,,,
"""daily_crashes.…","""borough""","""BYTE_ARRAY""",,"""OPTIONAL""",,"""UTF8""",,,,"""StringType()"""
"""daily_crashes.…","""day""","""INT64""",,"""OPTIONAL""",,,,,,"""TimestampType(…"
"""daily_crashes.…","""number_of_pers…","""INT64""",,"""OPTIONAL""",,,,,,


Then we can perform queries on the actualy files without having to resort to dataframes at all:

In [66]:
query = duckdb.sql("""
    SELECT
        borough,
        day,
        number_of_persons_injured,
        SUM(number_of_persons_injured) 
            OVER (
                PARTITION BY borough 
                ORDER BY day ASC
                ) AS cumulative_injuried
    FROM 
        read_parquet(daily_crashes.parquet)
    ORDER BY
        borough,
        day ASC
""")

In [73]:
query.pl().head(10)

borough,day,number_of_persons_injured,cumulative_injuried
str,datetime[μs],i64,f64
"""BRONX""",2021-02-26 00:00:00,0,0.0
"""BRONX""",2021-04-06 00:00:00,0,0.0
"""BRONX""",2021-04-08 00:00:00,0,0.0
"""BRONX""",2021-04-10 00:00:00,4,4.0
"""BRONX""",2021-04-11 00:00:00,0,4.0
"""BRONX""",2021-04-12 00:00:00,0,4.0
"""BRONX""",2021-04-13 00:00:00,3,7.0
"""BRONX""",2021-04-14 00:00:00,3,10.0
"""BRONX""",2021-04-15 00:00:00,4,14.0
"""BRONX""",2021-04-16 00:00:00,6,20.0


Pretty cool!!!

## Conclusions <a class="anchor" id="fifth-bullet"></a>

In this post I quickly covered what I view as the limitations in Pandas library. Next I covered how to get set up in with 
Juptyer lab using [Docker](https://www.docker.com/) on [AWS](https://aws.amazon.com/) and covered some basics of [Polars](https://www.pola.rs/), [DuckDB](https://duckdb.org/) and how to use the two in combination. The benefits of Polars is that,

* It allows for fast parallel querying on dataframes.
* It uses Apache Arrow for backend datatypes making it efficient for memory.
* It has both lazy and eager execution mode.
* It allows for SQL queries direcly on dataframes.
* Its API is similar to Spark's API and allows for highly readable queries using method chaining.

I am still new to both libraries, but looking forward to learning them more.

Hope you enjoyed reading this!