# Comparing DuckDB, Polars, and Pandas

I am going to use duckdb, polars and pandas to do some analysis on some csv files and I am going to compare the performance.

In [1]:
# first we are going to import the python libraries

import pandas as pd
import polars as pl
import duckdb

## Car Price Dataset

The first csv file that I am going to use for benchmarking purposes can be found here: https://www.kaggle.com/datasets/asinow/car-price-dataset

I am specifically going to try to find the second most expensive model for each brand for every year.

In [2]:
def pandas_car_price_func():
    df = pd.read_csv('car_price_dataset.csv')
    out = df.sort_values('Price', ascending=False).groupby(['Brand','Year'], as_index=False).nth(2) [["Brand", "Model", "Year", "Price"]]

In [7]:
%%timeit -n1 -r1
pandas_car_price_func()

25.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [8]:
def polars_car_price_func():
    df = pl.read_csv('car_price_dataset.csv')
    df = df.select('Brand', 'Model', 'Year', 'Price', pl.col('Price').rank('ordinal', descending=True).over('Brand','Year').alias('price_rank'))
    df = df.filter(pl.col("price_rank") == 2).select('Brand', 'Model', 'Year', 'Price')

In [9]:
%%timeit -n1 -r1
polars_car_price_func()

6.66 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [10]:
def duckdb_car_price_func():
    duckdb.read_csv("car_price_dataset.csv")
    duckdb.sql('''
    with temp_data as (
        SELECT
            Brand,
            Model,
            Year,
            Price,
            row_number() over (partition by Brand, Year order by Price desc) as rn
        FROM 'car_price_dataset.csv'
    )
    SELECT
        Brand,
        Model,
        Year,
        Price
    FROM temp_data
    WHERE rn = 2
    ''')
    

In [11]:
%%timeit -n1 -r1
duckdb_car_price_func()

74 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


DuckDB gives the worst results and polars gives the best for this dataset

Now we are going to look at the synthetic fraud dataset and we are going to find the second most used device for fraud by country. the dataset can be found here: https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset