## `duckdb` Window Operation

`duckdb`'s Window Operations do not support out-of-core processing and are susceptible to OOM errors. Moreover, It appears that results get cached in memory after the query has been executed. 


In [7]:
import duckdb
import psutil

In [8]:
!du -h data/perf.parquet

24G	data/perf.parquet


In [9]:
%load_ext memory_profiler

def print_memory_util():
    percent_available = psutil.virtual_memory().available/psutil.virtual_memory().total
    total_memory = psutil.virtual_memory().total
    print(f" Avaialble {percent_available*100:0.0f}% of total  {total_memory/float(1<<30):,.0f} GB of memory")

In [11]:
db = duckdb.connect("test.db")

In [15]:
db.execute("select count(*) as total_rows from 'data/perf.parquet';").fetchdf()

Unnamed: 0,total_rows
0,1890353680


In [16]:
db.execute("select count(distinct loan_id) as cardinality from 'data/perf.parquet';").fetchdf()

Unnamed: 0,cardinality
0,37015214


The following query finishes, it appears that the subquery is cached in ram. Since, ram is not freed. 

In [31]:
sql = """ select count(*) from (select *, RANK() OVER (PARTITION BY loan_id) as age from 'data/perf.parquet'); """

The following query throws an **OOM** error

```sql
""" SELECT count(*)
FROM (
	SELECT *
		,RANK() OVER (
			PARTITION BY loan_id ORDER BY monthly_reporting_period
			) AS age
	FROM 'data/perf.parquet'
	); """
```

A query without the `order by` clause, finishes but ends up using 85GB of RAM

In [10]:
print_memory_util()

 Avaialble 95% of total  126 GB of memory


In [32]:
%%memit
db.execute(""" select count(*) from (select *, RANK() OVER (PARTITION BY loan_id) as age from 'data/perf.parquet'); """).fetchdf()

peak memory: 87494.90 MiB, increment: 85713.16 MiB


Moreover, the RAM is not freed after the query is completed. It seems like the subquery gets cached in the RAM and used for subsequent operations

In [33]:
print_memory_util()

 Avaialble 28% of total  126 GB of memory


In contrast, the following query does not materialize the subquery. Is Window operation caching the results unnecessarily?

In [5]:
%%memit
db.execute(""" select count(*) from (select *, loan_age+ 1 from 'data/perf.parquet'); """).fetchdf()

peak memory: 849.61 MiB, increment: 775.07 MiB


In [18]:
explain_plan = db.execute(""" explain select count(*) from (select *, RANK() OVER (PARTITION BY loan_id) as age from 'data/perf.parquet'); """).fetchall()

In [22]:
for result in explain_plan[0]:
    print(result)

physical_plan
┌───────────────────────────┐
│      SIMPLE_AGGREGATE     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        count_star()       │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             42            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           WINDOW          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  RANK() OVER(PARTITION BY │
│          loan_id)         │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          loan_id          │
└───────────────────────────┘                             

