### Testing polars

There are multiple ways of having polars load & save data from S3. This was to test out approaches for which approach to take in the Datalake class defined above

In [None]:
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import time

In [None]:
fs = s3fs.S3FileSystem()
bucket = lake.bucket

#### Reading parquet file I

Doesn't work with scan

In [None]:
path = 'test/example1/data.parquet'

t0 = time.time()

dataset = pq.ParquetDataset(f"s3://{bucket}/{path}", filesystem=fs)
df = pl.from_arrow(dataset.read())
print(time.time() - t0)

0.45589590072631836


In [None]:
df

col1,col2
i64,i64
1,3
2,4


#### Reading parquet file II

using pyarrow dataset to specify format

In [None]:
t0 = time.time()
dataset2 = ds.dataset(f"s3://{bucket}/{path}", filesystem=fs, format='parquet')
df_parquet = pl.scan_pyarrow_dataset(dataset2)

print(df_parquet.collect().head())
print(time.time() - t0)

shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 3    │
│ 2    ┆ 4    │
└──────┴──────┘
0.3325178623199463


#### Reading parquet file III

using boto3 get_object

Doesn't allow scanning but approach works for csv and json files too. Appears to be quicker too

This is first choice and an easy switch

In [None]:
t0 = time.time()

obj = lake.get_object('test/example2/data.parquet')
df = pl.read_parquet(BytesIO(obj.read()))

print(time.time() - t0)

0.2166750431060791


#### Writing parquet file

In [None]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pl.DataFrame(data=d)
df

col1,col2
i64,i64
1,3
2,4


In [None]:
t0 = time.time()
with fs.open(f'{bucket}/test/example3/data.parquet', mode='wb') as f:
    df.write_parquet(f)
    
print(time.time() - t0)

0.09860610961914062


In [None]:
path = 'test/example3/data.parquet'
dataset2 = ds.dataset(f"s3://{bucket}/{path}", filesystem=fs, format='parquet')
df_parquet = pl.scan_pyarrow_dataset(dataset2)
print(df_parquet.collect().head())

shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 3    │
│ 2    ┆ 4    │
└──────┴──────┘
