# Profile Formats

Before running this test add COVID dataset into `duck.db` with

```python
python crawler/crawl_metadata.py --include 'dataset_5220'
python crawler/full_text_index.py
```

you might also want to remove the database to start from scratch `rm duck.db`. Then run the API with `make api`.

### COVID dataset:

- shape: (197042, 36)
- dataframe size: 78 MB
- S3 size: 11 MB
- DuckDB size: ~24 MB

In [1]:
import time
import pandas as pd
import requests

url_wo_format = 'http://127.0.0.1:8000/v1/dataset/data/backport/owid/latest/dataset_5220_covid_19__johns_hopkins_university/dataset_5220_covid_19__johns_hopkins_university'

t = time.time()
r = requests.get(url_wo_format + '.json')
assert r.ok
print(f'JSON format: {time.time() - t:.2f}s')

t = time.time()
r = requests.get(url_wo_format + '.csv')
assert r.ok
print(f'CSV format: {time.time() - t:.2f}s')

t = time.time()
r = requests.get(url_wo_format + '.feather')
assert r.ok
print(f'Feather format: {time.time() - t:.2f}s')

t = time.time()
r = requests.get(url_wo_format + '.feather_direct')
assert r.ok
print(f'Feather format (direct): {time.time() - t:.2f}s')

JSON format: 15.94s
CSV format: 4.80s
Feather format: 0.41s
Feather format (direct): 24.24s


In [2]:
# dataset info
df = pd.read_feather(url_wo_format + '.feather')
pd.Series({
    "Size in MB": df.memory_usage(deep=True).sum() / 1e6,
    "Shape": df.shape,
})

Size in MB       73.290722
Shape         (197042, 36)
dtype: object