# Save to Parquet

The BACI dataset comes in yearly csv files. This notebook compiles all these in one table and saves it as a Parquet file. 

Save the downloaded folder from BACI in `raw/`. Set `hs` and `release` below to correpond to the selected version of BACI. The Parquet file will be saved in `final/` under the same name as the downloaded folder.

In [1]:
hs = 'HS17'             # Change this!
release = '202301'      # Change this!

In [2]:
import pandas as pd
import os
import re
import duckdb

In [4]:
folder = f'BACI_{hs}_V{release}'
filelist = [file for file in os.listdir(f'raw/{folder}') if file.startswith('BACI')]
filelist.sort()

df = pd.DataFrame()

for file in filelist:
    year = re.search('[0-9]{4}', file).group()
    df = pd.concat(
        [df, pd.read_csv(f'raw/{folder}/{file}')], 
        ignore_index=True
    )
    print(f'{year} done')

df.to_parquet(f'final/{folder}.parquet', index=False)

2017 done
2018 done
2019 done
2020 done
2021 done


## Load and view Parquet file

In [5]:
duckdb.sql(
    f"""
    SELECT * 
    FROM read_parquet('final/{folder}.parquet')
    LIMIT 10
    """
).df()

Unnamed: 0,t,i,j,k,v,q
0,2017,4,12,130120,5.946,1.4
1,2017,4,12,130190,5.125,2.32
2,2017,4,12,401031,0.087,0.002
3,2017,4,12,853890,0.303,0.019
4,2017,4,36,71320,3.242,2.446
5,2017,4,36,80212,8.354,2.194
6,2017,4,36,80290,4.754,1.06
7,2017,4,36,80420,1.218,1.547
8,2017,4,36,80620,48.061,56.845
9,2017,4,36,81310,13.641,17.025
