# Usage

All functions are in `baci.py`.

In [1]:
from baci import baci_to_parquet, aggregate_baci

## Initial processing

From the [BACI website](http://www.cepii.fr/CEPII/en/bdd_modele/bdd_modele_item.asp?id=37), select a particular version of the dataset to work on. This is defined by an HS edition and a release date. Downloading and unzipping yields a folder of yearly csv files. Store this folder under `raw`.

Next, run the `baci_to_parquet()` function to convert the csv files into one parquet file. This will be saved in `final`.

In [2]:
hs = 'HS17'             # Change this!
release = '202301'      # Change this!

baci_to_parquet(hs=hs, release=release)

'BACI_HS17_V202301.parquet' successfully saved in 'final'.


The folders `raw` and `final` may be changed to the user's liking through the arguments `input_folder` and `output_folder`. For example, if one wishes to keep all files in the project's root directory, run

```python
baci_to_parquet(hs=hs, release=release, input_folder=None, output_folder=None)
```

Take a quick view of the saved file using DuckDB.

In [3]:
import duckdb

duckdb.sql("SELECT * FROM 'final/BACI_HS17_V202301.parquet' LIMIT 10").df()

Unnamed: 0,t,i,j,k,v,q
0,2018,4,24,845420,112.734,26.0
1,2018,4,24,848180,2.632,0.007
2,2018,4,31,570110,1.596,0.037
3,2018,4,32,340319,8.743,0.114
4,2018,4,32,391739,0.164,
5,2018,4,32,610910,1.098,0.013
6,2018,4,32,710310,0.142,0.007
7,2018,4,32,710399,1.577,0.108
8,2018,4,32,854232,0.332,0.002
9,2018,4,32,902519,0.511,0.004


## Aggregations

Products in the BACI dataset are at the 6-digit HS level. Depending on the use case, it may be preferable to work with an aggregated version of the dataset. The function `aggregate_baci()` aggregates the dataset to the 4-digit, 2-digit, or country level and saves it as a new parquet file. The `aggregation` argument accepts `country`, `2digit`, and `4digit`, with the default being `country`.

Aggregating the column `q` (weight of trade flows in metric tons) may lead to misleading results due to a substantial number of missing values, so it is dropped here.

In [4]:
aggregate_baci(
    input=f'final/BACI_{hs}_V{release}.parquet',
    output=f'final/BACI_{hs}_V{release}-2digit.parquet',
    aggregation='2digit'
)

View the results:

In [5]:
duckdb.sql("SELECT * FROM 'final/BACI_HS17_V202301-2digit.parquet' LIMIT 10").df()

Unnamed: 0,t,i,j,k2,v
0,2017,300,682,6,14.706
1,2017,300,682,12,4998.85
2,2017,300,682,56,903.204
3,2017,300,682,68,6775.087
4,2017,300,682,76,6495.575
5,2017,300,682,83,4623.852
6,2017,300,686,62,11.718
7,2017,300,686,64,1.01
8,2017,300,686,85,35.533
9,2017,300,686,87,307.69
