# Aggregate BACI

Products in the BACI dataset are at the 6-digit HS level. Depending on the use case, it may be preferable to work with an aggregated version of the dataset. This notebook contains scripts for aggregating at the 4-digit, the 2-digit, and the national levels.

Because the column `q` (weight of trade flows in metric tons) has a substantial amount of missing values, it is dropped here.

Run `save-to-parquet.ipynb` before running this notebook. Set `hs` and `release` below to correpond to the selected version of BACI.

In [1]:
hs = 'HS17'             # Change this!
release = '202301'      # Change this!

filename = f'BACI_{hs}_V{release}'

In [2]:
import duckdb

## Country level

In [5]:
duckdb.sql(
    f"""
    COPY (
        SELECT t, i, j, SUM(v) AS v
        FROM read_parquet('final/{filename}.parquet')
        GROUP BY t, i, j
        ORDER BY t
    ) TO 'final/{filename}-country.parquet'
    """
)

In [6]:
duckdb.sql(
    f"""
    SELECT * 
    FROM read_parquet('final/{filename}-country.parquet')
    LIMIT 10
    """
).df()

Unnamed: 0,t,i,j,v
0,2017,534,764,8.171
1,2017,540,480,33.759
2,2017,534,752,84.263
3,2017,534,826,130.097
4,2017,548,348,35.453
5,2017,548,699,0.528
6,2017,548,752,0.937
7,2017,548,392,64318.959
8,2017,548,826,235.869
9,2017,548,764,14306.937


## 2-digit level

In [7]:
duckdb.sql(
    f"""
    COPY (
        SELECT t, i, j, k2, SUM(v) AS v
        FROM (
            SELECT t, i, j, SUBSTRING(k, -6, 2) AS k2, v
            FROM read_parquet('final/{filename}.parquet')
        )
        GROUP BY t, i, j, k2
        ORDER BY t
    ) TO 'final/{filename}-2digit.parquet'
    """
)

In [8]:
duckdb.sql(
    f"""
    SELECT * 
    FROM read_parquet('final/{filename}-2digit.parquet')
    LIMIT 10
    """
).df()

Unnamed: 0,t,i,j,k2,v
0,2017,4,36,12,8.043
1,2017,4,36,56,5.144
2,2017,4,36,76,2.213
3,2017,4,40,9,2.375
4,2017,4,40,52,0.16
5,2017,4,40,62,7.134
6,2017,4,40,64,17.022
7,2017,4,40,65,0.097
8,2017,4,40,85,0.545
9,2017,4,40,94,0.821


## 4-digit level

In [9]:
duckdb.sql(
    f"""
    COPY (
        SELECT t, i, j, k4, sum(v) AS v
        FROM (
            SELECT t, i, j, substring(k, -6, 4) AS k4, v
            FROM read_parquet('final/{filename}.parquet')
        )
        GROUP BY t, i, j, k4
        ORDER BY t
    ) TO 'final/{filename}-4digit.parquet'
    """
)

In [10]:
duckdb.sql(
    f"""
    SELECT * 
    FROM read_parquet('final/{filename}-4digit.parquet')
    LIMIT 10
    """
).df()

Unnamed: 0,t,i,j,k4,v
0,2017,40,270,8541,0.316
1,2017,40,275,3822,11.177
2,2017,40,275,9026,4.861
3,2017,40,275,9031,6.455
4,2017,40,276,103,575.414
5,2017,40,276,202,53119.864
6,2017,40,276,403,61407.933
7,2017,40,276,702,2328.075
8,2017,40,276,710,7520.397
9,2017,40,276,805,1552.557
