# Deduplicate

AEMO has files which they publish every 5 minutes. And then after a few days they aggregate them into daily files. That's all per "package" (group of tables). And after a while they aggregate into monthly files, for all tables, and monthly files for each table. And some files are supposed to contain 'updated' data for previous files. So we need to deduplicate based on some 'primary key' columns. When there's a clash, we need to choose inteligently.

The primary keys are from AEMO's schema documentation, which we webscraped in an earlier script.
The sorting rule is:

* choose the largest `SCHEMA_VERSION` (a metadata field we added in the splitting stage earlier. Not present in AEMO's schema.)
* choose the largest `TOP_TIMESTAMP` (i.e. the file which AEMO generated most recently)
* If there is a `LASTCHANGED` column (most tables have this), this is another measure of when AEMO generated the data. Choose the largest (i.e. most recent data)

## Small Files

For sets of files small enough to fit into memory, we'll use [`polars`](https://docs.pola.rs/), which is like pandas but faster (and can handle empty values for all datatypes.)

For the small tables, they're typically reference data, like which region each generator is in. This gets republished each month, so we end up with >90% duplicated data.

## Large files

If the sum of the parquet files for a table is too large to fit into memory, we'll have to use some fancy techniques. (Deduplication normally involves sorting the whole list, or group by keys and then sort. Either way you need to have all rows for the key columns in memory. Which we can't do. Polars claims to be able to do it in 'streaming' mode, but that ends up using up all memory.)

So we need to use a different approach. For now, I don't have time. I've checked, and I think we only have 10% duplicated data for these datasets. That's probably for the week before we downloaded the data (when daily and 5-minute files overlap) which was not near a DST transition.

Suppose we want to deduplicate on primary key columns `A` and `B`, sorting by `C`, `D`, keeping data column `E`). The process I want to use is for each input file, write out to new parquet files, with hive partitioning. e.g. `A=1/file.parquet`, `A=2/file.parquet`. (We keep `B` inside the parquet file, because otherwise we'd have too many tiny files. That's not performant. But by splitting into one file per `A`, each resulting folder is small enough to fit into memory. Then We can deduplicate each folder based on only column `B` (sorting by `C` and `D`). Then we can recombine them. (Although that's kind of unecessary.)

## Notes

Polars uses multithreading for us. So we won't use `multiprocessing.Pool`.

In [None]:
import os
import json
import psutil
import importlib

import polars as pl
from tqdm import tqdm

# utils is our local utility module
# if we change utils.py, and re-run a normal 'import'
# python won't reload it by default. (Since it's already loaded.)
# So we force a reload
import utils
importlib.reload(utils)

## Constants

In [None]:
repo_data_dir = '/home/matthew/Documents/TSE/AppliedEconometrics/repo/data/'
laptop_data_dir = '/home/matthew/data/'

# result of the previous script
source_dir = os.path.join(laptop_data_dir, '01-D-parquet-batches')

# result of this script
dest_dir = os.path.join(laptop_data_dir, '01-E-parquet-deduplicated')

schema_path = os.path.join(repo_data_dir, 'schemas.json')

In [None]:
# assume that if a parquet file is x bytes
# once loaded into memory it will be x * compression_factor bytes
compression_factor = 36

## Preparation

In [None]:
logger = utils.Logger(os.path.join(repo_data_dir, 'logs.txt'), flush=True)
logger.info("Initialising Logger")

In [None]:
utils.renice()

In [None]:
with open(schema_path, 'r') as f:
    schemas = json.load(f)

## Check size of each table

In [None]:
# we have this many bytes of memory free
avail_memory = psutil.virtual_memory().available

In [None]:
def get_dir_size(dir):
    sz = 0
    for path in utils.walk(dir):
        sz = sz + os.path.getsize(path)
    return sz

tables = os.listdir(source_dir)
table_sizes = {table: get_dir_size(os.path.join(source_dir, table)) for table in tables}

big_tables = [t for t in tables if (table_sizes[t] * compression_factor > avail_memory) / 2]

In [None]:
utils.create_dir(dest_dir)

In [None]:
for table in tqdm(tables):
    logger.info(f"Starting {table}")
    table_source_dir = os.path.join(source_dir, table)
    source_paths = list(utils.walk(table_source_dir))
    dest_path = os.path.join(dest_dir, table + '.parquet')

    logger.info(f"Source files {source_paths}")
    ds = pl.scan_parquet(source_paths)
    
    primary_keys = schemas[table]['primary_keys']
    sort_keys = ['SCHEMA_VERSION', 'TOP_TIMESTAMP']
    if 'LASTCHANGED' in ds.columns:
        sort_keys.append('LASTCHANGED')

    assert ds.fetch().shape[0] > 0, f"No rows in {len(source_paths)} files for {table}"
    
    if table in big_tables:
        # for now, don't deduplicate
        # do merge into one
        ds.sink_parquet(dest_path)
    else:
        (
            ds.sort(sort_keys, descending=True)
            .unique(primary_keys)
            .select(pl.exclude("SCHEMA_VERSION", "TOP_TIMESTAMP"))
            .sink_parquet(dest_path)
        )