[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... #1087

idiom-bytes · 2024-05-23T23:41:32Z

Issue: CSV Deletion/Raw Table Duplication

TLDR; just drop raw and etl tables before updating and getting new CSVs.

Checks in save_to_storage against duckdb raw tables

The reason the problem above happened is that GQLDF uses save_to_storage, which doesn't do any checks against the DuckDB table when doing the insert. This is because it assumes that the DuckDB table will be in the same state as the CSV... clean, empty.

So, it's currently the responsibility of the user, that if they delete the CSV, that they'll also trim the DuckDB table (lake) with the drop commands that are currently there.

And, let's compartmentalise the problem...

The whole data pipeline works on the assumption that users will be iterating and dropping their database tables (DuckDB) to rebuild from scratch using the CSVs, with frequency.
- Right now, we do not want users to delete their CSV, or expose them to managing this.
The whole data pipeline works on the assumption that they only have to fetch this data once. Read: low touch/velocity.

This is a low priority issue

[CSV Management - Considerations / Potential Solutions]

Keep track of table min/max timestamp.
Checks are made in GQLDF -> append_to_table -> insert_to_duckdb (checks the row to be inserted is less than min or greater than max). This way, we update CSV but don't insert double records into lake.
We continue to extend CI commands in such a way, where you can also manage the CSV data easily.
We continue to extend CI commands in such a way, where you can drop across all CSV/RAW/ETL easily.
Even right now, if you do delete some of your CSVs, you can get your tables up and running again in just a few seconds.

kdetry · 2024-05-27T11:18:41Z

I have following ideas as a potential solution:

Unique Constraints in DuckDB:
Adding unique constraints (ID) in DuckDB to enforce data integrity. This will automatically prevent duplicate entries from being inserted into the table.
Track the last update point and only append new records beyond that point to both the CSV and DuckDB table.

Cleanup Scripts:
For whole csv/raw/etl processes, we could create cleaner drop scripts

calina-c · 2024-06-03T10:21:36Z

Why complicate with timestamps or ids? Can't we do a simple differential of identical rows? Before saving/updating from csv, remove rows that already exist in the table.

idiom-bytes · 2024-06-07T20:52:07Z

Adding unique constraints (ID) in DuckDB to enforce data integrity. This will automatically prevent duplicate entries from being inserted into the table.`

Why complicate with timestamps or ids? Can't we do a simple differential of identical rows? Before saving/updating from csv, remove rows that already exist in the table.

Because we're working with timeseries data and can leverage those two things to build a much more efficient system...

You guys are advocating for vanilla fetch and join strategies that lead to N<>N joins that have multiple O(n^2). Please go ahead and try doing this with 1 million rows of anything.

These checks that you are suggesting add "checks for duplicates and overhead by scanning the whole database every insert" are extremely expensive.

There is a reason why we've worked on this for months, and it's to build a pipeline that processes records efficiently, does as few scans and joins as possible, and it can scale.

[Users should not be deleting their lake.]

If anything, focus on this.... Enforce the SLAs.

PROPOSED/EXAMPLE Solution: Do a simple check at the cli level when "lake raw update" or "lake etl update" is called.

Warn the user they'll get duplicates because their csv.head < max(raw_tables.head, etl_tables.head) (i.e. they deleted things)... Prompt the user with a hard-stop, such that they are forced to drop their records to get csv.head == table.head...

pseudo code

def cli_helper_prepare_tables_for_processing():
  start = gql_df.get_csvs_max_timestamp()
  end = max(gql_df.get_tables_max_timestamp(), etl.get_tables_max_timestamp())

  # check if tables diverged from css
  if start < end:
    user_answer = prompt("CSV and Tables diverged, this will cause duplicates. Trim the tables?", options=["y"])
    if user_answer is null :
      return
    
    db.drop_tables(raw, start, end)
    db.drop_tables(etl, start, end)
    _do_both_raw_and_etl_update(csv.head, lake_ss.end_ts)

In this example, we simply check if csv.head diverged from raw_tables.head & etl_tables.head. Once. When the CLI is first run.

Notice that we're prompting the user to drop records rather than just doing it for them. No black boxes. Nothing "smart". Just simple checks, leveraging the system, and then resuming things.

This is a much more efficient way of dealing with this. It does everything through timestamps such that we lookup our checkpoint, and then uses the drop() functionality to get: csvs, raw tables, and etl tables to the same mark.

This is a better way of doing things because we're not adding wasteful checks and computation. We're simply looking up our checkpoints (int timestamps) and leveraging the existing functionality to manage the lake. Further, we're stressing the SLAs to the user.

Also, this is only done once. At the very beginning of the pipeliine, outside it's main loop and operations, carrying no additional overhead.

[Repeating Patterns]
The changes being recommended are not efficient. Again, too much "defensive code to handle edge cases" rather than just relying on the pattern/algorithm. Please, do not configure the DB to auto-solve conflicts, and don't do multiple N<>N row joins with historical data.

These do not scale.

idiom-bytes added the Priority: Low label May 23, 2024

idiom-bytes mentioned this issue May 23, 2024

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

Open

49 tasks

idiom-bytes changed the title ~~[CSV][GQLDF] If you delete CSV rows, an refetch, you get duplicate RAW rows~~ [CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... just drop raw tables before updating. May 23, 2024

idiom-bytes changed the title ~~[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... just drop raw tables before updating.~~ [CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... May 23, 2024

calina-c self-assigned this Jun 7, 2024

calina-c mentioned this issue Jun 7, 2024

Propose a fix for csv re-updating duplicates #1143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... #1087

[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... #1087

idiom-bytes commented May 23, 2024 •

edited

kdetry commented May 27, 2024

calina-c commented Jun 3, 2024

idiom-bytes commented Jun 7, 2024 •

edited

[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... #1087

[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... #1087

Comments

idiom-bytes commented May 23, 2024 • edited

Issue: CSV Deletion/Raw Table Duplication

[CSV Management - Considerations / Potential Solutions]

kdetry commented May 27, 2024

calina-c commented Jun 3, 2024

idiom-bytes commented Jun 7, 2024 • edited

idiom-bytes commented May 23, 2024 •

edited

idiom-bytes commented Jun 7, 2024 •

edited