Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... #1087

Open
idiom-bytes opened this issue May 23, 2024 · 3 comments
Assignees

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented May 23, 2024

Issue: CSV Deletion/Raw Table Duplication

TLDR; just drop raw and etl tables before updating and getting new CSVs.

Checks in save_to_storage against duckdb raw tables

The reason the problem above happened is that GQLDF uses save_to_storage, which doesn't do any checks against the DuckDB table when doing the insert. This is because it assumes that the DuckDB table will be in the same state as the CSV... clean, empty.

So, it's currently the responsibility of the user, that if they delete the CSV, that they'll also trim the DuckDB table (lake) with the drop commands that are currently there.

And, let's compartmentalise the problem...

  • The whole data pipeline works on the assumption that users will be iterating and dropping their database tables (DuckDB) to rebuild from scratch using the CSVs, with frequency.
    - Right now, we do not want users to delete their CSV, or expose them to managing this.
  • The whole data pipeline works on the assumption that they only have to fetch this data once. Read: low touch/velocity.

This is a low priority issue

[CSV Management - Considerations / Potential Solutions]

  • Keep track of table min/max timestamp.
  • Checks are made in GQLDF -> append_to_table -> insert_to_duckdb (checks the row to be inserted is less than min or greater than max). This way, we update CSV but don't insert double records into lake.
  • We continue to extend CI commands in such a way, where you can also manage the CSV data easily.
  • We continue to extend CI commands in such a way, where you can drop across all CSV/RAW/ETL easily.
  • Even right now, if you do delete some of your CSVs, you can get your tables up and running again in just a few seconds.
@idiom-bytes idiom-bytes changed the title [CSV][GQLDF] If you delete CSV rows, an refetch, you get duplicate RAW rows [CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... just drop raw tables before updating. May 23, 2024
@idiom-bytes idiom-bytes changed the title [CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... just drop raw tables before updating. [CSV][GQLDF] If you delete CSV rows, and update raw, you get duplicate RAW rows... May 23, 2024
@kdetry
Copy link
Contributor

kdetry commented May 27, 2024

I have following ideas as a potential solution:

Unique Constraints in DuckDB:
Adding unique constraints (ID) in DuckDB to enforce data integrity. This will automatically prevent duplicate entries from being inserted into the table.
Track the last update point and only append new records beyond that point to both the CSV and DuckDB table.

Cleanup Scripts:
For whole csv/raw/etl processes, we could create cleaner drop scripts

@calina-c
Copy link
Contributor

calina-c commented Jun 3, 2024

Why complicate with timestamps or ids? Can't we do a simple differential of identical rows? Before saving/updating from csv, remove rows that already exist in the table.

@idiom-bytes
Copy link
Member Author

idiom-bytes commented Jun 7, 2024

Adding unique constraints (ID) in DuckDB to enforce data integrity. This will automatically prevent duplicate entries from being inserted into the table.`
Why complicate with timestamps or ids? Can't we do a simple differential of identical rows? Before saving/updating from csv, remove rows that already exist in the table.

Because we're working with timeseries data and can leverage those two things to build a much more efficient system...

You guys are advocating for vanilla fetch and join strategies that lead to N<>N joins that have multiple O(n^2). Please go ahead and try doing this with 1 million rows of anything.

These checks that you are suggesting add "checks for duplicates and overhead by scanning the whole database every insert" are extremely expensive.

There is a reason why we've worked on this for months, and it's to build a pipeline that processes records efficiently, does as few scans and joins as possible, and it can scale.

[Users should not be deleting their lake.]

If anything, focus on this.... Enforce the SLAs.

PROPOSED/EXAMPLE Solution: Do a simple check at the cli level when "lake raw update" or "lake etl update" is called.

Warn the user they'll get duplicates because their csv.head < max(raw_tables.head, etl_tables.head) (i.e. they deleted things)... Prompt the user with a hard-stop, such that they are forced to drop their records to get csv.head == table.head...

pseudo code

def cli_helper_prepare_tables_for_processing():
  start = gql_df.get_csvs_max_timestamp()
  end = max(gql_df.get_tables_max_timestamp(), etl.get_tables_max_timestamp())

  # check if tables diverged from css
  if start < end:
    user_answer = prompt("CSV and Tables diverged, this will cause duplicates. Trim the tables?", options=["y"])
    if user_answer is null :
      return
    
    db.drop_tables(raw, start, end)
    db.drop_tables(etl, start, end)
    _do_both_raw_and_etl_update(csv.head, lake_ss.end_ts)

In this example, we simply check if csv.head diverged from raw_tables.head & etl_tables.head. Once. When the CLI is first run.

Notice that we're prompting the user to drop records rather than just doing it for them. No black boxes. Nothing "smart". Just simple checks, leveraging the system, and then resuming things.

This is a much more efficient way of dealing with this. It does everything through timestamps such that we lookup our checkpoint, and then uses the drop() functionality to get: csvs, raw tables, and etl tables to the same mark.

This is a better way of doing things because we're not adding wasteful checks and computation. We're simply looking up our checkpoints (int timestamps) and leveraging the existing functionality to manage the lake. Further, we're stressing the SLAs to the user.

Also, this is only done once. At the very beginning of the pipeliine, outside it's main loop and operations, carrying no additional overhead.

[Repeating Patterns]
The changes being recommended are not efficient. Again, too much "defensive code to handle edge cases" rather than just relying on the pattern/algorithm. Please, do not configure the DB to auto-solve conflicts, and don't do multiple N<>N row joins with historical data.

These do not scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants