Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Removed unnecessary and problematic column caching #2352

Merged
merged 6 commits into from Apr 16, 2024

Conversation

raulbonet
Copy link
Contributor

@raulbonet raulbonet commented Mar 30, 2024

fixes #2325


📚 Documentation preview 📚: https://meltano-sdk--2352.org.readthedocs.build/en/2352/

Copy link

codspeed-hq bot commented Mar 30, 2024

CodSpeed Performance Report

Merging #2352 will not alter performance

Comparing raulbonet:rb/2325/deprecate-caching (3f50ae4) with main (1d1afbe)

Summary

✅ 6 untouched benchmarks

Copy link

codecov bot commented Mar 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.97%. Comparing base (4f07b67) to head (3f50ae4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2352      +/-   ##
==========================================
- Coverage   88.98%   88.97%   -0.01%     
==========================================
  Files          54       54              
  Lines        4767     4763       -4     
  Branches      926      924       -2     
==========================================
- Hits         4242     4238       -4     
  Misses        364      364              
  Partials      161      161              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pnadolny13
Copy link
Contributor

pnadolny13 commented Apr 1, 2024

I agree that this is a bug and that it should be fixed but I also worry about reverting back to the poor performance we were seeing in target-postgres MeltanoLabs/target-postgres#192 (i.e. prepare table took 30s per table) as a result. I havent dug into this at all but I wonder if theres a simple alternative way to fix the cache rather removing it completely. cc @edgarrmondragon in case you have ideas.

@edgarrmondragon
Copy link
Collaborator

edgarrmondragon commented Apr 1, 2024

I agree that this is a bug and that it should be fixed but I also worry about reverting back to the poor performance we were seeing in target-postgres MeltanoLabs/target-postgres#192 (i.e. prepare table took 30s per table) as a result. I havent dug into this at all but I wonder if theres a simple alternative way to fix the cache rather removing it completely. cc @edgarrmondragon in case you have ideas.

Yeah, I worry too about reverting without more context. A closed issue is linked from this PR, so I'd prefer a new issue be opened with clear intructions on how to reproduce the bad behavior.

@raulbonet
Copy link
Contributor Author

Hi guys,

Thank you for your answers. I think I am not managing to understand the code or we are not understanding each other.

  1. The column caching is currently not even being used in target-postgres.

The method get_table_columns() of the SDK, where the caching was implemented, is being overridden.
This is precisely the reason that your current tests are not failing in target-postgres

def get_table_columns(  # type: ignore[override]
        self,
        schema_name: str,
        table_name: str,
        connection: sa.engine.Connection,
        column_names: list[str] | None = None,
    ) -> dict[str, sa.Column]:
        """Return a list of table columns.

        Overrode to support schema_name

        Args:
            schema_name: schema name.
            table_name: table name to get columns for.
            connection: database connection.
            column_names: A list of column names to filter to.

        Returns:
            An ordered list of column objects.
        """
        inspector = sa.inspect(connection)
        columns = inspector.get_columns(table_name, schema_name)

        return {
            col_meta["name"]: sa.Column(
                col_meta["name"],
                col_meta["type"],
                nullable=col_meta.get("nullable", False),
            )
            for col_meta in columns
            if not column_names
            or col_meta["name"].casefold() in {col.casefold() for col in column_names}
        }
  1. I am precisely trying to deprecate some of these custom methods in target-postgres to use the built-in ones in the SDK and, when doing this, I started to leverage the caching and tests started to fail.

  2. The issue was closed as a mistake: I merged the changes in my personal fork and this also closed this issue automatically, I just re-opened it.

@edgarrmondragon , can you please tell me what context are you missing? About reproducing the error, this is also in the original issue.

@raulbonet
Copy link
Contributor Author

raulbonet commented Apr 2, 2024

@pnadolny13 about ideas on how to fix this to leverage caching, if I may, it's a bit what I was saying here:

I think currently is a bit unclear the separation of concerns of SQLConnection and SQLSink, which in my opinion is why the target-postgres ended up diverging so much from the Meltano SDK: the methods in SQLConnection try to both encapsulate wrappers for common operations and manage optimization of these operations.

BUT, at the same time, the SQLConnection is re-used and what's created every time that there is a schema change is a SQLSink, not a SQLConnection, which is reused.

And whenever you want to do something a bit out of the box, you need to override the whole method, instead of being able to call super() and do your custom thing afterwards (what happened in target-postgres).

If we maintain this logic (a sink is supposed to have a stable schema), then all caching logic has to go on the Sink, including the managing of transactions (what you do in Postgres).

However, I am not sure if the problems you had in target-postgres came from this. In my opinion, right now the biggest problem that I see in target-postgres (I mentioned it somewhere, I don't know where), is that prepare_table() is called inside process_batch(), which exponentially increases the calls.

But maybe I am just not understanding the code, I am the new guy here! :)

@edgarrmondragon edgarrmondragon changed the title fix: deprecate column caching fix: Removed unnecessary and problematic column caching Apr 12, 2024
@edgarrmondragon
Copy link
Collaborator

edgarrmondragon commented Apr 12, 2024

I think I understand the context a bit better now. It seems that #1864 made the sort of caching introduced by #1779 unnecessary?

Note: I think this is the same cause behind MeltanoLabs/target-snowflake#165 so we may wanna apply these changes there too?

@edgarrmondragon
Copy link
Collaborator

I'm happy to merge this and ship a prerelease that we can test against MeltanoLabs/target-postgres :)

@edgarrmondragon edgarrmondragon added this pull request to the merge queue Apr 16, 2024
Merged via the queue into meltano:main with commit 834ea2d Apr 16, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Cache SQL columns and schemas does not work with a shared connector
4 participants