Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of calls to the cache to get Schema during the same query #6193

Closed
SebC99 opened this issue Nov 6, 2019 · 25 comments
Closed

Lots of calls to the cache to get Schema during the same query #6193

SebC99 opened this issue Nov 6, 2019 · 25 comments
Labels
type:feature New feature or improvement of existing feature

Comments

@SebC99
Copy link
Contributor

SebC99 commented Nov 6, 2019

In a similar way than #6060, we tried to instrument our Parse-Server production server using AWS-Xray (we're using an Elastic Beanstalk stack), and it seems we do a lot of requests to the cache to load the schema for any single requests (it's very high for some cloudFunctions we have)

When looking at the code, an weird thing is that we often call the loadSchema() method then its fulfilled, we called the getOneSchema() one on the resulting schemaController, and both of them are getting the full schema from the cache - with a big schema it's not a free call (it's around 5ms each in our case).

Is there a specific reason for that? is there room for improvement?

@SebC99
Copy link
Contributor Author

SebC99 commented Nov 6, 2019

To give an example, each time the server receives a request, there's at least 4 cache.get operation for the MAIN_SCHEMA, 2 for the _Session object, then 2 for the _User.

Would there be a way to save the schema in the request or the database operation to minimize the number of call during one request/cloudfunction.

I have a cloud function, where we do 3 queries in parallel, each with one pointer field included (the same).
And for that cloud function, we have 24 calls to the MAIN_SCHEMA through Redis...

Last question: why is it so important to preserve the exact order of operations through a queue in the cacheAdapter? in which case would it be really annoying to have an unordered operation? a sessionToken that would be invalidated but still accessible for a very few milliseconds?
Thanks!

@dplewis
Copy link
Member

dplewis commented Nov 7, 2019

I had an idea a while ago about a global singleton schema cache that gets updated every time a class / field is added or removed. I don’t believe this will be a problem for me personally because I use enableSingleSchemaCache.

I believe the cache isn’t this way was to minimize side effects and they didn’t want too many change to the controller.

We could leverage redis pub sub or mongo change stream (real time mongo updates 3.6+) to reduce side effects. Edit: We could use live query to watch the _Schema class. If only the schema wasn’t stored in the DB, this is an open source framework after all.

I agree with you, although the schema cache has been improving, a better solution would be needed.

How is your schema so big?

@davimacedo @acinader @TomWFox Thoughts? I mentioned this on slack before.

@SebC99
Copy link
Contributor Author

SebC99 commented Nov 7, 2019

Not that big but each Schema find opération takes around 2-3ms so in the case above it’s 50ms on a cloud function that takes 80ms to perform...

@dplewis
Copy link
Member

dplewis commented Nov 7, 2019

Feel free to add test cases so we know how many cache lookup you are using #6060 (comment)

Besides that we need a solid solution. This problem is easily solved if a developer has 1 instance of parse server. But for multiple instances running against the same DB is what I’m trying figure out

@dplewis
Copy link
Member

dplewis commented Nov 7, 2019

@SebC99 How does this sound, we remove the schema cache in favor of a singleton object.

The singleton object will always be in sync with the _Schema class in the DB by leveraging the PubSub Adapter. This method would mimic LiveQuery subscribed to the _Schema class.

This method would support multiple instances if redis is passed in, remove all side effects, memory leaks and queries will be much much faster because there will be no schema cache lookup.

@acinader @davimacedo Thoughts?

@acinader
Copy link
Contributor

acinader commented Nov 7, 2019

sounds reasonable to me

@SebC99
Copy link
Contributor Author

SebC99 commented Nov 7, 2019

So with multiple servers we would still rely on redis? I don't think redis is an issue anyway.

But in which case does the order of operations really matters? the only one I can think of is for the user cache. Sessions, roles and schema shouldn't be problematic even with a slight desync I guess.
So we could use a queue only for the user.

And then, queuing the get commands seems useless, and we could simply have a kind of lock when we have set / put / del commands, and no lock for a get.

It's just thought I would not know at all how to implement that, and I'm really not familiar with PubSub sorry!

(Typically in my case, we use a Redis Cluster with master & read-only slave so we are not fully synced between write and read operations, and we haven't seen any issue so far)

@davimacedo
Copy link
Member

It sounds like a good idea and I believe it will be a huge improvement in terms of performance. In most of the applications, the schema does not change that much and we are paying a high cost to make sure it is always updated. The PubSub should solve this problem.

@dplewis
Copy link
Member

dplewis commented Nov 7, 2019

So with multiple servers we would still rely on redis?

Yes for the schema fix I'm proposing. Most users already have live query with redis already. We could make it required for multiple instances. Single instance will just use event emitters.

But in which case does the order of operations really matters?

I honestly haven't looked into session, user and roles cache optimization. The queue was built specifically for schema caching (will be removed). We can definitely look into it at some point.

I'm really not familiar with PubSub sorry!

Here is how it works. You subscribe to a channel and get updates whenever you publish to the channel.
https://docs.parseplatform.org/parse-server/guide/#scalability

I have a question. Both, LiveQuery and RedisCacheAdapter use a redisURL. It's recommended to use two different databases. Why? Does redis pubsub store in the database? I don't think so.

@acinader
Copy link
Contributor

acinader commented Nov 7, 2019

Ideally, we wouldn't need to require redis. Postgress has notify / listen and as you point out mongo has a mechanism for this too so we should be able to do without bringing redis into the mix?

@dplewis
Copy link
Member

dplewis commented Nov 7, 2019

I'm sure GraphQL has a similar feature too.

@davimacedo
Copy link
Member

I think GraphQL will not help in this context since it is in the API layer. I'd go with our PubSubAdapter and we could, in the long run, provide multiple options. Im the beginning, we could offer Redis for multiple process deployments and a memory one for single process deployments (not requiring redis to be installed). I think Postgres + MongoDB adapters are also a good idea but I wouldn't rely only on them in the beginning because only the most recent versions have the live stream functionality available.

@dplewis
Copy link
Member

dplewis commented Nov 8, 2019

PG has supported LISTEN / NOTIFY since version 7 (we tests against 9.5).
Mongo supported real time updates since 3.6.

I'll have some time next week to work on this.

@vitaly-t For postgres I'm looking to listen / notify changes made to _Schema table (maybe need a trigger for that?). Do you have example code for LISTEN / NOTIFY in pg-promise?

@dplewis dplewis added enhancement type:question Support or code-level question labels Nov 8, 2019
@vitaly-t
Copy link
Contributor

vitaly-t commented Nov 8, 2019

@dplewis
Copy link
Member

dplewis commented Dec 24, 2019

@vitaly-t I just saw that, will probably add in the future.

Quick update I got this working but I'm going to leverage enableSingleSchemaCache for backwards compatibility. Will try to get it in ASAP.

@Moumouls
Copy link
Member

Moumouls commented Apr 4, 2020

An optimization can also be done with the graphql dataloader tools:
https://github.com/graphql/dataloader
Multi-parallel cache requests on the same server will be grouped into one during the same NodeJS event loop.

@SebC99
Copy link
Contributor Author

SebC99 commented Jun 6, 2020

@dplewis Have you had time to think about this?
To illustrate more, we've improved our tracing ability, and this is what a simple cloudFunction looks like for us with enableSingleSchemaCache activated:
XRAY

The number of call the the __SCHEMA__MAIN_SCHEMA is incredible.
And as we use Redis for the Cache Adapter, all this calls go to Redis, whereas we could keep the memory Cache for the Schema (Session cache has to be shared with multiple servers, but Schema cache?).

I don't know why the calls to Redis are sometimes longer as the redis cluster is large enough to handle this, so it may be the queue mechanism... Not sure about this.

Happy to discuss this further!
Thanks

@dplewis
Copy link
Member

dplewis commented Jun 6, 2020

Yeah, I’m trying to figure out what to remove from the current implementation. Since the solution to sharing across multiple servers has been changed so many times over the years, most of it can be removed.

I’ll spend some time next week on a PR

@SebC99
Copy link
Contributor Author

SebC99 commented Jun 18, 2020

I was looking at the way mongoDB Collection.watch works, and it seems almost easy to use.
In the MongoSchemaCollection, we could start the watch stream in the constructor and save the last version of the schema in a static variable (we just need to ensure we don't start multiple streams while instantiating multiple MongoSchemaCollection)

Then the SchemaController could ignore completely the cache and could directly call the storageAdapter getAllClasses method, that would call the MongoSchemaCollection stored value.

I don't really see the use of the PubSub adapter here... Am I missing something?
The only thing I don't really know would be where to store the schema, in the MongoSchemaCollection, in the storageAdapter, in the SchemaController?

@dplewis
Copy link
Member

dplewis commented Jun 18, 2020

Don't need the pub sub anymore since we require Mongodb 3.6 as a minimum.

  1. Initialize the DatabaseController once (currently created for every request)
  2. Create a new SingleCache and refactor the existing schema controller.
  3. Minimal Cleanup

@SebC99
Copy link
Contributor Author

SebC99 commented Jun 19, 2020

Well then, that's a bit more complicated as it means moving the databaseController initialization... I'm afraid I won't be of any help
I thought a more "simple" way would work, something like that:
hulab@253708b

@dplewis
Copy link
Member

dplewis commented Jun 19, 2020

Good work! I started look at past PR's (when it was singleCache). Changing the databaseController back should be simple.

09bd9e3#diff-d5ca9e73131b4f7750feeb9b51c43efb

If you want to do a PR I can have a look at it and add Postgres as well.

@SebC99
Copy link
Contributor Author

SebC99 commented Jun 19, 2020

Sure, the only thing harder to change is the clearSchemaCache method of the PromiseRouter as the databaseController not longer have access to the schema storage...
But it's only used without the enableSingleSchemaCache flag

@SebC99
Copy link
Contributor Author

SebC99 commented Jun 19, 2020

Here is it: #6743
I'm not a big fan of the simple object for the singleSchemaCache object in the SchemaController but it was the simplest thing to do

@stale stale bot added the stale label Nov 8, 2020
@mtrezza mtrezza removed the type:question Support or code-level question label Nov 8, 2020
@dplewis
Copy link
Member

dplewis commented Mar 16, 2021

Closing via #7214

@dplewis dplewis closed this as completed Mar 16, 2021
@mtrezza mtrezza added type:feature New feature or improvement of existing feature and removed type:improvement labels Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or improvement of existing feature
Projects
None yet
Development

No branches or pull requests

7 participants