Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for keeping local cache of recent data. #585

Merged
merged 15 commits into from
Dec 27, 2023

Conversation

trueleo
Copy link
Contributor

@trueleo trueleo commented Dec 22, 2023

Description

This PR adds local cache / hot tier for Parseable.

This option can be enabled by setting following env vars
P_CACHE_DIR - Local Path for file cache
P_CACHE_SIZE - Size for cache in human readable size ( mb/mib/gib/gb )

When these flags are set, sync flow will move the parquet from staging into the specified cache directory instead of deleting it. Any LRU cached entry is deleted to satisfy the cache constraint upon insertion. LocalCacheManager is responsible for updating and persisting this data structure.


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

@trueleo trueleo changed the title Localcache Add option for keeping local cache of recent data. Dec 22, 2023
server/src/option.rs Outdated Show resolved Hide resolved
server/src/option.rs Outdated Show resolved Hide resolved
@@ -1,140 +0,0 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is unused now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a separate cleanup PR with details on why files are not used. this change is clearly unrelated to caching.

Copy link
Contributor Author

@trueleo trueleo Dec 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is clearly unrelated to caching.

This file was being used to combine two execution plans ( remote and memory ), but now because of conditional addition of caching there are now three plans to combine. I had to refactor related code in the schema provider file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix commit message with details

Copy link
Member

@nitisht nitisht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the commit message with more details on why we added this and how it works.

@nitisht
Copy link
Member

nitisht commented Dec 24, 2023

Missing items

  • Prometheus stats on cache as we discussed.
  • Cache info printed on the startup banner under storage.

@nitisht
Copy link
Member

nitisht commented Dec 24, 2023

What does this mapping in the .cache.json file mean?

[ 
"home/nitish/Development/parseablehq/parseable/data/teststream/date=2023-12-24/hour=04/minute=57/nitish-x1.data.parquet",
"/home/nitish/Development/parseablehq/parseable/cache/teststream/date=2023-12-24.hour=04.minute=57.nitish-x1.data.parquet"
]

@nitisht
Copy link
Member

nitisht commented Dec 24, 2023

If I restart the server and change the cache size via P_CACHE_SIZE, the .cache.json still has old value. We need to update the .cache.json so current specified size is honoured.

@theteachr
Copy link
Contributor

theteachr commented Dec 24, 2023

What does this mapping in the .cache.json file mean?

@nitisht It's a mapping from store path to where the file is stored in the local cache.

@nitisht
Copy link
Member

nitisht commented Dec 24, 2023

What does this mapping in the .cache.json file mean?

[ 
"home/nitish/Development/parseablehq/parseable/data/teststream/date=2023-12-24/hour=04/minute=57/nitish-x1.data.parquet",
"/home/nitish/Development/parseablehq/parseable/cache/teststream/date=2023-12-24.hour=04.minute=57.nitish-x1.data.parquet"
]

What does this mapping in the .cache.json file mean?

@nitisht It's a mapping from store path to where the file is stored in the local cache.

That is clear, but

  • Both files are local, still there is difference in path format - leading / vs no /
  • Where do we use this mapping - that info should be added as comments and commit message

@trueleo
Copy link
Contributor Author

trueleo commented Dec 25, 2023

Missing items

* Prometheus stats on cache as we discussed.

* Cache info printed on the startup banner under storage.

I will do this in a separate PR.

@nitisht
Copy link
Member

nitisht commented Dec 25, 2023

Missing items

* Prometheus stats on cache as we discussed.

* Cache info printed on the startup banner under storage.

I will do this in a separate PR.

Let's finish in this PR

This PR adds local cache / hot tier for Parseable.

This option can be enabled by setting following env vars
P_CACHE_DIR - Local Path for file cache
P_CACHE_SIZE - Size for cache in human readable size ( mb/mib/gib/gb )

When these flags are set, sync flow will move the parquet from staging into
the specified cache directory instead of deleting it. Any LRU cached entry is
deleted to satisfy the cache constraint upon insertion. LocalCacheManager is
responsible for updating and persisting this data structure.
server/src/option.rs Outdated Show resolved Hide resolved
server/src/storage/s3.rs Outdated Show resolved Hide resolved
version: "v1".to_string(),
current_size: 0,
capacity,
files: Cache::new(100),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 100 only here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backing hashlru is capacity based. The capacity is arbitrary and is increased when needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who increases? user or the backing cache?

@nitisht
Copy link
Member

nitisht commented Dec 25, 2023

  • Prometheus stats on cache as we discussed.
  • Cache info printed on the startup banner under storage.
  • Both files are local, still there is difference in path format - leading / vs no /
  • Where do we use this mapping - that info should be added as comments and commit message
  • If I restart the server and change the cache size via P_CACHE_SIZE, the .cache.json still has old value. We need to update the .cache.json so current specified size is honoured.
  • What is the minimum allowed value for P_CACHE_SIZE? We need to fail if smaller values are set. IMO 1GiB is a good number.

@nitisht nitisht merged commit 631c8f4 into parseablehq:main Dec 27, 2023
6 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants