Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache mechanism is not working related to metadata.json in s3 #14221

Closed
1 task done
mehmetbutgul opened this issue Mar 31, 2024 · 3 comments · Fixed by #14224 or #14227
Closed
1 task done

Cache mechanism is not working related to metadata.json in s3 #14221

mehmetbutgul opened this issue Mar 31, 2024 · 3 comments · Fixed by #14224 or #14227
Assignees

Comments

@mehmetbutgul
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

@danilojsl

What are you working on?

I am working on developing code.

Current Behavior

The cache mechanism related to metadata.json in s3 is not working.
When the pretrained() method invokes, metadata.json downloads again, again...
For example; I want to download a model with an approximate size of 3 MB,
But the source code downloads ~10 MB metadata.json for every model Whenever pretrained() is used.
Actually, the main problem is that there is a cache but the cache is not working and is never updated.

Expected Behavior

When the metadata.json downloads, keep it in the cache; and check the cache for every download process.
Also, duration time is important. Maybe the duration time can be increased to 10 minutes.

Steps To Reproduce

need to debug for reproduction.

AAA.pretrained(),
BBB.pretrained(),
....

private val repoFolder2Metadata: mutable.Map[String, RepositoryMetadata] =

private val repoFolder2Metadata: mutable.Map[String, RepositoryMetadata] =

This above variable is not updated in the source code.


if (!needToRefresh) { // The condition is always false !!!

Need to consider different folder possibilities such as public/models, clinical/models

Spark NLP version and Apache Spark

sparknlp==5.3.1 / 5.3.2
Apache Spark version is not important to reproduce.

Type of Spark Application

No response

Java Version

Java 8

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

@maziyarpanahi
Copy link
Member

Do you have any PR to suggest some caching mechanism? It has to be:

  • time base (expires after a duration)
  • session base (if the session dies and we are in a new session, we MUST download a new metadata.json)
  • there must be a force_download Boolean for users to override, either enable caching or disable it to be in the default behavior now. (only the code suggests we cache the metadata.json, we have never mentioned this in our docs or anywhere else. As far as anybody knows, if we make a change in metadata.json you'll see it immediately!)

That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle.

@mehmetbutgul
Copy link
Contributor Author

Hi @maziyarpanahi,
Thanks for your comments.
After your comments, I made a PR for the issue.
I implemented your last idea.
That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle.
I agree with you. This idea seems to be the best solution.
PR --> #14224

@mehmetbutgul mehmetbutgul linked a pull request Apr 3, 2024 that will close this issue
10 tasks
@maziyarpanahi
Copy link
Member

Many thanks @mehmetbutgul - I left it to Danilo to review it, I make sure to include it in the tomorrow's release. Thanks agian for your contribution. 🚀

@maziyarpanahi maziyarpanahi linked a pull request Apr 5, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants