Cache mechanism is not working related to metadata.json in s3 #14221

mehmetbutgul · 2024-03-31T14:40:47Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

What are you working on?

I am working on developing code.

Current Behavior

The cache mechanism related to metadata.json in s3 is not working.
When the pretrained() method invokes, metadata.json downloads again, again...
For example; I want to download a model with an approximate size of 3 MB,
But the source code downloads ~10 MB metadata.json for every model Whenever pretrained() is used.
Actually, the main problem is that there is a cache but the cache is not working and is never updated.

Expected Behavior

When the metadata.json downloads, keep it in the cache; and check the cache for every download process.
Also, duration time is important. Maybe the duration time can be increased to 10 minutes.

Steps To Reproduce

need to debug for reproduction.

AAA.pretrained(),
BBB.pretrained(),
....

private val repoFolder2Metadata: mutable.Map[String, RepositoryMetadata] =

spark-nlp/src/main/scala/com/johnsnowlabs/nlp/pretrained/S3ResourceDownloader.scala

Line 40 in 6b181a6

private val repoFolder2Metadata: mutable.Map[String, RepositoryMetadata] =

This above variable is not updated in the source code.

spark-nlp/src/main/scala/com/johnsnowlabs/nlp/pretrained/S3ResourceDownloader.scala

Line 57 in 6b181a6

if (!needToRefresh) {

if (!needToRefresh) { // The condition is always false !!!

Need to consider different folder possibilities such as public/models, clinical/models

Spark NLP version and Apache Spark

sparknlp==5.3.1 / 5.3.2
Apache Spark version is not important to reproduce.

Type of Spark Application

No response

Java Version

Java 8

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2024-04-01T16:55:06Z

Do you have any PR to suggest some caching mechanism? It has to be:

time base (expires after a duration)
session base (if the session dies and we are in a new session, we MUST download a new metadata.json)
there must be a force_download Boolean for users to override, either enable caching or disable it to be in the default behavior now. (only the code suggests we cache the metadata.json, we have never mentioned this in our docs or anywhere else. As far as anybody knows, if we make a change in metadata.json you'll see it immediately!)

That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle.

mehmetbutgul · 2024-04-03T14:38:56Z

Hi @maziyarpanahi,
Thanks for your comments.
After your comments, I made a PR for the issue.
I implemented your last idea.
That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle.
I agree with you. This idea seems to be the best solution.
PR --> #14224

maziyarpanahi · 2024-04-03T16:36:50Z

Many thanks @mehmetbutgul - I left it to Danilo to review it, I make sure to include it in the tomorrow's release. Thanks agian for your contribution. 🚀

mehmetbutgul added the bug label Mar 31, 2024

mehmetbutgul assigned maziyarpanahi and danilojsl Mar 31, 2024

maziyarpanahi added Feature request and removed bug labels Apr 1, 2024

mehmetbutgul mentioned this issue Apr 3, 2024

Cache mechanism implementation for metadata.json #14224

Merged

10 tasks

mehmetbutgul linked a pull request Apr 3, 2024 that will close this issue

Cache mechanism implementation for metadata.json #14224

Merged

10 tasks

maziyarpanahi linked a pull request Apr 5, 2024 that will close this issue

release/533-release-candidate #14227

Merged

maziyarpanahi closed this as completed in #14227 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache mechanism is not working related to metadata.json in s3 #14221

Cache mechanism is not working related to metadata.json in s3 #14221

mehmetbutgul commented Mar 31, 2024

maziyarpanahi commented Apr 1, 2024

mehmetbutgul commented Apr 3, 2024

maziyarpanahi commented Apr 3, 2024

Cache mechanism is not working related to metadata.json in s3 #14221

Cache mechanism is not working related to metadata.json in s3 #14221

Comments

mehmetbutgul commented Mar 31, 2024

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information

maziyarpanahi commented Apr 1, 2024

mehmetbutgul commented Apr 3, 2024

maziyarpanahi commented Apr 3, 2024