You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The cache mechanism related to metadata.json in s3 is not working.
When the pretrained() method invokes, metadata.json downloads again, again...
For example; I want to download a model with an approximate size of 3 MB,
But the source code downloads ~10 MB metadata.json for every model Whenever pretrained() is used.
Actually, the main problem is that there is a cache but the cache is not working and is never updated.
Expected Behavior
When the metadata.json downloads, keep it in the cache; and check the cache for every download process.
Also, duration time is important. Maybe the duration time can be increased to 10 minutes.
Steps To Reproduce
need to debug for reproduction.
AAA.pretrained(),
BBB.pretrained(),
....
private val repoFolder2Metadata: mutable.Map[String, RepositoryMetadata] =
Do you have any PR to suggest some caching mechanism? It has to be:
time base (expires after a duration)
session base (if the session dies and we are in a new session, we MUST download a new metadata.json)
there must be a force_download Boolean for users to override, either enable caching or disable it to be in the default behavior now. (only the code suggests we cache the metadata.json, we have never mentioned this in our docs or anywhere else. As far as anybody knows, if we make a change in metadata.json you'll see it immediately!)
That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle.
Hi @maziyarpanahi,
Thanks for your comments.
After your comments, I made a PR for the issue.
I implemented your last idea. That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle.
I agree with you. This idea seems to be the best solution.
PR --> #14224
Many thanks @mehmetbutgul - I left it to Danilo to review it, I make sure to include it in the tomorrow's release. Thanks agian for your contribution. 🚀
Is there an existing issue for this?
Who can help?
@danilojsl
What are you working on?
I am working on developing code.
Current Behavior
The cache mechanism related to metadata.json in s3 is not working.
When the pretrained() method invokes, metadata.json downloads again, again...
For example; I want to download a model with an approximate size of 3 MB,
But the source code downloads ~10 MB metadata.json for every model Whenever pretrained() is used.
Actually, the main problem is that there is a cache but the cache is not working and is never updated.
Expected Behavior
When the metadata.json downloads, keep it in the cache; and check the cache for every download process.
Also, duration time is important. Maybe the duration time can be increased to 10 minutes.
Steps To Reproduce
need to debug for reproduction.
AAA.pretrained(),
BBB.pretrained(),
....
private val repoFolder2Metadata: mutable.Map[String, RepositoryMetadata] =
spark-nlp/src/main/scala/com/johnsnowlabs/nlp/pretrained/S3ResourceDownloader.scala
Line 40 in 6b181a6
This above variable is not updated in the source code.
spark-nlp/src/main/scala/com/johnsnowlabs/nlp/pretrained/S3ResourceDownloader.scala
Line 57 in 6b181a6
if (!needToRefresh) { // The condition is always false !!!
Need to consider different folder possibilities such as
public/models, clinical/models
Spark NLP version and Apache Spark
sparknlp==5.3.1 / 5.3.2
Apache Spark version is not important to reproduce.
Type of Spark Application
No response
Java Version
Java 8
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: