- 
                Notifications
    You must be signed in to change notification settings 
- Fork 532
update countDownloads for PaddleOCR #1812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| docsUrl: "https://www.paddleocr.ai/", | ||
| snippets: snippets.paddleocr, | ||
| filter: true, | ||
| countDownloads: `path_extension:"safetensors" OR path_extension:"pdiparams"`, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for info, if the user runs the https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main/PaddleOCR-VL-0.9B model this will work fine. However, if they run the two-stage pipeline, this rule will count one download for https://huggingface.co/PaddlePaddle/PaddleOCR-VL/blob/main/PaddleOCR-VL-0.9B/model.safetensors, and another one for https://huggingface.co/PaddlePaddle/PaddleOCR-VL/blob/main/PP-DocLayoutV2/inference.pdiparams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reminder. We will reconsider how to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I have updated the codes. It now only counts file whose full path is exactly inference.pdiparams, instead of ending with the pdiparams suffix.
Could you please review it again? If there are any issues or suggestions, please feel free to let me know and I will make further improvements.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating! But I think this change will still result in the same situation described in my previous comment. Sorry my message was not clear, let me try to summarize:
- If the user only downloads PaddleOCR-VL-0.9Band runs it, then the download count will be incremented once.
- If they use the full pipeline to run both PP-DocLayoutV2andPaddleOCR-VL-0.9Bone after the other, then the download count will be incremented twice: one for each model. This is because files from both directories will match the rule.
If PP-DocLayoutV2 is never intended to be used without PaddleOCR-VL-0.9B, then perhaps it would make sense to add a rule for just PaddleOCR-VL-0.9B. But I don't know if that's the case, or if PP-DocLayoutV2 may be used independently. I'm pointing this out just to give you information about how to interpret the download counts; depending on the use cases you support, you may want to restrict counts to one of the model, or accept duplicate counts when both are used as part of the same pipeline.
As discussed offline, we think that using one repo per model is usually clearer and provides a lot more benefits in addition to cleaner download counts. For example, this repo is a copy of your PaddleOCR-VL-0.9B directory, and it can easily be used with the from_pretrained API as shown in the snippet I added to the model card. It also conveys useful metadata: it's an image-text-to-text model, it works with transformers, but it requires custom code to do so (as indicated by the custom_code tag, which was automatically inferred). It provides some starter code snippets when users click on the "Use this model" button, which are also inferred automatically based on the model characteristics.
If you need more time to work on a single model-per-repo structure, that's totally ok - we can merge this PR meanwhile, but keep in mind what the download rule means, and if possible try to make it match the use cases your models support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments. Please allow me to clarify our plan:
- Both the PaddleOCR-VL-0.9BandPP-DocLayoutV2models can be used independently. Therefore, we agree with your suggestion that the two models should be stored in separate repositories.
- The PaddleOCR-VL-0.9Bmodel will continue to be hosted in the PaddlePaddle/PaddleOCR-VL repository. However, we will move the relevant files to the root directory of the repository and remove the PaddleOCR-VL-0.9B subdirectory. This change will be completed as soon as possible, most likely by tomorrow.
- As for the PP-DocLayoutV2model currently present in the PaddlePaddle/PaddleOCR-VL repository, we will remove it at an appropriate time in the future. After that, thePP-DocLayoutV2model will only be available in the PaddlePaddle/PP-DocLayoutV2 repository.
Given the plan described above, would it be possible to proceed with merging the current code?
Thank you very much for your patience, guidance, and suggestions. If there is anything inappropriate in our approach, please feel free to point it out.
b16e36d    to
    4725e1e      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for me, cc @Wauplin in case there's more feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems good to me! Thanks a lot @TingquanGao @pcuenca for the discussions. Also agree splitting the models in 2 repos is the right move.
No description provided.