-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLD cache filelock error on read-only systems #116
Comments
Thank you @LaundroMat for reporting this issue. Could you send me the Traceback? I am thinking about solution so I would like to know if there is some Python Exception I can catch. I will try to provide a solution for this in next release. Right now I am thinking of adding some flag/parameter which will tell urlxtract to disable updates of TLDs. |
Thanks for the quick followup! The full exception is
So the problem is that (as the
|
@LaundroMat and would you be OK to use the TLDs which might be not up to date on this read only file system? Note: If you are not looking for URLs with exotic TLDs then you should be fine. Usually only those are removed/added. |
Oh, sure, no problem. I'm happy to manually update the file once in a while. |
I've been looking through the code, and I believe it's not a problem with updating the file. The issue is that the _load_cached_tlds function tries to get a file lock on the data file in order to read it, and a read-only environment won't allow that.
|
Yes I agree. But I do not want to remove the filelock because it prevents other collisions when you run multiple instances of urlextract. It happened that one instance was updating the TLDs (just created a file) and another one read the file in that moment. Which lead to loading empty file (not fully updated yet). So urlextract could not find anything because it did not load any TLDs. @LaundroMat And right now I think about different solution that is already in place. Do you have any directory that you can write to?
And I see it is not in documentation. I have to update it. |
As you probably saw, I removed the file lock in a fork, and that works in the serverless environment. I fully understand and agree with the need for the lock, but instances are isolated in my serverless setup, so there's no risk. But of course, serverless is a very specific use-case... The configurable cache directory is a great addition, but I can't write anywhere in the serverless filesystem. I could write to an S3 bucket or something similar. But that's a whole new can of worms and I believe too far out of the scope of your library. (Which, by the way, I'm very grateful for!) |
OK, I think I have all the info I need. |
Just a thought, but maybe adding a flag to download the file and keep it in memory (instead of writing it to a cache) might be a solution too. |
I was thinking about this as well. Maybe I will go this way. Download to memory to have up to date TLDs. On the other hand I would not encourage to use it when you are creating new instance quickly (which lead to downloading updated list). I do not want to "DDOS" IANA from where the list is downloaded from. Because I do not have any control on his library. I will go thought the code and think about solution. |
Oh, that's a very good point indeed. |
I'm trying to use URLExtract in a serve function, but locking the cached TLD file provokes an error Attempting to acquire lock 2023-05-30 06:38:53,125 |DEBUG| Attempting to acquire lock 2240053224272 on C:\Program Files\Python311\Lib\site-packages\urlextract\data\tlds-alpha-by-domain.txt.lock Is there a way around this? |
I'm trying to use URLExtract in a serverless function, but locking the cached TLD file provokes an error on this read-only system.
cachefile.py tries to lock the file
URLExtract/urlextract/cachefile.py
Line 236 in 638c0e2
but the read-only system won't allow it:
Is there a way around this?
The text was updated successfully, but these errors were encountered: