TLD cache filelock error on read-only systems #116

LaundroMat · 2022-01-12T07:16:38Z

I'm trying to use URLExtract in a serverless function, but locking the cached TLD file provokes an error on this read-only system.

cachefile.py tries to lock the file

URLExtract/urlextract/cachefile.py

Line 236 in 638c0e2

with filelock.FileLock(self._get_cache_lock_file_path()):

but the read-only system won't allow it:

[Errno 30] Read-only file system: '/home/site/wwwroot/.python_packages/lib/site-packages/urlextract/data/tlds-alpha-by-domain.txt.lock'

Is there a way around this?

The text was updated successfully, but these errors were encountered:

lipoja · 2022-01-12T07:37:52Z

Thank you @LaundroMat for reporting this issue.
I am sorry but I do not think that there is workaround for this yet. Since I did not encounter this error yet.

Could you send me the Traceback? I am thinking about solution so I would like to know if there is some Python Exception I can catch. I will try to provide a solution for this in next release. Right now I am thinking of adding some flag/parameter which will tell urlxtract to disable updates of TLDs.

LaundroMat · 2022-01-12T07:43:31Z

Thanks for the quick followup!

The full exception is

OSError: [Errno 30] Read-only file system: '/home/site/wwwroot/.python_packages/lib/site-packages/urlextract/data/tlds-alpha-by-domain.txt.lock'
  File "django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "**********/****.py", line 35, in ****
    extractor = URLExtract()
  File "urlextract/urlextract_core.py", line 103, in __init__
    self._reload_tlds_from_file()
  File "urlextract/urlextract_core.py", line 136, in _reload_tlds_from_file
    tlds = sorted(self._load_cached_tlds(), key=len, reverse=True)
  File "urlextract/cachefile.py", line 236, in _load_cached_tlds
    with filelock.FileLock(self._get_cache_lock_file_path()):
  File "filelock/_api.py", line 213, in __enter__
    self.acquire()
  File "filelock/_api.py", line 169, in acquire
    self._acquire()
  File "filelock/_unix.py", line 30, in _acquire
    fd = os.open(self._lock_file, open_mode)

So the problem is that (as the _acquire method in filelock/_unix.py's UnixFileLock shows) locking the file requires read access.

    def _acquire(self) -> None:
        open_mode = os.O_RDWR | os.O_CREAT | os.O_TRUNC
        fd = os.open(self._lock_file, open_mode)
        try:
            fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
        except OSError:
            os.close(fd)
        else:
            self._lock_file_fd = fd

lipoja · 2022-01-12T07:48:59Z

@LaundroMat and would you be OK to use the TLDs which might be not up to date on this read only file system?
If I disable the update it will use the TLDs which were added to the release of urlextract you are using.
Or do you need a way to load up to dayte list of TLDs?

Note: If you are not looking for URLs with exotic TLDs then you should be fine. Usually only those are removed/added.

LaundroMat · 2022-01-12T07:56:48Z

Oh, sure, no problem. I'm happy to manually update the file once in a while.

LaundroMat · 2022-01-12T08:07:46Z

I've been looking through the code, and I believe it's not a problem with updating the file. The issue is that the _load_cached_tlds function tries to get a file lock on the data file in order to read it, and a read-only environment won't allow that.

def _load_cached_tlds(self):
    """
    Loads TLDs from cached file to set.
    :return: Set of current TLDs
    :rtype: set
    """

    # check if cached file is readable
    if not os.access(self._tld_list_path, os.R_OK):
        self._logger.error(
            "Cached file is not readable for current "
            "user. ({})".format(self._tld_list_path)
        )
        raise CacheFileError("Cached file is not readable for current user.")

    set_of_tlds = set()

    with filelock.FileLock(self._get_cache_lock_file_path()): # <-- This triggers an error in a read-only environment
        with open(self._tld_list_path, "r") as f_cache_tld:
            for line in f_cache_tld:
                tld = line.strip().lower()
                # skip empty lines
                if not tld:
                    continue
                # skip comments
                if tld[0] == "#":
                    continue

                set_of_tlds.add("." + tld)
                set_of_tlds.add("." + idna.decode(tld))

    return

lipoja#116

lipoja · 2022-01-12T08:24:11Z

Yes I agree. But I do not want to remove the filelock because it prevents other collisions when you run multiple instances of urlextract. It happened that one instance was updating the TLDs (just created a file) and another one read the file in that moment. Which lead to loading empty file (not fully updated yet). So urlextract could not find anything because it did not load any TLDs.

@LaundroMat And right now I think about different solution that is already in place. Do you have any directory that you can write to?
If yes you should be able to set cache_dir

extractor = URLExtract(cache_dir='/path/to/directory/with/write/access')

And I see it is not in documentation. I have to update it.

LaundroMat · 2022-01-12T08:29:41Z

As you probably saw, I removed the file lock in a fork, and that works in the serverless environment.

I fully understand and agree with the need for the lock, but instances are isolated in my serverless setup, so there's no risk. But of course, serverless is a very specific use-case...

The configurable cache directory is a great addition, but I can't write anywhere in the serverless filesystem. I could write to an S3 bucket or something similar. But that's a whole new can of worms and I believe too far out of the scope of your library. (Which, by the way, I'm very grateful for!)

lipoja · 2022-01-12T08:41:21Z

OK, I think I have all the info I need.
I will think about a solution for this.

LaundroMat · 2022-01-12T08:45:31Z

Just a thought, but maybe adding a flag to download the file and keep it in memory (instead of writing it to a cache) might be a solution too.

lipoja · 2022-01-12T08:46:23Z

I was thinking about this as well. Maybe I will go this way. Download to memory to have up to date TLDs.

On the other hand I would not encourage to use it when you are creating new instance quickly (which lead to downloading updated list). I do not want to "DDOS" IANA from where the list is downloaded from. Because I do not have any control on his library. I will go thought the code and think about solution.

LaundroMat · 2022-01-12T08:59:13Z

Oh, that's a very good point indeed.
I think for my case (and again, it's a fringe case, so don't prioritize it too much), the ability to save the cache file in a cloud environment (such as S3 or B2) or to be able to specifically state the file should not be locked when reading it would be ideal then.

hemangisaupure · 2023-06-02T10:53:45Z

I'm trying to use URLExtract in a serve function, but locking the cached TLD file provokes an error Attempting to acquire lock

2023-05-30 06:38:53,125 |DEBUG| Attempting to acquire lock 2240053224272 on C:\Program Files\Python311\Lib\site-packages\urlextract\data\tlds-alpha-by-domain.txt.lock
2023-05-30 06:38:53,125 |DEBUG| Lock 2240053224272 acquired on C:\Program Files\Python311\Lib\site-packages\urlextract\data\tlds-alpha-by-domain.txt.lock
2023-05-30 06:38:53,144 |DEBUG| Attempting to release lock 2240053224272 on C:\Program Files\Python311\Lib\site-packages\urlextract\data\tlds-alpha-by-domain.txt.lock
2023-05-30 06:38:53,144 |DEBUG| Lock 2240053224272 released on C:\Program Files\Python311\Lib\site-packages\urlextract\data\tlds-alpha-by-domain.txt.lock

Is there a way around this?

LaundroMat added a commit to LaundroMat/URLExtract that referenced this issue Jan 12, 2022

Removed file lock when reading cached data file

40fb6d8

lipoja#116

LaundroMat mentioned this issue Apr 24, 2022

Add ability to read local TLDs file rather than download #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TLD cache filelock error on read-only systems #116

TLD cache filelock error on read-only systems #116

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022 •

edited

Loading

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022 •

edited

Loading

LaundroMat commented Jan 12, 2022 •

edited

Loading

hemangisaupure commented Jun 2, 2023

TLD cache filelock error on read-only systems #116

TLD cache filelock error on read-only systems #116

Comments

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022 • edited Loading

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022

LaundroMat commented Jan 12, 2022

lipoja commented Jan 12, 2022 • edited Loading

LaundroMat commented Jan 12, 2022 • edited Loading

hemangisaupure commented Jun 2, 2023

LaundroMat commented Jan 12, 2022 •

edited

Loading

lipoja commented Jan 12, 2022 •

edited

Loading

LaundroMat commented Jan 12, 2022 •

edited

Loading