Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting of a base dir for url cache #24

Merged
merged 2 commits into from
Oct 8, 2018

Conversation

keyz182
Copy link
Contributor

@keyz182 keyz182 commented Oct 5, 2018

I'm using this library in an application deployed on AWS Elastic Beanstalk. In it's environment, os.path.dirname(__file__) would resolve to an unwriteable directory. I've added the optional cache_dir parameter to the constructor, defaulting to the value it was previously, to allow code to explicitly set a base cache directory.

urlextract.py Outdated Show resolved Hide resolved
@lipoja
Copy link
Owner

lipoja commented Oct 5, 2018

Hi Kieran,
thank you for the pull request.

I would like to know your opinion about logging. When it has the ability to change the cache_dir should we warn user when the cache_dir is not accessible (around line 86: if not os.access(dir_path, os.W_OK):).

I think you might be interested to see if the path that you provided is not writable.
How do you see from users point of view?

@keyz182
Copy link
Contributor Author

keyz182 commented Oct 7, 2018

While logging may have helped us find the issue a little sooner, the eventual error thrown (/home/wsgi/.urlextract_tlds not writeable or similar) guided us directly to the issue pretty quickly. Though, from a good practice point-of-view, logging that the code has reached a condition where it needs to fall back may be the best approach.

For us, on Elastic Beanstalk, both the current directory, and the home directory were not writeable (~ resolved to /home/wsgi which I don't think even existed. Would falling back to the temp directory be an option if the current dir is unreadable, rather than the home directory? Or make a temp directory the 3rd fallback?

The way we're using my branch currently is with the temp dir shown below, and is a cross platform compatible way of finding a temp directory.

import tempfile
extractor = URLExtract(cache_dir=tempfile.gettempdir())

@keyz182
Copy link
Contributor Author

keyz182 commented Oct 7, 2018

I suppose you loose the cache then across reboots potentially, so may not be the most desirable, but as a fallback, may be acceptable.

@lipoja
Copy link
Owner

lipoja commented Oct 8, 2018

Thank you for your time, opinions and ideas! I really appreciate it. It is really good idea with the temp dir so I've crated issue (#25) for it. If you have time you and are interested in helping with it just write comment to the issue otherwise so I know you (or somebody else) will do it or I will work on it when I have free time.

Thank you and have a nice day.

@lipoja lipoja merged commit cae76b7 into lipoja:master Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants