Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: private tlds can be used at call-time #207

Merged

Conversation

brycedrennan
Copy link
Collaborator

@brycedrennan brycedrennan commented Oct 8, 2020

This is a second attempt at doing what was done in #144.

Addresses #66

  • Adds include_psl_private_domains to the __call__ method. This is now something you can choose on a per-call basis. The object level argument now is only a default value for each call.
  • The entire dataset from publicsuffix.org is saved to cache
  • Ensured no weird cache issues happen when using with different suffix_list_urls by using different filenames per suffix_list_urls
  • Use filelock to support multiprocessing and multithreading use cases
  • Updates the bundled snapshot to be the raw publicsuffix data. Need to look at performance impact of this.
  • various other cleanups
  • Breaking change cache_file => cache_dir

- Adds `include_psl_private_domains` to the `__call__` method.  This is now something you can choose on a per-call basis.  The object level argument now is only a default value for each call.
- The entire dataset from publicsuffix.org is saved to cache
- Ensured no weird cache issues happen when using with different `suffix_list_urls` by using different filenames per `suffix_list_urls`
- Use filelock to support multiprocessing and multithreading use cases
- Updates the bundled snapshot to be the raw publicsuffix data. Need to look at performance impact of this.
- Breaking change cache_file => cache_dir
- various other cleanups
@brycedrennan brycedrennan marked this pull request as ready for review October 8, 2020 17:19
@brycedrennan
Copy link
Collaborator Author

@john-kurkowski for your consideration. This replaces #144. I think it's cleaner than the first attempt.
@hangtwenty might also want to take a look

Hopefully we can merge this faster than the last PR 😄

@john-kurkowski
Copy link
Owner

Let's! 👍 😅 Will look later today.

Copy link
Owner

@john-kurkowski john-kurkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Clean! Can merge in the next couple days.

tldextract/cache.py Outdated Show resolved Hide resolved
tldextract/cache.py Outdated Show resolved Hide resolved
tldextract/cache.py Outdated Show resolved Hide resolved
tldextract/cache.py Show resolved Hide resolved
@brycedrennan
Copy link
Collaborator Author

Nice catches. Addressed your feedback. Left as a separate commit but will squash it once you've looked at it

@brycedrennan
Copy link
Collaborator Author

Oh one thing I should point out is I deleted that snapshot diff functionality. Admittedly was just feeling a little lazy at figuring out how to implement that again. Can figure it out if it's important though.

The lock needed to wrap the retrieval, parsing, and storage of the suffix lists
@john-kurkowski
Copy link
Owner

Glad you pointed that out. I think it's fine without. I'll note in the CHANGELOG.

@john-kurkowski john-kurkowski merged commit 2eba0e5 into john-kurkowski:master Oct 10, 2020
@john-kurkowski
Copy link
Owner

Thank you very much for this much requested feature, and seeing it through years!

@brycedrennan
Copy link
Collaborator Author

Thanks! I hope i didn’t sneak that multiprocessing test and related fix by you ( the last two commits)

@brycedrennan brycedrennan deleted the runtime-select-private-tlds branch October 10, 2020 02:10
@john-kurkowski
Copy link
Owner

Tests and fixes always welcome. I see no sneaking. 🙏

bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request Oct 21, 2020
https://build.opensuse.org/request/show/843031
by user mia + dimstar_suse
- Update to 3.0.0:
  This release fixes the long standing bug that public and private
  suffixes were generated separately and could not be switched at
  runtime, john-kurkowski/tldextract#66
  * Breaking Changes
    + Rename `cache_file` to `cache_dir` as it is no longer a
      single file but a directory
      (john-kurkowski/tldextract#207)
    + Rename CLI arg also, from `--cache_file` to `--cache_dir`
    + Remove Python 2.7 support
  * Features
    + Can pass `include_psl_private_domains` on call, not only on
      construction
    + Use filelocking to support multi-processing and
      multithreading environments
  * Bugfixes
    + Select public or private suffixes at runtime
      (https://github.com/john-kurkowski/tldextract/issu
@evrial
Copy link

evrial commented Oct 21, 2020

Fantastic changes, great job

In [1]: from tldextract import TLDExtract                                                                                                                                         

In [2]: TLDExtract.tlds                                                                                                                                                           
Out[2]: <property at 0x7faa2da45c50>

In [3]: t = TLDExtract()                                                                                                                                                          

In [4]: t.tlds                                                                                                                                                                    
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-2bd415795d39> in <module>
----> 1 t.tlds

~/anaconda3/envs/ocean/lib/python3.7/site-packages/tldextract/tldextract.py in tlds(self)
    243     @property
    244     def tlds(self):
--> 245         return self._get_tld_extractor().tlds
    246 
    247     def _get_tld_extractor(self):

AttributeError: '_PublicSuffixListTLDExtractor' object has no attribute 'tlds'

@brycedrennan
Copy link
Collaborator Author

🤨 hmm I suspect sarcasm

Yes it does seem we missed that. Given that we now track public and private tlds we probably won't be able to support this ambiguous property.

Perhaps a public_tlds and a private_tlds?

@evrial
Copy link

evrial commented Oct 21, 2020

Yes, thats sensible names.

@evrial
Copy link

evrial commented Oct 21, 2020

Are these results correct?

        (
            'runners.blogspot.com',
            {'subdomain': '', 'domain': 'runners', 'suffix': 'blogspot.com'},
        ),
        (
            'http://www.ocean.io',
            {'subdomain': 'www', 'domain': 'ocean', 'suffix': 'io'},
        ),
        (
            'blogspot.com',
            {'subdomain': '', 'domain': 'blogspot', 'suffix': 'com'},
        ),

Also this:

In [6]: tldextract.extract('runners.blogspot.com')                                                                                                                                
Out[6]: ExtractResult(subdomain='runners', domain='blogspot', suffix='com')

In [7]: tldextract.extract('runners.blogspot.com', include_psl_private_domains=True)                                                                                              
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-0e1c98efa124> in <module>
----> 1 tldextract.extract('runners.blogspot.com', include_psl_private_domains=True)

TypeError: extract() got an unexpected keyword argument 'include_psl_private_domains'

@brycedrennan
Copy link
Collaborator Author

I can't know whether those results are correct without knowing what calls produced them.

The second issue you pointed out is another oversight.

I've made a PR to address these issues here: #210

@evrial
Copy link

evrial commented Oct 21, 2020

We were using wrapped patched version of tldextract to properly handle private domains and invoke tldextract with private parameter if the domain is private and vice versa. Above are our tests

    def is_private_domain(self, result):
        return (
            result.suffix in self.psl_private_tld_list and
            (
                not result.domain or result.domain == 'www'
            )
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants