Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tld parsers optimize access #84

Merged
merged 6 commits into from
May 26, 2024
Merged

Conversation

Anvil
Copy link
Contributor

@Anvil Anvil commented May 24, 2024

Hey there.

While looking at the code, it appeared there was a giant if/elif combo in parse_tld.py, which implied that to reach the "ve" tld parser you need to test all other combinations (linear complexity)

I've moved the parsers to their specific files, and replaced the nearly 150 lines of if/elif by a getattr, which should prove efficient enough to allow access to any RegexFOO class in constant time.

Also, I've made all RegexFOO respect the same interface, so that there's no need for a specific __init__ in each class.. which allowed the removal of another 200 lines. This should be easier to maintain.

There are a few other simplifications, in the branch to reduce the memory consumption.

Hope this helps.

@pogzyb pogzyb self-requested a review May 24, 2024 18:17
Copy link
Owner

@pogzyb pogzyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow yeah this is so much better!

The if/else block was originally a big dictionary, but that was even worse for memory usage. Can't believe I never thought to leverage getattr here.

I will try to get this rolled out in 1.1.3 sometime this weekend - hopefully I don't mess it up again lol. I will probably refactor the tests and maybe reorganize the "parse" related files into a separate dir or something too.

Again, really appreciate your contributions to this library.

@Anvil
Copy link
Contributor Author

Anvil commented May 25, 2024

Happy to help.

Do you want to keep the python 3.9 compat ? If so, I'll restore the Union i've dropped (Union[t1, t2] replaced by t1 | t2).

(Doing less imports from typing speeds up loading type:

  • Dict, List, Set, etc. which are deprecated since 3.9 can be replaced by dict, list, set, and so on.
  • Union which is not deprecated can be by replaced | syntax since 3.10
    )

@pogzyb
Copy link
Owner

pogzyb commented May 25, 2024

Happy to help.

Do you want to keep the python 3.9 compat ? If so, I'll restore the Union i've dropped (Union[t1, t2] replaced by t1 | t2).

(Doing less imports from typing speeds up loading type:

* Dict, List, Set, etc. which are deprecated since 3.9 can be replaced by dict, list, set, and so on.

* Union which is _not_ deprecated can be by replaced `|` syntax since 3.10
  )

Yes thank you - Python 3.9 won't be EOL until October 2025, so I'd vote that we keep the old-style typing for the near future.

Other than that some of the parser tests are failing:

  • RegexNU: This is due to my comment above about tld_specific_expressions not being present in the specific parser classes for RegexPT nor RegexNU.
  • RegexGQ: Even though this class is inheriting from RegexTK, it's still carrying over it's empty expression dictionary. I added a possible fix below.
class RegexGQ(RegexTK):
    ...
    # tld_specific_expressions: ExpressionDict = {}  # comment out or remove, so parent class expressions are used

After these changes, we should be good to merge!

@pogzyb pogzyb merged commit b6bdd26 into pogzyb:main May 26, 2024
12 checks passed
@pogzyb pogzyb mentioned this pull request May 26, 2024
@Anvil Anvil deleted the tld-parsers-optimize-access branch May 26, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants