Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser recompiles regex every loop #130

Closed
bdraco opened this issue Apr 2, 2023 · 15 comments
Closed

Parser recompiles regex every loop #130

bdraco opened this issue Apr 2, 2023 · 15 comments

Comments

@bdraco
Copy link
Contributor

bdraco commented Apr 2, 2023

pattern = re.compile(signature, re.DOTALL)

It would be nice if this cached the compile of the regex and didn't recompile it every loop

@lowdef
Copy link
Contributor

lowdef commented Apr 2, 2023

But the signature is different for every traversal of the loop:

for signature, parser in self.telegram_specification['objects'].items():
            pattern = re.compile(signature, re.DOTALL)
            matches = pattern.findall(telegram_data)

As this is done again for every new telegram offered to the parser, you might still be right that there might be some unnecessary overhead here.

I noticed on bulk processing that the parser is very slow. Not sure how much improvement this will bring.
Could be an interesting experiment.

@bdraco
Copy link
Contributor Author

bdraco commented Apr 2, 2023

A quick fix would be an lru_cache around the re.compile

_cached_re_compile = lru_cache(maxsize=512)(re.compile). -- max size may be too high

for signature, parser in self.telegram_specification['objects'].items():
            pattern = _cached_re_compile(signature, re.DOTALL)
            matches = pattern.findall(telegram_data)

A better fix would be to store the re.compile objects for each signature

@ndokter
Copy link
Owner

ndokter commented Apr 3, 2023

Another option could be to do it in the init

@ndokter
Copy link
Owner

ndokter commented Apr 7, 2023

@bdraco i tested it and it seems to improve the performance by about 5-6%. Do you have a preference for the chosen solution? Im impartial and also fine wit the lru approach.

https://github.com/ndokter/dsmr_parser/pull/131/files

I suspect there is a big(ger) gain to be made in TelegramBuffer.get_all. It's a method that is called for every x bytes received, after which it performs a regex to see if a full telegram is received. Maybe it can be optimized to more smartly look at the end of a telegram which often is something like !56DD\r\n

@bdraco
Copy link
Contributor Author

bdraco commented Apr 7, 2023

Building them ahead of time and storing them is cleaner than using an LRU 👍

@bdraco
Copy link
Contributor Author

bdraco commented Apr 7, 2023

I suspect there is a big(ger) gain to be made in TelegramBuffer.get_all. It's a method that is called for every x bytes received, after which it performs a regex to see if a full telegram is received. Maybe it can be optimized to more smartly look at the end of a telegram which often is something like !56DD\r\n

import re

# - Match all characters after start of telegram except for the start
# itself again '^\/]+', which eliminates incomplete preceding telegrams.
# - Do non greedy match using '?' so start is matched up to the first
# checksum that's found.
# - The checksum is optional '{0,4}' because not all telegram versions
# support it.
_FIND_TELEGRAMS_REGEX = re.compile(r"\/[^\/]+?\![A-F0-9]{0,4}\0?\r\n", re.DOTALL)


class TelegramBuffer(object):
    """
    Used as a buffer for a stream of telegram data. Constructs full telegram
    strings from the buffered data and returns it.
    """

    def __init__(self):
        self._buffer = ""

    def get_all(self):
        """
        Remove complete telegrams from buffer and yield them.
        :rtype generator:
        """
        for telegram in _FIND_TELEGRAMS_REGEX.findall(self._buffer):
            self._remove(telegram)
            yield telegram

    def append(self, data):
        """
        Add telegram data to buffer.
        :param str data: chars, lines or full telegram strings of telegram data
        """
        self._buffer += data

    def _remove(self, telegram):
        """
        Remove telegram from buffer and incomplete data preceding it. This
        is easier than validating the data before adding it to the buffer.
        :param str telegram:
        :return:
        """
        # Remove data leading up to the telegram and the telegram itself.
        index = self._buffer.index(telegram) + len(telegram)

        self._buffer = self._buffer[index:]

Maybe something like this.

I don't actually use this lib so I can't test it. I was only submitting this on behalf of a user that I was helping with a performance issue with their system.

@pimw1
Copy link

pimw1 commented Apr 8, 2023

Yup, that user is me ;) Thank you all for looking into this!

@ndokter
Copy link
Owner

ndokter commented Apr 12, 2023

Thank you, i have put these performance improvements in version 1.2.2

@ndokter ndokter closed this as completed Apr 12, 2023
@pimw1
Copy link

pimw1 commented Apr 16, 2023

Great work! Would it be possible to merge the update into Home Assistant?

@ndokter
Copy link
Owner

ndokter commented Apr 16, 2023

I don't know how to test the changes in HA. It seems that the version listed here is quite old as well (v0.33):
https://github.com/home-assistant/core/blob/dev/homeassistant/components/dsmr/manifest.json

@bdraco
Copy link
Contributor Author

bdraco commented Apr 16, 2023

home-assistant/core#84097 looks like it's waiting for a reviewer that can test this

@dupondje
Copy link
Collaborator

I've made a HACS integration for it now, as the PR keeps hanging in HA:
https://github.com/dupondje/homeassistant-dsmr

@pimw1
Copy link

pimw1 commented Apr 18, 2023

Cool, thanks! I'll give it a try today/tomorrow. I'm curous to see the impact of the efficiency-improvement on the cpu-utlization.

@pimw1
Copy link

pimw1 commented Apr 18, 2023

By the way, since home-assistant/core#84097 is from dec 2022, i guess the performance improvement of version 1.2.2 is not in there. Would it be an idea to add the latest improvement in it?

@pimw1
Copy link

pimw1 commented Apr 20, 2023

Cool, thanks! I'll give it a try today/tomorrow. I'm curous to see the impact of the efficiency-improvement on the cpu-utlization.

I've tested with the new version and compared the performance with the vanila dsmr of home assistant. From performance perspective, i don't see a clear pattern. It might be that vanilla home assistant dsmr performs slightly better, but the difference is too small to tell from the graphs. I've done two runs and rescaled the y-axis for both runs.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants