Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.
/ regexy Public archive

〰️ Linear time regex matching supporting streams and other goodies

License

Notifications You must be signed in to change notification settings

nitely/regexy

Repository files navigation

regexy

Build Status Coverage Status pypi licence

A Python library for parsing, compiling, and executing regular expressions. All searches execute in linear time with respect to the size of the regular expression and search text.

Initial work inspired by Thompson's NFA paper.
APIs inspired by RE2 and Python's re module.
Implementation is my own and it's not influenced by any existing solution.

Be aware! This is nothing more than an experiment for researching purposes.

Status

  • Matcher
  • Basic operators: *, ?, + and |
  • Capturing groups
  • Symbols escaping
  • Shorthands: \w, \d, \s, \W, \D, \S
  • Sets [...] (+ ranges and shorthands)
  • Repetition ranges {n, m}
  • non-capturing groups
  • Greedy and non-greedy match
  • ^ and $ symbols
  • \b word boundary
  • Match any (dot)
  • Sets complement
  • Lookahead assertion (?=...) and (?!...) (limited to a single char)
  • Assertions \A, \z, \B
  • Named capturing groups
  • search
  • full_match
  • Flags
  • User friendly compiling errors
  • ... ?

Compatibility

  • Python +3.5

Install

$ pip install regexy

Usage

Notice regexy returns all capturing groups specified within a repeated sub-expression

import regexy

regexy.match(regexy.compile(r'((a)*b)'), 'aab')
# Match<('aab', ('a', 'a'))>

regexy.match(regexy.compile(r'a'), 'b')
# None

regexy.match(regexy.compile(r'a'), 'a')
# Match<()>

Streams are supported (i.e: network and files)

Note: Capturing may take as much RAM as all of the data in worst case when the full regex is captured

import io
import regexy


def stream_gen():
    stream = io.BytesIO(b'Im a stream')
    stream_wrapper = io.TextIOWrapper(stream, encoding='utf-8', write_through=True)

    while True:
        chars = stream_wrapper.read(5)

        if not chars:
            break

        yield from chars

regexy.match(regexy.compile(r'(\w+| +)*'), stream_gen())
# (('Im', ' ', 'a', ' ', 'stream'),)

Here is a (undocumented) way to print the generated NFA for debugging purposes:

import regexy

str(regexy.compile(r'a*').state)
# ('*', [('a', [('*', [...])]), ('EOF', [])])
# The [...] thing means it's recursive

Docs

Read The Docs

Tests

$ make test

License

MIT

About

〰️ Linear time regex matching supporting streams and other goodies

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages