# 01_pattern

> Hyphenation patterns

In [None]:
#| default_exp pattern

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| export
import re, string
import itertools as it
from collections.abc import Iterable, Mapping

TeX patterns look like `2a1ly4`. There are letters and numbers, and the numbers
denote weights that fall between the letters, also before the first letter and
after the last letter:

| | | | | | | |
|-|-|-|-|-|-|-|
| |a| |l| |y| |
|2| |1| |0| |4|

Missing numbers mean zero.

In [None]:
#| exporti
def _cvt(
    pattern: str  # pattern as read from the TeX patterns file
) -> tuple[int, ...]:  # position i has the weight of the slot before character i
    res = [0 for _ in pattern]
    pos = 0
    for ch in pattern:
        if ch in string.digits:
            res[pos] = int(ch)
        else:
            pos += 1
    return tuple(res[:pos+1])

In [None]:
show_doc(_cvt)

---

[source](https://github.com/jkseppan/shyster/blob/main/shyster/pattern.py#L12){target="_blank" style="float:right; font-size:smaller"}

### _cvt

>      _cvt (pattern:str)

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| pattern | str | pattern as read from the TeX patterns file |
| **Returns** | **tuple** | **position i has the weight of the slot before character i** |

The following function turns many patterns into one regular expression
that matches zero-width strings but whose first group has the matched
pattern. (The zero-widthness is because `re.findall` only finds non-
overlapping matches.) Along with that regexp, it returns a mapping from
matched patterns to their weights.

In [None]:
#| export
def convert_patterns(
    patterns: Iterable[str]  # patterns as read from the TeX patterns file
) -> tuple[re.Pattern, Mapping[str, tuple[int, ...]]]:  # regex for patterns, and mapping from pattern to weights
    regexes = []
    mapping = {}
    for p in patterns:
        # replace dot with a control character unlikely to appear in words (ASCII unit separator)
        p = p.replace('.', '\x1f')
        s = re.sub('[0-9]', '', p)
        regexes.append(s)
        mapping[s] = _cvt(p)
    return re.compile(f"(?=({'|'.join(regexes)}))"), mapping

In [None]:
show_doc(convert_patterns)

---

[source](https://github.com/jkseppan/shyster/blob/main/shyster/pattern.py#L25){target="_blank" style="float:right; font-size:smaller"}

### convert_patterns

>      convert_patterns (patterns:collections.abc.Iterable[str])

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| patterns | Iterable | patterns as read from the TeX patterns file |
| **Returns** | **tuple** | **regex for patterns, and mapping from pattern to weights** |

In [None]:
assert convert_patterns(['1ba', '1be', 'ch2r', '.ä2']) == (
    re.compile('(?=(ba|be|chr|\x1Fä))'),
    {'ba': (1,0,0), 'be': (1,0,0), 'chr': (0,0,2,0), '\x1Fä': (0,0,2)})

TeX exceptions are simply words with hyphens where hyphenation should happen.

In [None]:
#| export
def convert_exceptions(
    exceptions: Iterable[str]
) -> Mapping[str, str]:  # mapping from word to word with hyphens
    return {w.replace('-', ''): w for w in exceptions}

In [None]:
show_doc(convert_exceptions)

---

[source](https://github.com/jkseppan/shyster/blob/main/shyster/pattern.py#L39){target="_blank" style="float:right; font-size:smaller"}

### convert_exceptions

>      convert_exceptions (exceptions:collections.abc.Iterable[str])

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| exceptions | Iterable |  |
| **Returns** | **Mapping** | **mapping from word to word with hyphens** |

In [None]:
assert convert_exceptions(['saippua-kauppias', 'xyzzy']) == {'saippuakauppias': 'saippua-kauppias', 'xyzzy': 'xyzzy'}

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()