# 02_hyphenation

> Hyphenator

In [None]:
#| default_exp hyphenation

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| export
import re
import itertools as it
from collections.abc import Sequence, Mapping, Callable

First a simple function to add hyphens at given positions:

In [None]:
#| exporti
def add_hyphens(
    s: str,  # word to hyphenate
    positions: Sequence[int],  # positions to insert hyphens (increasing order)
    hyphen: str='-'  # hyphen character
) -> str:  # word with hyphens
    i0, i1 = it.tee(iter(positions))
    i0 = it.chain((0,), i0)
    i1 = it.chain(i1, (len(s),))
    substrings = (s[p0:p1] for (p0,p1) in zip(i0, i1))
    return hyphen.join(substrings).strip(hyphen)

In [None]:
show_doc(add_hyphens)

---

[source](https://github.com/jkseppan/shyster/blob/main/shyster/hyphenation.py#L7){target="_blank" style="float:right; font-size:smaller"}

### add_hyphens

>      add_hyphens (s:str, positions:collections.abc.Sequence[int],
>                   hyphen:str='-')

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| s | str |  | word to hyphenate |
| positions | Sequence |  | positions to insert hyphens (increasing order) |
| hyphen | str | - | hyphen character |
| **Returns** | **str** |  | **word with hyphens** |

In [None]:
assert add_hyphens('saippuakauppias', ()) == 'saippuakauppias'
assert add_hyphens('saippuakauppias', (7,)) == 'saippua-kauppias'
assert add_hyphens('saippuakauppias', (4, 7, 11)) == f'saip-pua-kaup-pias'
assert add_hyphens('', ()) == ''

The following function implements the Liang hyphenation algorithm,
given the patterns and exceptions. For each possible hyphenation slot,
we take the maximum of all weights given by the patterns, and if
the maximum is odd, we insert a hyphen. TeX has parameters called
`\lefthyphenmin` and `\righthyphenmin`, with default values 2 and 3
(respectively), meaning that hyphens with only one letter to their left
or only one or two to their right are forbidden. The default patterns
produce such hyphens so we must also filter them out.

In [None]:
#| export
def hyphenator(
    regex: re.Pattern,  # first return value from `pattern.convert_patterns`
    mapping: Mapping[str, tuple[int,...]],  # second return value from `pattern.convert_patterns`
    exceptions: Mapping[str, str],  # return value from `pattern.convert_exceptions`
    hyphen: str='-', # hyphen character
    lefthyphenmin: int=2,  # at least this many characters before the first hyphen
    righthyphenmin: int=3,  # at least this many characters after the last hyphen
) -> Callable[[str], str]:  # function that hyphenates words
    def fun(word):
        if (result := exceptions.get(word)):
            return result
        word = f'\x1f{word}\x1f'
        weights = bytearray(len(word))
        for match in regex.finditer(word):
            pos = match.span()[0]-1
            key = match.group(1)
            rule = mapping[key]
            for i, w in enumerate(rule):
                weights[pos+i] = max(weights[pos+i], w)
        positions = (i for (i,w) in enumerate(weights)
                     if w&1==1 and i>=lefthyphenmin and i<=len(word)-2-righthyphenmin)
        return add_hyphens(word[1:-1], positions, hyphen=hyphen)
    return fun

In [None]:
show_doc(hyphenator)

---

[source](https://github.com/jkseppan/shyster/blob/main/shyster/hyphenation.py#L15){target="_blank" style="float:right; font-size:smaller"}

### hyphenator

>      hyphenator (regex:re.Pattern, mapping:collections.abc.Mapping[str,str],
>                  exceptions:collections.abc.Mapping[str,str], hyphen:str='-',
>                  lefthyphenmin:int=2, righthyphenmin:int=3)

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| regex | Pattern |  | first return value from `pattern.convert_patterns` |
| mapping | Mapping |  | second return value from `pattern.convert_patterns` |
| exceptions | Mapping |  | return value from `pattern.convert_exceptions` |
| hyphen | str | - | hyphen character |
| lefthyphenmin | int | 2 | at least this many characters before the first hyphen |
| righthyphenmin | int | 3 | at least this many characters after the last hyphen |
| **Returns** | **Callable** |  | **function that hyphenates words** |

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()