In [2]:
%%html
<style>
table {float:left}
</style>

In [3]:
import regex as re

# Word Boundary ```\b```

* [Word Boundaries](https://www.regular-expressions.info/wordboundaries.html)

> The metacharacter ```\b``` matches at a position that is called a **word boundary**. This match is zero-length. Simply put: ```\b``` allows “whole words only” match with the patten ```\bword\b```.
> 
> There are three different positions that qualify as word boundaries:
> 1. Before the first character in the string, if the first character is a word character.
> 2. After the last character in the string, if the last character is a word character.
> 3. Between two characters in the string, where one is a word character and the other is not a word character.

## Python definition of word boundary is **non-word character**.

* [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#more-pattern-power)

> ```\b```: Word boundary. This is a zero-width assertion that **matches only at the beginning or end of a word**. A word is defined as a sequence of alphanumeric characters, so **the begin/end of a word is indicated by whitespace or a non-alphanumeric character**.

> Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, ```\b``` is defined as the boundary 
> 
> 1. between ```\w``` and a ```\W``` character (or vice versa), or 
> 2. between ```\w``` and the beginning/end of the string. 
> 
> This means that ```r'\bfoo\b'``` matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Need to use lookaround to match the word only.


# Wraning

Cannot use character class ```[\b]``` as it is not a character.

* [Can word boundary \b not be used in a character class?](https://stackoverflow.com/a/77252779/4281353)

> You cannot include \b in a character class as it is not a character. But, you could use an alternation instead of the attempted character class: ```(?:\b|\s)cool\b```

In [63]:
pattern = r'f[\b\s]cool\b'

text = "cool"

matches = re.finditer(
    pattern=pattern, 
    string=text.strip(), 
    flags=re.IGNORECASE
)
if matches:
    for match in matches:
        print(f"match: {match.group(0): <20} start: {match.start():<5} end: {match.end():<5} pos: {match.endpos:<5}")

# Examples

```\b``` matches with:

```
&tako  [sushi]
 ^   ^  ^    ^
 1   2  3    4    
```

1. In-between ```&``` and ```t``` because it is the start boundary of the word **tako**.
2. After ```o``` of the word ```tako``` as it is the end boundary of the word **tako**.
3. In-between ```[``` and ```s``` because it is the start boundary of the word **sushi**.
4. After ```i``` of the word ```sushi``` as it is the end boundary of the word **sushi**. 

In [51]:
pattern = r'\b'
text = "&tako  [sushi]"

matches = re.finditer(
    pattern=pattern, 
    string=text, 
    flags=re.IGNORECASE
)
for m in matches:
    print(f"match: {m.group(0)} start: {m.start(0):<3} end: {m.end(0):<3} string: {text[m.start(0):]}")

match:  start: 1   end: 1   string: tako  [sushi]
match:  start: 5   end: 5   string:   [sushi]
match:  start: 8   end: 8   string: sushi]
match:  start: 13  end: 13  string: ]


```[^\b]+``` matches every character. Because ```\b``` matches empty boundary, all the non-empty boundray character will match, **even the non-word \W** character.

In [53]:
pattern = r'[^\b]'
text = "&tako  [sushi]"

matches = re.finditer(
    pattern=pattern, 
    string=text, 
    flags=re.IGNORECASE
)
for m in matches:
    print(f"match: {m.group(0)} start: {m.start(0):<3} end: {m.end(0):<3} string: {text[m.start(0):]}")

match: & start: 0   end: 1   string: &tako  [sushi]
match: t start: 1   end: 2   string: tako  [sushi]
match: a start: 2   end: 3   string: ako  [sushi]
match: k start: 3   end: 4   string: ko  [sushi]
match: o start: 4   end: 5   string: o  [sushi]
match:   start: 5   end: 6   string:   [sushi]
match:   start: 6   end: 7   string:  [sushi]
match: [ start: 7   end: 8   string: [sushi]
match: s start: 8   end: 9   string: sushi]
match: u start: 9   end: 10  string: ushi]
match: s start: 10  end: 11  string: shi]
match: h start: 11  end: 12  string: hi]
match: i start: 12  end: 13  string: i]
match: ] start: 13  end: 14  string: ]


In [54]:
pattern = r'\B'
text = "&tako  [sushi]"

matches = re.finditer(
    pattern=pattern, 
    string=text, 
    flags=re.IGNORECASE
)
for m in matches:
    print(f"match: {m.group(0)} start: {m.start(0):<3} end: {m.end(0):<3} string: {text[m.start(0):]}")

match:  start: 0   end: 0   string: &tako  [sushi]
match:  start: 2   end: 2   string: ako  [sushi]
match:  start: 3   end: 3   string: ko  [sushi]
match:  start: 4   end: 4   string: o  [sushi]
match:  start: 6   end: 6   string:  [sushi]
match:  start: 7   end: 7   string: [sushi]
match:  start: 9   end: 9   string: ushi]
match:  start: 10  end: 10  string: shi]
match:  start: 11  end: 11  string: hi]
match:  start: 12  end: 12  string: i]
match:  start: 14  end: 14  string: 




```\bfoo\b``` matches:

```
#foo foo foobar bar tako) foo!@#$%^
 ^   ^                    ^
 1   2                    3
```

1. ```foo``` in ```#foo``` because ```\b``` matches in-between ```#``` and ```foo``` as between the beginning of the string ```foo```.
3. ```foo!@#$%``` because ```\b``` matches between ```o``` and ```!```.

In [52]:
pattern = r'\bfoo\b'
text = "#foo foo foobar bar tako) foo!@#$%^"

matches = re.finditer(
    pattern=pattern, 
    string=text, 
    flags=re.IGNORECASE
)
for m in matches:
    print(f"match: {m.group(0)} start: {m.start(0):3} end: {m.end(0):3} string: {text[m.start(0):]}")

match: foo start:   1 end:   4 string: foo foo foobar bar tako) foo!@#$%^
match: foo start:   5 end:   8 string: foo foobar bar tako) foo!@#$%^
match: foo start:  26 end:  29 string: foo!@#$%^


---
# Custom boundary

* [How to write word boundary inside character class in python without losing its meaning? I wish to add underscore(_) in definition of word boundary(\b)](https://stackoverflow.com/questions/41460829/how-to-write-word-boundary-inside-character-class-in-python-without-losing-its-m)

In [85]:
rx = r"(?:\b|(?<=_))word(?=\b|_)"
s = "some_word_here and a word there and #word."
print(re.findall(rx,s))

['word', 'word', 'word']
