# SCC.413 Applied Data Mining
# Week 17
# Regular Expressions in Python

## Contents
- [Introduction](#intro)
- [Compiling](#compiling)
- [Matching and searching](#matching)
- [Search and replace](#replace)
- [Splitting](#split)
- [Unicode and regex](#unicode)
    - [Emoji](#emoji)

<a name="intro"></a>
## Introduction

Before completing this workbook, you should consult the slides and webpage on regular expressions. This workbook will show you how to use regular expressions within Python. First we will use Python's default `re` package to show basic functionality. Further on, we will use the `regex` package, which allows us to deal with Unicode more easily.

We import `re` below, and setup a string which we'll use as our target string, to perform regular expression matches and searches upon.

In [None]:
import re

some_text = "This is some text that needs processing. #regex #fun"

print(some_text)

<a name="compiling"></a>
## Compiling regular expressions

In Python, regular expressions are made in to Pattern objects to complete regex operations (searching, replacing, etc.). Compiling a regex once (e.g. outside of a loop) is good practice for the sake of efficiency.

We use `r` to denote a "raw string" when creating regex patterns, this is to avoid having to keep escaping backslashes, and makes the patterns easier to read and write.

There are also various flags that can be set during compilation: https://docs.python.org/3/howto/regex.html#compilation-flags. For example, re.IGNORECASE can be used to make the pattern case-insensitive.

Two regex patterns are created below, one to find basic words, the other to find @mentions or #hashtags. Check your understanding of the patterns, and what you expect them to match.

In [None]:
word_regex = re.compile(r"[a-z]+", re.IGNORECASE)
ht_at_regex = re.compile(r"([@#])\w+")

<a name="matching"></a>
## Matching and searching

Now we have some patterns, we can perform various operations. `match()` is the most basic, which will check if the regex matches from the start of the string, e.g. if we match on our some_text.

In [None]:
match = word_regex.match(some_text)

We have the first word returned in a match object. There are several functions available, such as:

In [None]:
print(match.group()) #return the string matched
print(match.start()) #start position in the string of the match
print(match.end()) #end position in the string of the match
print(match.span()) #tuple with start and end positions

The match has to be at index 0 of the string, so the other regex will not find a match:

In [None]:
no_match = ht_at_regex.match(some_text)
print(no_match)

If we want to check if a regex matches, we normally do something like this (switch to the word_regex to see this working):

In [None]:
m = ht_at_regex.match(some_text)
if m:
    print("Match: ", m.group(), m.span())
else:
    print("No match found")

The `search()` function does the same as `match()`, except it will scan through the string to find the first match, e.g.:

In [None]:
m = ht_at_regex.search(some_text)
if m:
    print("Match: ", m.group(), m.span())
else:
    print("No match found")

`findall()` will return a list of matching strings (not match objects), given a pattern and a string to search, e.g.: (note, this is basic tokenisation)

In [None]:
matches = word_regex.findall(some_text)
print(matches)

Note that if grouping `()` is included in the regex (as with our hashtag regex), then `findall()` returns just the group matches, not the full match. Use `finditer()` instead to get the whole match object.

In [None]:
matches = ht_at_regex.findall(some_text)
print(matches)

In [None]:
matches = ht_at_regex.finditer(some_text)
for match in matches:
    print(match.group())

The `group()` function on a match allows you to access captured group matches, e.g.:

In [None]:
match = ht_at_regex.search(some_text)
print(match)

In [None]:
print(match.group(0))

In [None]:
print(match.group(1))

**Task**: write some code to iterate through the word_regex matches, printing the matching token and its location in the text.

In [None]:
# Answer



<a name="replace"></a>
## Search and replace

You can use the same regular expressions to replace matches, using the `sub()` function:

In [None]:
replaced = word_regex.sub("word", some_text)
print(replaced)

`subn()` provides the replaced string, along with the number of replacements made:

In [None]:
replaced = word_regex.subn("word", some_text)
print(replaced)

The original string can be included in the replacement using a call to the group match, as below (you will see annotation/tagging like this):

In [None]:
tagged = word_regex.sub("\g<0>_word", some_text)
print(tagged)

You can also pass a function to do whatever you like with the match, the below reverses each match.

In [None]:
def reverse(match):
    return match.group()[::-1]

reversed = word_regex.sub(reverse, some_text)
print(reversed)

**Task**: Write some code that will replace the hashtags in the text with "#hashtag". Think about how to do this if the text contained #hashtags and @mentions:

In [None]:
# answer 


<a name="split"></a>
## Splitting

`split()` finds all matches, and returns the surrounding text. Using this with our word regex, we find the surrounding non-words:

In [None]:
split = word_regex.split(some_text)
print(split)

A whitespace tokeniser (i.e. that finds all text separated by spaces) could look like this:

In [None]:
whitespace_regex = re.compile(r"\s+")
split = whitespace_regex.split(some_text)
print(split)

**Task**: Write some code that will split on all non-word characters, i.e. will not include punctuation and hashtags in the returned list of tokens:

In [None]:
# answer



<a name="unicode"></a>
## Unicode and regex
Python's re module is Unicode aware, but limited.

In [None]:
new_text = "The café served pizzas with jalapeños"
word_regex = re.compile(r"\w+")
word_regex.findall(new_text)

é and ñ are included in the \w character set as part of Python's Unicode awareness, i.e. they are treated as letters.

To see the impact of this not being included, we can set a compilation flag to force the regex engine to not do full Unicode matches, and instead treat as ASCII:

In [None]:
ascii_word_regex = re.compile(r"\w+", re.ASCII)
ascii_word_regex.findall(new_text)

Check your understanding of why *jalape* and *os* are separate matches.

Many other options are available for regular expressions with unicode: https://www.regular-expressions.info/unicode.html, but most options aren't available with Python's standard re module: https://www.regular-expressions.info/refunicode.html, especially character classes with Unicode character sets.

Fortunately, another regex library is available: https://pypi.org/project/regex/. This is backwards-compatible with re (so you can use the same functions), but offers lots more functionality.

We import the regex package below, and use this as re, and get the same results.

In [None]:
import regex
word_regex = regex.compile(r"\w+")
word_regex.findall(new_text)

We now have access to Unicode [character sets](https://www.regular-expressions.info/unicode.html), for example the below uses the `\p{L}` character set, which is all letters in any script:

In [None]:
letters_regex = regex.compile(r"\p{L}+")
letters_regex.findall(new_text)

A problem occurs though with *combining markers*, which are not part of the letters character set. An ñ can also be written as two *codepoints*, to create one *grapheme* (displayed character). Here the tilde (\u0303) is added to the preceding character (n).

In [None]:
combined_text = "The café served pizzas with jalapen\u0303os"
print(combined_text, len(combined_text))
print(new_text, len(new_text))

Note the text is displayed exactly the same as the previous text, but is actually one character longer (`len` counts codepoints).

In [None]:
letters_regex.findall(combined_text)

Now the n is part of the previous word, but the ~ is not recognised as a letter, so the next match starts (*os*).

There is a special character set for these combining marks:

In [None]:
combiner_regex = regex.compile(r"\p{M}+")

In [None]:
combiner_regex.findall(combined_text)

Note this is printed horribly, as the ~ is combined with the previous character, which here is an apostrophe.

We can search for letters with optional combining marks as letters, with the regex pattern below.

In [None]:
letters_combiners_regex = regex.compile(r"(?:\p{L}\p{M}*)+")
letters_combiners_regex.findall(combined_text)

Check your understanding of this pattern. `(?: )` is a non-capturing group, i.e. the group is used but not captured as part of the match, this allows to find multiple letters with `+`. We include 0 or more (`*`) combining characters, as it is possible to have multiple marks on the same letter, e.g.:

In [None]:
print("o\u0308\u0337")

This may not display as one character on some browsers (e.g. Safari), though it seems to display okay on Google Chrome. This is a font display issue.

<a name="emoji"></a>
### Emoji
Lots and lots available: http://www.unicode.org/emoji/charts/full-emoji-list.html

We can represent these as codepoints, or with the unicode character directly:

In [None]:
print("\U0001F643\U0001F596")
print("🙃🖖")

Emojis have their own combining markers, e.g. we can represent skintones (note, [not all will combine](https://www.unicode.org/emoji/charts/full-emoji-modifiers.html)):

In [None]:
print("\U0001F596\U0001F3FE")
print("🖖\U0001F3FD🖖\U0001F3FC🖖\U0001F3FB")
print("🙃\U0001F3FE")

The \u200D ([zero-width-joiner, ZWJ](https://blog.emojipedia.org/emoji-zwj-sequences-three-letters-many-possibilities/)) allows for multiple emojis to be combined into one.

In [None]:
print("\U0001F469") #a woman
print("\U0001F680") #a rocket
print("\U0001F469\u200D\U0001F680") #put them together, a woman astronaut.

For some emoji, a special codepoint is required ["variation selector-16"](https://unicode-table.com/en/FE0F/), to indicate to display as emoji. The fonts installed on your system will dictate how the below are displayed, some require the `\uFE0F`, whilst others do not, whilst some will not display as an emoji regardless. The important point is to be aware that `\uFE0F` might be used optionally in some emojis.

In [None]:
print("\u2600")
print("\u2600\uFE0F")
print("\u2639")
print("\u2639\uFE0F")

If we have a string with emoji in it, these will consist of different numbers of codepoints, depending on how many combining markers there are, and various zero-width special characters.

The emoji will be displayed as single characters/graphemes, but underlying are multiple codepoints, these are seen when listing the string (codepoints are displayed), and the length indicates this.

In [None]:
emoji = "🙃🖖🖖🏽🖖🏼🖖🏻👩\u200D🚀\u2639\uFE0F\U0001F468\u200D\U0001F469\u200D\U0001F467\u200D\U0001F466"
print(emoji)
print(list(emoji))
print(len(emoji))

We can use the special regex `\X`, which is the Unicode version of `.`, and will match any **grapheme** regardless as the number of codepoints that make it up.

In [None]:
grapheme_regex = regex.compile(r"\X")
graphemes = grapheme_regex.findall(emoji)

(if we print the list of matches, some are not displayed properly, this is a fault in the way lists are displayed.)

In [None]:
print(graphemes)

Printing each grapheme in turn, shows us the correct matches are made.

In [None]:
print(len(graphemes))
for grapheme in graphemes:
    print(grapheme)

More Unicode fun:

- https://norasandler.com/2017/11/02/Around-the-with-Unicode.html
- https://blog.jonnew.com/posts/poo-dot-length-equals-two
- https://pypi.org/project/emoji/