# **Regular Expressions**

A **regular expression** (shortened as **regex**) is a sequence of characters that specifies a search pattern in text. In other words, they use specialised syntax to find/extract strings of text that match provided criteria.


>[Regular Expressions](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=S5eaUrnrcXe1)

>>[Abstract](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=Rqj9eoIQdV7i)

>>[Formal Language Theory](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=KDbtbUzndX0D)

>>[Regular Expression Syntax](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=WDsj3xRrjAUN)

>>>[Wildcard](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=FZ7uqv2WqdwG)

>>>[Disjunction](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=VHq_EmkxqIFI)

>>>[Quantification](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=Z7nxC2v3qSaq)

>>>[Anchors](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=vdEJQATXqVik)

>>>[Grouping](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=4xJ2coPCqumw)

>>[Python's re Module](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=pXmIXQLwrHvB)

>>>[String Pattern Matching](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=Bu_rtzU-rdgm)

>>>[String Pattern Substitution](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=CHJehtaRrgeE)



## Abstract

This notebook explores regular expressions as a basic preprocessing tool for NLP pipelines. It reports the basic syntax for regular expressions as well as basic usage of Python's `re` module. 

## Formal Language Theory

<font color="hotpink">Regular expressions</font> describe <font color="hotpink">regular languages</font> in formal language theory. They consist of: 
- Constants, which denote sets of strings.
- Operator symbols, which denote operations over these sets. 

The following <font color="hotpink">definition</font> is standard, and found as such in most textbooks on formal language theory. Given a finite alphabet Σ, the following constants are defined as regular expressions:

- (Empty set) ∅ denoting the set ∅.
- (Empty string) ε denoting the set containing only the "empty" string, which has no characters at all.
- (Literal character) $a$ in Σ denoting the set containing only the character $a$.

Given regular expressions $R$ and $S$, the following operations over them are defined to produce regular expressions:

- (Concatenation) ($RS$) denotes the set of strings obtained by concatenating a string accepted by $R$ and another accepted by $S$ (in that order).
  
  For example, let $R$ denote {"ab", "c"} and $S$ denote {"d", "ef"}. Then, ($RS$) denotes {"abd", "abef", "cd", "cef"}.

- (Alternation) ($R|S$) denotes the set union of sets described by $R$ and $S$. 
  
  For example, if $R$ describes {"ab", "c"} and $S$ describes {"ab", "d", "ef"}, expression ($R|S$) describes {"ab", "c", "d", "ef"}.

- (Kleene star) ($R^*$) denotes the set of all strings that can be made by concatenating any finite number (including zero) of strings from the set defined by $R$.

  For example, if $R$ denotes {"0", "1"}, ($R^*$) denotes the set of all finite binary strings (including the empty string).



## Regular Expression Syntax







- ### Wildcard

  The wildcard `.` matches any character. 

- ### Disjunction
  
  Disjunctions `[]` are used to indicate a set of characters, any of which can be matched.
  - Ranges `[0-9]`: will match any digit within the range.
  - Negation Caret `^`: if the first character of the set is `^`, all the characters that are not in the set will be matched. 

- ### Quantification

  A quantifier after an element (such as a token, character, or group) specifies how many times the preceding element is allowed to repeat.
  - `?` indicates zero or one occurrences of the preceding element. 
  - `*` indicates zero or more occurrences of the preceding element. 
  - `+` indicates one or more occurrences of the preceding element.
  - `{n}` the preceding item is matched exactly `n` times.
  - `{n, }` the preceding item is matched at least `n` times.
  - `{, m}` the preceding item is matched at most `m` times.
  - `{n, m}` the preceding item is matched at least `n` times, but not more than `m` times.

- ### Anchors

  Anchors don't match any characters, but assert something about the matching process. 

  - `^` indicates the start of the line. 
  - `$` indicates the end of the line.
  - `\b` indicates a word boundary.
  - `\B` indicates a word non-boundary.

- ### Grouping

  Parentheses `()` are used to define the scope and precedence of the operators. They can also be used to enable referring to the matched group.

## Python's `re` Module

The [re](https://docs.python.org/3/library/re.html) module provides regular expression matching operations similar to those found in Perl.

In [1]:
import re

### String Pattern Matching

In [2]:
text = "NLP is extremely interesting; I love it. I also love music."
regex = "love"

# Returns a list of all the matches
re.findall(regex, text)

['love', 'love']

### String Pattern Substitution

In [3]:
re.sub("love", "like", text)

'NLP is extremely interesting; I like it. I also like music.'