# A guide to regular expressions

### BSc AI & CS Mentor Program 2023 - 2024. 
>>>>>
### In this guide, we get acquainted with Python regular expressions level-by-level, proceeding from the most basic cases to more and more expressive patterns.

>>>>>>>>>>>
### Part I: The raw syntax of specifying regular expressions

#### **A regular expression.**
A regular expression is a pattern of characters used for lookup in text. It is defined within two forward slashes / /, with possible modifiers applied outside of them.  Their expressiveness is achieved through the use of 11 special characters.
> [] . ^ $ * + ? {} () \ | 

### **Character classes**

**Character sets allow for the most basic form of matches. They are defined using squared brackets [ ].** 

1. [abc]       matching any character in the set: a, b, or c. 

2. [^abc]      negates a character set and matches anything BUT the characters in the set. Is somewhat similar to the logical operator NOT. 
> !! Note: ^ will be known to denote the beginning of a line. Within the squared brackets[] , however, it can only mean NOT. 

3. [a-m]       matching a range of characters, including a and m. 

**Element types: 3 types & their negations.**
Let us consider the following input sentence: **I was born in 1993 and had 2 dogs growing up.**

1. /\w/        denotes a word (alphabetic) character. 
>**Matches:** ['I', 'w', 'a', 's', 'b', 'o', 'r', 'n', 'i', 'n', 'a', 'n', 'd' .....and so on ].   

2. /\W/        denotes NOT a word (alphabetic) character.
>**Matches:** [' ', ' ', ' ', ' ', 1,9,9,3, ' ', ' ', ' ', 2, ' ', ' ', ' ']. 

3. /\d/        denotes a digit. 
>**Matches:** [1, 9, 9, 3, 2]. 

4. /\D/        denotes NOT a digit.
>**Matches:** ['I', ' ', 'w', 'a', 's', ' ', 'b', 'o', 'r', 'n', ' ', 'i', 'n', ' ', 'a', 'n', 'd' .....and so on ].   

5. /\s/        denotes a whitespace. 
>**Matches:** [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

6. /\S/        denotes NOT a whitespace.
>**Matches:** ['I', 'w', 'a', 's', 'b', 'o', 'r', 'n', 'i', 'n', 1, 9, 9, 3, 'a', 'n', 'd' .....and so on ].   

### **Operators**

**Quantifiers are a type of operators. There are 4 of them:**

1. ' + '      matches strings where the preceding+ character occurs >= 1 time.
>Example: /e+/ matches 'e' in '**e**ntropy', 'f**ee**ling', 'Y**eeee**ah', etc.
>> !! Note: /e/ would also match 'f**ee**ling', but with 2 separate '**e**' matches instead of one '**ee**' that /e+/ would produce. 

2. ' ? '      matches the strings where the preceding? character either DOES NOT occur or occurs, but only ONCE. 

>Example: /o?/ matches 's**o**les' and 'sea', but not 'solo' nor 'saloon'.  

3. ' * '      matches both the strings where the preceding* character DOES NOT and DOES occur. If it does occur, all its occurrences are matched.  
>Example: /o*/ matches 's**o**les, 'sea', 's**o**l**o**' and 'sal**oo**n'. 

4. ' {n} '     matches strings where the preceding{n} character occurs exactly 'n' times. Think of the set notation, we are looking for {3} occurences specifically. 

>Example: o{3} matches 'Colorado', but not 'boohoo'. 
>> !! Note: you can, however, specify a range of character like: {4, 5}. The lookup will then match words containing 4 - 5 specified characters. 

**Anchors are operators that influence morphological boundaries.**

1. ' \b '        denotes a word boundary. Respectively, \B stands for NOT a word boundary. 

>Example: \bexam\b matches 'exam' but not 'examinor' or 'pre-examination'. 

2. ' ^ '         denotes the beginning of a line.

>Example: ^Exam matches 'Exam' only if it is found at the beginning of a line.

3. ' \\$ '         denotes the end of a line.

>Example: similarly, exam$ matches 'exam' only if it is found at the end of a line.

**Groups** 

1. (    )    The so-called capturing group: 1) any operator outside the brackets is applied to the full expression inside them, 2) the expression inside is 'captured' = returned when matched. It helps to think of it as being similar to the basic math operator ().
>Example: (abc)+ would have matches in 'abc', 'abcisalphabet', 'abcabcabcisalphabet', etc. ---> Refer back to Quantifiers 2.

2. (?:  )    The so-called "Non-capturing group". Finds the regular expression following ?:, but DOES NOT capture (return) it. 
> Example:(?:Emergency)(warning) would have 2 matches in 'Emergergency warning', but would only return 'warning'.

**Special characters. A functionally diverse group of exceptional vocabulary.** 

1.  .       Matches any single (!) character except newline characters ( ) 
>Example: /r.n/ matches include 'ran', 'run', 'r@n', but does not match'a\nc', 'ra' or 'ru', which are not followed by 'n'.  

2. \        Cancels out a special character, treating it as a literal character. The characters that would be otherwise be used as operators are now considered regular alphanumeric/symbolic characters in the expression. 
>Example: \. matches '.' instead of any character, as per point 1. 

**Logical operators. Essentially, there is one: the logical OR.**

1. |        The logical OR. 
>Example: Albert|albert matches both 'Albert' and 'albert'.

2. The semantics of NOT are respresented by the class operator ^ 

3. The semantics of AND are implicit in the expressibility of other operators and hierarchal grouping. 

**Getting more abstract...**
1. ?=       The so-called "Positive lookahead". Matches a pattern but only if it is followed by another pattern. 

> Example: \b\w+(?=\.com\b) would match 'google' in 'google.com'

**Flags, or Modifiers, per their name, modify the general properties of the regular expression pattern. They are placed after the regular expression. Their order does not matter.**

1. g     is responsible for global matching. With it, we retrieve all occurences of the expression. Without it, only the first one shall be matched. 

2. i:    case-insensitive modifier: matches both lowercase and uppercase expressions.

3. m:    multiline matching: matches instances encountered in all lines. 


>Examples: 
> - /Prof/gi matches all occurrences of '*prof*.', 'Prof.', 'Professor' throughout the text. /Prof/i would match either one of those that occurs first. /Prof/g would match all instances of 'Prof'. 
> - /([A-Z])\w+/ matches any upperchase letter between A and Z, followed by any word-like token. 

#### Complex regular expressions. 
> Example: /\b\w+@(?:student\.vu\.nl)\b/. This example may be challenging to understand, so let us unpack it symbol by symbol: 
>> / \b - boundary   \w+ - >1 word character  @ - treated literally as @ (?: ) - non-capturing group. Next, let us look inside the non-capturing group: student, \ - special character cancelling out other special characters, . treated literally, \. - same effect, nl. 

>>>>>>>>>>>
### The next part is on re class methods. TBC. 

##### Created by Ilona Masiuk, Vrije Universiteit Amsterdam, 30th March 2024. 

Sources: 
    https://regexr.com/