# Regular Expressions

## What is Regular Expressions?

Regular expressions are special sequences of (meta)characters that define patterns for string matching. They are sometimes referred to as regexes, REs, or even regex patterns.Try to think about it in terms of, you have a dataset that consists of strings of sentences, or email addresses, etc. and you want to check if the data contains a specific pattern. You can quickly and (usually easily) do this by executing regex which will identify and retrieve the specified pattern if it exists.   


## Why use it? 
Because it allows for a quick identification and retrieval of patterns in strings! Therefore, regex makes your life easier and facilitates question asking such as, “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. 

You can also use regex to modify a string or to split it apart in various ways.

**Note**: It's good to keep in mind that with anything, there are sometimes some limitations based on a specific scenario. For example, the regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. 

In some cases, you'll find writing out the full code is better and more understandable especially for others who are unfamiliar with the regex characters. 


## Let's Get Started
### Determine if a string matches

There are a few ways to check if a string contains a specific substring. Let's take, for example, `'123'`. 

1. The `in` operator lets you check if a string contains a particular substring/pattern.

~~~ python
t = 'cod31srad'
'31' in t
True
~~~ 
Now try '312'. What happens and why?


2. The `.find()` and `.index()` will find the character position of the substring within a string.  

~~~ python
t = 'cod31srad'
t.find('31')
3
~~~  

~~~ python 
t = 'cod31srad'
t.index('31')
3
~~~

**Note**: the last two methods will only return the first character position of a substring within a given string. For example, the character position for `'31'` returns an index of `'3'`, but if we want to know the position of `'1'`, you would have to type only the `'1'` into the paranthesis which will return an index of `'4'`. 

But what if you don't want to search for a fixed substring because maybe you aren't too familiar with the data? How would you go about determining whether a string contains ***any*** of the digit characters (as used above) in other strings? 

This is where regex can start to be helpful! 

## Access the re. module

To access the regex functions and methods in Python you'll use the module named **`re.`**. 

**Import the **`re.`** module into your environment.**
~~~ python 
import re
~~~

One widely used function in regex is **`search()`** . This method will scan a string for the first location of a defined pattern (ie., find the first location where the pattern **`<regex>`** matches).  

The general syntax of the **`re.search()`** is: 1.) the **`re.`** module, 2.) the **`search()`**, 3.) the **`<regex>`** pattern, 4.) and the string. 
    
**Basic syntax** 
~~~ python 
re.search(<regex>, <string>)
~~~ 
 
If there is a match, then a matched object will be returned. If no match is found, then 'None' will be returned. The **`search()`** can take optional **`<flag>`** arguments - we'll touch on later, so stay tuned.


## Match a pattern re.search()

### Import the re., search() function
You can import the **`search()`** function in one of two ways. 


The first is to import the entire module, then use the module name as a prefix when calling the **`search()`**. 

1.) First approach
~~~ python 
import re
re.search(...)
~~~

The second is to import the function from the module by name, then refer to it without the module name prefix **`re.`**

2.) Second approach
~~~ python 
from re import search
search(...)
~~~

Neither is better than the other, it really comes down to preference. I find it easier to use the first approach because it's necessary to import **`re.search()`** by one means or another before you can use it.  

### Trying it out

Take the example below. We have the **`<regex>`** search pattern `'31'` and the **`<string`>** `'t'`. The output returns a matched object, in essence the specified **`<regex>`** search pattern `'31'` is present in the **`<string`>** `'t'`. If there was ***no*** matched object, then the output would return **`None`**. 

~~~ python
t = 'cod31srad'

# remember to always import re!
import re

re.search('31', t)

output: <re.Match object; span=(3, 5), match='31'>
~~~ 

Since **`search()`** returns **`None`** when no match is found, it's easy to test whether there is a match using a simple if statement:

~~~ python
if re.search('31', t):
    print('Found a match.')
else:
    print('No match.')

output: Found a match.
~~~ 

The output displays the match object as <kbd>`<re.Match object; span=(3, 5), match='31'>`</kbd>. <br>

1.) <kbd>span=(3, 5)</kbd> returns the portion of **`<string>`** where the match was found (ie. this is the same process as if you were to do a slice notation <kbd>`t[3, 5]`</kbd>) <br>
2.) <kbd>`match='31'`</kbd> specifies which characters from the  **`<string>`** matched. 

### Practice 1

In [None]:
# try on your own!

string = "The quick brown fox jumps over the lazy dog."

re.search('fox', string) 

## Create a matching pattern

Metacharacters are special characters that have unique meanings and enhance the capability of the regex matching search. 

Let's reconsider our original question about whether a string contains any three consecutive digits. We can quickly check our  **`<string>`** by placing a set of characters in square brackets `([])` to create a **character class**. 

In the example below, we can start to create an expression using `[0-9]` which matches any single digit character (ie., any character between '0' and '9'). We can make the full expression by using `[0-9][0-9]` which matches any sequence of two digit characters.

### Regex metacharacters
#### Matching a specific set of characters using character class
1.) creating a character class `([])`
~~~python 
t = 'cod31srad'
import re
re.search('[0-9][0-9]', t)

output: <re.Match object; span=(3, 5), match='31'>
~~~ 

We can see the expression also works for the following strings, but is restrictive to searches that contain **only** **two** **consecutive** digits.  
~~~python
import re
re.search('[0-9][0-9]', 'book55')
output: <re.Match object; span=(4, 6), match='55'>

import re
re.search('[0-9][0-9]', 'quote22')
output: <re.Match object; span=(5, 7), match='22'>
~~~

1a.) character class limitation
~~~python 
import re
print(re.search('[0-9][0-9]', 'c0dei5fory0u'))
output: None
~~~
#### dot `.` 
If we wanted to search for strings that contain digits in general we can apply the dot `.` metacharacter. The dot `.` metacharacter functions like a wildcard and matches any character except a newline. 

2.) creating regex search using a dot `.` metacharacter
~~~python 
t = 'cod315rad'
import re
re.search('3.5', t)
output: <re.Match object; span=(3, 6), match='315'>

t = 'cod35rad'
import re
print(re.search('3.5', t))
output: None
~~~

### Practice 2

In [None]:
### Try this on your own! 
string = "The quick brown fox jumps over the lazy dog."

re.search('\.', string)

## Matching single characters

~~~python
[]

#Specifies a specific set of characters to match
~~~

Character sequences contained in square brackets `([])` represent a character class and will match any single characters contained in the class.

The metacharacter sequence `[artz]` will match any single 'a', 'r', 't', or 'z' character. Below, the regex `ba[artz]` matches both 'bat' and 'baz' (and would also match 'baa' and 'bar').

~~~python 
re.search('ba[artz]', 'c0d3bats')
output: <re.Match object; span=(4, 7), match='bat'>

re.search('ba[artz]', 'c0d3bazz')
output: <re.Match object; span=(4, 7), match='baz'>
~~~

A hyphen can also be used to match a range of characters -- matching any single character within the range. We'll use `[a-z]` to show how `[a-z]` can match any lowercase alphabetic character between 'a' and 'z'. This means that`[0-9]` can match any digit character between '0' and '9'. 

~~~python 
re.search('[a-z]', 'C0d3bar')
output: <re.Match object; span=(2, 3), match='d'>
~~~

Let's create a more complex character class 
~~~python
re.search('[0-9a-fA-f]', '--- a0 ---') #PLAY AROUND WITH THIS
<re.Match object; span=(4, 5), match='a'>
~~~ 

The `[0-9a-fA-F]` expression matches the first [hexadecimal](https://www.lifewire.com/what-is-hexadecimal-2625897) digit character in the search string, which is 'a'.

**Note**: Regular expressions return values always from the leftmost possible match. For example, `re.search()` will scans any search string from left to right, and as soon as it locates a match for `<regex>`, it stops scanning and returns the match.

#### Complimenting character classes using `^`

If `^` is the first character in a character class then it will match any character that **is not** in the set. For example, if we use the following character sequence `[^0-9]` any characters that are not a digit will be returned.

~~~python
re.search('[^0-9]', '12345code')
output: <re.Match object; span=(5, 6), match='c'>
~~~
The first character in the string that isn’t a digit is 'c'.

**Note**: if `^` is not the first character in a character class, then it has no special meaning and matches only a **literal** `^` character (see below).
~~~python
re.search('[#:^]', 'code^bats:baz')
output: <re.Match object; span=(4, 5), match='^'>
~~~

### Practice 3

In [None]:
# try these two on your own!
string = "The quick brown fox jumps over the lazy dog."

re.search('q[\w]ck', string)

In [None]:
re.search('[^aeiou]', string)

#### Match metacharacters literally
When using metacharacters you should think about 1.) whether you want to use the character's special meaning to match or 2.) to match metacharacters literally. In some cases, to match a metachracters literally, you may need to escape them with a backslash `(\)`. 

Take the following, for example, if you want to use a `]` then you'll have to escape it with a backslash `(\)`.

~~~python
re.search('[]]', 'code[1]')
output: <re.Match object; span=(6, 7), match=']'>


re.search('[ab\]cd]', 'code[1]')
output: <re.Match object; span=(0, 1), match='c'>
~~~

### Practice 4

In [None]:
### Try on your own!
string = "The quick brown fox jumps over the lazy dog."

re.search('\$', string)

#### Match any word characters
`\w` matches any alphanumeric word character such as uppercase and lowercase letters, digits, and underscore (_) characters.`\w` is shorthand for `[a-zA-Z0-9_]`, so the output will be the same if using the longer version `[a-zA-Z0-9_].`

~~~python
#short version
re.search('\w', '#(.a$@&')
output: <re.Match object; span=(3, 4), match='a'>


#long version 
re.search('[a-zA-Z0-9_]', '#(.a$@&')
output: <re.Match object; span=(3, 4), match='a'>
~~~

#### Match non-word characters
`\W` is the opposite of `\w`. `\W` will match any **non-word** character. The`[^a-zA-Z0-9_]`is the equivalent long version of `\W`

~~~python
re.search('\W', '#(.a$@&')
output: <re.Match object; span=(0, 1), match='#'>
~~~ 

#### Match any whitespace characters
Unlike the dot `.` wildcard metacharacter, `\s` matches any whitespace character
~~~python
re.search('\s', 'code\nfun bats')
output: <re.Match object; span=(4, 5), match='\n'>
~~~

The opposite of `\s` is `\S` and will match any character that **is not** a whitespace. See the example below. 

~~~python
re.search('\S', '  \n code  \n  ')
output: <re.Match object; span=(4, 5), match='c'>
~~~

### Practice 5

In [None]:
### Try out these two on your own!
string = "The quick brown fox jumps over the lazy dog."

re.search('\w', string)
re.search('\W', string)

#### Match any digital characters
`\d` will match any decimal digit character and is equivalent to `[0-9]`. The opposite of `\d` is `\D` and matches any character that **is not** a decimal digit

~~~python
re.search('\d', 'abc4def')
output: <re.Match object; span=(3, 4), match='4'>
~~~

### Practice 6

In [None]:
# Give it a try!
string = "The quick brown fox jumps over the lazy dog."

re.search('\d', string)
re.search('\D', string)
re.search('\D+', string) #matches all word characters 

## Additional Metacharacters supported by `re.` 

- **<kbd>.</kbd>**: Matches any single character except newline <br>
- **<kbd>^</kbd>**: Anchors a match at the start of a string **AND** Complements a character class <br>
- **<kbd>$</kbd>**: Anchors a match at the end of a string<br>
- **<kbd>*</kbd>**: Matches zero or more repetitions<br>
- **<kbd>+</kbd>**: Matches one or more repetitions<br>
- **<kbd>?</kbd>**: Matches zero or one repetition **AND** Specifies the non-greedy versions of *, +, and ? **AND** Introduces a lookahead or lookbehind assertion **AND** Creates a named group<br>
- **<kbd>{}</kbd>**: Matches an explicitly specified number of repetitions<br>
- **<kbd>`\`</kbd>**: Escapes a metacharacter of its special meaning **AND** Introduces a special character class **AND** Introduces a grouping backreference<br>
- **<kbd>[]</kbd>**: Specifies a character class<br>
- **<kbd>|</kbd>**: Designates alternation<br>
- **<kbd>()</kbd>**: Creates a group<br>
- **<kbd>:</kbd>**: Designate a specialized group<br>
- **<kbd>#</kbd>**: Designate a specialized group<br>
- **<kbd>=</kbd>**: Designate a specialized group<br>
- **<kbd>!</kbd>**: Designate a specialized group<br>
- **<kbd>`<>`</kbd>**:Creates a named group<br>

## Until next time! 

There is so a lot to regular expressions that is not covered in this introduction workshop such as escaping metacharacters, anchors, using quantifiers, grouping constructs and backreferences, lookahead and lookbehind assertions, and much much more. 

#### Reach out to me with questions! 
Email: lbeltran@andrew.cmu.edu <br>
Get coding help by scheduling a consultation by visiting the [Data & Code Support](https://www.library.cmu.edu/service/data-code-support) page.  