## Regular Expressions

Regular expressions (often shortened as RegEx or regex) are tools, and like all tools, regular expressions are designed to solve a very specific problem. The best way to understand regular expressions and what they do is to understand the problem they solve.

Consider the following scenarios:

- You are searching for a file containing the text car (regardless of case) but do not want to also locate car in the middle of a word (for example, scar, carry, and incarcerate).

- You are generating a Web page and need to display text retrieved from a database. Text may contain URLs, and you want those URLs to be clickable in the generated page (so that instead of generating just text, you generate a valid HTML <a href></a>).

- You create an app with a form that prompts for user information including e-mail address. You need to verify that specified addresses are formatted correctly (that they are syntactically valid).

- You are editing source code and need to replace all occurrences of size with iSize, but only size and not size as part of another word.

- You are displaying a list of all files in your computer file system and want to filter so that you locate only files containing the text Application.

- You are importing data into an application. The data is tab delimited and your application supports CSV format files (one row per line, comma-delimited values, each possibly enclosed with quotes).

- You need to search a file for some specific text, but only at a specific location (perhaps at the start of a line or at the end of a sentence).


**Regex are written using the following rules:**

<img src='character_classes.png'>

#### Here is a list of all the regex operators
![](regex.png)

### Matching a single set of characters

To match a single set of characters we just need to write define the pattern in term of the characters to be matched.

For example imagine our text was 

```
Hello, my name is Ben. Please visit
my website at http://www.forta.com/.

```
And we wanted to match `Ben`

Let's see how can we do that


In [1]:
import re

In [2]:
## Variation 1
pattern = re.compile(r'Ben')
text = '''Hello, my name is Ben. Please visit
my website at http://www.forta.com/.'''
re.findall(pattern,text)

['Ben']

In [3]:
## Variation 2
pattern = re.compile(r'ben')
text = '''Hello, my name is Ben. Please visit
my website at http://www.forta.com/.'''
re.findall(pattern,text)

[]

In [4]:
## Variation 3
pattern = re.compile(r'Ben',re.IGNORECASE)
text = '''Hello, my name is Ben. Please visit
my website at http://www.forta.com/.'''
re.findall(pattern,text)

['Ben']

In the text ```Hello, my name is Ben. Please visit
my website at http://www.forta.com/.``` match `my`

In [5]:
pattern = re.compile(r'my',re.IGNORECASE)
re.findall(pattern,text)

['my', 'my']

### Using regex operator '.' 
Suppose we have a list of following files:
```
sales1.xls
orders3.xls
sales2.xls
sales3.xls
apac1.xls
europe2.xls
na1.xls
na2.xls
sa1.xls
```
We want to match all the files that start with `sales`

In [6]:
text = '''sales1.xls
orders3.xls
sales2.xls
sales3.xls
apac1.xls
europe2.xls
na1.xls
na2.xls
sa1.xls'''
pattern = re.compile(r'sales.',re.IGNORECASE)
re.findall(pattern,text)

['sales1', 'sales2', 'sales3']

pattern `.` matches anything

Match all the filenames that contain letter `a` in the filename itself

One good resource to work with regex is [this website](https://regex101.com)

In [7]:
pattern = re.compile(r'.a.')
re.findall(pattern, text)

['sal', 'sal', 'sal', 'pac', 'na1', 'na2', 'sa1']

#### Matching `.`
One can use `\.` to match `.`. Lets see an example. Imagine we need to match all those file names that have any character before or after `a` ending with a `.`

In [8]:
pattern = re.compile(r'.a.\.')
re.findall(pattern, text)

['na1.', 'na2.', 'sa1.']

In [9]:
pattern = re.compile(r'.a.\.xls')
re.findall(pattern, text)

['na1.xls', 'na2.xls', 'sa1.xls']

### Matching set of characters

Consider the following code snippet. 

```javascript
var myArray = new Array();
∙∙∙
if (myArray[0] == 0) {
∙∙∙
}

```
Now, Imagine your task is to find out all the references of `myArray[0]`, `myArray[1]`, `myArray[2]`... `myArray[N]`

We can use metacharacters to define that.

Lets start with a simple example of matching digits.

#### Matching digits

We can use `\d` or `[0-9]` to match digits in text.

In [10]:
text = "This text contains a digit 99 here and another digit 111 here"
pattern = re.compile(r'\d')
re.findall(pattern,text)

['9', '9', '1', '1', '1']

`\d` will only match a single digit, what we want to match is 1 or more digit. We can use `+` to signify we need one or more digits to be matched.

In [11]:
pattern = re.compile(r'\d+')
re.findall(pattern,text)

['99', '111']

In [12]:
pattern = re.compile(r'[0-9]+')
re.findall(pattern,text)

['99', '111']

#### Matching alphabets

We can use metacharacters to match alphabets as well. `\w` or `[A-Z a-z]` match alphabets

In [15]:
pattern = re.compile(r'\w+')
re.findall(pattern,text)

['This',
 'text',
 'contains',
 'a',
 'digit',
 '99',
 'here',
 'and',
 'another',
 'digit',
 '111',
 'here']

In [18]:
pattern = re.compile(r'[a-zA-Z]+')
re.findall(pattern,text)

['This',
 'text',
 'contains',
 'a',
 'digit',
 'here',
 'and',
 'another',
 'digit',
 'here']

Match alphanumeric characters
```
11213
A1C2E3
48075
48237
M1B4F2
90046
H1H2H2
```


In [19]:
text = '''11213
A1C2E3
48075
48237
M1B4F2
90046
H1H2H2'''
pattern = re.compile(r'\w+')
re.findall(pattern,text)

['11213', 'A1C2E3', '48075', '48237', 'M1B4F2', '90046', 'H1H2H2']

In [20]:
pattern = re.compile(r'\w\d\w\d\w\d')
re.findall(pattern,text)

['A1C2E3', 'M1B4F2', 'H1H2H2']

In [21]:
pattern = re.compile(r'\d{5}')
re.findall(pattern,text)

['11213', '48075', '48237', '90046']

Consider the following code snippet,

```javascript
var myArray = new Array();
∙∙∙
if (myArray[0] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[1] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[2] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[3] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[4] == 0) {
∙∙∙
}

```
You need to extract `myArray[0], myArray[1] ... myArray[5]`

In [22]:
text = '''
var myArray = new Array();
∙∙∙
if (myArray[0] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[1] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[2] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[3] == 0) {
∙∙∙
}
var myArray = new Array();
∙∙∙
if (myArray[4] == 0) {
∙∙∙
}
'''
pattern = re.compile(r'myArray\[[0-9]\]')
re.findall(pattern, text)

['myArray[0]', 'myArray[1]', 'myArray[2]', 'myArray[3]', 'myArray[4]']

Consider the text below:
```
Send personal email to ben@forta.com or
ben.forta@forta.com. For questions about a
book use support@forta.com. If your message
is urgent try ben@urgent.forta.com. Feel
free to send unsolicited email to
spam@forta.com (wouldn't it be nice if
it were that simple, huh?).
```
Extract email ids from this text

In [24]:
text = '''Send personal email to ben@forta.com or
ben.forta@forta.com. For questions about a
book use support@forta.com. If your message
is urgent try ben@urgent.forta.com. Feel
free to send unsolicited email to
spam@forta.com (wouldn't it be nice if
it were that simple, huh?).'''
pattern = re.compile(r'\w+@\w+\.\w+')
re.findall(pattern,text)

['ben@forta.com',
 'forta@forta.com',
 'support@forta.com',
 'ben@urgent.forta',
 'spam@forta.com']

Consider the data as given below

```shell
1001: $496.80
1002: $1290.69
1003: $26.43
1004: $613.42
1005: $7.61
1006: $414.90
1007: $25.00
```
Extract the dollar amount only.

In [26]:
pattern = re.compile(r'\$\d+\.\d+')
text = '''
1001: $496.80
1002: $1290.69
1003: $26.43
1004: $613.42
1005: $7.61
1006: $414.90
1007: $25.00
'''
re.findall(pattern,text)

['$496.80', '$1290.69', '$26.43', '$613.42', '$7.61', '$414.90', '$25.00']

Consider the following html code

```html
This offer is not available to customers
living in <b>AK</b> and <b>HI</b>.
```
Extract text in `<b></b>` tags

In [27]:
text = '''This offer is not available to customers
living in <b>AK</b> and <b>HI</b>.'''
pattern = re.compile(r'<b>.*</b>')
re.findall(pattern,text)

['<b>AK</b> and <b>HI</b>']

In [28]:
pattern = re.compile(r'<b>.+</b>')
re.findall(pattern,text)

['<b>AK</b> and <b>HI</b>']

In [31]:
pattern = re.compile(r'<b>.*?</b>')
re.findall(pattern,text)

['<b>AK</b>', '<b>HI</b>']

### Positional Matching

You’ve now learned how to match all sorts of characters in all sorts of combinations and repetitions and in any location within text. However, it is sometimes necessary to match at specific locations within a block of text, and this requires position matching, which is explained below

Position matching is used to specify where within a string of text a match should occur. To understand the need for position matching, consider the following example:

Text:

`The cat scattered his food all over the room.`

Regex 

`cat`

In [32]:
text = 'The cat scattered his food all over the room.'
pattern = re.compile(r'cat')
re.findall(pattern,text)

['cat', 'cat']

the pattern `cat` matches all occurrences of cat, even cat within the word `scattered`. This may, in fact, be the desired outcome, but more than likely it is not. If you were performing the search to replace all occurrences of `cat` with `dog`, you would end up with the following nonsense:

`The dog sdogtered his food all over the room.`

**Using Word Boundaries**

The first boundary (and one of the most commonly used) is the word boundary specified as `\b`. As its name suggests, `\b` is used to match the start or end of a word.

To demonstrate the use of `\b`, here is the previous example again, this time with the boundaries specified:

Text
`The cat scattered his food all over the room.`

Regex:
`\bcat\b`

In [33]:
pattern = re.compile(r'\bcat\b')
re.findall(pattern, text)

['cat']

The word `cat` has a space before and after it, and so it matches `\bcat\b` (space is one of the characters used to separate words). The word `cat` in `scattered`, however, did not match, because the character before it is `s` and the character after it is `t` (neither of which match \b).