# What is the Regex

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by `string` searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

### Regex in Python

python provide module for regex called `re`

we are going to use the `findall` and `sub` functions

* sub
<p>used to replace a world in a string by another world </p>

In [3]:
# example : replace ahmed in a string by ali
import re
text = "my name is Ahmed"
new_text = re.sub("Ahmed", "Ali", text)
print(new_text)

my name is Ali


* findall 
<p> used to find a specific word in a string </p>

In [4]:
text = "my name is Ahmed"
results = re.findall("Ahmed", text)
print(results)

['Ahmed']


### Special Characters (Regular Expressions)

some characters have special functions and are not just character, for example the `\n` which indicate a newline and the `\t` which is a tap space.

* Basic patterns that match single chars

| Character  | function |
| ------------- | ------------- |
| a-z, 0-9  | ordinary characters just match themselves exactly.|
| . (dot)  | matches any single character except newline '\n'  |
| \w | matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_] |
| \W | matches any non-word character |
| \b | boundary between word and non-word |
| \s | matches a single whitespace character -- space, newline, return, tab |
| \S | matches any non-whitespace character |
| \t, \n, \r | tab, newline, return |
| \d | decimal digit [0-9] |
| ^ | matches start of the string |
| $ | match the end of the string |

Let's mix them with normal characters

> note that we use `r` before the pattern string to let python know not to parse them, for example not to take \n and replace it by newline.


In [5]:
import re
text = """regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using regular expression pattern. regular expressions are widely used in UNIX world."""
results = re.findall(r"^regular", text)
print(results)

['regular']


In [6]:
import re
text = """regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using regular expression pattern. regular expressions are widely used in UNIX world."""
results = re.sub(r"^regular", "Regular", text)
print(results)

Regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using regular expression pattern. regular expressions are widely used in UNIX world.


### Complex Regex
you can mix and match multiple expressions and have more than one instances of them

* Grouping characters

let's take a look at some examples

| Example  | description |
| --- | --- |
| [Pp]ython | Match "Python" or "python" |
| rub[ye] | Match "ruby" or "rube" |
| [aeiou] | Match any one lowercase vowel |
| [0-9] | Match any digit; same as [0123456789] |
| [a-z] | Match any lowercase ASCII letter |
| [A-Z] | Match any uppercase ASCII letter |
| [a-zA-Z0-9] | Match any of the defined |
| [^aeiou] | Match anything other than a lowercase vowel |
| [^0-9] | Match anything other than a digit |

we can use **OR** to use multiple regex together.


In [7]:
# Example
import re
texts = [
  "python is a great language",
  "i lov to write in py",
  "what a cool language Python is",
  "the pyramids of giza are so huge!"
]
for text in texts:
  python_detected = re.findall(r"[Pp]ython|[Pp]y\b", text)
  if python_detected:
    print("talking about python")
  else:
    print("something else")

talking about python
talking about python
talking about python
something else


* Repetition Cases

| Example | description |
| --- | --- |
| ruby? | Match "rub" or "ruby": the y is optional |
| ruby* | Match "rub" plus 0 or more y(s) |
| ruby+ | Match "rub" plus 1 or more y(s) |
| \d{3} | Match exactly 3 digits |
| \d{3,} | Match 3 or more digits |
| \d{3,5} | Match 3, 4, or 5 digits |


In [8]:
# example 
import re
text = "it's 2018, happy new year!"
years = re.findall(r"\d{4}", text)
print(years)

['2018']


In [9]:
# one More example
import re
text = "this is a text tweet that contains multiple numbers 011121314, 012121212"
new_text = re.sub(r"[0-9]+", "NUM", text)
print(new_text)

this is a text tweet that contains multiple numbers NUM, NUM


* Search groups
<p>you can create a search groups with regex and retrieve each one with `search()` function </p>

In [11]:
import re
email_address = 'Please contact us at: support@datacamp.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)
if match:
  print(match.group()) # The whole matched text
  print(match.group(1)) # The username (group 1)
  print(match.group(2)) # The host (group 2)

support@datacamp.com
support
datacamp.com


* `match` vs `search`

The `match()` function checks for a match only at the beginning of the string whereas the `search()` function checks for a match anywhere in the string.

* Greedy vs Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". It is the normal behavior of a regular expression but sometimes this behavior is not desired

Example

```python
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()
```

```raw
'<h1>TITLE</h1>'
```

The pattern `<.*>` matched the whole string, right up to the second occurrence of `>`.

However, if you only wanted to match the first `<h1>` tag, you could have used the greedy qualifier `*?` that matches as little text as possible.

Adding `?` after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched. When you run `<.*>`, you will only get a match with `<h1>`.

Example

```python
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()
```

```raw
'<h1>'
```