| Topic | Details |
|-------|---------|
| **Definition** | A sequence of characters that define a search pattern |
| **Also Known As** | Regex, regexp, rational expression |
| **Common Uses** | String searching algorithms for "find" or "find and replace" operations, input validation |
| **Origin** | 1950s - formalized by American mathematician Stephen Cole Kleene |
| **Popular Usage** | Unix text-processing utilities (since 1950s) |
| **Syntax Standards** | POSIX standard, Perl syntax (widely used since 1980s) |
| **Field** | Theoretical computer science and formal language theory |

| Topic | Details |
|-------|---------|
| **Regular Expressions** | A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with Unix text-processing utilities. Since the 1980s, different syntaxes for writing regular expressions exist, one being the POSIX standard and another, widely used, being the Perl syntax. |
| **Regex in Python** | Python provides a module for regex called `re`. We are going to use the `findall` and `sub` functions. |

## Regex in Python

In [1]:
import re
text = "my name is mohamed"

re.sub("mohamed", "Ahmed", text)

'my name is Ahmed'

In [3]:
re.findall("mohamed", text)

['mohamed']

--------------

## Regular Expressions

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by `string` searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with `Unix` text-processing utilities. Since the 1980s, different syntaxes for writing regular expressions exist, one being the POSIX standard and another, widely used, being the Perl syntax.

## Regex in Python

python provide module for regex called `re`

we are going to use the `findall` and `sub` functions

### Special Characters (Regular Expressions)

some characters have special functions and are not just character, for example the `\n` which indicate a newline and the `\t` which is a tap space.

### Basic patterns that match single chars

| Character  | function |
| ------------- | ------------- |
| a-z, 0-9  | ordinary characters just match themselves exactly.|
| . (dot)  | matches any single character except newline '\n'  |
| \w | matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_] |
| \W | matches any non-word character |
| \b | boundary between word and non-word |
| \s | matches a single whitespace character -- space, newline, return, tab |
| \S | matches any non-whitespace character |
| \t, \n, \r | tab, newline, return |
| \d | decimal digit [0-9] |
| ^ | matches start of the string |
| $ | match the end of the string |

### Let's mix them with normal characters

> note that we use `r` before the pattern string to let python know not to parse them, for example not to take \n and replace it by newline.

In [4]:
text = """regular expression is a special sequence of characters \
that helps you match or find other strings or sets of strings, \
using regular expression pattern. regular expressions are widely used in UNIX world."""

re.findall(r"^regular", text)


['regular']

### Replace the first regular to a title case one

In [5]:
text = """regular expression is a special sequence of characters \
that helps you match or find other strings or sets of strings, \
using regular expression pattern. regular expressions are widely used in UNIX world."""

re.sub(r"^regular", "Regular", text)

'Regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using regular expression pattern. regular expressions are widely used in UNIX world.'

### Ends with number

using regular expression write a script that check if a string ends with a number

In [6]:
text = input("enter a string:").strip()

result = re.findall(r"[0-9]$", text)

if result:
    print("string ends with a number")
else:
    print("it is not")

string ends with a number


--------

### Only Valid text

write a script using regular expression to check if user input consists of only **`alphabet letters`**, **`number`** and **`_`**

In [9]:
user_name = input("enter username: ").strip()

invalid_in = re.findall(r"\W", user_name)

if invalid_in:
    print("invalid username")
else:
    print("username {} is valid".format(user_name))

username mohamed_1234 is valid


---------

### Building a bigger regular expression

You can combine multiple expressions and use more than one instance of them.

| Example  | description |
| --- | --- |
| [Pp]ython | Match "Python" or "python" |
| rub[ye] | Match "ruby" or "rube" |
| [aeiou] | Match any one lowercase vowel |
| [0-9] | Match any digit; same as [0123456789] |
| [a-z] | Match any lowercase ASCII letter |
| [A-Z] | Match any uppercase ASCII letter |
| [a-zA-Z0-9] | Match any of the defined |
| [^aeiou] | Match anything other than a lowercase vowel |
| [^0-9] | Match anything other than a digit |


-> We can use **OR** to combine multiple regex patterns together.


In [12]:
texts = [
  "python is a great language",
  "i love to write in py",
  "what a cool language Python is",
  "the pyramids of giza are so huge!",
  "we are so happy to be here",
  "i can code with python"
]

for text in texts:
    python_detected = re.findall(r"[Pp]ython|\b[Pp]y\b", text)
    if python_detected:
        print("talking about python")
    else:
        print("didn't talk about python")

talking about python
talking about python
talking about python
didn't talk about python
didn't talk about python
talking about python


-------

### Repetition Cases

| Example | description |
| --- | --- |
| ruby? | Match "rub" or "ruby": the y is optional |
| ruby* | Match "rub" plus 0 or more y(s) |
| ruby+ | Match "rub" plus 1 or more y(s) |
| \d{3} | Match exactly 3 digits |
| \d{3,} | Match 3 or more digits |
| \d{3,5} | Match 3, 4, or 5 digits |

In [15]:
text = "it's 2024, happy new year"

re.findall(r"\d{4}", text)

['2024']

In [16]:
re.findall(r"\d+", text)

['2024']

-----------

### Let's replace numbers with NUM

In [23]:
text = "this text contain numbers \
        the numbers are 01234567810, 01235446"

new_text = re.sub(r"\d{11}", "phone", text)
new_text = re.sub(r"\d+", "num", new_text)
print(new_text)

this text contain numbers         the numbers are phone, num


-------------------

### Search groups

You can create search groups with regex and retrieve each one with the `search()` function. Groups are defined using parentheses `()` in your regex pattern. Each group can be accessed individually using the `group()` method.

- `group(0)` or `group()` returns the entire match
- `group(1)` returns the first group
- `group(2)` returns the second group, and so on

For example, if you have a pattern like `r'(\w+)@(\w+)\.com'` to match an email, you can extract the username and domain separately by accessing `group(1)` and `group(2)`.

In [26]:
email = "my mail : mohamedzahran3008@gmail.com"
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email)

if match:
    print(match.group())
    print(match.group(1))
    print(match.group(2))

mohamedzahran3008@gmail.com
mohamedzahran3008
gmail.com


-------

### `match` vs `search`

The `match()` function checks for a match only at the beginning of the string, whereas the `search()` function checks for a match anywhere in the string.

**`match()`:**
- Only looks at the start of the string
- Returns a match object if the pattern is found at the beginning
- Returns `None` if the pattern is not at the start, even if it exists elsewhere in the string
- Useful when you want to validate that a string starts with a specific pattern

**`search()`:**
- Scans through the entire string looking for the first location where the pattern matches
- Returns a match object for the first occurrence found anywhere in the string
- Returns `None` only if the pattern is not found anywhere in the string
- More flexible and commonly used when you need to find a pattern regardless of its position

**Example:**
```python
import re

text = "Hello Python"
pattern = r"Python"

match_result = re.match(pattern, text)  # Returns None (Python not at start)
search_result = re.search(pattern, text)  # Returns match object (Python found)
```

-----

### The `re.compile()`

We can compile a regular expression pattern instead of writing it multiple times. This improves performance and code readability.

**Benefits of using `re.compile()`:**
- **Performance**: When you need to use the same pattern multiple times, compiling it once is more efficient than rewriting it each time
- **Readability**: You can store the compiled pattern in a variable with a descriptive name, making your code clearer
- **Reusability**: The compiled pattern object can be reused throughout your code


In [28]:
mail_re = re.compile(r"[\w\.-]+@[\w\.-]+\.[\w\.-]+")

mails = [
    "this is a message with an email of:example.name@company.org",
    "my-email55@yahoo.com is the email you would like to use",
    "send me an email at:shortmail@long-company.net"    
]

for mail in mails:
    print(mail_re.findall(mail))

['example.name@company.org']
['my-email55@yahoo.com']
['shortmail@long-company.net']


----------------

### Extract Hashtags

Write a script to extract a hashtag from tweet.

In [30]:
tweet = "a tweet with no #hashtag, but a #HASHTAG and another #cool one #هاشتاج"

hashtags = re.findall(r"#\S+", tweet)

if hashtags:
    print(hashtags)
else:
    print("no hashtags found")

['#hashtag,', '#HASHTAG', '#cool', '#هاشتاج']


---------

> ## `Great Job`

--------------