# Regular Expression

In [2]:
import re

re.search(pattern, string, flags=0): Searches for a pattern anywhere in the string and returns a match object if found.

re.match(pattern, string, flags=0): Matches the pattern at the beginning of the string and returns a match object if found.

re.fullmatch(pattern, string, flags=0): Matches the entire string against the pattern and returns a match object if found.

re.findall(pattern, string, flags=0): Returns all non-overlapping occurrences of the pattern in the string as a list of strings.

re.finditer(pattern, string, flags=0): Returns an iterator yielding match objects for all non-overlapping occurrences of the pattern in the string.

re.split(pattern, string, maxsplit=0, flags=0): Splits the string by the occurrences of the pattern and returns a list of substrings.

re.sub(pattern, repl, string, count=0, flags=0): Replaces occurrences of the pattern in the string with the specified replacement string.

re.subn(pattern, repl, string, count=0, flags=0): Performs the same function as re.sub() but returns a tuple containing the modified string and the number of substitutions made.

re.compile(pattern, flags=0): Compiles a regular expression pattern into a pattern object, which can be used for matching or searching operations.

re.escape(string): Escapes special characters in the string, making it suitable for use as a literal in a regular expression pattern.

In [3]:
quote = "I saved a life. My own. Am I a hero? I really can't say, but yes. - Michael Scott" 

### re.search() only find the first one

In [4]:
re.search("hero", quote)

<re.Match object; span=(31, 35), match='hero'>

### re.findall will find all the matches

In [5]:
re.findall("I", quote)

['I', 'I', 'I']

### if we want to split by the period, we need to add a slash in front of it

In [6]:
re.split("\.", quote)

['I saved a life',
 ' My own',
 " Am I a hero? I really can't say, but yes",
 ' - Michael Scott']

In [7]:
re.split("\.", quote, maxsplit=1)

['I saved a life',
 " My own. Am I a hero? I really can't say, but yes. - Michael Scott"]

### re.sub()

In [8]:
re.sub('I', 'You', quote, count = 1)

"You saved a life. My own. Am I a hero? I really can't say, but yes. - Michael Scott"

## Regex Character Classes

[ ]: Defines a character class (matches any single character within the brackets).

\ : Used for escaping Metacharacters

"." (dot): Matches any character except a newline.

^ : Anchors the regex at the start of the string.

$ : Anchors the regex at the end of the string.

* : Matches 0 or more occurrences of the preceding character.

+ : Matches 1 or more occurrences of the preceding character.

? : Matches 0 or 1 occurrence of the preceding character.

{ } : Exactly the specified number of occurrences

| : Acts like a logical OR. A|B matches either A or B.

## Regex MetaCharacters

[abc]: Matches any single character that is either a, b, or c.

[^abc]: Matches any single character that is not a, b, or c. The ^ inside the square brackets negates the character class.

[a-z]: Matches any single lowercase letter from a to z.

[A-Z]: Matches any single uppercase letter from A to Z.

[0-9]: Matches any single digit from 0 to 9.

[a-zA-Z]: Matches any single letter, either lowercase or uppercase.

[\d]: Matches any single digit. It is equivalent to [0-9].

[\D]: Matches any single non-digit character. It is equivalent to [^0-9].

[\w]: Matches any single alphanumeric character (letters, digits, or underscore). It is equivalent to [a-zA-Z0-9_].

[\W]: Matches any single non-alphanumeric character. It is equivalent to [^a-zA-Z0-9_].

[\s]: Matches any whitespace character (space, tab, newline, etc.).

[\S]: Matches any non-whitespace character.

[\b]: Matches when there is a word boundary present (spaces mostly).

[\B]: Matches there is not a word boundary present (characters).

[\Z]: Matches at the end of a string

## Match Object Methods

group(): Returns the matched string.

start(): Returns the starting position of the match.

end(): Returns the ending position of the match.

span(): Returns a tuple containing the starting and ending positions of the match.

groups(): Returns a tuple containing all the captured groups in the match.

group(index): Returns the captured group at the specified index.

groupdict(): Returns a dictionary containing named captured groups, where the keys are the group names.

expand(template): Returns the matched string with any backreferences replaced using the specified template.

## Modifiers

Modifiers (also known as flags) are used to modify the behavior of the pattern matching. They are added as optional parameters to the regular expression functions or as inline modifiers within the pattern itself. Here is a list of the main modifiers in Python's re module:

re.IGNORECASE or re.I: Ignores case sensitivity when matching characters. For example, /abc/i matches "abc", "AbC", "ABC", etc.

re.MULTILINE or re.M: Enables multi-line mode. Changes the behavior of ^ and $ to match the start and end of each line in addition to the start and end of the string.

re.DOTALL or re.S: Makes the . metacharacter match any character, including newline characters (\n).

re.VERBOSE or re.X: Allows the use of whitespace and comments within the regular expression pattern for better readability. Whitespace and comments are ignored unless they are within square brackets or escaped.

re.ASCII or re.A: Restricts character interpretation to the ASCII character set. Overrides the use of Unicode for character classes like \w, \W, \b, etc.

re.UNICODE or re.U: Enables Unicode matching. By default, \w, \W, \b, etc., use Unicode property escapes to match word characters and word boundaries.

re.DEBUG: Prints debugging information about the pattern compilation.



In [9]:
string = 'I enjoy fajitas in the winter'

In [10]:
re.findall("[a-mA-M]", string)

['I', 'e', 'j', 'f', 'a', 'j', 'i', 'a', 'i', 'h', 'e', 'i', 'e']

In [11]:
string = 'I have 123,456 cats'
re.findall("[0-9]", string)

['1', '2', '3', '4', '5', '6']

In [12]:
re.findall("\d", string)

['1', '2', '3', '4', '5', '6']

In [13]:
string = 'hellp hero helpomng helpo'

In [14]:
re.findall("he.o", string)

['hero']

In [15]:
re.findall("he..o", string)

['helpo', 'helpo']

In [16]:
string = 'sunflowers, sundrop, santa clause, and mushrooms are lovely in the spring'

re.findall("^sun", string)

['sun']

In [17]:
re.findall("^mush", string)

[]

In [18]:
string = "I love this python course!"

re.findall("!$", string)

['!']

In [19]:
string = 'This Thing called a Thimble is Thick'

re.findall('Thi.*s', string)

['This Thing called a Thimble is']

In [20]:
string2 = 'Helping someone is kind'

re.findall('H.*ing', string2)

# re.findall('Help+', string2)

['Helping']

## Regex Use Cases

In [21]:
import re

random_text = """
My name is Mr. Wang. My phone number is 123-456-7899. My email is OscarWang@gmail.com.
My name is Mr. Freberg. My phone number is 542-234-2819. My email is AlexFreberg@yahoo.com.
My name is Mrs.Rosie. My phone number is 285-036-5215. My email is GoldenGirl@apple.com.
"""

In [22]:
re.findall(r"@([a-zA-Z]+)", random_text)

['gmail', 'yahoo', 'apple']

In [23]:
re.findall(r"@([\w\.+]+)", random_text)

['gmail.com.', 'yahoo.com.', 'apple.com.']

In [24]:
re.findall(r"[\w\.+]+@[\w\.+]+", random_text)

['OscarWang@gmail.com.', 'AlexFreberg@yahoo.com.', 'GoldenGirl@apple.com.']

In [25]:
import re

random_text = """
My name is Mr. Wang. My phone number is 123-456-7899. My email is OscarWang@gmail.com.
My name is Mr. Freberg. My phone number is 542-234-2819. My email is AlexFreberg@yahoo.com.
My name is Mrs.Rosie. My phone number is 285-036-5215. My email is GoldenGirl@apple.com.
"""

In [26]:
re.findall(r"\d{3}-\d{3}-\d{4}", random_text)

['123-456-7899', '542-234-2819', '285-036-5215']

In [27]:
re.findall(r"\ (\d{3})", random_text)

['123', '542', '285']

In [28]:
email_list = ['OscarWang@gmail.com', 'AlexFreberg@yahoo.com', 'sss@apple.com']

In [29]:
domain = [re.findall(r"@([a-zA-Z]+)", email)[0] for email in email_list]

domain

['gmail', 'yahoo', 'apple']