<a href="https://colab.research.google.com/github/mnjaaga/Python-Misc/blob/main/Regex_regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions

### What are regular expressions?
A Regular Expression (RegEX), is a sequence of characters that forms a search pattern. RegEx are used in many programming languages, text editors, and other tools as a means of determining whether a string matches a specified pattern.        
Common cases where regular expressions are used include:
- Search in search engines.
- Validation of the format of email addresses, phone numbers or passwords during registration in a web portal.
- Manipulation of textual data in data science projects.

### Regular Expresssions in Python.
In Python, regular expressions are supported by the **re module**.Therefore, to  use regex in python scripts we have to import the module.  as shown below:

In [None]:
import re

This library provides several functions that make it posssible to search a string for a match. Some of the most common ones include:
- **re.compile() -**
Compiles a regular expression pattern into a regular expression object, which can be used for matching. It is more efficient when the expression will be used several times in a single program.
- **re.search() -**
Scans through a string looking for the first location where the regular expression pattern produces a match, and returns a corresponding match object. Otherwise, it returns None.
- **re.findall() -**
Returns all matches of a pattern in a string as a list of strings.
- **re.finditer() -**
Returns an iterator containing match objects of all matches of a pattern in a string. 
- **re.sub() -**
Returns the string obtained by replacing the occurrences of a pattern in the string by the replacement repl. If the pattern isn’t found, string is returned unchanged.  
- **re.split() -**
Split string by the occurrences of a specified pattern.      

More functions can be found in the [Python Documenation](https://docs.python.org/3/library/re.html)


### Below is a simple docstring which we shall use regular expressions on.

In [None]:
text_to_search = '''This is just a simple line of text
abcdefghijklmnopqurtuvwxyz
abc
abcabcabc
abcabcabcabcabc
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I love cats.
.com
1234567890
La LaLa
. ^ $ * + ? { } [ ] \ | ( )
This marks the end'''

#### re.search()

We can perform simple regex using literals as illustrated below:

In [None]:
match_object = re.search('ABC', text_to_search) #case sensitive
print (match_object)


<re.Match object; span=(92, 95), match='ABC'>


This returns an object containing the first occurence of "abc".

In [None]:
print (text_to_search[92:95])

ABC


#### Working with Match Objects

The Match object has properties and methods:

- **span() -** returns a tuple containing the start and end positions of the match.
- **string -** returns the string passed into the function.
- **group() -** returns the part of the string where there was a match

In [None]:
print (f"{match_object}\n")
print (f"The start and end indices: {match_object.span()}\n")
print ("The pattern matched: " + match_object.group() + "\n")
print ("\tThe entire string: \n ___________________________________\n" + match_object.string)


<re.Match object; span=(92, 95), match='ABC'>

The start and end indices: (92, 95)

The pattern matched: ABC

	The entire string: 
 ___________________________________
This is just a simple line of text
abcdefghijklmnopqurtuvwxyz
abc
abcabcabc
abcabcabcabcabc
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I love cats.
.com
1234567890
La LaLa
. ^ $ * + ? { } [ ] \ | ( )
This marks the end


#### re.compile()

In [None]:
pattern = re.compile(r'abc')

#### re.findall()

In [None]:
letters = pattern.findall(text_to_search)#returns all occurences.
for match in letters:
    print (match)

abc
abc
abc
abc
abc
abc
abc
abc
abc
abc


#### re.finditer()

In [None]:
letters_object = pattern.finditer(text_to_search)#returns match objects of all occurences.
for object in letters_object:
    print (object)

<re.Match object; span=(35, 38), match='abc'>
<re.Match object; span=(62, 65), match='abc'>
<re.Match object; span=(66, 69), match='abc'>
<re.Match object; span=(69, 72), match='abc'>
<re.Match object; span=(72, 75), match='abc'>
<re.Match object; span=(76, 79), match='abc'>
<re.Match object; span=(79, 82), match='abc'>
<re.Match object; span=(82, 85), match='abc'>
<re.Match object; span=(85, 88), match='abc'>
<re.Match object; span=(88, 91), match='abc'>


##### re.sub()

In [None]:
substitute = re.sub("cats","dogs",text_to_search) #replace the word cats with dogs.
print(substitute)

This is just a simple line of text
abcdefghijklmnopqurtuvwxyz
abc
abcabcabc
abcabcabcabcabc
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I love dogs.
.com
1234567890
La LaLa
. ^ $ * + ? { } [ ] \ | ( )
This marks the end


##### re.split()

In [None]:
split_string = re.split("\s", text_to_search) #split the string at the whitespaces.
print(split_string)

['This', 'is', 'just', 'a', 'simple', 'line', 'of', 'text', 'abcdefghijklmnopqurtuvwxyz', 'abc', 'abcabcabc', 'abcabcabcabcabc', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'I', 'love', 'cats.', '.com', '1234567890', 'La', 'LaLa', '.', '^', '$', '*', '+', '?', '{', '}', '[', ']', '\\', '|', '(', ')', 'This', 'marks', 'the', 'end']


### Metacharacters
These are characters that are interpreted in a special way by a RegEx engine.

##### \\ - Escapes special characters or denotes character classes.

In [None]:
periods = re.finditer(r"\.",text_to_search)
for period in periods:
    print(period)


<re.Match object; span=(130, 131), match='.'>
<re.Match object; span=(132, 133), match='.'>
<re.Match object; span=(156, 157), match='.'>


##### . - Matches any character except line terminators like \n (new line).


In [None]:
period_char = re.findall(r".",text_to_search)
print(period_char)


['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'j', 'u', 's', 't', ' ', 'a', ' ', 's', 'i', 'm', 'p', 'l', 'e', ' ', 'l', 'i', 'n', 'e', ' ', 'o', 'f', ' ', 't', 'e', 'x', 't', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'u', 'r', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'I', ' ', 'l', 'o', 'v', 'e', ' ', 'c', 'a', 't', 's', '.', '.', 'c', 'o', 'm', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'L', 'a', ' ', 'L', 'a', 'L', 'a', '.', ' ', '^', ' ', '$', ' ', '*', ' ', '+', ' ', '?', ' ', '{', ' ', '}', ' ', '[', ' ', ']', ' ', '\\', ' ', '|', ' ', '(', ' ', ')', 'T', 'h', 'i', 's', ' ', 'm', 'a', 'r', 'k', 's', ' ', 't', 'h', 'e', ' ', 'e', 'n', 'd']


##### ^ - Matches the literal at the start of a string.  Checks if a string starts with a certain character.

In [None]:
start_string = re.search(r"^This", text_to_search)
print (start_string)

<re.Match object; span=(0, 4), match='This'>


##### \$ - Matches the literal at the end of a string. Checks if a string ends with a certain character.

In [None]:
end_string = re.search("end$", text_to_search)
print (end_string)

<re.Match object; span=(199, 202), match='end'>


In [None]:
print(text_to_search)

This is just a simple line of text
abcdefghijklmnopqurtuvwxyz
abc
abcabcabc
abcabcabcabcabc
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I love cats.
.com
1234567890
La LaLa
. ^ $ * + ? { } [ ] \ | ( )
This marks the end


##### ? - Matches zero or one occurrence of the pattern left to it.

In [None]:
phrase = "pan paan in Japan and paaan or pn is not a pun "


qmark_re = re.finditer(r"pa?n", phrase)
for qmark in qmark_re:
    print (qmark)

print(len(phrase))


<re.Match object; span=(0, 3), match='pan'>
<re.Match object; span=(14, 17), match='pan'>
<re.Match object; span=(31, 33), match='pn'>
47


In [None]:
split_phrase = re.split("\s", phrase) #split the string at the whitespaces.
print(split_phrase)

['pan', 'paan', 'in', 'Japan', 'and', 'paaan', 'or', 'pn', 'is', 'not', 'a', 'pun', '']


##### * - Matches zero or more occurrences of the pattern left to it.

In [None]:
phrase = "pan paan in Japan and paaan or pn is not a pun "

star_re = re.finditer(r"pa*n", phrase)
for star in star_re:
    print (star)

<re.Match object; span=(0, 3), match='pan'>
<re.Match object; span=(4, 8), match='paan'>
<re.Match object; span=(14, 17), match='pan'>
<re.Match object; span=(22, 27), match='paaan'>
<re.Match object; span=(31, 33), match='pn'>


##### + - Matches one or more occurrences of the pattern left to it.

In [None]:
phrase = "pan paan in Japan and paaan or pn is not a pun "

plus_re = re.finditer(r"pa+n", phrase)
for p in plus_re:
    print (p)

<re.Match object; span=(0, 3), match='pan'>
<re.Match object; span=(4, 8), match='paan'>
<re.Match object; span=(14, 17), match='pan'>
<re.Match object; span=(22, 27), match='paaan'>


##### {} - Matches exactly the specified number of occurrences

In [None]:
alphabet = re.finditer(r"(abc){5}", text_to_search)
for a in alphabet:
    print (a)


<re.Match object; span=(76, 91), match='abcabcabcabcabc'>


##### {n,} -Matches n or more occurrences of preceding expression

In [None]:
alphabet = re.finditer(r"(abc){1,}", text_to_search)
for a in alphabet:
    print (a)

<re.Match object; span=(35, 38), match='abc'>
<re.Match object; span=(62, 65), match='abc'>
<re.Match object; span=(66, 75), match='abcabcabc'>
<re.Match object; span=(76, 91), match='abcabcabcabcabc'>


##### {n, m} -Matches at least n and at most m occurrences of preceding expression

In [None]:
alphabet = re.finditer(r"(abc){2,4}", text_to_search)
for a in alphabet:
    print (a)

<re.Match object; span=(66, 75), match='abcabcabc'>
<re.Match object; span=(76, 88), match='abcabcabcabc'>


##### | - (or operator). Matches either or

In [None]:
match_object = re.finditer(("ABCD|abcd"), text_to_search) #case sensitive
for obj in match_object:
    print (obj)

<re.Match object; span=(35, 39), match='abcd'>
<re.Match object; span=(92, 96), match='ABCD'>


##### () - Used to capture and group sub-patterns

###### 	[...] - Matches any single character in brackets.

##### [^...] -Matches any single character not in brackets

### Special Sequences

Special sequences are denoted by a \ followed by a specified character. They make commonly used patterns easier to write.        

***It is advisable to use raw strings with special sequences.***

In [None]:
print ("\n \t Newline")
print (r"\n \t Newline") #rawstring


 	 Newline
\n \t Newline


##### \\w - Matches alphanumeric characters. *includes _*

In [None]:
alphanumeric = re.findall("\w", text_to_search)
print(alphanumeric)

['T', 'h', 'i', 's', 'i', 's', 'j', 'u', 's', 't', 'a', 's', 'i', 'm', 'p', 'l', 'e', 'l', 'i', 'n', 'e', 'o', 'f', 't', 'e', 'x', 't', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'u', 'r', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'I', 'l', 'o', 'v', 'e', 'c', 'a', 't', 's', 'c', 'o', 'm', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'L', 'a', 'L', 'a', 'L', 'a', 'T', 'h', 'i', 's', 'm', 'a', 'r', 'k', 's', 't', 'h', 'e', 'e', 'n', 'd']


##### \\d  - Matches digits, which means 0-9.

In [None]:
digits = re.findall("\d", text_to_search)
print(digits)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']


### \\s - Matches whitespace characters.

In [None]:
whitespaces = re.finditer("\s", text_to_search)
for whitespace in whitespaces:
    print(whitespace)

<re.Match object; span=(4, 5), match=' '>
<re.Match object; span=(7, 8), match=' '>
<re.Match object; span=(12, 13), match=' '>
<re.Match object; span=(14, 15), match=' '>
<re.Match object; span=(21, 22), match=' '>
<re.Match object; span=(26, 27), match=' '>
<re.Match object; span=(29, 30), match=' '>
<re.Match object; span=(34, 35), match='\n'>
<re.Match object; span=(61, 62), match='\n'>
<re.Match object; span=(65, 66), match='\n'>
<re.Match object; span=(75, 76), match='\n'>
<re.Match object; span=(91, 92), match='\n'>
<re.Match object; span=(118, 119), match='\n'>
<re.Match object; span=(120, 121), match=' '>
<re.Match object; span=(125, 126), match=' '>
<re.Match object; span=(131, 132), match='\n'>
<re.Match object; span=(136, 137), match='\n'>
<re.Match object; span=(147, 148), match='\n'>
<re.Match object; span=(150, 151), match=' '>
<re.Match object; span=(155, 156), match='\n'>
<re.Match object; span=(157, 158), match=' '>
<re.Match object; span=(159, 160), match=' '>
<re.Ma

In [None]:
print(text_to_search)

This is just a simple line of text
abcdefghijklmnopqurtuvwxyz
abc
abcabcabc
abcabcabcabcabc
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I love cats.
.com
1234567890
La LaLa
. ^ $ * + ? { } [ ] \ | ( )
This marks the end


### \\b | Matches if there is a boundary (or empty string) at the start and end  therefore, mtches if the specified characters are at the beginning or end of a word

In [None]:
words = re.finditer(r"\bLa",text_to_search)
for word in words:
    print(word)

<re.Match object; span=(148, 150), match='La'>
<re.Match object; span=(151, 153), match='La'>


***The characters in uppercase do the opposite.***

### Practical Examples
In this examples we shall be retrieving information from a text file *contacts.txt*.            
***Disclaimer:*** All the names and contacts are not real. (generated using the Faker library)

#####  	Getting the phone numbers.

In [None]:
with open ("contacts.txt") as f:
    #print phonenumbers from Kenya
    phonenumbers = re.findall(r"[+]?254\d{9}|07\d{8}", f.read())
    print (f"Kenyan phone numbers:{phonenumbers}")
     


Kenyan phone numbers:['254745328975', '0721675429', '+254723432321']


In [None]:
 with open ("contacts.txt") as f:
    #american numbers
    numbers = re.findall(r"[+]?1?-?[(]?\d{3}[).-]\d{3}[.-]\d{4}", f.read())
    print ("American phone numbers:")
    for n in numbers:
        print (n)

American phone numbers:
741.676.2796
(972)245-1462
746-718-5427
+1-239-380-7394
907-548-9426
(449)494-7467
710.762.9359
(339)546-7970
(865)215-8166
123.555.1234
817-555-1234
(725)826-1520
+1-338-411-7644
364-070-8802
(839)754-8681
(826)493-4218


#### Printing names of proffessors.

In [None]:
with open ("contacts.txt") as f:
    names = re.findall(r"[Pp]rof.?\s?[a-zA-Z]+", f.read())
    print(names)

['profJohn', 'Prof.Angela', 'Prof. George', 'prof monica', 'Prof.Lucy', 'Prof. Casey', 'prof. Colleen']


##### Getting the email addresses.

In [None]:
 with open ("contacts.txt") as f:
    emails = re.findall(r"[a-zA-Z0-9_-]+\@[a-zA-Z-]+\.[A-Za-z]{2,6}", f.read())
    for e in emails:
     print (e)

zthompson23@hotmail.com
juliesullivan@hotmail.com
robertgutierrez@cook.net
tingram@yahoo.com
kelly25@yahoo.com
shawn01@johnson-mata.info
kristi60@hotmail.com
jamess-immons@hotmail.com
chorn@washington-leonard.biz
obonilla@gmail.com
davisabigail@hotmail.com
ypalmer@yahoo.com
charles07@horn.com
courtneyday@weaver-stanton.com
g@IT.org
ess_wambui@gmail.com
lucy@yahoo.co
j-l@gmail.edu
d4l@hotmail.com
12345@ymail.com
cory88@gmail.com
timothy94@anderson.com
efields@potter.com
steven06@gmail.com
sandra_rodriguez@hotmail.com
john12@bradley-gonzalez.org


In [None]:
phrase_to_search = '''The quick brown fox jumped over the lazy dog
                      Nairobi city an all tourist destination
                      lets go to the city in the sun.
                      It is also the capital city of Kenya.
                      Nairobi has some of the best caffees in the face 
                      of the world
                      acbacbdefghijekpps
                      bbi imetupiliwa mbali na hiyo ndio raha yetu
                      #$%&^*)({}[]:"
                      +254712393400
                      7208971450
                      i said bbi was overruled by the supreme court
                      076802892
                      +14786980012
                      this marks the end of this string'''


In [None]:
search_s1 = re.search('bbi', phrase_to_search)
print(search_s1)
print(len(phrase_to_search))

<re.Match object; span=(391, 394), match='bbi'>
732


In [None]:
findall_f1 = re.findall('bbi', phrase_to_search)
print(findall_f1)
#print(len(phrase_to_search))

['bbi', 'bbi']


In [None]:
finditer_f1 = re.finditer('bbi', phrase_to_search)
for word in finditer_f1:
  print(word)

    
#print(len(phrase_to_search))

<re.Match object; span=(391, 394), match='bbi'>
<re.Match object; span=(571, 574), match='bbi'>


In [None]:
pattern_1 = re.compile(r"Nairobi")

In [None]:
city = pattern_1.findall(phrase_to_search)
for c in city:
  print(c)

Nairobi
Nairobi


In [None]:
city = pattern_1.finditer(phrase_to_search)
for c in city:
  print(c)

<re.Match object; span=(67, 74), match='Nairobi'>
<re.Match object; span=(243, 250), match='Nairobi'>
