### Regular Expressions

Regular expressions is a concept used to search for patterns in string text.

<pre>
.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)
</pre>

In [1]:
import re # import regular expressions module

document = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
123abc

Hello HelloHello Hola

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

utexas.edu
2a 2b 3_ _
321-555-4321
123.555.1234
4a 4b 4c
lisa@myorg.edu
a4 b5 c6
Mr. Johnson
Mr Smith
Ms Davis
Mrs. Robinson
Mr. Lewis
Mr T
'''

### Searching literals

In [2]:
# Regular expression to find the pattern 'abc'

pattern = re.compile(r'abc') # r - raw string to avoid escaping backslashes

matches = pattern.finditer(document) # finditer returns an iterator yielding match objects
print(type(matches))

# Iterate through the matches and print them
print("\nMatches found:")
for mat in matches:
    print(mat)

# Demonstrating string slicing
print("\nhere is the document:")
print("location 1 to 4:",document[1:4])  # This will print the substring 'abc' from the document
print("location 69 to 72:",document[69:72])  # This will print the substring 'abc' from the document

<class 'callable_iterator'>

Matches found:
<re.Match object; span=(1, 4), match='abc'>
<re.Match object; span=(69, 72), match='abc'>

here is the document:
location 1 to 4: abc
location 69 to 72: abc


In [3]:
# finding a different pattern 'cba'

pattern = re.compile(r'cba')
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern 'cba'.")

No matches found for pattern 'cba'.


### Searching special characters

In [4]:
# finding a different pattern '.' use escape character to match any character
pattern = re.compile(r'\.')
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '.'.")

<re.Match object; span=(134, 135), match='.'>
<re.Match object; span=(169, 170), match='.'>
<re.Match object; span=(201, 202), match='.'>
<re.Match object; span=(205, 206), match='.'>
<re.Match object; span=(230, 231), match='.'>
<re.Match object; span=(246, 247), match='.'>
<re.Match object; span=(277, 278), match='.'>
<re.Match object; span=(290, 291), match='.'>


In [5]:
# find a digit we use \d
pattern = re.compile(r'\d')
matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No digits found in the document.")


<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(60, 61), match='6'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(62, 63), match='8'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='0'>
<re.Match object; span=(66, 67), match='1'>
<re.Match object; span=(67, 68), match='2'>
<re.Match object; span=(68, 69), match='3'>
<re.Match object; span=(174, 175), match='2'>
<re.Match object; span=(177, 178), match='2'>
<re.Match object; span=(180, 181), match='3'>
<re.Match object; span=(185, 186), match='3'>
<re.Match object; span=(186, 187), match='2'>
<re.Match object; span=(187, 188), match='1'>
<re.Match object; span=(189, 190), match='5'>
<re.Match object; span=(190, 191), match='5'>
<re.Match object; span=(191, 192), match='5'>
<re.Match obje

In [6]:
# find a non-digit character we use \D
pattern = re.compile(r'\D')
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:
    print("No non-digit characters found in the document.")


<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Ma

In [7]:
# finds a digit followed by a word character, \w is used for word characters (alphanumeric + underscore)
pattern = re.compile(r'\d\w')
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '\\d\\w'.")

<re.Match object; span=(55, 57), match='12'>
<re.Match object; span=(57, 59), match='34'>
<re.Match object; span=(59, 61), match='56'>
<re.Match object; span=(61, 63), match='78'>
<re.Match object; span=(63, 65), match='90'>
<re.Match object; span=(66, 68), match='12'>
<re.Match object; span=(68, 70), match='3a'>
<re.Match object; span=(174, 176), match='2a'>
<re.Match object; span=(177, 179), match='2b'>
<re.Match object; span=(180, 182), match='3_'>
<re.Match object; span=(185, 187), match='32'>
<re.Match object; span=(189, 191), match='55'>
<re.Match object; span=(193, 195), match='43'>
<re.Match object; span=(195, 197), match='21'>
<re.Match object; span=(198, 200), match='12'>
<re.Match object; span=(202, 204), match='55'>
<re.Match object; span=(206, 208), match='12'>
<re.Match object; span=(208, 210), match='34'>
<re.Match object; span=(211, 213), match='4a'>
<re.Match object; span=(214, 216), match='4b'>
<re.Match object; span=(217, 219), match='4c'>


In [8]:
# finds a digit followed by a whitespace character, \s is used for whitespace characters (space, tab, newline)
pattern = re.compile(r'\d\s')
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:    
    print("No matches found for pattern digit followed by whitespace '\\d\\s'.")


<re.Match object; span=(64, 66), match='0\n'>
<re.Match object; span=(196, 198), match='1\n'>
<re.Match object; span=(209, 211), match='4\n'>
<re.Match object; span=(236, 238), match='4 '>
<re.Match object; span=(239, 241), match='5 '>
<re.Match object; span=(242, 244), match='6\n'>


### Word boundary

In [9]:
# Hello HelloHello
pattern = re.compile(r'Hello')
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern 'Hello'.")

<re.Match object; span=(74, 79), match='Hello'>
<re.Match object; span=(80, 85), match='Hello'>
<re.Match object; span=(85, 90), match='Hello'>


In [10]:
# Hello HelloHello with word boundary
# \b is a word boundary
pattern = re.compile(r'Hello\b') 
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern 'Hello\\b'.")

<re.Match object; span=(74, 79), match='Hello'>
<re.Match object; span=(85, 90), match='Hello'>


In [11]:
pattern = re.compile(r'\bHello\b')
matches = list(pattern.finditer(document))
if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '\\bHello\\b'.")

<re.Match object; span=(74, 79), match='Hello'>


In [12]:
# Hello HelloHello with non-word boundary
# \B is a non-word boundary
# \b is a word boundary

pattern = re.compile(r'\BHello\b')
matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '\\BHello\\b'.")


<re.Match object; span=(85, 90), match='Hello'>


In [13]:
# finds a digit at the start of a word
# \b is used to denote a word boundary
# \d is used to denote a digit

pattern = re.compile(r'\b\d')
matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '\\b\\d'.")
    

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(66, 67), match='1'>
<re.Match object; span=(174, 175), match='2'>
<re.Match object; span=(177, 178), match='2'>
<re.Match object; span=(180, 181), match='3'>
<re.Match object; span=(185, 186), match='3'>
<re.Match object; span=(189, 190), match='5'>
<re.Match object; span=(193, 194), match='4'>
<re.Match object; span=(198, 199), match='1'>
<re.Match object; span=(202, 203), match='5'>
<re.Match object; span=(206, 207), match='1'>
<re.Match object; span=(211, 212), match='4'>
<re.Match object; span=(214, 215), match='4'>
<re.Match object; span=(217, 218), match='4'>


In [14]:
# finds a whitespace character at the start of a line
# ^ is used to denote the start of a line

pattern = re.compile(r'^\s')
matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '^\\s'.")
    

<re.Match object; span=(0, 1), match='\n'>


### Character sets

In [15]:
# any digits in [123] followed by a word character
# \w is used to denote a word character (alphanumeric + underscore) 

pattern = re.compile(r'[123]\w')
matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '[123]\\w'.")


<re.Match object; span=(55, 57), match='12'>
<re.Match object; span=(57, 59), match='34'>
<re.Match object; span=(66, 68), match='12'>
<re.Match object; span=(68, 70), match='3a'>
<re.Match object; span=(174, 176), match='2a'>
<re.Match object; span=(177, 179), match='2b'>
<re.Match object; span=(180, 182), match='3_'>
<re.Match object; span=(185, 187), match='32'>
<re.Match object; span=(194, 196), match='32'>
<re.Match object; span=(198, 200), match='12'>
<re.Match object; span=(206, 208), match='12'>
<re.Match object; span=(208, 210), match='34'>


In [16]:
# find 2 consecutive lowercase letters
# [a-z] is used to denote lowercase letters

pattern = re.compile(r'[a-z][a-z]')
matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '[a-z][a-z]'.")
    

<re.Match object; span=(1, 3), match='ab'>
<re.Match object; span=(3, 5), match='cd'>
<re.Match object; span=(5, 7), match='ef'>
<re.Match object; span=(7, 9), match='gh'>
<re.Match object; span=(9, 11), match='ij'>
<re.Match object; span=(11, 13), match='kl'>
<re.Match object; span=(13, 15), match='mn'>
<re.Match object; span=(15, 17), match='op'>
<re.Match object; span=(17, 19), match='qu'>
<re.Match object; span=(19, 21), match='rt'>
<re.Match object; span=(21, 23), match='uv'>
<re.Match object; span=(23, 25), match='wx'>
<re.Match object; span=(25, 27), match='yz'>
<re.Match object; span=(69, 71), match='ab'>
<re.Match object; span=(75, 77), match='el'>
<re.Match object; span=(77, 79), match='lo'>
<re.Match object; span=(81, 83), match='el'>
<re.Match object; span=(83, 85), match='lo'>
<re.Match object; span=(86, 88), match='el'>
<re.Match object; span=(88, 90), match='lo'>
<re.Match object; span=(92, 94), match='ol'>
<re.Match object; span=(98, 100), match='et'>
<re.Match object; 

In [17]:
# [a-zA-Z0-9][a-zA-Z-] finds a letter or digit followed by a letter or hyphen
# [a-zA-Z0-9] is used to denote letters and digits, [a-zA-Z-] is used to denote letters and hyphen

pattern = re.compile(r'[a-zA-Z0-9][a-zA-z-]')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '[a-zA-Z0-9][a-zA-z-]'.")

<re.Match object; span=(1, 3), match='ab'>
<re.Match object; span=(3, 5), match='cd'>
<re.Match object; span=(5, 7), match='ef'>
<re.Match object; span=(7, 9), match='gh'>
<re.Match object; span=(9, 11), match='ij'>
<re.Match object; span=(11, 13), match='kl'>
<re.Match object; span=(13, 15), match='mn'>
<re.Match object; span=(15, 17), match='op'>
<re.Match object; span=(17, 19), match='qu'>
<re.Match object; span=(19, 21), match='rt'>
<re.Match object; span=(21, 23), match='uv'>
<re.Match object; span=(23, 25), match='wx'>
<re.Match object; span=(25, 27), match='yz'>
<re.Match object; span=(28, 30), match='AB'>
<re.Match object; span=(30, 32), match='CD'>
<re.Match object; span=(32, 34), match='EF'>
<re.Match object; span=(34, 36), match='GH'>
<re.Match object; span=(36, 38), match='IJ'>
<re.Match object; span=(38, 40), match='KL'>
<re.Match object; span=(40, 42), match='MN'>
<re.Match object; span=(42, 44), match='OP'>
<re.Match object; span=(44, 46), match='QR'>
<re.Match object; s

In [18]:
# finds a letter followed by a non-letter character
# [a-zA-Z] is used to denote letters, [^a-zA-z] is used to denote non-letter characters

pattern = re.compile(r'[a-zA-Z][^a-zA-z]')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '[a-zA-Z][^a-zA-z]'.")


<re.Match object; span=(26, 28), match='z\n'>
<re.Match object; span=(53, 55), match='Z\n'>
<re.Match object; span=(71, 73), match='c\n'>
<re.Match object; span=(78, 80), match='o '>
<re.Match object; span=(89, 91), match='o '>
<re.Match object; span=(94, 96), match='a\n'>
<re.Match object; span=(110, 112), match='s '>
<re.Match object; span=(116, 118), match='d '>
<re.Match object; span=(119, 121), match='o '>
<re.Match object; span=(122, 124), match='e '>
<re.Match object; span=(130, 132), match='d)'>
<re.Match object; span=(168, 170), match='s.'>
<re.Match object; span=(172, 174), match='u\n'>
<re.Match object; span=(175, 177), match='a '>
<re.Match object; span=(178, 180), match='b '>
<re.Match object; span=(212, 214), match='a '>
<re.Match object; span=(215, 217), match='b '>
<re.Match object; span=(218, 220), match='c\n'>
<re.Match object; span=(223, 225), match='a@'>
<re.Match object; span=(229, 231), match='g.'>
<re.Match object; span=(233, 235), match='u\n'>
<re.Match object; 

### Character groups

In [19]:
# \b is a word boundary
# finds 'abc', 'edu', or 'texas' at a word boundary
# (abc|edu|texas) is used to denote a group of alternatives

pattern = re.compile(r'(abc|edu|texas)\b')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '(abc|edu|texas)\\b'.")
    

<re.Match object; span=(69, 72), match='abc'>
<re.Match object; span=(164, 169), match='texas'>
<re.Match object; span=(170, 173), match='edu'>
<re.Match object; span=(231, 234), match='edu'>


### Quantifiers

In [20]:
# \. is used to denote a literal dot
# ? is used to denote zero or one occurrence of the preceding character
# \s is used to denote a whitespace character
# [A-Z] is used to denote uppercase letters

pattern = re.compile(r'Mr\.?\s[A-Z]')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern 'Mr\\.\\?\\s[A-Z]'.")

<re.Match object; span=(244, 249), match='Mr. J'>
<re.Match object; span=(256, 260), match='Mr S'>
<re.Match object; span=(288, 293), match='Mr. L'>
<re.Match object; span=(298, 302), match='Mr T'>


In [21]:
# \. is used to denote a literal dot
# ? is used to denote zero or one occurrence of the preceding character
# \s is used to denote a whitespace character
# [A-Z][a-z]* is used to denote an uppercase letter followed by zero or more lowercase letters
# * is used to denote zero or more occurrences of the preceding character

pattern = re.compile(r'Mr\.?\s[A-Z][a-z]*')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern 'Mr\\.\\?\\s[A-Z][a-z]*'.")

<re.Match object; span=(244, 255), match='Mr. Johnson'>
<re.Match object; span=(256, 264), match='Mr Smith'>
<re.Match object; span=(288, 297), match='Mr. Lewis'>
<re.Match object; span=(298, 302), match='Mr T'>


In [22]:
# \. is used to denote a literal dot
# ? is used to denote zero or one occurrence of the preceding character
# \s is used to denote a whitespace character
# [A-Z][a-z]* is used to denote an uppercase letter followed by zero or more lowercase letters
# + is used to denote one or more occurrences of the preceding character

pattern = re.compile(r'Mr\.?\s[A-Z][a-z]+')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern 'Mr\\.\\?\\s[A-Z][a-z]*'.")

<re.Match object; span=(244, 255), match='Mr. Johnson'>
<re.Match object; span=(256, 264), match='Mr Smith'>
<re.Match object; span=(288, 297), match='Mr. Lewis'>


In [23]:
# finds 'Ms' or 'Mrs' followed by an optional dot, a whitespace character, and an uppercase letter followed by zero or more lowercase letters
# (M(s|rs)) is used to denote a group of alternatives, 
# \.?\s is used to denote an optional dot followed by a whitespace character,
# [A-Z][a-z]* is used to denote an uppercase letter followed by zero or more lowercase letters

pattern = re.compile(r'M(s|rs)\.?\s[A-Z][a-z]*')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern 'M(s|rs)\\.\\?\\s[A-Z][a-z]*'.")
    

<re.Match object; span=(265, 273), match='Ms Davis'>
<re.Match object; span=(274, 287), match='Mrs. Robinson'>


In [24]:
# finds a phone number pattern in the format '123-456-7890', '123.456.7890', or '1234567890'
# \d{3} is used to denote three digits, [.-] is used to denote a hyphen or dot, \d{4} is used to denote four digits
# # \d{3}[.-]\d{3}[.-]\d{4} is used to denote the phone number pattern

pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '\\d{3}[.-]\\d{3}[.-]\\d{4}'.") 

<re.Match object; span=(185, 197), match='321-555-4321'>
<re.Match object; span=(198, 210), match='123.555.1234'>


In [25]:
# [a-zA-Z0-9_]+ is used to denote one or more alphanumeric characters or underscores
# \. is used to denote a literal dot, [a-z]{3} is used to denote exactly three lowercase letters
# This pattern matches filenames with a three-letter extension

pattern = re.compile(r'[a-zA-Z0-9_]+\.[a-z]{3}')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '[a-zA-Z0-9_]+\\.[a-z]{3}'.")
    

<re.Match object; span=(163, 173), match='utexas.edu'>
<re.Match object; span=(225, 234), match='myorg.edu'>


In [26]:
# finds an email address pattern
# [a-zA-Z0-9_.+-]+ is used to denote one or more alphanumeric characters, underscores, dots, pluses, or hyphens
# @ is used to denote the at symbol
# [a-zA-Z0-9-]+ is used to denote one or more alphanumeric characters or hyphens
# \. is used to denote a literal dot, [a-zA-Z0-9-.]+ is used to denote one or more alphanumeric characters, dots or hyphens

pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = list(pattern.finditer(document))

if matches:
    for mat in matches:
        print(mat)
else:
    print("No matches found for pattern '[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+'.")

<re.Match object; span=(220, 234), match='lisa@myorg.edu'>


### Accessing information in the Match object

In [27]:
# finds an email address pattern with a specific top-level domain length
# [a-zA-Z0-9_.+-]+ is used to denote one or more alphanumeric characters, underscores, dots, pluses, or hyphens
# @ is used to denote the at symbol
# [a-zA-Z0-9-]+ is used to denote one or more alphanumeric characters or hyphens
# \. is used to denote a literal dot
# [a-zA-Z0-9-.]{2,4} is used to denote two to four alphanumeric characters, dots or hyphens

pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,4}')
matches = list(pattern.finditer(document))
for mat in matches:
    print(mat.span(0))
    print(mat.group(0))
    print(document[mat.span(0)[0]:mat.span(0)[1]])
    print(mat.string[mat.span(0)[0]:mat.span(0)[1]])


(220, 234)
lisa@myorg.edu
lisa@myorg.edu
lisa@myorg.edu


In [28]:
urls = r'''
https://www.google.com
http://yahoo.com
https://www.whitehouse.gov
https://craigslist.org
'''

In [29]:
# finds URLs in the format 'https://www.example.com' or 'http://example.com'
# https? is used to denote 'http' or 'https', (www\.)?
# is used to denote an optional 'www.' prefix, 
# \w+ is used to denote one or more word characters (alphanumeric + underscore), 
# \.\w+ is used to denote a dot followed by one or more word characters (alphanumeric + underscore)

pattern = re.compile(r'https?://(www\.)?\w+\.\w+')
matches = pattern.finditer(urls)
for mat in matches:
    print(mat)

<re.Match object; span=(1, 23), match='https://www.google.com'>
<re.Match object; span=(24, 40), match='http://yahoo.com'>
<re.Match object; span=(41, 67), match='https://www.whitehouse.gov'>
<re.Match object; span=(68, 90), match='https://craigslist.org'>


In [30]:
# finds URLs in the format 'https://www.example.com' or 'http://example.com'
# https? is used to denote 'http' or 'https', (www\.)?
# is used to denote an optional 'www.' prefix, 
# \w+ is used to denote one or more word characters (alphanumeric + underscore), 
# \.\w+ is used to denote a dot followed by one or more word characters (alphanumeric + underscore)

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for mat in matches:
    print(mat.group(2)+mat.group(3))

google.com
yahoo.com
whitehouse.gov
craigslist.org


In [31]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for mat in matches:
    print(mat.group(0))
    print(urls[mat.span(2)[0]:mat.span(2)[1]]+urls[mat.span(3)[0]:mat.span(3)[1]])

https://www.google.com
google.com
http://yahoo.com
yahoo.com
https://www.whitehouse.gov
whitehouse.gov
https://craigslist.org
craigslist.org
