# Regular Expressions

The term `Regular Expression` refers to sequences of characters that form search parameters for texts patterns. Python has a package `re`, containing functions used in Regular Expressions. This package has 4 main functions:

- `findall` - returns a list containing all matches
- `search` - returns a `match object` if a match exists
- `split` - returns a list where the string has been split at each match
- `sub` - replaces all matches in a string with other specified characters

Regular expressions allow you to carry out this four basic functions using special sequences and metacharacters to help in pattern matching. Instead of matching a single word, you can use special characters shown below to match a array of possible words matching certain patterns.

More information about the metacharacters and special sequences can be found here: [Regex in Python](https://docs.python.org/3/howto/regex.html)

| Character | Description                                    | Example         |
|-----------|------------------------------------------------|-----------------|
| []        | A set of characters                           | "[a-m]"         |
| \         | Signals a special sequence                    | "\\d"           |
| .         | Any character (except newline character)      | "he..o"         |
| ^         | Starts with                                   | "^hello"        |
| $         | Ends with                                     | "planet\\$"     |
| *         | Zero or more occurrences                      | "he.*o"         |
| +         | One or more occurrences                       | "he.+o"         |
| ?         | Zero or one occurrences                       | "he.?o"         |
| {}        | Exactly the specified number of occurrences  | "he.{2}o"       |
| \|        | Either or                                     | "falls\|stays"  |
| ()        | Capture and group                             |                 |
| \d        | Matches any decimal digit                     |                 |
| \D        | Matches any non-digit character               |                 |
| \s        | Matches any whitespace character              |                 |
| \S        | Matches any non-whitespace character          |                 |
| \w        | Matches any alphanumeric character            |                 |
| \W        | Matches any non-alphanumeric character        |                 |


The regular expression functionality is found in the package `re`

In [2]:
import re

## Finding a match
`findall` returns a list of all the matches that correspond to the regular expression pattern.

In [3]:
str1 = "The rain in Spain falls mainly on the plains"
pat1 = r"\w+ain\w*"
x1 = re.findall(pat1,str1)
print(x1)

['rain', 'Spain', 'mainly', 'plains']


In [4]:
str2="Denise and Dennis deny denouncing their dentist. Their denials denominated the news."
pat2=r"[Dd]en\w*"
x2 = re.findall(pat2,str2)
print(x2)

['Denise', 'Dennis', 'deny', 'denouncing', 'dentist', 'denials', 'denominated']


## Search for only the first match
The only significant difference between `findall` and `search` is that `search` will only find **the first match** in the string of interest. 

In [5]:
str1 = "The rain in Spain falls mainly on the plains"
pat1 = r"\w+a(in)\w*"
x = re.search(pat1,str1)
print(x)

<re.Match object; span=(4, 8), match='rain'>


In [6]:
x.span()

(4, 8)

In [12]:
x.group(1)

'in'

There are several methods that you can use to access the information in the Match object produced by `search`. These methods can give you the indices of the match, the match, and the original string.

In [13]:
print(x.start())
print(x.end())
print(x.span())
print(x.string)
print(x.group())

4
8
(4, 8)
The rain in Spain falls mainly on the plains
rain


## Splitting apart
The `split` function in `re` allows you to split your string on any character(s) you wish. There are also optional arguments allowing you to specify the exact number of splits you wish to make in a given string

In [15]:
getty = "But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground.";
pat3 = r'[\s.—,]+'
pat4 = r'[—]'
x3=re.split(pat3,getty)
x4=re.split(pat4,getty,maxsplit=2)
print(x3)
print(x4)

['But', 'in', 'a', 'larger', 'sense', 'we', 'can', 'not', 'dedicate', 'we', 'can', 'not', 'consecrate', 'we', 'can', 'not', 'hallow', 'this', 'ground', '']
['But, in a larger sense, we can not dedicate', 'we can not consecrate', 'we can not hallow—this ground.']


## Substitution
The function `sub` will return the altered string after it substitutes a replacement substring.

In [16]:
x5 = re.sub(r"[Dd]en",r"ur",str2)
print(x5)

urise and urnis ury urouncing their urtist. Their urials urominated the news.


# Web Scraping
We can use regular expression to scrap things from the internet. In fact, nefarious actors on the internet use this method so egregiously, organizations now bury email addresses and phone numbers in images and other places that are not so easily accessible.

In [17]:
import urllib.request as ur

In [18]:
#grab a webpage
page = ur.urlopen('https://www.senate.gov/senators/')
print(page.getcode())

200


In [19]:
html = page.read()
strhtml = str(html,'utf-8')

In [21]:
html

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<!-- [if lt IE 7]> <html class="ie6 oldie"> <![endif] --><!-- [if IE 7]>    <html class="ie7 oldie"> <![endif] --><!-- [if IE 8]>    <html class="ie8 oldie"> <![endif] --><!-- [if gt IE 8]> <! --><html class="">\n<!-- <![endif] -->\n<head>\n<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<meta name="object" content="index.xml">\n<meta name="version" content="70.1">\n<meta name="path" content="/Company Home/Sites/senategov/documentLibrary/Senate.gov/senators">\n<meta name="date" content="Tuesday, September 12, 2023">\n<meta name="time" content="9:22:08 AM EDT">\n<meta name="keywords" content="">\n<meta name="bucket" content="senators">\n<meta name="description" content="">\n<title>U.S. Senate: Senators</title>\n<link type="image/x-icon" rel="shor

In [26]:
pattern = r'https:\/\/www\.[a-zA-Z0-9.-]+\.senate\.gov'
sitelist = re.findall(pattern,strhtml)

In [24]:
sitelist

['https://www.baldwin.senate.gov',
 'https://www.barrasso.senate.gov',
 'https://www.bennet.senate.gov',
 'https://www.blackburn.senate.gov',
 'https://www.blumenthal.senate.gov',
 'https://www.booker.senate.gov',
 'https://www.boozman.senate.gov',
 'https://www.braun.senate.gov',
 'https://www.britt.senate.gov',
 'https://www.brown.senate.gov',
 'https://www.budd.senate.gov',
 'https://www.butler.senate.gov',
 'https://www.cantwell.senate.gov',
 'https://www.capito.senate.gov',
 'https://www.cardin.senate.gov',
 'https://www.carper.senate.gov',
 'https://www.casey.senate.gov',
 'https://www.cassidy.senate.gov',
 'https://www.collins.senate.gov',
 'https://www.coons.senate.gov',
 'https://www.cornyn.senate.gov',
 'https://www.cortezmasto.senate.gov',
 'https://www.cotton.senate.gov',
 'https://www.cramer.senate.gov',
 'https://www.crapo.senate.gov',
 'https://www.cruz.senate.gov',
 'https://www.daines.senate.gov',
 'https://www.duckworth.senate.gov',
 'https://www.durbin.senate.gov',
 

In [None]:
sitelist

In [29]:
emailpat = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
phonepat = r'\(\d{3}\) \d{3}-\d{4}'
pattern = r'https:\/\/www\.([a-zA-Z0-9.-]+)\.senate\.gov'
#run through sites
senatedict = {}
for address in sitelist:
    #get senators name
    senator = re.search(pattern,address)
    senator = senator.group(1)
    #grab their website
    try:
        site = ur.urlopen(address)     
    except:
        phonenum = site.getcode()
        email =  ''
    else:
        #make sure it's a string
        strhtml = str(site.read(),'utf-8')
        #find phone numbers and emails
        phonenum = re.findall(phonepat,strhtml)
        email = re.findall(emailpat,strhtml) 
    subdict={'phone':phonenum,'email':email}
    senatedict[senator]=subdict

In [30]:
senatedict

{'baldwin': {'phone': ['(000) 000-0000',
   '(000) 000-0000',
   '(715) 832-8424',
   '(000) 000-0000',
   '(000) 000-0000',
   '(608) 264-5338',
   '(000) 000-0000',
   '(000) 000-0000',
   '(715) 832-8424',
   '(000) 000-0000',
   '(000) 000-0000',
   '(920) 498-2668',
   '(000) 000-0000',
   '(000) 000-0000',
   '(414) 297-4451',
   '(000) 000-0000',
   '(000) 000-0000',
   '(608) 796-0045',
   '(000) 000-0000',
   '(000) 000-0000',
   '(202) 224-5653',
   '(715) 832-8424',
   '(608) 264-5338',
   '(715) 832-8424',
   '(920) 498-2668',
   '(414) 297-4451',
   '(608) 796-0045',
   '(202) 224-5653'],
  'email': []},
 'barrasso': {'phone': [], 'email': []},
 'bennet': {'phone': 200, 'email': ''},
 'blackburn': {'phone': ['(901) 527-9199',
   '(901) 527-9515',
   '(731) 660-3971',
   '(731) 660-3978',
   '(629) 800-6600',
   '(615) 298-2148',
   '(423) 541-2939',
   '(423) 541-2944',
   '(865) 540-3781',
   '(865) 540-7952',
   '(423) 753-4009',
   '(423) 788-0250',
   '(202) 224-3344',

In [None]:
#let's look back in time - 2008
site = 'https://web.archive.org/web/20080701043126/http://www.senate.gov/general/contact_information/senators_cfm.cfm'
page = ur.urlopen(site)
print(page.getcode())

In [None]:
html = page.read()
strhtml = str(html,'utf-8')

In [None]:
strhtml

In [None]:
phonepat = r'\(\d{3}\) \d{3}-\d{4}'
namepat =r'>[A-Z][a-z]+, [A-Z][a-z]+<''
phonenum = re.findall(phonepat,strhtml)

In [None]:
phonenum