## Regex Basics (Python)

regex or regular expression is really fascinating and useful from an applied linguistics standpoint. Regular expressions are a kind of 'meta' language to standardize exactly what in written expression we are talking about, so that when we are looking for something specific we can refer to that so exactly that even a computer program will be able to automate that process. 

Consistant patterns or formats such as phone numbers and email addresses can easily be described or expressed using regular expressions. 

## How & What

Using a regular expression starts with making a "pattern" 
(which in python is a string).
e.g.
(This looks for any indivdul characters that are number 'digits')

```
r"\d"
```
or 
(This looks for sets of only 3 consecutive digits (not 2 or 4 or other)
```
r"\d{3}"
```
(This looks for sets of digits of any length)
or 
```
r"\d+"
```

Now there is also an important "what" are you looking for and asking about.
For example, you may want to know just WHERE in the text your item is,
or you may just want to know WHAT the item is. So be sure you are asking the right question
with your regular expression.

Note: people usually look up regex commands in tables and charts, as opposed to trying to memorize them all.

# Charaters
Common abstract character types for regex include
(Note the pattern of lower case and upper case):

#### \d  digit
#### \D  NOT a digit

#### \w  alphabetic
#### \W  NOT alphabetic

#### \s  white space
#### \S  NOT white space


# Quantifiers

#### +      Occurs one or more times
#### {#}    Occurs exactly # times
#### {#,#}  Occurs from # to # times
#### {#,}   Occurs # OR more times
#### *      Occurs zero OR more times
#### ?      Occurs once or never [binary, 1, 0]

In [1]:
# import the regex library
import re

In [2]:
# digits \d

In [25]:
text_1 = "The phone number of the agent is 328-289-2838. Call hello world! Is?"

In [26]:
"phone" in text_1

True

In [27]:
# span tells you where in the text the pattern starts, and where it stops.

pattern = r"\d+"
text = text_1

my_match = re.search(pattern, text)

my_match.span()

(33, 36)

In [6]:
my_match.span()

(33, 36)

In [7]:
# re.search only finds first result
# re.findall

In [29]:
pattern = r"\d"
text = text_1

my_match = re.findall(pattern, text)

my_match

['3', '2', '8', '2', '8', '9', '2', '8', '3', '8']

In [28]:
pattern = r"\d{3}"
text = text_1

my_match = re.findall(pattern, text)

my_match

['328', '289', '283']

In [8]:
pattern = r"\d+"
text = text_1

my_match = re.findall(pattern, text)

my_match

['328', '289', '2838']

In [9]:
for match in re.finditer("phone",text):
    print(match.span())

(4, 9)


In [10]:
# here you are looking for the location of every individual character
for match in re.finditer(r"\d",text):
    print(match.span())

(33, 34)
(34, 35)
(35, 36)
(37, 38)
(38, 39)
(39, 40)
(41, 42)
(42, 43)
(43, 44)
(44, 45)


In [30]:
# Here you are looking for the location of just sets of 3
for match in re.finditer(r"\d{3}",text):
    print(match.span())

(33, 36)
(37, 40)
(41, 44)


In [11]:
# this is one way of specifying a phone number
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'

In [12]:
phone_numbers = re.findall(pattern, text)

In [13]:
phone_numbers

['328-289-2838']

In [14]:
#quantifiers

# here is another way of specifying a phone number
pattern = r'\d{3}-\d{3}-\d{4}'

In [15]:
phone_numbers = re.findall(pattern, text)
phone_numbers

['328-289-2838']

In [16]:
# grouping
pattern = r'(\d{3})-(\d{3})-(\d{4})'
my_match = re.search(pattern, text)


In [17]:
my_match

<re.Match object; span=(33, 45), match='328-289-2838'>

In [18]:
my_match.group()

'328-289-2838'

In [19]:
# return just one group
# but doesn't work with find-all?
my_match.group(1)

'328'

In [20]:
text_2 = "cat sat hat fish dish"

In [21]:
# wildcard

text = text_2

pattern = r'.at'

In [22]:
my_match = re.findall(pattern, text)

In [23]:
my_match

['cat', 'sat', 'hat']

In [24]:
# ^start and end$
# only reflect...a whole line? or string?

text = text_1

pattern = r'\!$'

my_match = re.findall(pattern, text)
my_match

[]

In [35]:
# [] for [words]
# ^start and end$
# only reflect...a whole line? or string?

# hmm...include and exclude vs. starts and ends?

text = text_1

pattern = r'[\d$]'

my_match = re.findall(pattern, text)
my_match

['3', '2', '8', '2', '8', '9', '2', '8', '3', '8']

In [36]:
# [] for [words]
# ^start and end$
# only reflect...a whole line? or string?

# hmm...include and exclude vs. starts and ends?

text = text_1

pattern = r'[^\d]'

my_match = re.findall(pattern, text)
my_match

['T',
 'h',
 'e',
 ' ',
 'p',
 'h',
 'o',
 'n',
 'e',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 ' ',
 'o',
 'f',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'a',
 'g',
 'e',
 'n',
 't',
 ' ',
 'i',
 's',
 ' ',
 '-',
 '-',
 '.',
 ' ',
 'C',
 'a',
 'l',
 'l',
 ' ',
 'h',
 'e',
 'l',
 'l',
 'o',
 ' ',
 'w',
 'o',
 'r',
 'l',
 'd',
 '!',
 ' ',
 'I',
 's',
 '?']

In [41]:
# [] for [words]
# ^start and end$
# only reflect...a whole line? or string?

# hmm...include and exclude vs. starts and ends?

text = text_1

pattern = r'[^\d]+'

my_match = re.findall(pattern, text)
my_match

['The phone number of the agent is ', '-', '-', '. Call hello world! Is?']

In [43]:
text = text_1
pattern = r'[^!.? ]+'

In [44]:
my_match = re.findall(pattern, text)
my_match

['The',
 'phone',
 'number',
 'of',
 'the',
 'agent',
 'is',
 '328-289-2838',
 'Call',
 'hello',
 'world',
 'Is']

In [45]:
' '.join(my_match)

'The phone number of the agent is 328-289-2838 Call hello world Is'

In [54]:
text = text_1
pattern = r"[\w]+-[\w]+-[\w]+"

In [55]:
my_match = re.findall(pattern, text)
' '.join(my_match)

'328-289-2838'

In [None]:
# regex to search for email addresses

import re

text = page_two_text
pattern = r'[\w]+@[\w]+.[\w]+'

re.findall(pattern, text)