In [33]:
"dog" in "my dog i love"

True

We already know that we can actually search for sub strings within a larger string with the in operator.

Now we can already see this has severe limitations.

We need to know the exact string we're looking for and we need to perform additional operations to account
for things like capitalization and punctuation.
So what if we only want to search for a pattern structure of a string we're looking for?
That is to say, what if we know that we're looking for an email but we don't know the exact email we're
looking for.
We're just looking for all the emails within a document, or we're looking for something like a phone
number where we know the general structure of the way a phone number is presented, but we don't actually
know the exact phone number we're looking for.

This is where regular expressions or regex for short allow us to search for general patterns in text
data.

if I wanted to find all the emails within
a very large document of text, I know I'm looking for some pattern of text@text.com

So essentially what we're looking for is what we don't actually know exactly or precisely.
That is, I know there's going to be some text for the username and I know there's some sort of domain
name.

And then I have the things that I'll assume I do know.
In this case, I will assume that there's an @ in between the user and the domain.
And in this case, I assume the domain name ends in .com

So what I can use regular expressions for is to construct a generalized pattern to search for something
like this.

So Python comes with this built in regular expressions library or re for short and you import it as
re and the library allows us to create specialized pattern strings and then search for matches within
text.

So here's a simple example of what a generalized regular expression could look like.
Let's imagine I'm looking for a phone number, and I know that it's in this format somewhere within
a very large document.

(555)-555-5555

I know it's going to have three numbers in parentheses, a dash and other set of three numbers a dash and then a set of four numbers.

An example regular expression pattern could look like this.

r"(\d\d\d)-\d\d\d-\d\d\d\d"

So let's focus on what's happening here.
Notice that outside of the string there's an r and that basically informs Python, especially when you're
using the regular expression library that don't treat this as just a normal string.
There's actually identifiers within this string, and you'll notice that there's a bunch of back slashes
which corresponds to the individual identifiers.

So these identifiers are essentially just placeholders, almost like wild cards, waiting for a match
based off a particular data type.
And in this particular case, backslash d stands for digit.
So this is essentially saying I'm looking for three digits in a row.
It doesn't actually care what the digits are because we don't know what the digits are yet.
We just know that they're going to be in this format.

And then you'll notice the other strings that are present are the format strings themselves.
I know there's going to be parentheses.
I know there's going to be a dash.

And those you'll  donnotice't have a backslash associated with them because they're not an actual identifier
for a regular expression.
We are actually looking for that exact string.

So again, we have these general identifiers and then the exact strings we're looking for.
And this in turn constructs a general regular expression pattern.

In [34]:
text = "my phone number is 4327672628"
pattern = "phone"

In [35]:
import re

In [36]:
re.search(pattern, text)

<re.Match object; span=(3, 8), match='phone'>

This match object will report back not just whether there was a match to the phone, but also where
the actual index location spanned to.
So it starts at index 3 and then ends at index 8.

In [37]:
pattern = "not in text"

In [38]:
re.search(pattern,text)

we get back none which means we don't really get back anything because there is no match.

In [39]:
# reseting back
pattern = "phone"

In [40]:
match = re.search(pattern,text)

In [41]:
match

<re.Match object; span=(3, 8), match='phone'>

In [42]:
match.span()

(3, 8)

In [43]:
match.start()

3

In [44]:
match.end()

8

if we had multiple matches inside the string, we would unfortunately only get back the first match.
If I want to find multiple matches or all the matches, I can use the find all function instead.

In [45]:
text = "my phone once, my phone twice"

In [46]:
match.span() # only the first match

(3, 8)

In [47]:
matches = re.findall(pattern,text)

In [48]:
matches

['phone', 'phone']

And if I want to get that back actual match objects, then I use the iterator.

In [49]:
len(matches)

2

In [50]:
for match in re.finditer(pattern,text):
    print(match)
    print(match.span())
    print(match.group()) # if you wanted the actual text that matched

<re.Match object; span=(3, 8), match='phone'>
(3, 8)
phone
<re.Match object; span=(18, 23), match='phone'>
(18, 23)
phone


what this does is it iterates through this text and then returns each match object that's found.

So just like before, when we were using search, we return back the first match object, essentially
indicating where we first matched up.

if we use find all, it just returns back a list of the strings themselves.

if I want to kind of combine these two by iterating through the text object and then finding every single
match object I use, finditer().
So you're going to need to essentially use a for loop with this to iterate.

### Character identifier

Basically, you'll have special character identifiers that start off with the backslash code and then
a letter indicating what character you're referencing.

<table>
    <tr>
    <th>character</th>
    <th>desc.</th>    
    <th>eg. pattern code</th>    
    <th>eg. match</th> 
    </tr>
    <tr>
        <td>\d</td>
        <td>a digit</td>   
        <td>file_\d\d</td>    
        <td>file_25</td>   
    </tr>
    <tr>
        <td>\w</td>
        <td>alphanumeric</td>   
        <td>\w-\w\w\w</td>    
        <td>A-b_1</td>   
    </tr>
    <tr>
        <td>\s</td>
        <td>white space</td>   
        <td>a\sb\sc</td>    
        <td>a b c</td>   
    </tr>
    <tr>
        <td>\D</td>
        <td>non digit</td>   
        <td>\D\D\D</td>    
        <td>ABC</td>   
    </tr>
    <tr>
        <td>\W</td>
        <td>non alphanumeric</td>   
        <td>\W\W\W\W\W</td>    
        <td>*-+=)</td>   
    </tr>
    <tr>
        <td>\S</td>
        <td>non white space</td>   
        <td>\S\S\S\S</td>    
        <td>Yoyo</td>   
    </tr>

In [51]:
text = "my phone number is 333-456-7836"

In [52]:
phone = re.search(r"\d\d\d-\d\d\d-\d\d\d\d",text)

In [53]:
phone

<re.Match object; span=(19, 31), match='333-456-7836'>

In [54]:
phone.group() # this is how you could actually grab that phone number itself

'333-456-7836'

For every single digit in this phone number, we wrote a \d But what if we were looking for a pattern that included 20 digits or 100 digits?
We wouldn't want to have to write \d 20, 50 or 100 times.

For this reason we can use Quantifier to indicate repetition of the same character.
So we already know our character identifiers.
Now let's learn about the quantifier allowing us to not have to write these multiple times.

<table>
    <tr>
    <th>character</th>
    <th>desc.</th>    
    <th>eg. pattern code</th>    
    <th>eg. match</th> 
    </tr>
    <tr>
        <td>+</td>
        <td>occurs one or more times</td>
        <td>Version \w-\w+</td>
        <td>Version A-b1_1</td>
    </tr>
    <tr>
        <td>{3}</td>
        <td>occurs exactly 3 times</td>
        <td>\D{3}</td>
        <td>abc</td>
    </tr>
    <tr>
        <td>{2,4}</td>
        <td>occurs 2 to 4 times</td>
        <td>\d{2,4}</td>
        <td>123</td>
    </tr>
    <tr>
        <td>{3,}</td>
        <td>occurs 3 or more times</td>
        <td>\w{3,}</td>
        <td>anycharacter</td>
    </tr>
    <tr>
        <td>*</td>
        <td>occurs 0 or more times</td>
        <td>A*B*C*</td>
        <td>AAAACCC</td>
    </tr>
    <tr>
        <td>?</td>
        <td>once or none</td>
        <td>plurals?</td>
        <td>plural</td>
    </tr>
</table>

And basically the way the syntax works is you stick your character identifier and then immediately after
it you stick this quantifier if you want to indicate a certain quantity



In [59]:
phone = re.search(r"\d{3}-\d{3}-\d{4}",text)

In [60]:
phone

<re.Match object; span=(19, 31), match='333-456-7836'>

In [61]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [63]:
phone_pattern

re.compile(r'(\d{3})-(\d{3})-(\d{4})', re.UNICODE)

In [64]:
result = re.search(phone_pattern, text)

In [65]:
result.group()

'333-456-7836'

what a compile does is it compiles together different regular expression pattern codes
the parentheses indicate that it's a group of a pattern.
it takes multiple pattern codes and each pattern code is separated with parentheses
as a group, and then it compiles them into a single expression.

So what this is going to do is it's going to compile these into this expression.
But what's nice about using the compile is that it's still understands that these were three separate
groupings.

So you could call the groupings individually.

here is a little bit of a differentiation
from normal Python code, we don't start indexing at zero, we're starting at one.
So group ordering starts at one.

In [66]:
result.group(1)

'333'

In [67]:
result.group(3)

'7836'

In [68]:
result.group(4)

IndexError: no such group

if I ask for a group that is outside of this, such as group four, it's going to say sorry, there
is no such group because here we only had three sets of parentheses first group, second group and third
group.

that's why the compile function along with the group method is so powerful

if we want to grab subsections or subgroups of our entire pattern, we can use
the compile function to create the actual pattern instead of just passing in the pattern as a whole
string separating each group with parentheses.
And then we can say grouped together off of the search results in order to grab everything together.

But if we only want to grab subgroups, we can say group.
And then that group position where group positions start at

### additional regex syntax

the OR operator.
Sometimes you're going to want to search for multiple terms.
You can use the OR operator to do this.

In [69]:
re.search(r'cat|dog', 'the cat is here')

<re.Match object; span=(4, 7), match='cat'>

What if we want it to search for cat or dog?
In that case, we can use the pipe operator.

So the pipe operator, it looks like |
And that stands for or meaning you can search for cat or dog and then we'll get a match if there's a
dog there or if there's a cat there.

So that's the OR operator.

a wild card operator.
the wild card acts as a placement that will match any character place there.

In [70]:
re.findall(r'at', 'the cat is in the hat at mat')

['at', 'at', 'at', 'at']

But what if I actually wanted to grab the actual letter in front of at?
In that case I can provide a period.

And that stands for a wild card.
Meaning anything here, any wild card there attached to at.

In [72]:
re.findall(r'.at', 'the cat is in the hat at mat also splat')

['cat', 'hat', ' at', 'mat', 'lat']

In [73]:
re.findall(r'...at', 'the cat is in the hat at mat also splat') # also going to grab other letters and other spaces

['e cat', 'e hat', 't mat', 'splat']

carrot symbol - if I want to find everything that starts with a number

In [74]:
re.findall(r'^\d','1 is number')

['1']

So what this indicates is that the string I'm searching through starts with a number.
And this returns a match for one because the entire text itself starts with the actual digit.

Keep in mind, this is only for the entire text, not for a number randomly inside of this.

In [75]:
re.findall(r'^\d','the 2 is number')

[]

I'm not going to get any matches here because it's only searching
if the entire text itself starts with a number.

In [76]:
re.findall(r'\d$','number is 2')

['2']

dollar sign - ends with

carrot/power sign - starts with

In [77]:
phrase = "there are 3 numbers inside 34 and 6 and lets not forget 4555 here"
pattern = r'[^\d]'

In [78]:
re.findall(pattern, phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 'l',
 'e',
 't',
 's',
 ' ',
 'n',
 'o',
 't',
 ' ',
 'f',
 'o',
 'r',
 'g',
 'e',
 't',
 ' ',
 ' ',
 'h',
 'e',
 'r',
 'e']

What I want to do is find or get back everything that isn't a number in this sentence.
So I want to exclude digits or exclude numbers.

And what we can do is if we actually want to get the words back together, I can add a plus sign to the end of this, because remember a plus sign, if we look back at our quantifier, it just says occurs
one or more times.

In [79]:
pattern = r'[^\d]+'
re.findall(pattern, phrase)

['there are ', ' numbers inside ', ' and ', ' and lets not forget ', ' here']

So why would you use this exclusion syntax?

This is a really common way to get rid of punctuation from a sentence.

In [80]:
test_phrase = "this is hiii!!! let's remove, 'punctuations'. we can right?"

In [83]:
re.findall(r"[^!.'?]+", test_phrase)

['this is hiii', ' let', 's remove, ', 'punctuations', ' we can right']

And something else we can do is add a space here so that it will end up removing the spaces.

In [86]:
clean = re.findall(r"[^!,.'? ]+", test_phrase)

In [87]:
clean

['this',
 'is',
 'hiii',
 'let',
 's',
 'remove',
 'punctuations',
 'we',
 'can',
 'right']

In [88]:
' '.join(clean)

'this is hiii let s remove punctuations we can right'

In [91]:
text = "only find hyphen-words. like long-ish"
pattern = r'[\w]+'

In [92]:
re.findall(pattern, text)

['only', 'find', 'hyphen', 'words', 'like', 'long', 'ish']

In [95]:
pattern = r'[\w]+-[\w]+'
re.findall(pattern, text)

['hyphen-words', 'long-ish']

So all this is doing is allowing you to basically combine that or statement we saw earlier with other
pieces of text and provide multiple options👇

In [97]:
text = "this is catfish"
text2 = "this is catnap"
text3 = "this is caterpillar"

In [100]:
re.search(r'cat(fish|nap|erpillar)',text)

<re.Match object; span=(8, 15), match='catfish'>

In [101]:
re.search(r'cat(fish|nap|erpillar)',text2)

<re.Match object; span=(8, 14), match='catnap'>

In [102]:
re.search(r'cat(fish|nap|erpillar)',text3)

<re.Match object; span=(8, 19), match='caterpillar'>