## Getting Started with RegEx

#### Matching a string: Returns True/False depending on whether the string matches the regular expression

```python
import re
re.search() 
```


#### If we actually want the matching strings to be extracted we use
```python
re.findall() 

```

### Curious Case of Square Brackets:

Everything inside square bracket is one characters. 

We can have a list of things,
like AEIOU would be vowels.
Zero through nine is digit.
So, bracket zero dash nine bracket is a single digit.
But then, we added a plus to it and that says one or more digits. 


```python
re.findall( <some expr here>, <string> )

```
What it does is it runs
all the way through the texts that you've asked it to look for,
checking to see when this matches,
and it gives us back a list of the matches.
So, it extracts out the pieces. 


We get back a list of the matches (if there's a asterisk we will be getting list of empty string) from ```re.findall()``` of all the possible matches. 

In [None]:
import re
x = 'My 2 favourite numbers are 19 and 42'
y  =re.findall('[0-9]+',x)
y

In [2]:
z = re.findall('[AEIOU]+',x)
z

[]

### Warning: Greedy Matching

The repear characters (* and +) push outward in both directions (greedy) to match the largest possible string.

In [3]:
import re
x = 'From: Using the : characters'
y = re.findall('^F.+:', x)
y

['From: Using the :']

What if we wanted to search only for string 'From:', not the whole redundant stuff that comes along with it. 

#### Non-Greedy Matching

Not all regular expression repeat codes are greedy! If you add a `?` character, the `+` and `*` will stop being pushy (i.e forcefully find the largest string possible to match with) 

In [4]:
### Example of Non-Greedy

## First character in the is an F, 
## + indicates to match for One or more characters
## Last character in the match is colon (:)
z = re.findall('F.+?:', x)
z

['From:']

In [5]:
### EXtracting email in greedy and non-greedy way
example =  'From nprithviraj24@gmail.com to asdf246'
greedy = re.findall('\S+@\S+', example); print(greedy)
non_greedy = re.findall('[\S]+?@[\S]+?', example); print(non_greedy)

['nprithviraj24@gmail.com']
['nprithviraj24@g']


### Fine Tuning String Extraction

`Parantheses` are not part of the match - but they tell where to start and stop what stringt to extract. Following examples.



In [6]:
fullmail = re.findall('^From (\S+@\S+)', example)
fullmail

['nprithviraj24@gmail.com']

In [7]:
re.findall('@(\S+@\S+)', example)

[]

```
'^From (\S+@\S+)'
```


what I'm doing is I'm saying,
start extracting after the space.
`(^From` is part of the match but the extracted part starts here, `(\S+ ...)` and then the extracted part ends at closed parantheses.


So, that says, this is the part that I
want to extracted even though I demand `^From` to match.
So, I'm demanding, I'm extracting less than what I'm matching.
I'm using the matching to be very precise as to the lines I want,
and then I'm using the parentheses that I add to pull out what I want.
So, here I get back exactly the email address. 

In [8]:
mail = re.findall('@([^ ]*)', example); mail
### Match a non-black character -  `[^ ]`
### MAtch many of them - `*`

['gmail.com']

```
'@([^ ]*)'
```
find me an `@` sign,
followed by some number of non-blank characters.
I don't want to extract the at sign,
see where I put the parentheses,
I want to start extracting after the at sign
and up to the rest of those non-blank characters. 

Also `[^ ]*`

Match a non-blank character, that's with the brackets,
so that's another syntax and that is,
this is a single character but if the first letter
of the set inside there is the character,
that means not, everything but.
So, that means everything but a space, that's non-blank

### Fine tune 

```
'^From .*@([^ ]*)'
```

fine tune this by saying I want to start with `^From` in the line,
I want to have a space,
but I want any number of characters - ` .*` up to an `@` i.e. `^From .*@`
and then I want to begin extracting all the non-blank characters, `([^ ]*`
and then end extracting i.e. `([^ ]*)` (closed parantheses). 

Important NOTE:

So, if you if you didn't have a `From .... ` line,
you would get nothing back,
and you're not finding email addresses in the middle of text,
you're just finding email addresses on lines that start with `From` space

#### Escape Character

If we want a special regular expression character to just behave normally, we prefix it with
` \`

In [9]:
dollar = ' Ten dollar is $10'
y = re.findall('\$[0-9]+', dollar)
y

['$10']