#Regular expressions


---



A regular expression can be defined as a pattern that can be matched on the text in order to find other strings such as web links, email addresses, telephones, dates, among others.

The Python **re** library allows us to implement  regular expressions in a simple way. 

Below, we study some of the most useful functions of this library:

- The **search** function, which takes as first argument the pattern and as second argument the text, returns the first location where the regular expressions produces a match. 

- The function **findall** returns the list of all possible matches.

- The function **match** only checks at the beginning of the string. 

The three functions return None if there is no match.

In [22]:
import re
pattern=r'is'
text='Candela is very smart'
print('search:',re.search(pattern,text))
print('findall:',re.findall(pattern,text))
print('match:',re.match(pattern,text))

print('match:',re.match(r'Candela',text))



search: <_sre.SRE_Match object; span=(8, 10), match='is'>
findall: ['is']
match: None
match: <_sre.SRE_Match object; span=(0, 7), match='Candela'>


If the pattern will be used several time in the same program, from the point of view of efficiency, you must save it as an object. 
To do this, you must use the **compile** function, which compiles the pattern into a regular expression object. 


In [23]:
pattern = re.compile(r"\d+")
text = "My number is 666666666 or 999999999"
#return the first occurrence
print(pattern.search(text).group())

666666666


## Creating patterns


You can use special characters in order to build general patterns

| Special pattern|    |
|------|------|
|  .  | matches any character (*)|
|  \w  | matches any single letter, digit or underscore|
|  \W  | matches any character not part of \w|
|  \d  | matches decimal digit 0-9|
| [abc]  | matches a or be or c|
| [a-zA-Z0-9]  | matches any lettter (lowercase or uppercase) or any digit (0-9)|
| \s | matches a single whitespace character like: space, newline, tab, return|
|\t | matches tab|
|\n | matches a new line|
|\r | matches return|
|^|  matches a pattern at the start of the string|
|$|  matches a pattern at the end of the string|

Note: To include the character '.' in a pattern, this must be preceded by the character '\'

The following special characters allow us to handle repetitions in the patterns:


|Character|    |
|------|------|
|*|0 or more characters to its left|
|+|1 or more characters to its left|
|?|0 or 1 character to its left|

You may indicate the exact number of repetions. For example, the following patterns:
- **\d{9}** indicates that the digit must be repeated exactly 9 times. 
- **\d{6,9}** indicates that the digit must be repeated at least 6 times but not more than 9 times. 
- **\d{6,}** indicates that the digit must be repeated at least 6 times or more. 




## Example: 

Create a pattern to find email addresses in a text: 

In [24]:
text = 'If you have any doubt, please send me an email to: isegura@inf.uc3m.es or isegurabe@gmail.com'

emails=re.findall(r'[\w\.-]+@[\w\.-]+',text)
for x in emails:
  print(x)

isegura@inf.uc3m.es
isa1974@gmail.com
isegurabe@gmail.com


## Exercise
When creating an email account, 'username@domain.dom', the following rules can be used to avoid creating invalid email addresses. The 'username' part of the email address should follow:

- Use only alphanumeric characters. That is, use only “A” through “Z” and “0 (zero)” through “9”.
- Do not use the following characters: < > ( ) [ ] ; : , @ \
- As long as they are not the first character in the e-mail address, hyphens ( – ), underscores ( _ ), periods ( . ), and numeric characters (“0” through “9”) are acceptable characters to use within the address.

For example:
1. isegura-teacher@domain.com is a correct email. However, _isegura_teacher@domain.com is an incorrect email because it starts with a non-standard character.
2. isegura.teacher@domain.com is a correct email. However, .isegura.teacher@domain.com is an incorrect email, because it starts with a non-standard character.
3. isegura_teacher@domain.com is a correct email. However, isegura teacher@domain.com is an incorrect email, because it contains a whitespace.

Please, modify the previous code to follow these rules.

In [0]:
#Include your code 

## Using regular expression to replace words 

The **re** library can also be used to replace texts. In particular, the **sub** function  returns the string obtained by replacing the leftmost occurrence of pattern in the text by a new text. 

It takes two arguments: the pattern, the new text that will be used to replace the text found by the pattern and the whole text. If the pattern is not found then it returns the original text. 



In [21]:
pattern='\d{3}-\d{3}-\d{3}'
text='My telephone number is 666-123-458'
result=re.sub(pattern,'XXX-XXX-XXX',text)
print(result)

My telephone number is XXX-XXX-XXX
