# Regular Expressions (RegEx)

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

![regex](https://miro.medium.com/max/1200/1*ZVlIZ1ZYC6rASz-dYPzhZQ.jpeg)

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3 install regex** should install it.

In [4]:
import re #así es como se importa por defecto
import numpy as np
import pandas as pd

## Sintax
### Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- NOTA: re.M -> modo multilinea

https://docs.python.org/3/library/re.html#re-syntax



### Special Sequences:

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]`
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,&)
- **Ranges** `[a-d]`, `[1-9]`

- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`

**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore _ is also considered an alphanumeric character. 

**\d** - Matches any digit. Equivalent to `[0-9]` 

**\s** - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]`

**\W** - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`

**\D** - Matches any non digit. Equivalent to `[^0-9]` 

**\S** - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`



### Methods

### re.sub(pattern, repl, string, count=0)
Replaces one or many matches with a string

In [6]:
txt = "gabriel, Dio & Clara are TA's??"

In [8]:
#re.sub
#Literals
re.sub('g','G',txt)

"Gabriel, Dio & Clara are TA's??"

In [9]:
#Ranges
re.sub('[A-Z]','',txt)

"gabriel, io & lara are 's??"

In [10]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"gabriel, Dio & Clara are TA's."

### re.search(pattern, string, flags=0)
Scan through a string, looking for any location where this RE matches. If the search is succesful, `re.search()` returns a match object. Otherwise, it returns `None`.

In [11]:
#re.search
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt) 
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [12]:
print(x)

<re.Match object; span=(0, 17), match='The rain in Spain'>


In [13]:
txt = "The rain in Spain"
#\b whole words only
x = re.search(r"\bS\w+", txt)
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain in Spain
Spain


In [14]:
print(re.search(r'r\w*', txt))
print(re.search(r'R\w*', txt))
print(re.search(r'^T\w*', txt))
print(re.search(r'^t\w*', txt))

<re.Match object; span=(4, 8), match='rain'>
None
<re.Match object; span=(0, 3), match='The'>
None


### re.match(pattern, string)
Determine if the RE matches at the beginning of the string.

In [15]:
#Match siempre muestra si algo 

In [16]:
#re.match
pattern = r"Cookie"
sequence = "I want a Cookie"
sequence2= "Cookie, I want you!"
if re.match(pattern, sequence2):
    print("Match!")
else: 
    print("Not a match!")

Match!


In [17]:
txt = "The rain in Spain"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))
print(re.match(r'^T\w*', txt))
print(re.match(r'T\w*', txt))

None
None
<re.Match object; span=(0, 3), match='The'>
<re.Match object; span=(0, 3), match='The'>


In [18]:
#Cada vez que se utiliza en match un paréntesis, se conforman subgrupos que pueden seleccionarse
#  
email_address = 'Please contact us at: support@thebridge.com'
match = re.search(r'(\w+)@([\w\.]+)', email_address)
if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@thebridge.com
support
thebridge.com


### re.fullmatch(pattern, string)

In [19]:
class_names = ["Andrea", "Estela", "Anais", "Xeles", "Maria", "Mar"]
for name in class_names:
    if re.fullmatch("Maria", name):
        print(f"{name} is desired name")
    else:
        print(f"{name} is not desired name")

Andrea is not desired name
Estela is not desired name
Anais is not desired name
Xeles is not desired name
Maria is desired name
Mar is not desired name


### re.findall (pattern, string)
Find all substrings where the RE matches, and returns them as a list.

In [20]:
#re.findall
email_address = "Please contact us at: support.data@data-science.com, xyz@thebridge.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.]+@[\w\.-]+', email_address)
addresses

['support.data@data-science.com', 'xyz@thebridge.com']

In [21]:
print(re.findall('[^aeiou\s]',email_address))
print(re.findall('\sc\w*',email_address))
print(re.findall('^P\w*',email_address))

['P', 'l', 's', 'c', 'n', 't', 'c', 't', 's', 't', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 's', 'c', 'n', 'c', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 't', 'h', 'b', 'r', 'd', 'g', '.', 'c', 'm']
[' contact']
['Please']


In [22]:
with open('info.txt', 'r') as file:
    client_info = file.read()

In [23]:
emails_clients=re.findall(r"[\w.]+@+[\w.]+", client_info)
print(emails_clients[:5])

['sit.amet.metus@egestasnunc.ca', 'arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk', 'Nulla.eu.neque@Fuscealiquetmagna.com', 'Nullam.velit@non.ca', 'In@gravidamolestiearcu.co.uk']


In [24]:
client_numbers = re.findall(r"[0-9]{2}-\d{3}-\d{3}-\d{3}", client_info)

In [25]:
print(client_numbers)

['34-739-941-941', '34-278-870-242', '34-999-876-292', '34-345-887-949', '34-905-089-682', '34-773-463-479', '34-017-915-525', '34-274-204-840', '34-575-459-881', '34-249-358-256', '34-299-478-659', '34-094-099-748', '34-236-498-114', '34-541-455-803', '34-274-768-546', '34-850-484-655', '34-193-830-599', '34-768-704-320', '34-960-058-312', '34-835-461-291', '34-524-499-405', '34-553-655-405', '34-193-752-726', '34-165-726-657', '34-172-146-895', '34-309-707-078', '34-289-368-945', '34-432-424-781', '34-880-153-396', '34-876-903-767', '34-508-574-378', '34-219-498-365', '34-413-279-781', '34-789-736-506', '34-701-997-370', '34-912-146-256', '34-550-871-297', '34-818-230-259', '34-707-183-700', '34-006-975-807', '34-975-336-347', '34-208-023-425', '34-810-185-675', '34-318-817-026', '34-229-857-982', '34-415-168-417', '34-595-803-021', '34-620-827-404', '34-711-768-527', '34-159-329-878', '34-619-616-824', '34-865-861-872', '34-294-644-638', '34-439-853-222', '34-215-852-041', '34-425-5

### re.split(pattern, string, maxsplit=0)
Returns a list where the string has been split at each match

In [26]:
#re.split
sente = "Hello,\n Please, contact me the sooner.\n Thank you,\n Me"

In [27]:
reg = re.split("\n", sente)
reg

['Hello,', ' Please, contact me the sooner.', ' Thank you,', ' Me']

In [28]:
"".join(reg)

'Hello, Please, contact me the sooner. Thank you, Me'

In [29]:
client_list = re.split(r"(?<=[0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:5])

['Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941', ' Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. 34-278-870-242', ' Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. 34-999-876-292', ' Scarlett Ortiz Nullam.velit@non.ca 6082 Massa Road 34-345-887-949', ' Ocean Bell In@gravidamolestiearcu.co.uk P.O. Box 370, 440 Suspendisse Rd. 34-905-089-682']


### re.compile(pattern
Compiles a RE into a regular expression object.

In [30]:
name_check = re.compile(r"[^A-Za-z ]")

In [31]:
name = input("Please insert your name:")
while name_check.search(name):
    # it loops while if finds a match
    print("Please enter your name correctly!")
    name = input("Please insert your name:")
print("Finally mate, I thought you'd never do it")

Please enter your name correctly!
Finally mate, I thought you'd never do it


-----------------------------------------------------------------------------------------------------------


## Some practice 
Now is your turn.

#### Simple validation of a username
https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023
    
Write a simple regex to validate a username. Allowed characters are:

- lowercase letters,
- numbers,
- underscore

Length should be between 4 and 16 characters (both included).


In [32]:
# your solution

In [33]:
def validate_usr(username):
    #your code here
    pass

#### Regex validate PIN code 

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132

ATM machines allow 4 or 6 digit PIN codes and PIN codes cannot contain anything but exactly 4 digits or exactly 6 digits.

If the function is passed a valid PIN string, return true, else return false.

Examples:
```python
"1234"   -->  True
"12345"  -->  False
"a234"   -->  False
```

In [34]:
# your solution
pin = input('Enter your pin')
pin_check = re.compile(r'[^A-Za-z\s\])
while pin_check.search(pin):
    print('No seas borrico e introduce solo números, anda...')
    pin = input('Enter your pin')
print('Ahora sí')

SyntaxError: EOL while scanning string literal (<ipython-input-34-a7bc4071831b>, line 3)

In [39]:
import re

def validate_usr(pin):
    pin_check = re.compile(r"^[0-9]{4}$|^[0-9]{6}$")
    while not pin_check.search(pin):
        print("'Pin incorrecto'")
        print(pin)
        pin = input('Dame un pin')
    print("Pin correcto")


validate_usr(pin=input("Insert a number"))

'Pin incorrecto'
55
Pin correcto


-----------------------------------------------------------------------------------------------------------

# The FBI challenge

- https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

It's your first day at the FBI office and your boss has just send you an `txt` file: `emails.txt`, she asked you to make some analysis but first of all, you need to get a dataframe like the following. You'll need some python knowledge and some regex for that goal. 

In [68]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


---------------------------------------------------------------------------------------------------------------------------

#### Since we are good people, here you have a proposed solution

In [51]:
emails_info={}

In [52]:
fh = open("emails.txt", "r").read()

In [53]:
fh.count("From r")

3977

In [54]:
contents = re.split(r"From r", fh)

In [55]:
contents[0]

''

In [56]:
contents.pop(0)

''

- sender_email
- sender_name
- date_sent
- time_sent
- subject

### Info Sender

In [57]:
info_sender=[]
for i,e in enumerate(contents):
    try:
        info_sender.append(re.search("From:.*", e).group())
    except: 
        info_sender.append("not found")

In [58]:
len(info_sender)

3977

In [59]:
#sender_email
emails_info['sender_email']=[]
for line in info_sender:
    res=re.findall(r'[\w\.]+@[\w\.-]+', line)
    if res:
        emails_info['sender_email'].append(res[0])
    else:
        emails_info['sender_email'].append(np.nan)
        
len(emails_info['sender_email'])

3977

In [60]:
#sender name
emails_info['sender_name']=[]
for line in info_sender:
    res=re.findall(r':.*<', line)
    if res:
        emails_info['sender_name'].append(res[0][1:-1])
    else:
        emails_info['sender_name'].append(np.nan)
len(emails_info['sender_name'])

3977

### Info Dates

In [61]:
#DATES
dates=[]
for i,e in enumerate(contents):
    try:
        dates.append(re.search("Date:.*", e).group())
    except: 
        dates.append("not found")
len(dates)

3977

In [62]:
#email date
emails_info['date_sent']=[]
for dat in dates:
    res=re.findall(r"\d+\s\w{3}\s\d+", dat)
    if res:
        emails_info['date_sent'].append(res[0])
    else:
        emails_info['date_sent'].append(np.nan)

len(emails_info['date_sent'])

3977

In [63]:
emails_info['time_sent']=[]
for dat in dates:
    res=re.findall(r"\d{2}:\d{2}:\d{2}", dat)
    if res:
        emails_info['time_sent'].append(res[0])
    else:
        emails_info['time_sent'].append(np.nan)

len(emails_info['time_sent'])

3977

### Subject

In [64]:
subject=[]
for i,e in enumerate(contents):
    try:
        subject.append(re.search("Subject:.*", e).group())
    except: 
        subject.append("not found")
len(subject)

3977

In [65]:
emails_info['subject']=[]
for sub in subject:
    res=re.findall(r":.*", sub)
    if res:
        emails_info['subject'].append(res[0][2:])
    else:
        emails_info['subject'].append(np.nan)

len(emails_info['subject'])

3977

### Creating DataFrame

In [71]:
df=pd.DataFrame(emails_info)
df.isnull().sum()

sender_email    476
sender_name     837
date_sent       614
time_sent       618
subject          27
dtype: int64

In [72]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


### ¡Now you can start your analysis!