<a href="https://colab.research.google.com/github/joaopedropeira/Python/blob/main/Scientific_Computing_Python_5RegularExpressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Regular Expressions
---

In computing, a **regular expression**, also referred to as **"regex"** or **"regexp"**, **provides** a **concise** and **flexible** means for **matching strings of text**, such as particular characters, words, or patterns of characters.<br>
A **regular** **expression** is **written** in a **formal language** that can be **interpreted** by a **regular expression processor**

<br>
Really smart "Find" or "Search" (Control+F)

### Regular Expressions Quick Guide:

'^' - Matches the **beginning** of a line<br>
'$' - Matches the **end** of a line<br>
'.' - Matches **any** character<br>
'\s' - Matches **whitespac**e<br>
'\S' - Matches any **non-whitespace** chracter<br>



### Regular Expressions Module

* Before you can use regular expressions in your program, you must import the library using **"import re"**

* You can use **re.search()** to see if a string matches a regular expression, similar to using **find()** method for strings

* You can use **re.findall()** to extract portions of a string that match your regular expression, similar to a combination of **find()** and slicing: var[5:10]


### Using re.search() like find()


In [None]:
#connecting to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
xfile = open('/content/drive/MyDrive/Colab Notebooks/test.txt')

for line in xfile:
    line = line.rstrip()
    if line.find('Hello') >=0:
        print(line)

Hello World!
Hello World! 2


In [None]:
#Using re.search like find()

import re

xfile = open('/content/drive/MyDrive/Colab Notebooks/test.txt')

for line in xfile:
    line = line.rstrip()
    if re.search('Hello', line):
        print(line)


Hello World!
Hello World! 2


### Using re.search() like startswith()


In [None]:
xfile = open('/content/drive/MyDrive/Colab Notebooks/test.txt')

for line in xfile:
    line = line.rstrip()
    if line.startswith('Hello'):
        print(line)

Hello World!
Hello World! 2


In [None]:
#using re.search like startswith

import re
xfile = open('/content/drive/MyDrive/Colab Notebooks/test.txt')

for line in xfile:
    line = line.rstrip()
    if re.search('^Hello', line):
        print(line)

Hello World!
Hello World! 2


## Matching and Extracting Data
---

* **re.search()** returns a **True/False** depending on whether the string matches the regualr expression

* If we actually want the matching string to be extracted, we use **re.findall()**

In [1]:
import re
x = 'My 2 favorite numbers are 8 and 27'

#[0-9]+ means one or more digits
y = re.findall('[0-9]+',x)
print(y)

['2', '8', '27']


* When we use **re.findall()**, it returns a list of **zero** or **more sub-strings** that **match** the **regular expression**


In [2]:
import re
x = 'My 2 favorite numbers are 8 and 27'
y = re.findall('[0-9]+',x)
print(y)

['2', '8', '27']


In [5]:
#here, we're looking for one or more, minimum one or more uppercase A,E,I,O U as a set of characters
y = re.findall('[AEIOU]+', x)
print(y)

[]


### Warning: Greedy Matching

The **repeat** characters (* and +) push **outward** in both directions (greedy) to match the largest possible string

In [8]:
import re
x = 'From: Using The : character'

#First character in the match is 'F'
#Last character in the match is a ':'
# '+' means one or more character
y = re.findall('^F.+:', x)
print(y)

['From: Using The :']


In [21]:
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\\S+@\\S+', s)
print(lst)

['csev@umich.edu', 'cwen@iupui.edu']


### Non-Greedy Matching

* **Not all regular expression** repeat codes are **greedy**! If you add a **'?'** character, the '+' and '*' chill out a bit...

In [10]:
import re

x = 'From: Using the : character'

#First character in the match is 'F'
#Last character in the match is a ':'
# '+?' means one or more character but not greedy!
y = re.findall('^F.+?:', x)
print(y)

['From:']


### Fine-Tuning String Extraction

* You can **refine** the **match** for **re.findall()** and **separately** determine which **portion of the match** is to be **extracted** by **using** **parentheses**

In [13]:
import re
x = 'From joaopedromarques76@gmail.com Sat Jan 5 09:15:32 2022'

#\S = at least one non-whitespace character
y = re.findall('\S+@\S+', x)
print(y)

['joaopedromarques76@gmail.com']


* **Parentheses** are **not part of the match** - but **they tell where** to **start** and **stop** what **string to extract**

In [17]:
import re
x = 'From joaopedromarques76@gmail.com Sat Jan 5 09:15:32 2022'
y = re.findall('^From \S+@\S+', x)
print(y)

y = re.findall('^From (\S+@\S+)', x)
print(y)

['From joaopedromarques76@gmail.com']
['joaopedromarques76@gmail.com']


## String Parsing Examples...
---

* Let's go back to the way we first tore apart strings

In [22]:
#Extracting a host name - using find and string slicing

#Looking for the '@' position
data = 'From joaopedromarques76@gmail.com Sat Jan 5 09:15:32 2022'
atpos = data.find('@')
print(atpos)

23


In [23]:
#Looking for the whitespace position after '@'
sppos = data.find(' ', atpos)
print(sppos)

33


In [24]:
#Printing the characters between '@' and first whitepsace after '@'
host = data[atpos+1 : sppos]
print(host)

gmail.com


### The Double Split Pattern

* Sometimes we split a line one way, and then grab one of the pieces of the line and split that piece again

In [25]:
line = 'From joaopedromarques76@gmail.com Sat Jan 5 09:15:32 2022'

#Breaking into words, From = 0, joaopedromarques76@gmail.com = 1, Sat = 2....
words = line.split()
email = words[1]
#breaking the word[1] by the '@'
pieces = email.split('@')
#printing the host name
print(pieces[1])

gmail.com


### The Regex Version (Regular Expressions)

In [27]:
import re

x = 'From joaopedromarques76@gmail.com Sat Jan 5 09:15:32 2022'

#Look through the string until you find an '@' sign
#[^ ] -> Match non-blank character
# * -> match many of them
# ( ) -> start/stop searching
y = re.findall('@([^ ]*)', x)
print(y)

['gmail.com']


### Even Cooler Regex Version (Regular Expressions)

In [28]:
import re

#^ -> Starting at the beginning of the line, look for the string 'From'
x = 'From joaopedromarques76@gmail.com Sat Jan 5 09:15:32 2022'
y = re.findall('^From .*@([^ ]*)', x)
print(y)

['gmail.com']


### Scape Character

* If you want a **special regualr expression ** to just **behave normally** (most ot the time) you prefix it with **'\'**

In [30]:
import re

#$ - a real $ sign
#[0-9.] - A digit or a period
#+ - at least one or more
x = 'We just received $10.00 for apples'
y = re.findall('\$[0-9.]+', x)
print(y)

['$10.00']
