# Regular expression in Python

In [1]:
import re

## Pattern searching

Let's start off by searching for patterns in text.

In [2]:
patterns = ['python', 'fun']

In [3]:
sent = 'This is a sentence which says python is easy'

To search, "re.search()" is used where the first argument is what to search and second one is from where to search.

So below we try searching "is" in "python is fun" and get the match.

In [4]:
re.search('is', 'python is fun')

<re.Match object; span=(7, 9), match='is'>

Using this we will search for the pattern.

In [5]:
for pattern in patterns:
    print('Searching for "{}" in: \n"{}"'.format(pattern, sent))
    
    if re.search(pattern,sent):
        print("\nPattern found\n")
    else:
        print("\nPattern not found\n")

Searching for "python" in: 
"This is a sentence which says python is easy"

Pattern found

Searching for "fun" in: 
"This is a sentence which says python is easy"

Pattern not found



Now let's get a closer look into this match object.

In [6]:
match = re.search(patterns[0], sent)

In [7]:
type(match)

re.Match

This match object has its own methods which can be called. start() tells the index of the start of the match. end() tells index of the end of match.

In [8]:
match.start()

30

In [9]:
match.end()

36

## Splitting

In [10]:
splitter = ','

text = 'Hey, is this your book, No'

In [11]:
re.split(splitter, text)

['Hey', ' is this your book', ' No']

## Finding instances of pattern

The arguments passed are the term you want to match and the text.

In [12]:
re.findall('python', 'python is fun and python is easy')

['python', 'python']

## Finding specific patterns

Using metacharacters this can be done.

A pattern followed by metcharacter * is repeated zero or more times.

A pattern followed by metcharacter + must appear at least once.

Using ? means the pattern appears zero or one time.

For specific number of occurences use {m} after the pattern. m is the number of times pattern should repeat.

Use {m,n} where m in minimum and n is maximum number of repeatitions. ({m,}) means the value appears at least m times with no max.

In [13]:
def re_find(patterns, phrase):
    '''
    Input: list of regex patterns
    Output: list of matches
    '''
    for pattern in patterns:
        print('Searching the phrase "{}"'.format(pattern))
        print(re.findall(pattern, phrase))
        print('\n')

### Repetition

In [14]:
txt = 'jdjd..jjjddd...jdddjddd...djdj...djjjjj...jdddd'

patterns = ['jd*', 'jd+', 'jd?', 'jd{3}', 'jd{2,3}']

re_find(patterns,txt)

Searching the phrase "jd*"
['jd', 'jd', 'j', 'j', 'jddd', 'jddd', 'jddd', 'jd', 'j', 'j', 'j', 'j', 'j', 'j', 'jdddd']


Searching the phrase "jd+"
['jd', 'jd', 'jddd', 'jddd', 'jddd', 'jd', 'jdddd']


Searching the phrase "jd?"
['jd', 'jd', 'j', 'j', 'jd', 'jd', 'jd', 'jd', 'j', 'j', 'j', 'j', 'j', 'j', 'jd']


Searching the phrase "jd{3}"
['jddd', 'jddd', 'jddd', 'jddd']


Searching the phrase "jd{2,3}"
['jddd', 'jddd', 'jddd', 'jddd']




### Character set

It is used when we want to match any one of a group of characters at a point. Brackets are used to construct these character sets.

[ab] searches for occurence of either a or b.

In [15]:
txt = 'jdjd..jjjddd...jdddjddd...djdj...djjjjj...jdddd'

patterns = ['[jd]', 
            'j[jd]+']  # j followed by one or more j or d

re_find(patterns,txt)

Searching the phrase "[jd]"
['j', 'd', 'j', 'd', 'j', 'j', 'j', 'd', 'd', 'd', 'j', 'd', 'd', 'd', 'j', 'd', 'd', 'd', 'd', 'j', 'd', 'j', 'd', 'j', 'j', 'j', 'j', 'j', 'j', 'd', 'd', 'd', 'd']


Searching the phrase "j[jd]+"
['jdjd', 'jjjddd', 'jdddjddd', 'jdj', 'jjjjj', 'jdddd']




### Exclusion

To exclude the terms we use ^ in the bracket suntax notation.

[^...] will match any single character not in the brackets.

In [16]:
txt = 'Hello! My name is liza, what is yours? Sir. '

check for matches that are not !.,? or space. The + sign is to check if the match appears at least once. Thsi way we can remove the punctautions.

In [17]:
re.findall('[^!,.? ]+', txt)

['Hello', 'My', 'name', 'is', 'liza', 'what', 'is', 'yours', 'Sir']

### Character ranges

It helps you define a character set to include all the characters between start to stop point. 

[b-j] would return matches with any instances of letters between b and j.

In [18]:
txt = 'Hey. I love coding in python. It is fun and easy.'

patterns = ['[a-z]+', '[A-Z]+', 
            '[a-zA-Z]+',    # lower or upper case 
            '[A-Z][a-z]+']  # one upper followed by lower case

re_find(patterns, txt)

Searching the phrase "[a-z]+"
['ey', 'love', 'coding', 'in', 'python', 't', 'is', 'fun', 'and', 'easy']


Searching the phrase "[A-Z]+"
['H', 'I', 'I']


Searching the phrase "[a-zA-Z]+"
['Hey', 'I', 'love', 'coding', 'in', 'python', 'It', 'is', 'fun', 'and', 'easy']


Searching the phrase "[A-Z][a-z]+"
['Hey', 'It']




 ### Escape codes
 
They are used to find specific types of patterns in data.
 
\d - digit

\D - non-digit

\s - whitespace (space, tab, newline)

\S - non-whitespace

\w - alphanumeric

\W - non-alphanumeric

To differentiate between string escape characters and regex escape characters, regex escape characters are preceeded by a "r".

In [20]:
txt = 'Python is a #easy language and I am using it since 123 months'

patterns = [r'\d+', r'\D+', r'\s+', r'\S+', r'\w+', r'\W+']

re_find(patterns, txt)

Searching the phrase "\d+"
['123']


Searching the phrase "\D+"
['Python is a #easy language and I am using it since ', ' months']


Searching the phrase "\s+"
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase "\S+"
['Python', 'is', 'a', '#easy', 'language', 'and', 'I', 'am', 'using', 'it', 'since', '123', 'months']


Searching the phrase "\w+"
['Python', 'is', 'a', 'easy', 'language', 'and', 'I', 'am', 'using', 'it', 'since', '123', 'months']


Searching the phrase "\W+"
[' ', ' ', ' #', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


