## Pattern Matching and Replacement in Python

We will try illustrate the power of regular expressions in describing patterns and their use in search and replacing part of strings. This will by no means be an exhaustive study of the regular expression library of Python and as with other packages you have to read the documentation for all its features.

In [1]:
import re
print(re.search("also","This is a sentence. This is also a sentence."))

<re.Match object; span=(28, 32), match='also'>


It searches for the first match and returns a **match** object with details of the match.
If there is no match it returns **None**.

In [2]:
print(re.search("That","This is a sentence. This is also a sentence"))

None


You can extract the location and the exact substring that matched.

In [6]:
m = re.search("This","This is a sentence. This is also a sentence.")
print(m.span())
print(m.group())

(1, 5)
This


If you want to find all matches you can use 

In [7]:
ls = re.findall("This","This is a sentence. This is also a sentence.")
print(ls)

['This', 'This']


If you want all matches returned to you as **match** objects use

In [10]:
ms = re.finditer("This","This is a sentence. This is also a sentence.")

In [11]:
for m in ms:
    print(m.span(),m.group())

(0, 4) This
(21, 25) This


#### Regular expressions 

A rich language to describe patterns. 

In the simplest forms it offers short forms for various groups of letters: 

  * \s ---> matches a whitespace letter
  * \S ---> matches any letter that \s does not
  * \d ---> matches a digit
  * \D ---> matches any letter that \d does not
  * \w ---> matches a alphanumeric letter i.e.{a,...,z,A,...,Z,0,...,9}
  * ...

In [14]:
ms = re.finditer("\w\s\w","This is a sentence. This is also a sentence.")
for m in ms:
    print(m.span())
    print(m.group())

(3, 6)
s i
(6, 9)
s a
(23, 26)
s i
(26, 29)
s a
(31, 34)
o a


Matches are disjoint. So, 'a s' is not part of the matches above.

You can also match one of a set of letters by enclosing them between [ and ]

In [15]:
ms = re.finditer("\w[aeiou]\w","This is a sentence. This is also a sentence.")
for m in ms:
    print(m.span())
    print(m.group())

(1, 4)
his
(10, 13)
sen
(13, 16)
ten
(21, 24)
his
(35, 38)
sen
(38, 41)
ten


You can complement a character class such as [aeiou] by adding a ^ at the beginning.

In [16]:
print(re.findall("[^aeiou]","This is a sentence. This is also a sentence."))

['T', 'h', 's', ' ', 's', ' ', ' ', 's', 'n', 't', 'n', 'c', '.', ' ', 'T', 'h', 's', ' ', 's', ' ', 'l', 's', ' ', ' ', 's', 'n', 't', 'n', 'c', '.']


There are special patterns that identify the beginning of a line, end of a line (and similarly with words).
^ matches beginning of the line. It matches a empty string. Similarly $ matches a empty string at the end of the line.

In [3]:
ms = re.finditer("^","This is the first Line.\nWhat about This? Second?\n Third Line",re.MULTILINE)
for m in ms:
    print(m.span())

(0, 0)
(24, 24)
(49, 49)


In [20]:
ms = re.finditer("$","This is the first Line.\nWhat about This? Second?\n Third Line")
for m in ms:
    print(m.span())

(60, 60)


Without the option MULTILINE it simply treats the entire input as one line.

In [25]:
ms = re.findall("^","This is the first Line.\nWhat about this? Second?\n Third Line",re.MULTILINE)
print(ms)

['', '', '']


Observe that the pattern ^ matches an empty string. It matches empty strings at the beginning of a line. 
Similarly the pattern $ matches an empty string at the end of the line.  (Incidentally, . matches any character)

In [28]:
ms = re.findall("^.","This is the first Line.\nWhat about this? Second?\n Third Line",re.MULTILINE)
print(ms) 

['T', 'W', ' ']


#### Searching for repeated patterns

Adding a **\*** after a pattern denotes 0 or more successive occurrences of the pattern.

Adding a **+** after a pattern denotes 1 or more successive occurrences of the pattern.

In [29]:
ms = re.finditer("\d+","30 is a number and so is 400")
for m in ms:
    print(m.group())

30
400


Why did it not match just **3** or **4** or **40**? 

       Because, matching in **re**  is designed to match *greedily*. That is, for all matches that start at a    
       position, it picks the longest. Then skips to the end of this match and continues searching for the next 
       match and so on.

In [5]:
ms = re.finditer("\d*","30 is 400")
for m in ms:
    print(m.group())

30




400



What happened there? 
star denotes 0 or more so it is counting other characters such as spaces also

In [31]:
print(re.findall("\d*","30 is 400"))

['30', '', '', '', '', '400', '']


In [32]:
ms = re.finditer("\d*","30 is 400")
for m in ms:
    print(m.group(),m.span())

30 (0, 2)
 (2, 2)
 (3, 3)
 (4, 4)
 (5, 5)
400 (6, 9)
 (9, 9)


The other 5 matches are empty string matches starting positions 2,3,4,5 and 9. Remember **\*** matches 0 occurrences as well.

### Redoing words and findInts using re

Finding words is easy. It is just what we get if we greedily match sequences of non-whitespace letters.

In [33]:
print(re.findall("\S+"," This is a long and boring sentence. \nThis is another, and   so on."))

['This', 'is', 'a', 'long', 'and', 'boring', 'sentence.', 'This', 'is', 'another,', 'and', 'so', 'on.']


#### Compiling Regular Expressions

It turns out that converting a regular expression into an appropriate state machine is an expensive operation. So, if we plan to search the same pattern several times then it is best to *store* this intermediate step and use it for the search. This is done by **compiling** the expression. 

In [34]:
words = re.compile(r"\S+")
print(words.findall(" This is a long and boring sentence. \nThis is another, and   so on."))

['This', 'is', 'a', 'long', 'and', 'boring', 'sentence.', 'This', 'is', 'another,', 'and', 'so', 'on.']


In [35]:
words.findall("Another sentence. ")

['Another', 'sentence.']

Finding integers is slightly more involved --- we still have to learn how to peep into the next letter in **re**

#### Look Aheads in Python 

The pattern **(?=re)** tests if the letters following this position match the pattern **re** and matches the empty string if they do. Thus, it looks ahead to see if there is a match of **re** following this position. 

The pattern **(?!re)** verifies that the letters following this position do NOT match the pattern **re** and matches the empty string in this case.

In [7]:
ms = re.findall("T(?=hi)","This and That but only This Matches")
print(ms)

['T', 'T']


In [39]:
ms = re.finditer("T(?=hi)","This and That but only This Matches")
for m in ms:
    print(m.group(), m.span())

T (0, 1)
T (23, 24)


In [11]:
ms = re.finditer("T(?!hi)","This That This That")
for m in ms:
    print(m.group(), m.span())

T (5, 6)
T (10, 11)
T (20, 21)


Now we can formulate **integers** as used in the previous lecture as:

    maximal sequences of digits, 
    not immediately following a . and
    not immediately followed by a . followed by 1 or more digits.
    
The obvious strategy fails!

In [15]:
intsRE = re.compile(r'[^.]\d+(?![.]\d)')
intsRE.findall('''This is a long sentence, of more than 20 letters and 8 words. I may have real numbers such as 3.14 in this, but it should not print them out. But
integers can be the last word in a sentence such as 30. The word a39b contains the integer 39..45 and .382. That was just to confuse the issue. Does it work with .45?''')

[' 20', ' 8', '14', ' 30', 'a39', ' 39', '45', '382', '45']

There are several issues. To start with
    * The pattern includes the character preceding the "supposed" integer we wish to pick. (see " 20")
    

We may get around this by grouping the part we need and leaving rest out. This is done by placing the relevant part inside ( ).

In [41]:
intsRE = re.compile(r'[^.](\d+(?![.]\d))')
intsRE.findall('''This is a long sentence, of more than 20 letters and 8 words. I may have real numbers such as 3.14 in this, but it should not print them out. But
integers can be the last word in a sentence such as 30. The word a39b contains the integer 39..45 and .382. That was just to confuse the issue. Does it work with .45?''')

['20', '8', '4', '30', '39', '39', '5', '82', '5']

Do you now realise why you got a "14" earlier (which is part of 3.14 and should not have been there) and why you get a "4" now (again part of the same 3.14)? 

In the first attempt, the *14* got picked because 
    1) *1* matched **[^.]** 
    2) *4* matched **\d+** and 
    3) the following " " tests True for **(?![.]\d))**


In the second attempt, the *1* got dropped but the *4* remains.

The problem is that a digit (such as *1*) may match **[^.]**. This was not our intention. We wanted that the letter preceding this entire sequence of digits is NOT . 

One way to arrange this is to demand that the previous letter should not be a digit either.

In [None]:
intsRE = re.compile(r'[^.\d](\d+(?![.]\d))')
intsRE.findall('''This is a long sentence, of more than 20 letters and 8 words. I may have real numbers such as 3.14 in this, but it should not print them out. But
integers can be the last word in a sentence such as 30. The word a39b contains the integer 39..45 and .382. That was just to confuse the issue. Does it work with .45?''')

Python also allows lookbehinds, which test whether the immediately preceding string satisfies some re. The syntax is
**(?<=re)** and **(?<!re)**

In [42]:
intsRE = re.compile(r'(?<![.\d])\d+(?![.]\d)')
intsRE.findall('''This is a long sentence, of more than 20 letters and 8 words. I may have real numbers such as 3.14 in this, but it should not print them out. But
integers can be the last word in a sentence such as 30. The word a39b contains the integer 39..45 and .382. That was just to confuse the issue. Does it work with .45?''')

['20', '8', '30', '39', '39']

### More on Grouping

Grouping allows you to break up the matched part of the string into pieces. Given a line from a CSV file with userID, first name and last names per line, here is way to extract the different fields using re.

In [12]:
getUIDLastName = re.compile(r'([^,]*),([^,]*),([^,]*)')
m = getUIDLastName.search("gvr,Guido,von Rossum")
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))

gvr,Guido,von Rossum
gvr
Guido
von Rossum


#### Extracting domain names 

Let us consider a simplified version of email addreses. We will require that every email id be of the form 

     username@domainname
     
where *username* is a sequence of letters that are alpha-numeric or . or +. Further . or + cannot be the first character and  *domainname* is a sequence of words separated by . where each word uses alphanumeric letters only. 

We would like to read a file and print out all the domain names that appeared in an email address in that file. 

In [9]:
emailid = re.compile(r'([a-zA-Z0-9][a-zA-Z0-9.+]*)[@](([a-zA-Z0-9]+[.])*[a-zA-Z0-9]+)')
with open("InputFile","r") as f:
    s = f.read()
    ms = emailid.finditer(s)
    for m in ms:
        print(m.group(0))

guy.v.rossum@whatever.com
gvr@whatever.org
kernighan@att.com
sergey.brin+pythoncourse@gmail.com
kumar@cmi.ac.in


In [18]:
def breakup(s):
    words=re.compile(r"\S+")
    ms=words.finditer(s)
    idlist=[]
    domainlist=[]
    templist=[]
    for m in ms:
        emailid = re.compile(r'([a-zA-Z0-9][a-zA-Z0-9.+]*[@])(([a-zA-Z0-9]+[.])*[a-zA-Z0-9]+)')
        ms2=emailid.finditer(m.group())
        for mq in ms2:
            templist=mq.group().split("@")
            idlist.append(templist[0])
            domainlist.append(templist[1])
    for i in idlist:
        print(i)
    for j in domainlist:
        print(j)
            
    
    
    
breakup("My id is kumar@cmi.ac.in while @smi.ac.in is not valid but junk mail like t@cmi.ac.in.")

kumar
t
cmi.ac.in
cmi.ac.in


**Exercise:** In a html file links usually appear as follows:
                       
                 <sometag href="https://somedomainname"> Something here which will serve as the link </sometag>

Write a python function using **re** which identifies all the links that appear in a given string. A domainname itself is as described above. For instance in the above example the link is:

                  https://somedomainname

###  Substituting using re

In addition to searching a string **re** also provides the ability to replaced the matched substrings with something else that you desire. Here is an example, which replaces all the email ids in a given string with postmaster@noname.com

In [None]:
emailidre = r'[a-zA-Z0-9][a-zA-Z0-9.+]*[@]([a-zA-Z0-9]+[.])*[a-zA-Z0-9]+'
with open("InputFile","r") as f:
    s = f.read()
    print(re.sub(emailidre,"postmaster@noname.com",s),end="")

**re.sub(regexp,result,string)**  replaces all matchings of regexp in string with result.

#### Using back references

Suppose we wanted to replace every address of the form username@domainname by postmaster@domainname? Here, we need to extract a part of the matched string and use it as part of the *result*. This can be done using groups. One can refer to the ith group using **\\g&lt;i&gt;** as illustrated below. Such references to matched substrings are called back references.


In [None]:
emailidre = r'([a-zA-Z0-9][a-zA-Z0-9.+]*)[@](([a-zA-Z0-9]+[.])*[a-zA-Z0-9]+)'
with open("InputFile","r") as f:
    s = f.read()
    print(re.sub(emailidre,"postmaster@\g<2>",s),end="")

If instead you wanted to send the mails to the username@noname.com you can do that as follows

In [None]:
emailidre = r'([a-zA-Z0-9][a-zA-Z0-9.+]*)[@](([a-zA-Z0-9]+[.])*[a-zA-Z0-9]+)'
with open("InputFile","r") as f:
    s = f.read()
    print(re.sub(emailidre,"\g<1>@noname.com",s),end="")

Back references can also be used in simple matching of strings (not just in substitution).

One can refer to the ith group within the same regular expression as **\\i**

The following identifies all those users whose userID is of the form firstname.lastname in the file UserData.csv which lists email id, first name and last names.

In [None]:
parse = r'([^.,]+).([^.,]+),\1,\2'
with open("UserData.csv","r") as f:
    for l in f:
        if re.search(parse,l):
            print(l,end="")

Regular expressions are a very useful and powerful formalism for describing and processing patterns in words. They are not specific to Python and appear in some form or the other in alm