# Using Regular Expressions in Python for String Manipulations

By Katherine Wuestney, PhD, RN

## How do we filter strings?

Basic string manipulation using built-in string methods

Where **s** is a string var:

- s.find() -> returns index where the pattern occurs in the string
- s.split() -> returns list of strings split along the delimiter (default is white space)
- s[:] -> basic string slicing using index vals

## Example 1

Each data entry in a dataset has a unique id: 

- "pt001"
- 2-letter prefix and 3-digit suffix

Get the numeric portion using string slicing:

In [1]:
ptid = "pt001"
ptnum = ptid[2:]
print(ptnum)

001


## What if the string isn't standardized?

## Regular Expressions

- "Regex" type of syntax (not unique to Python)
- advanced pattern matching in strings

Built in Python module -> re

In [2]:
import re

## RegEx Functions

**re.search**(*pattern, string*) 

Return first location in string matching pattern.


In [3]:
fruit = "bananas, apples, pineapples, pears, crabapples, mangos"

apples = re.search("apples", fruit)
print(apples)

<re.Match object; span=(9, 15), match='apples'>


In [None]:
fruit[9:15]

## RegEx Functions

**re.findall**(*pattern, string*) -> Return all non-overlapping matches of pattern in string, as a list of strings or tuples


In [4]:
fruit = "bananas, apples, pineapples, pears, crabapples, mangos"

apples = re.findall("apples", fruit)
print(apples)

['apples', 'apples', 'apples']


## RegEx Functions

**re.match()** -> same as search but only matches at beginning of string  
**re.finditer()** -> returns iterable generator instead of list  

**match()** and **search()** return None when there is no match. Can test with bool

In [10]:
fruit = "bananas, apples, pineapples, pears, crabapples, mangos"
print(re.match("(apples)", fruit))

None


## Building a RegEx Pattern
Instead of specifying an exact string to match, we can specify sets of string characters in particular orders that may be acceptable. 

- Use "[ ]" to specify acceptable characters to match


### In a set "[ ]": 
- list characters individually
    - "[amp]" will find "a", "m", or "p"
- use "-" to indicate range of characters
    - "[a-z]" will find any lower case ASCII letter
- use "^" in front of set to negate the set
    - "[^amp]" will find any character that ISN'T "a", "m", or "p"
- use "\" to escape special characters

In [11]:
twodigits = "03, 68, 14, 56, 29, 91"
#find all two digit numbers between 00-59
less_than_60 = re.findall("[0-5][0-9]", twodigits)
less_than_60

['03', '14', '56', '29']

### Allowing multiple characters in the match

- Use '\*', '+', and '?' to specify variable number of instances to match
- \* means 0 or more instances
- \+ means 1 or more instances
- \? means 0 or 1 instances

In [13]:
fruit = "bananas, apples, pineapples, pears, crabapples, mangos"

apples_all = re.findall("[a-z]*apples", fruit)
print(apples_all)

#must have at least 1 letter before apples
apples_other = re.findall("[a-z]+apples", fruit)
print(apples_other)

['apples', 'pineapples', 'crabapples']
['pineapples', 'crabapples']


In [14]:
amp = re.findall("[^amp]", "ample")
print(amp)

['l', 'e']


### Greedy vs Non-Greedy

- "\*", "+", and "?" will match as much text as possible

In [15]:
#find all instances of repetitions of "an"
fruit = "bananas, apples, pineapples, pears, crabapples, mangos"
an_fruits = re.findall("an+", fruit)
print("an+ results:", an_fruits)

an+ results: ['an', 'an', 'an']


In [16]:
#find all instances of three letter grams starting with 'a'
a_trigrams = re.findall("a[a-z][a-z]+", fruit)
print("a_trigrams:", a_trigrams)

a_trigrams: ['ananas', 'apples', 'apples', 'ars', 'abapples', 'angos']


### Greedy vs Non-Greedy

- "\*", "+", and "?" will match as much text as possible
    - Greedy
- Use '\*?' or '+?' to match as few characters as possible
    - Non-greedy

In [18]:
a_trigrams = re.findall("a[a-z][a-z]+?", fruit)
print("a_trigrams:", a_trigrams)

a_trigrams = re.findall("a[a-z]{2}?", fruit)
print("a_trigrams:", a_trigrams)

a_trigrams: ['ana', 'app', 'app', 'ars', 'aba', 'ang']
a_trigrams: ['ana', 'app', 'app', 'ars', 'aba', 'ang']


- Can use '{}' to specify exact numbers of matches of a previous RE
    - {m} exactly *m* copies
    - {m,n} *m* - *n* number of copies
    - {,n} 0 - *n* copies
    - {m,} *m* - infinity
    - {m,n}? *m* - *n* non-greedy

## Other Special Characters

- '.' matches any character except newline (\n)
- '^' matches the start of the string (unless inside a set [], then it negates)
- '\\$' matches end of the string 

## Special Sequences

Short hand for sets of characters

- \d any decimal digit character (i.e., [0-9])
- \D any character NOT a decimal digit 
- \s any whitespace characters ( \t\n\r\f\v)
- \S NOT whitespace
- \w any word characters (alphanumeric and '_\', not whitespace) 
- \W NOT word characters

## Example 2 HTML Tags

HTML code for a hyperlink beneath text. 

In [28]:
html = """<a href="https://www.freecodecamp.org/contribute/">The freeCodeCamp Contribution Page</a>"""

#extract url from html
url = re.search("""href="(.+?)">""", html)
print(url.group(0))
print(url.group(1))
print(re.findall("""href="(.+?)">""", html))

href="https://www.freecodecamp.org/contribute/">
https://www.freecodecamp.org/contribute/
['https://www.freecodecamp.org/contribute/']


In [None]:
url = re.findall("<.+?>", html)
url = re.findall("href=.+?>", html)
url = re.findall("href=.+?>", html)
url = re.search("href=(.+?)>", html)
url.group(0)
url.group(1)

## Example 3 Parsing Bibliographic Entries From Scratch



- "pubmed-results.txt" contains 2 results from PubMed search
- entries in PubMed format -> tags at beginning of line
- Use "|" for OR matching logic within the RE


In [29]:
#find all keyword and index terms
#tags -> "MH" or "OT"

keywords = []
with open("pubmed-results.txt", encoding='utf8') as fhand:
    for line in fhand:
        if re.match("MH |OT ", line):
            kw = re.search("- (.+)", line)
            keywords.append(kw.group(1))
print(keywords)

['Humans', 'Aged', '*Frailty/diagnosis', 'Machine Learning', 'Algorithms', 'Artificial intelligence', 'Cognitive frailty', 'Elderly', 'Frailty', 'Frailty index', 'Fried frailty phenotype', 'Machine learning', 'Systematic review', 'Humans', 'Aged', '*Artificial Intelligence', 'Frail Elderly', '*Frailty', 'Machine Learning', 'Area Under Curve', 'accuracy', 'aging', 'artificial intelligence', 'biological variability', 'detection', 'diagnosis', 'frail older adult', 'frailty', 'identification', 'older adults', 'review', 'screening', 'sensitivity', 'tool']
