# DSC200 - Lecture 5

## Regular Expressions

## Regular Expressions

String manipulation is a common problem when preparing data for analysis.

We previously came across the `"".index()` method, and there is also a `"".split()` method, among others. But what if we want to find more complex patterns?

One example might be searching for phone numbers (e.g. 999-999-9999) in a text database. What other string patterns might we want to find?

**class[buzz]: `lec5-1`**

Regular expressions are text matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation.

Regular expressions can include a variety of rules, for finding repetition, to text-matching, and much more.

As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).

If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the `re` module with Python for this lecture.

## Searching for Patterns in Text

One of the most common uses for the `re` module is for finding patterns in text.

Let's do a quick example of using the `search` method in the `re` module to find some text:

In [1]:
import re

pattern = 'term2'

text = 'This is a string with term1, but it does not have the other term.'

print(re.search(pattern,  text) is not None)

False


So, we've seen that `re.search()` will take the pattern and if no pattern is found, a **None** is returned.

But what if a pattern is found?

In [2]:
pattern = 'term1'

text = 'This is a string with term1, but it does not have the other term.'

match = re.search(pattern,  text)
match

<re.Match object; span=(22, 27), match='term1'>

This **Match** object returned by the `search()` method is more than just a **Boolean** or **None**, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match.

Let's see the methods we can use on the match object:

In [3]:
# Index of start of match
match.start()

22

In [4]:
# Index of the end
match.end()

27

## Split with regular expressions

Let's see how we can split with the `re` syntax. This should look similar to how you used the `split()` method with strings.

In [5]:
# Term to split on
split_term = '@'

phrase = 'What is the domain name of someone with the email: hello@gmail.com'

# Split the phrase
re.split(split_term, phrase)

['What is the domain name of someone with the email: hello', 'gmail.com']

Note how `re.split()` returns a list with the term to spit on removed and the terms in the list are a split up version of the string.

Create a couple of more examples for yourself to make sure you understand!

## Finding all instances of a pattern

You can use `re.findall()` to find all the instances of a pattern in a string. For example:

In [6]:
# Returns a list of all matches
re.findall('match','test phrase match is in middle')

['match']

## Pattern re Syntax

This will be the bulk of this lecture on using `re` with Python. Regular expressions support a huge variety of patterns, not just simply finding where a single string occurred.

We can use *metacharacters* along with `re` to find specific types of patterns.

### Repetition Syntax

There are five ways to express repetition in a pattern:

    1.) A pattern followed by the meta-character * is repeated zero or more times.
    2.) Replace the * with + and the pattern must appear at least once.
    3.) Using ? means the pattern appears zero or one time.
    4.) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat.
    5.) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the value appears at least m times, with no maximum.


Now lets see an example of each of these:

In [7]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

print(re.findall('sd*', # s followed by zero or more d's
                 test_phrase))

['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


In [8]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

print(re.findall('sd+', # s followed by one or more d's
                 test_phrase))

['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


How many matches will this find?

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head"></th>
<th class="head">Number of matches</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">A</span></tt></td>
<td>0</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">B</span></tt></td>
<td>5</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">C</span></tt></td>
<td>7</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">E</span></tt></td>
<td>I don't know</td>
</tr>
</tbody>
</table>



**class[buzz]: `lec5-2`**

In [9]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

print(re.findall('sd?',          # s followed by zero or one d's
                 test_phrase))

['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


In [10]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

print(re.findall('sd{3}',        # s followed by three d's
                 test_phrase))

['sddd', 'sddd', 'sddd', 'sddd']


In [11]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd+'

print(re.findall('d\+',      # s followed by two to three d's
                 test_phrase))

['d+']


  print(re.findall('d\+',      # s followed by two to three d's


## Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input.

Brackets are used to construct character set inputs. For example: the input `[ab]` searches for occurrences of either a or b.

Let's see some examples.

In [12]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

print(re.findall('[sd]',        # either s or d
                 test_phrase))

['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


In [13]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

print(re.findall('s[sd]+',        # s followed by one or more s or d
                 test_phrase))

['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']


## Exclusion

We can use `^` to exclude terms by incorporating it into the bracket syntax notation.

For example: `[^...]` will match any single character not in the brackets.

Let's see some more examples:

In [14]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

Use `[^!.? ]` to check for matches that are not a `!,.,?,` or space.

Add the `+` to check that the match appears at least once, this basically translate into finding the words.

In [15]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious.

A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is `[start-end]`.

Common use cases are to search for a specific range of letters in the alphabet, such `[a-f]` would return matches with any instance of letters between a and f.

In [16]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

print(re.findall('[a-z]+',      # sequences of lower case letters
                 test_phrase))

['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


In [17]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

print(re.findall('[A-Z]+',      # sequences of upper case letters
                 test_phrase))

['T', 'L']


In [18]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

print(re.findall('[a-zA-Z]+',   # sequences of lower or upper case letters
                 test_phrase))

['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


In [19]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

print(re.findall('[A-Z][a-z]+',  # one upper case letter followed by lower case letters
                 test_phrase))

['This', 'Lets']


## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash (`\`). Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read.

Using raw strings, created by prefixing the literal value with r, for creating regular expressions eliminates this problem and maintains readability.

These symbols however, and especially the escaping, can be quite off-putting if you're not familiar with regex.

Hopefully after seeing these examples this syntax will become clear.

In [20]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

print(re.findall(r'\d+', # sequence of digits
                 test_phrase))

['1233']


In [21]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

print(re.findall(r'\w+', # alphanumeric characters
                 test_phrase))

['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


## Conclusion

You should now have a solid understanding of how to use the regular expression module in Python. There are a ton of more special character instances, but it would be unreasonable to go through every single use case.

Instead take a look at the full [documentation](https://docs.python.org/3/library/re.html) if you ever need to look up a particular case. ChatGPT is also pretty good at this...!

On Friday, numpy.

# Backup slides

## You can also indicate the start/end of a line

Two ‘positional’ matches:
 - $ : matches the end of a string or right before a newline
 - ^ : matches the start of a string or right after a newline

In [22]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

print(re.findall(r'\w+$', # alphanumeric characters
                 test_phrase))

['hashtag']
