<div style="text-align: center;">

<h2 style="margin-bottom: 2rem; margin-top: 2rem;">CRMDA Python Workgroup</h2>

<h1 style="margin-bottom: 2rem; margin-top: 2rem;">Regular Expressions</h1>

<h4 style="margin-bottom: 2rem; margin-top: 2rem;">(<em>Automate the Boring Stuff with Python</em> Chapter 7)</h4>

<p style="margin-bottom: 2rem; margin-top: 2rem; text-align: center;">Matt Menzenski</p>

</div>

<img style="margin: 0 auto;" src="http://imgs.xkcd.com/comics/regular_expressions.png">

<div style="margin: 0 auto; max-width: 80%">
<h1>What are Regular Expressions?</h1>

<p>Regular Expressions, at their most basic, are sequences of characters that define **search patterns**. If you can define a pattern using a regular expression (or *regex* for short), you can use it to search a text for character sequences that fit that pattern, and do something with them.</p>

<p>Regular expressions date back to the 1950s, and are implemented in many programming languages. Some, like Perl, JavaScript, and Ruby, have them built into the language, while others, like Python, Java, C, and C++, import them from standard libraries (i.e., you need to call <code>import re</code> in order to use them, but the <code>re</code> module is included with every Python installation).</p>
</div>

<div style="margin: 0 auto; max-width: 80%">

<p>Regular Expressions in Python have some nifty extra powers that aren't found in the regex implementations of other languages, but the vast majority of regular expression syntax is similar if not the same across implementations. So if you go on to program in Ruby, or C, or Java, the knowledge from this chapter should transfer over.</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h1>When should you use regular expressions?</h1>

<p>Only use regular expressions if you have no other options.</p>

</div>

<img style="margin: 0 auto;" src="http://imgs.xkcd.com/comics/perl_problems.png">

<div style="margin: 0 auto; max-width: 80%">

<blockquote style="margin: 0 auto;">
    <p>Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.</p>
    <footer>Jamie Zawinski (1997)</footer>
</blockquote>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h1>Case Study: Finding Phone Numbers in a String</h1>

<p>The <code>isPhoneNumber</code> function steps through a supplied text character-by-character, and if it finds anything not compatible with a phone number (e.g., <code>415-555-4242</code>)</p>

</div>

In [10]:
def isPhoneNumber(text):
    """Return True if text is a valid phone number, False otherwise."""
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdigit():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdigit():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdigit():
            return False
    return True
    

In [12]:
print('"415-555-4242" is a phone number:')
print(isPhoneNumber('415-555-4242'))

print('\n"moshi-moshi" is a phone number:')
print(isPhoneNumber('moshi-moshi'))

print('\n"415-555-4c42" is a phone number:')
print(isPhoneNumber('415-555-442'))

"415-555-4242" is a phone number:
True

"moshi-moshi" is a phone number:
False

"415-555-4c42" is a phone number:
False


In [13]:
message = "Call me at 415-555-1011 tomorrow. 415-555-9999 is my office."

for i in xrange(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done.')

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done.


<div style="margin: 0 auto; max-width: 80%">

<h1>Regular Expression Syntax</h1>

<p><strong>Special characters</strong></p>

<ul>
    <li><code>\</code> &mdash; escape character</li>
    <li><code>.</code> &mdash; match any character</li>
    <li><code>^</code> &mdash; match beginning of string</li>
    <li><code>$</code> &mdash; match end of string</li>
    <li><code>[5b-d]</code> &mdash; match any of '5', 'b', 'c', 'd'</li>
    <li><code>[^a-c6]</code> &mdash; match any <em>except</em> 'a', 'b', 'c', '6'</li>
    <li><code>R|S</code> &mdash; match regex R or regex S</li>
    <li><code>()</code> &mdash; create a capture group</li>

</ul>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h1>Regular Expression Syntax</h1>

<p><strong>Quantifiers</strong></p>

<ul>
    <li><code>*</code> &mdash; zero or more</li>
    <li><code>*?</code> &mdash; zero or more (non-greedy)</li>
    <li><code>+</code> &mdash; one or more</li>
    <li><code>+?</code> &mdash; one or more (non-greedy)</li>
    <li><code>{3}</code> &mdash; exactly 3 occurrences</li>
    <li><code>{2,4}</code> &mdash; from two to four occurrences</li>
    <li><code>{,4}</code> &mdash; from zero to four occurrences</li>
    <li><code>{4,}</code> &mdash; four or more occurrences</li>
    <li><code>{2,4}+</code> &mdash; from two to four occurrences (non-greedy)</li>
</ul>

<p>A regex will generally try to match as much of a string as possible. "Non-greedy" means matching as little as possible instead.</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h1>Regular Expression Syntax</h1>

<p><strong>Escaped characters</strong></p>

<ul>
    <li><code>\A</code> &mdash; start of string</li>
    <li><code>\b</code> &mdash; empty string at word boundary</li>
    <li><code>\B</code> &mdash; empty string not at word boundary</li>
    <li><code>\d</code> &mdash; digit character</li>
    <li><code>\D</code> &mdash; non-digit character</li>
    <li><code>\s</code> &mdash; whitespace (= <code>[ \t\n\r\f\v]</code>)</li>
    <li><code>\S</code> &mdash; non-whitespace</li>
    <li><code>\w</code> &mdash; alphanumeric (= <code>[0-9a-zA-Z_]</code>)</li>
    <li><code>\W</code> &mdash; non-alphanumeric</li>
    <li><code>\Z</code> &mdash; end of string</li>
</ul>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h1>Regex Patterns and Raw Strings</h1>

<p>Our phone number pattern can be specified with the regex pattern <code>\d\d\d-\d\d\d-\d\d\d\d</code>: three digits, hyphen, three digits, hyphen, four digits.</p>

<p>We can shorten this to <code>\d{3}-\d{3}-\d{4}</code> using quantifiers.</p>

<p>But there's a problem: in Python strings, the backslash is the <strong>escape character</strong>. It signals that the next character shouldn't be interpreted literally: e.g., the Python string <code>'\n'</code> is <strong>not</strong> a backspace character followed by an N character. It is a single newline character.</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h1>Regex Patterns and Raw Strings</h1>

<p>One way to get around this is to escape the backslashes (don't do this): <pre>'\\\\d\\\\d\\\\d-\\\\d\\\\d\\\\d-\\\\d\\\\d\\\\d\\\\d'</pre></p>

<p>Another way (the most common) is to use a <strong>raw string</strong>. A raw string is prefixed by <code>r</code> (before the opening quote); backslashes in raw strings are just backslashes. <pre>r'\d\d\d-\d\d\d-\d\d\d\d'</pre></p>

</div>

In [22]:
import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My phone number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


<div style="margin: 0 auto; max-width: 80%">

<h1>Regex Matching</h1>

<ol>
    <li>Import the regex module with <code>import re</code>.</li>
    <li>Create a regex object by calling <code>re.compile()</code> on a raw string.</li>
    <li>Pass the string to be searched into the regex object's <code>.search()</code> method, returning a match object.</li>
    <li>Call the match object's <code>.group()</code> method to return a string of the matched text.</li>
</ol>

</div>

<div style="text-align: center; margin: 0 auto; max-width: 80%">

<p style="text-align: center;">Are regular expressions complicated? Yes.</p>

<p style="text-align: center;">Don't worry about memorizing all of the escape sequences and special patterns. You can always look up those details on a case-by-case basis, as you need them. The general patterns are what's important.</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h1>Real-life regex usage</h1>

</div>

<div style="margin: 0 auto; max-width: 80%">

<p>Find numerals in a text...</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<p>...and replace them with their spelled-out equivalents...</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<p>...and by 'numerals' we mean cardinal numbers (0-9999), ordinal numbers (1st-9999th), decimals (.1, .11, .111), and times (12:34)...</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<p>...and by 'text' we mean 'Uyghur text'...</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<p>...and of course I don't speak Uyghur.</p>

</div>

<div style="margin: 0 auto; max-width: 80%">

<h2>Input</h2>

<p><code>Aldinqi esirning 80-Yillirining otturiliridin bashlap, xitayda qiz-Oghul nisbiti arisidiki tengpungsizliq barghanséri éship bériwatqan bolup, 2020-Yilidin kéyin %10 yash erler jora tapalmaydiken, qanche kéyin tughulghan bolsa, bu ehwal shunche éghir bolidiken.</code></p>

<h2>Output</h2>

<p><code>Aldinqi esirning sekseninchi Yillirining otturiliridin bashlap, xitayda qiz-Oghul nisbiti arisidiki tengpungsizliq barghanséri éship bériwatqan bolup, ikki ming yigirminchi Yilidin kéyin %10 yash erler jora tapalmaydiken, qanche kéyin tughulghan bolsa, bu ehwal shunche éghir bolidiken.</code></p>

</div>