# Tutorial: Python Regex (Regular Expressions) for Data Scientists

In this tutorial, we’ll use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. They’re pretty entertaining to read.

In [1]:
import re
import pandas as pd

## Introducing Python’s Regex Module


In [4]:
fh = open(r"dataset/test_emails.txt", "r").read()

Now, suppose we want to find out who the emails are from. We could try raw Python on its own:

In [5]:
for line in fh.split("n"):
    if "From:" in line:
        print(line)

der.com>
Message-Id: <200210311310.g9VDANt24674@bloodwork.mr.itd.UM>
From: "Mr. Be
g_715@epatra.com>
Message-Id: <200210312227.g9VMQvDj017948@bluewhale.cs.CU>
From: "PRINCE OBONG ELEME" <obo


But that’s not giving us exactly what we want. If you take a look at our test file, we could figure out why and fix it, but instead, let’s use Python’s re module and do it with regular expressions!

In [6]:
for line in re.findall("From:.*", fh):
    print(line)

From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>


## Common Python Regex Patterns
The pattern we used with re.findall() above contains a fully spelled-out out string, "From:". This is useful when we know precisely what we’re looking for, right down to the actual letters and whether or not they’re upper or lower case. If we don’t know the exact format of the strings we want, we’d be lost. Fortunately, regex has basic patterns that account for this scenario. Let’s look at the ones we use in this tutorial:

- w matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.
- d matches digits, which means 0-9.
- s matches whitespace characters, which include the tab, new line, carriage return, and space characters.
- S matches non-whitespace characters.
- . matches any character except the new line character n.

With these regex patterns in hand, you’ll quickly understand our code above as we go on to explain it.

## Working with Regex Patterns

We might even go further and isolate only the name. Let’s use re.findall() to return a list of lines containing the pattern "From:.*" as we’ve done before. We’ll assign it to the variable match for neatness. Next, we’ll iterate through the list. In each cycle, we’ll execute re.findall again, matching the first quotation mark to pick out just the name:

In [7]:
match = re.findall("From:.*", fh)

for line in match:
    print(re.findall('\".*\"', line))

['"Mr. Ben Suleman"']
['"PRINCE OBONG ELEME"']


What if we want the email address instead?

In [13]:
match = re.findall("From:.*", fh)

for line in match:
    print(re.findall("\w\S*@\w*.\w*", line))

['bensul2004nng@spinfinder.com']
['obong_715@epatra.com']


Looks simple enough, doesn’t it? Only the pattern is different. Let’s walk through it.

Here’s how we match just the front part of the email address:

In [14]:
for line in match:
    print(re.findall("\w\S*@", line))

['bensul2004nng@']
['obong_715@']


Now for the pattern behind the @ symbol:

In [15]:
for line in match:
    print(re.findall("@.*", line))

['@spinfinder.com>']
['@epatra.com>']


If we look at the line closely, we see that each email is encapsulated within angle brackets, < and >. Our pattern, .*, includes the closing bracket, >. Let’s remedy it:

In [16]:
for line in match:
    print(re.findall("@.*\w", line))

['@spinfinder.com']
['@epatra.com']
