# Tutorial: Python Regex (Regular Expressions) for Data Scientists

In this tutorial, we’ll use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. They’re pretty entertaining to read.

In [1]:
import re
import pandas as pd
import email as em

## Introducing Python’s Regex Module


In [2]:
fh = open(r"dataset/test_emails.txt", "r").read()

Now, suppose we want to find out who the emails are from. We could try raw Python on its own:

In [3]:
for line in fh.split("n"):
    if "From:" in line:
        print(line)

der.com>
Message-Id: <200210311310.g9VDANt24674@bloodwork.mr.itd.UM>
From: "Mr. Be
g_715@epatra.com>
Message-Id: <200210312227.g9VMQvDj017948@bluewhale.cs.CU>
From: "PRINCE OBONG ELEME" <obo


But that’s not giving us exactly what we want. If you take a look at our test file, we could figure out why and fix it, but instead, let’s use Python’s re module and do it with regular expressions!

In [4]:
for line in re.findall("From:.*", fh):
    print(line)

From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>


## Common Python Regex Patterns
The pattern we used with re.findall() above contains a fully spelled-out out string, "From:". This is useful when we know precisely what we’re looking for, right down to the actual letters and whether or not they’re upper or lower case. If we don’t know the exact format of the strings we want, we’d be lost. Fortunately, regex has basic patterns that account for this scenario. Let’s look at the ones we use in this tutorial:

- w matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.
- d matches digits, which means 0-9.
- s matches whitespace characters, which include the tab, new line, carriage return, and space characters.
- S matches non-whitespace characters.
- . matches any character except the new line character n.

With these regex patterns in hand, you’ll quickly understand our code above as we go on to explain it.

## Working with Regex Patterns

We might even go further and isolate only the name. Let’s use re.findall() to return a list of lines containing the pattern "From:.*" as we’ve done before. We’ll assign it to the variable match for neatness. Next, we’ll iterate through the list. In each cycle, we’ll execute re.findall again, matching the first quotation mark to pick out just the name:

In [5]:
match = re.findall("From:.*", fh)

for line in match:
    print(re.findall('\".*\"', line))

['"Mr. Ben Suleman"']
['"PRINCE OBONG ELEME"']


What if we want the email address instead?

In [6]:
match = re.findall("From:.*", fh)

for line in match:
    print(re.findall("\w\S*@\w*.\w*", line))

['bensul2004nng@spinfinder.com']
['obong_715@epatra.com']


Looks simple enough, doesn’t it? Only the pattern is different. Let’s walk through it.

Here’s how we match just the front part of the email address:

In [7]:
for line in match:
    print(re.findall("\w\S*@", line))

['bensul2004nng@']
['obong_715@']


Now for the pattern behind the @ symbol:

In [8]:
for line in match:
    print(re.findall("@.*", line))

['@spinfinder.com>']
['@epatra.com>']


If we look at the line closely, we see that each email is encapsulated within angle brackets, < and >. Our pattern, .*, includes the closing bracket, >. Let’s remedy it:

In [9]:
for line in match:
    print(re.findall("@.*\w", line))

['@spinfinder.com']
['@epatra.com']


## Common Python Regex Functions

re.findall() is undeniably useful, but it’s not the only built-in function that’s available to us in re:

- re.search()
- re.split()
- re.sub()
Let’s look at these one by one before using them to bring some order to our data set.

### re.search()
While re.findall() matches all instances of a pattern in a string and returns them in a list, re.search() matches the first instance of a pattern in a string, and returns it as a re match object.

In [10]:
match = re.search("From:.*", fh)
print(type(match))
print(type(match.group()))
print(match)
print(match.group())

<class 're.Match'>
<class 'str'>
<re.Match object; span=(204, 258), match='From: "Mr. Ben Suleman" <bensul2004nng@spinfinder>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>


Because re.search() returns a re match object, we can’t display the name and email address by printing it directly. Instead, we have to apply the group() function to it first. We’ve printed both their types out in the code above. As we can see, group() converts the match object into a string.

### re.split()

Suppose we need a quick way to get the domain name of the email addresses. We could do it with three regex operations, like so:

In [11]:
address = re.findall("From:.*", fh)
for item in address:
    for line in re.findall("\w\S*@.*\w", item):
        username, domain_name = re.split("@", line)
        print("{}, {}".format(username, domain_name))

bensul2004nng, spinfinder.com
obong_715, epatra.com


### re.sub()
Another handy re function is re.sub(). As the function name suggests, it substitutes parts of a string. An example:

In [12]:
sender = re.search("From:.*", fh)
address = sender.group()
email = re.sub("From", "Email", address)
print(address)
print(email)

From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
Email: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>


## Regex with Pandas

Now we have the basics of Python regex in hand. But often for data tasks, we’re not actually using raw Python, we’re using the pandas library. Now let’s take our regex skills to the next level by bringing them into a pandas workflow.

### Sorting Emails with Python Regex and Pandas

Our corpus is a single text file containing thousands of emails. We’ll use regex and pandas to sort the parts of each email into appropriate categories so that the Corpus can be more easily read or analysed.

We’ll sort each email into the following categories:

- sender_name
- sender_address
- recipient_address
- recipient_name
- date_sent
- subject
- email_body

Each of these categories will become a column in our pandas dataframe (i.e., our table). This will make it easier for us work on and analyze each column individually.

### Preparing the Script

In [13]:
emails = []

fh = open(r"dataset/test_emails.txt", "r").read()

In [14]:
contents = re.split("From r", fh)
contents.pop(0)

''

### Getting Every Name and Address With a For Loop

In [15]:
for item in contents:
    emails_dict = {}

    # Find sender's email address and name.

    # Step 1: find the whole line beginning with "From:".
    sender = re.search(r"From:.*", item)
    
    # Step 2: find the email address and name.
    if sender is not None:
        s_email = re.search(r"\w\S*@.*\w", sender.group())
        s_name = re.search(r":.*<", sender.group())
    else:
        s_email = None
        s_name = None
        
    print("sender type: " + str(type(sender)))
    print("sender.group() type: " + str(type(sender.group())))
    print("sender: " + str(sender))
    print("sender.group(): " + str(sender.group()))
    print()
    
    # Step 3A: assign email address as string to a variable.
    if s_email is not None:
        sender_email = s_email.group()
    else:
        sender_email = None
    # Add email address to dictionary.
    emails_dict["sender_email"] = sender_email
    
    # Step 3B: remove unwanted substrings, assign to variable.
    if s_name is not None:
        sender_name = re.sub("s*<", "", re.sub(":s*", "", s_name.group()))
    else:
        sender_name = None

    # Add sender's name to dictionary.
    emails_dict["sender_name"] = sender_name
    
    print(sender_email)
    print(sender_name)
    print()

sender type: <class 're.Match'>
sender.group() type: <class 'str'>
sender: <re.Match object; span=(198, 252), match='From: "Mr. Ben Suleman" <bensul2004nng@spinfinder>
sender.group(): From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>

bensul2004nng@spinfinder.com
 "Mr. Ben Suleman" 

sender type: <class 're.Match'>
sender.group() type: <class 'str'>
sender: <re.Match object; span=(180, 229), match='From: "PRINCE OBONG ELEME" <obong_715@epatra.com>>
sender.group(): From: "PRINCE OBONG ELEME" <obong_715@epatra.com>

obong_715@epatra.com
 "PRINCE OBONG ELEME" 



Perfect. We’ve isolated the email address and the sender’s name. We’ve also added them to the dictionary, which will come into play soon.

Now that we’ve found the sender’s email address and name, we do exactly the same set of steps to acquire the recipient’s email address and name for the dictionary.

In [16]:
recipient = re.search(r"To:.*", item)
recipient

<re.Match object; span=(236, 260), match='To: obong_715@epatra.com'>

### Getting all data

In [17]:
emails = []
fh = open(r"dataset/fradulent_emails.txt", "r").read()
contents = re.split("From r", fh)
contents.pop(0)

for item in contents:
    emails_dict = {}
    
    # Sender
    sender = re.search(r"From:.*", item)

    if sender is not None:
        s_email = re.search(r"\w\S*@.*\w", sender.group())
        s_name = re.search(r":.*<", sender.group())
    else:
        s_email = None
        s_name = None

    if s_email is not None:
        sender_email = s_email.group()
    else:
        sender_email = None

    emails_dict["sender_email"] = sender_email
    
    if s_name is not None:
        sender_name = re.sub("s*<", "", re.sub(":s*", "", s_name.group()))
    else:
        sender_name = None

    emails_dict["sender_name"] = sender_name
    
    print("sender email: " + str(sender_email))
    print("sender name: " + str(sender_name))
    
    # Recipient
    recipient = re.search(r"To: .*", item)
    if recipient is not None:
        r_email = re.search("\w\S*@\w*.\w*", recipient.group())
        r_name = re.search(r":.*<", recipient.group())
    else:
        r_email = None
        r_name = None
    
    if r_email is not None:
        recipient_email = r_email.group()
    else:
        recipient_email = None

    emails_dict["recipient_email"] = recipient_email

    if r_name is not None:
        recipient_name = re.sub("\s*<", "", re.sub(":\s*", "", r_name.group()))
    else:
        recipient_name = None

    emails_dict["recipient_name"] = recipient_name
    
    print("recipient email: " + str(recipient_email))
    print("recipient name: " + str(recipient_name))
    
    ##Date
    date_field = re.search(r"Date:.*", item)
    
    if date_field is not None:
        date = re.search(r"\d+\s\w+\s\d+", date_field.group())
    else:
        date = None
        
    if date is not None:
        date_sent = date.group()
    else:
        date_sent = None

    emails_dict["date_sent"] = date_sent
        
    print("date sent: "+ str(date_sent))
    
    #Subject
    subject_field = re.search(r"Subject: .*", item)
    
    if subject_field is not None:
        subject = re.sub(r"Subject: ", "", subject_field.group())
    else:
        subject = None
    
    print("subject: "+ str(subject))
    
    #Body
    full_email = em.message_from_string(item)
    body = full_email.get_payload()
    emails_dict["email_body"] = body
    print()
    
    emails.append(emails_dict)

sender email: james_ngola2002@maktoob.com
sender name:  "MR. JAMES NGOLA." 
recipient email: james_ngola2002@maktoob.com
recipient name: None
date sent: 31 Oct 2002
subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP

sender email: bensul2004nng@spinfinder.com
sender name:  "Mr. Ben Suleman" 
recipient email: R@M
recipient name: None
date sent: 31 Oct 2002
subject: URGENT ASSISTANCE /RELATIONSHIP (P)

sender email: obong_715@epatra.com
sender name:  "PRINCE OBONG ELEME" 
recipient email: obong_715@epatra.com
recipient name: None
date sent: 31 Oct 2002
subject: GOOD DAY TO YOU

sender email: obong_715@epatra.com
sender name:  "PRINCE OBONG ELEME" 
recipient email: webmaster@aclweb.org
recipient name: None
date sent: 31 Oct 2002
subject: GOOD DAY TO YOU

sender email: m_abacha03@www.com
sender name:  "Maryam Abacha" 
recipient email: m_abacha03@www.com
recipient name: None
date sent: 1 Nov 2002
subject: I Need Your Assistance.

sender email: davidkuta@postmark.net
sender name:  Kuta Davi

In [18]:
# Print number of dictionaries, and hence, emails, in the list.
print("Number of emails: " + str(len(emails)))

print()

# Print first item in the emails list to see how it looks.
for key, value in emails[0].items():
    print(str(key) + ": " + str(emails[0][key]))

Number of emails: 3977

sender_email: james_ngola2002@maktoob.com
sender_name:  "MR. JAMES NGOLA." 
recipient_email: james_ngola2002@maktoob.com
recipient_name: None
date_sent: 31 Oct 2002
email_body: FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WHILE WE WERE HOLDING MEETING WITH HIS EXCELLENCY OVER THE FINANCIAL RETURNS FROM THE DIAMOND SALES IN THE AREAS CONTROLLED BY (D.R.C.) DEMOCRATIC REPUBLIC OF CONGO FORCES AND THEIR FOREIGN ALLIES ANGOLA AND ZIMBABWE, HAVING RECEIVED THE PREVIOUS DAY (USD$100M) ONE HUNDRED MILLION UNITED STATES DOLLARS, CASH IN THREE DIPLOMATIC BOXES ROUTED THROUGH ZIMBABWE.

MY PURPOSE OF WRITING YOU THIS LETTER IS TO SOLICIT FOR YOUR ASSISTANCE AS TO BE A COV

## Manipulating Data with Pandas

With dictionaries in a list, we’ve made it infinitely easy for the pandas library to do its job. Each key will become a column title, and each value becomes a row in that column.

All we have to do is apply the following code:

In [19]:
emails_df = pd.DataFrame(emails)

With this single line, we turn the emails list of dictionaries into a dataframe using the pandas DataFrame() function. We assign it to a variable too.

That’s it. We now have a sophisticated pandas dataframe. This is essentially a neat and clean table containing all the information we’ve extracted from the emails.

Let’s look at the first few rows.

In [20]:
emails_df.head()

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,email_body
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",james_ngola2002@maktoob.com,,31 Oct 2002,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",R@M,,31 Oct 2002,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",obong_715@epatra.com,,31 Oct 2002,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",webmaster@aclweb.org,,31 Oct 2002,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,m_abacha03@www.com,"""Maryam Abacha""",m_abacha03@www.com,,1 Nov 2002,"Dear sir, \n \nIt is with a heart full of hope..."


In [22]:
emails_df.isna().sum()

sender_email        473
sender_name         837
recipient_email     491
recipient_name     3558
date_sent           614
email_body            0
dtype: int64

We can also find precisely what we want. For instance, we can find all the emails sent from a particular domain name. However, let’s learn a new regex pattern to improve our precision in finding the items we want.

The pipe symbol, |, looks for characters on either side of itself. For instance, a|b looks for either a or b.

Now, let’s use | to find all the emails sent from one or another domain name.

In [23]:
emails_df[emails_df["sender_email"].str.contains("epatra|spinfinder",na=False)]

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,email_body
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",R@M,,31 Oct 2002,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",obong_715@epatra.com,,31 Oct 2002,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",webmaster@aclweb.org,,31 Oct 2002,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
567,kaladah@epatra.com,,webmaster@aclweb.org,,24 Nov 2003,MR. KALADA HART\nORIENT BANK (NIG.) LTD.\nIDUM...
584,rharare1@spinfinder.com,robert harare,R@M,,01 Dec 2003,"[[Content-Type, Content-Transfer-Encoding]]"
