### Python Regex (Regular Expressions)

This file is about how Python regex works, and using the basic patterns and functions in Python’s regex module, re, for to analyze text strings.

In [1]:
fh = open(r'Data/test_emails.txt', 'r').read()

# precede the directory path with an r: converts a string into a raw string,
# which helps to avoid conflicts such as backslashes in directory paths on Windows.

In [2]:
# to find out who the emails are from. We could try raw Python on its own:

for line in fh.split("n"):
    if "From" in line:
        print(line)

From r  Thu Oct 31 08:11:39 2002
Retur
der.com>
Message-Id: <200210311310.g9VDANt24674@bloodwork.mr.itd.UM>
From: "Mr. Be


From r  Thu Oct 31 17:27:16 2002
Retur
g_715@epatra.com>
Message-Id: <200210312227.g9VMQvDj017948@bluewhale.cs.CU>
From: "PRINCE OBONG ELEME" <obo


### Python’s 're' module

In [3]:
# import Python’s re module.
# Then, use re.findall() function that returns a list of all instances
# of a pattern we define in the string we’re looking at.

import re

# re.findall(pattern, string)
for line in re.findall("From:.*", fh): # .* is a shorthand for a string pattern
    print(line)

From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>


### Common Python Regex Patterns

w : alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.

d : digits, which means 0-9.

s :  whitespace characters, which include the tab, new line, carriage return, and space characters.

S : non-whitespace characters.

. :  any character except the new line character n.

\* : zero or more instances of a pattern on its left. This means it looks for repeating patterns.

\+ : one or more instances of a pattern on its left.

In [4]:
# pick out just the name between the quotation marks:

match = re.findall("From:.*", fh)

for line in match:
    print(re.findall('\".*\"', line)) # \" is used to escape quotation marks
    
# The name is also printed within square brackets because re.findall returns matches in a list.

['"Mr. Ben Suleman"']
['"PRINCE OBONG ELEME"']


In [5]:
# # pick out just the email address:

for line in match:
    print(re.findall("\w\S*@.*\w", line))

['bensul2004nng@spinfinder.com']
['obong_715@epatra.com']



##### Let's analyze the pattern ("\w\S*@.*\w") for finding out email:

    First part (\w\S*@) is up to the @ character:
The part of the email before the @ symbol might contain alphanumeric characters, which means w is required. However, because some emails contain a period or a dash, that’s not enough. We add S to look for non-whitespace characters. But, w\S will get only two characters. Add * to look for repetitions.
   
     Second part (@.*\w), is the the domain name:
It usually contains alphanumeric characters, periods, and a dash sometimes, so a . will do. we extend the search with a *. This allows us to match any character till the end of the line. After the @ symbol we have .*\w, which means that the pattern we want is a group of any type of characters ending with an alphanumeric character. This excludes >.


In [6]:
#

#### re.search function

finds out the first instance of a pattern in a string, and returns it as a re match object.
group() method converts the match object into a string.

In [7]:
match = re.search("Subject:.*", fh)

print(type(match))

# group() method converts the match object into a string.
print(type(match.group()))
print(match)
print(match.group())

<class 're.Match'>
<class 'str'>
<re.Match object; span=(299, 343), match='Subject: URGENT ASSISTANCE /RELATIONSHIP (P)'>
Subject: URGENT ASSISTANCE /RELATIONSHIP (P)


#### re.split() function

splits the text with the delimiter

In [8]:
from_section = re.findall("From:.*", fh)

for item in from_section:
    for email in re.findall("\w\S*@.*\w", item):
        username, domainname = re.split('@', email)
        print("username: {}, domain name: {}".format(username, domainname))
    


username: bensul2004nng, domain name: spinfinder.com
username: obong_715, domain name: epatra.com


#### re.sub() function
It substitutes parts of a string.


In [9]:
from_section = re.search('From:.*\"', fh).group()

email_sender = re.sub('From','Sender', from_section)

print(from_section)
print(email_sender)


From: "Mr. Ben Suleman"
Sender: "Mr. Ben Suleman"


## Regex with Pandas

In [10]:
import re
import pandas as pd
import email

emails = []

fex = open(r'Data/fradulent_emails.txt', 'r').read()

In [11]:
# Since each email in the file starts wtih 'From r', we split them by 'From r'.
# Then get rid of the first empty string in the contents list.
contents = re.split(r'From r', fex)
print(type(contents))
print(len(contents))
contents.pop(0)

<class 'list'>
3978


''

In [12]:
for item in contents:

    emails_dict ={} # holds the info about emails, later we turn it to pandas dataframe.

    # Step 1: find the From section
    from_section = re.search(r'From:.*', item)
    
    # Step 2: Check if from section exists. If exists, get the name and email, if not assign the variables to None to prevent errors.
    if from_section is not None:
        
        sender_email = re.search('\w\S*@.*\w', from_section.group())
        sender_name = re.search(r':.*<', from_section.group())
    else:
        sender_email = None
        sender_name = None
        
    # Step 3A: assign sender's email as string to a variable.
        # don't forget that re.search function returns match object.
        # still need to check wheteher the email exists.
    
    if sender_email is not None:
        sender_email = sender_email.group()
    else:
        sender_email = None
    
    # add sender's email to the dictionary
    
    emails_dict["sender_email"] = sender_email
    
    # Step 3B: actions for sender's name almost similar to sender' email 
        # if the sender's name isn't None, first get rid of the ':' and the white spaces in front of the name
        # and then get rid of the white spaces and '<' after the name.  
    if sender_name is not None:
        
        sender_name = re.sub("s*<", "", re.sub(":s*", "", sender_name.group())) 
    else:
        sender_name = None
    
    # add sender's name to the dictionary
    
    emails_dict["sender_name"] = sender_name
    
    # we do exactly the same set of steps to acquire the recipient’s email address and name for the dictionary.
    
    # Step 1: find the To section
    to_section = re.search(r'To:.*', item)
    
    # Step 2: Check if to section exists. If exists, get the name and email, if not assign the variables to None to prevent errors.
    if to_section is not None:
        
        recipient_email = re.search('\w\S*@.*\w', to_section.group())
        recipient_name = re.search(r':.*<', to_section.group())
    else:
        recipient_email = None
        recipient_name = None
        
    # Step 3A: assign sender's email as string to a variable.
        # don't forget that re.search function returns match object.
        # still need to check wheteher the email exists.
    
    if recipient_email is not None:
        recipient_email = recipient_email.group()
    else:
        recipient_email = None
    
    # add sender's email to the dictionary
    
    emails_dict["recipient_email"] = recipient_email
    
    # Step 3B: actions for sender's name almost similar to sender' email 
        # if the sender's name isn't None, first get rid of the ':' and the white spaces in front of the name
        # and then get rid of the white spaces and '<' after the name.  
    if recipient_name is not None:
        
        recipient_name = re.sub("s*<", "", re.sub(":s*", "", recipient_name.group())) 
    else:
        recipient_name = None
    
    # add sender's name to the dictionary
    
    emails_dict["recipient_name"] = recipient_name

    # The Date of the email
        
    # Step 1: find the date section
    date_section = re.search(r'Date:.*', item)
    
    
    # Step 2: Check if date section exists. If exists, get only the DD MMM YYYY part.
        
    if date_section is not None:
        
        date = re.search(r"\d+\s\w+\s\d+", date_section.group())
    else:
        date = None
    
    # Step 3: check if the date exists, and then assign it as string to a variable.
    
    if date is not None:
        date_sent = date.group()
    else:
        date_sent = None
    
    # add date to the dictionary
    
    emails_dict["date_sent"] = date_sent
       
    
    # similar steps for the email's subject
    
    # Step 1: find the Subject section
    subject_section = re.search(r'Subject:.*', item)
    
    # Step 2: Check if subject section exists. If exists, get rid of the 'Subject: ' and 
        # get only the subject itself.
        
    if subject_section is not None:
        
        subject = re.sub(r"Subject: ", "", subject_section.group())
    else:
        subject = None
    
    # add subject to the dictionary
    
    emails_dict["subject"] = subject
    
    
    # body of the e-mail
    
    # turn the string to email message
    full_email = email.message_from_string(item)
    
    # get the body of the email
    body = full_email.get_payload()
    
    # add the body to the dictionary
    emails_dict["email_body"] = 'email body' #subsituted for body variable not to store all the body
    
    # Finally append the dictionary to the emails list.
    emails.append(emails_dict)
    
    

In [13]:
print("Number of emails: "+ str(len(emails)))

Number of emails: 3977


In [14]:
# print a sample item 

for key, value in emails[1957].items():
    print(str(key) + ": " + str(emails[1957][key]))

sender_email: vivianmutan11@yahoo.com
sender_name:  vivian mutan 
recipient_email: R@M
recipient_name: None
date_sent: 17 Nov 2005
subject: From vivian mutan
email_body: email body


### Manipulating Data with Pandas

In [15]:
# turn the list to a pandas dataframe

emails_df = pd.DataFrame(emails)
emails_df

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,subject,email_body
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",james_ngola2002@maktoob.com,,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,email body
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",R@M,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),email body
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,email body
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",webmaster@aclweb.org,,31 Oct 2002,GOOD DAY TO YOU,email body
4,m_abacha03@www.com,"""Maryam Abacha""",m_abacha03@www.com,,1 Nov 2002,I Need Your Assistance.,email body
...,...,...,...,...,...,...,...
3972,michealagu0255@zipmail.com.br,,,,,=?iso-8859-1?Q?CONTACT=20GLOBAL=20MAX=20SHIPIN...,email body
3973,ali_sherif252@hotmail.fr,ali sherif,ali_sherif105@yahoo.co.uk,,17 Sep 2007,TREAT AS URGENT.,email body
3974,drusmanibrahimtg08@hotmail.fr,Dr Usman Ibrahim Danko,drusmanibrahim.tg@homs.cc,,18 Sep 2007,From Dr Usman Ibrahim / Mr Wahid Yoffe property.,email body
3975,motherdorisk61@hotmail.com,Mother Doris Killam,motherdorisk9@yahoo.com.hk,,19 Sep 2007,My Beloved In Christ.,email body


In [49]:
#  to find emails sent from one or another domain name we use '|' 

import numpy as np
emails_df[emails_df['sender_email'].str.contains('epatra|hotmail', na = False)] # we use 'na=False' to get rid of None values.

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,subject,email_body
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,email body
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",webmaster@aclweb.org,,31 Oct 2002,GOOD DAY TO YOU,email body
16,anayoawka@hotmail.com,""" DR. ANAYO AWKA """,webmaster@aclweb.org,,15 Nov 2002,REQUEST FOR YOUR UNRESERVED ASSISTANCE,email body
17,anayoawka@hotmail.com,""" DR. ANAYO AWKA """,webmaster@aclweb.org,,15 Nov 2002,REQUEST FOR YOUR UNRESERVED ASSISTANCE,email body
102,marabac011@hotmail.com,"""Dr mariam abacha""",marabac121@ny.com,,7 Feb 2003,urgent and confidential.,email body
...,...,...,...,...,...,...,...
3965,kennethobia5@hotmail.com,KENNETH OBI,kennethobi@gmail.com,,12 Sep 2007,Attention please,email body
3969,motherdorisk93@hotmail.com,Mother Doris Killam,motherdorisk9@yahoo.com.hk,,14 Sep 2007,My Beloved In Christ.,email body
3973,ali_sherif252@hotmail.fr,ali sherif,ali_sherif105@yahoo.co.uk,,17 Sep 2007,TREAT AS URGENT.,email body
3974,drusmanibrahimtg08@hotmail.fr,Dr Usman Ibrahim Danko,drusmanibrahim.tg@homs.cc,,18 Sep 2007,From Dr Usman Ibrahim / Mr Wahid Yoffe property.,email body


In [55]:
# using Regex to find emails sent from particular email adresses

index = emails_df[emails_df['sender_email'].str.contains(r'\w\S*@spinfinder.com', na=False)].index.values
index

array([  1, 584], dtype=int64)

In [62]:
adress_df = emails_df.loc[index, ['sender_email', 'email_body']]
adress_df

Unnamed: 0,sender_email,email_body
1,bensul2004nng@spinfinder.com,email body
584,rharare1@spinfinder.com,email body


source : [Dataquest - Tutorial: Python Regex (Regular Expressions) for Data Scientists](https://www.dataquest.io/blog/regular-expressions-data-scientists/)