# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
import re

In [2]:
#讀取文本資料
with open('sample_emails.txt', 'r') as f:
    sample_corpus = f.read()
sample_corpus = str(sample_corpus)

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [3]:
#<your code>#
#match
pattern_obj = r"From:.*"
match = re.findall(pattern_obj, sample_corpus)

### 只讀取寄件者姓名

In [4]:
#<your code>#
for m in match:
    print(re.search(r'\".*\"', m).group())

"MR. JAMES NGOLA."
"Mr. Ben Suleman"
"PRINCE OBONG ELEME"


### 只讀取寄件者電子信箱

In [5]:
#<your code>#
pattern_obj = r'\w\S*@.*\b'
for m in match:
    print(re.search(pattern_obj, m).group())

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [6]:
#<your code>#
pattern_obj = r'(?<=@).+(?=.com)'
for m in match:
    print(re.search(pattern_obj, m).group())

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [7]:
pattern_obj = r'(?<=<).+(?=.com)'
for m in match:
    print(re.search(pattern_obj, m).group().split('@'))

['james_ngola2002', 'maktoob']
['bensul2004nng', 'spinfinder']
['obong_715', 'epatra']


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [9]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
with open('all_emails.txt', 'r', encoding='windows-1252') as f:
    emails = f.read()
emails = str(emails)
emails = re.split(r'From r', emails, flags=re.M)
emails = emails[1:]
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
print(len(emails)) #查看有多少封email

3977


In [10]:
with open('all_emails.txt', 'r', encoding='windows-1252') as f:
    emails = f.read()
emails = str(emails)
emails = re.split(r'From r', emails, flags=re.M)
emails = emails[1:]
print(len(emails))

3977


In [11]:
emails[0]

'  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDENT OCC

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [12]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    sender = re.search(r'From:.*', mail)

    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if sender is not None:
        sender_mail = re.search(r'\w\S*@.*\b', sender.group())
        sender_name = re.search(r'(?<=\").*(?=\")', sender.group())
        #print(sender.group())
    else:
        sender_mail = None
        sender_name = None        
    #Step3: 將取得的姓名與地址存入字典中
    if sender_mail is not None:
        emails_dict["sender_email"] = sender_mail.group()    
    else:
        emails_dict["sender_email"] = sender_mail

    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    recipient = re.search(r'To:.*', mail)

    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if recipient is not None:
        r_email = re.search(r'\w\S*@.*\b', recipient.group())
        r_name = re.search(r'(?<=\").*(?=\")', recipient.group())
        print(r_name)
    else:
        r_email = None
        r_name = None
    #Step3: 將取得的姓名與地址存入字典中
    if r_email is not None:
        emails_dict['recipient_email'] = r_email.group()
    else:
        emails_dict['recipient_email'] = r_email
    if r_name is not None:
        emails_dict['recipient_name'] = r_name.group()
    else:
        emails_dict['recipient_name'] = r_email
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    dates = re.search(r'Date:.*', mail)
    #print(dates)
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    if dates is not None:
        date = re.search(r'\d+\s\w+\s\d+', dates.group())
    else:
        date = None
    #print(date)
    #Step3: 將取得的日期資訊存入字典中
    if date is not None:
        emails_dict['date_sent'] = date.group()
    else:
        emails_dict['date_sent'] = date
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    subjects = re.search(r'Subject: .*', mail)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    if subjects is not None:
        subject = re.sub(r'Subject: ', '', subjects.group())
    else:
        subject = None
    
    #Step3: 將取得的主旨存入字典中
    emails_dict['subject'] = subject
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


In [13]:
#將處理結果轉化為dataframe
import pandas as pd
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,sender_email,recipient_email,recipient_name,date_sent,subject,email_body
0,james_ngola2002@maktoob.com,james_ngola2002@maktoob.com,"<re.Match object; span=(4, 31), match='james_n...",31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,bensul2004nng@spinfinder.com,R@M,"<re.Match object; span=(4, 7), match='R@M'>",31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,obong_715@epatra.com,obong_715@epatra.com,"<re.Match object; span=(4, 24), match='obong_7...",31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,obong_715@epatra.com,webmaster@aclweb.org,"<re.Match object; span=(4, 24), match='webmast...",31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,m_abacha03@www.com,m_abacha03@www.com,"<re.Match object; span=(4, 22), match='m_abach...",1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,davidkuta@postmark.net,davidkuta@yahoo.com,"<re.Match object; span=(4, 23), match='davidku...",02 Nov 2002,Partnership,ATTENTION: ...
6,tunde_dosumu@lycos.com,tunde_dosumu@lycos.com,"<re.Match object; span=(4, 26), match='tunde_d...",,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,william2244drallo@maktoob.com,william2244drallo@maktoob.com,"<re.Match object; span=(4, 33), match='william...",3 Nov 2002,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,abdul_817@rediffmail.com,R@M,"<re.Match object; span=(4, 7), match='R@M'>",04 Nov 2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,barrister_td@lycos.com,barrister_td@lycos.com,"<re.Match object; span=(4, 26), match='barrist...",,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
