# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
#讀取文本資料
with open('sample_emails.txt','r') as f:
    sample_corpus = f.read() 
#<your code>#

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [2]:
#<your code>#
import re
patt = re.compile(r'^From:.*>$',re.M)
match = re.findall(patt,sample_corpus)

In [3]:
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [4]:
#<your code>#
receiver = re.compile(r'^From: "(.*)"',re.M)
match = re.findall(receiver,sample_corpus)

match

['MR. JAMES NGOLA.', 'Mr. Ben Suleman', 'PRINCE OBONG ELEME']

### 只讀取寄件者電子信箱

In [5]:
#<your code>#
email = re.compile(r'^From:.*<(.*)>',re.M)
match = re.findall(email,sample_corpus)
match

['james_ngola2002@maktoob.com',
 'bensul2004nng@spinfinder.com',
 'obong_715@epatra.com']

### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [6]:
#<your code>#
sender = re.compile(r'^From:.*@(.*)\.com',re.M)
match = re.findall(sender,sample_corpus)
match

['maktoob', 'spinfinder', 'epatra']

### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [7]:
#<your code>#
sender = re.compile(r'^From:.*<(.*).com',re.M)
result = '\n'.join(re.findall(sender,sample_corpus)).replace('@',', ')
print(result)

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [167]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
#<your code>#
with open('all_emails.txt','r',encoding='utf8',errors='ignore') as f:
    corpus = f.read()
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
#<your code>#

#查看有多少封email
emails = re.split(r"From r",corpus)
emails = emails[1:]
len(emails)

sender = re.search(r'Subject:.*',emails[10])
re.search(r'(?<=Subject:\s).*',sender.group()).group()


' IMPORTANT'

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [201]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    #<your code>#
    sender = re.search(r'From:.*',mail).group() if re.search(r'From:.*',mail) is not None else 'None'
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#

    sender_name = re.search(r'\w\S*@.*\b',sender).group() if re.search(r'\w\S*@.*\b',sender) is not None else None 
    sender_address = re.search(r'(?<=From: ).*<(.*)>',sender).group(1) if re.search(r'(?<=From: ).*<(.*)>',sender,flags=re.M) is not None else None
    
    
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    emails_dict['sender_email'] = sender_address
    emails_dict['sender_name'] = sender_name
    
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    #<your code>#
    recipient = re.search(r'To:.*',mail).group() if re.search(r'To:.*',mail) is not None else 'None'
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    recipient_address = re.search(r'\w\S*@.*\b',recipient).group() if re.search(r'\w\S*@.*\b',recipient,flags=re.M) is not None else None 
    recipient_name = re.search(r'(?<=From: ).*<(.*)>',recipient).group(1) if re.search(r'(?<=From: ).*<(.*)>',recipient,flags=re.M) is not None else None
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    emails_dict['recipient_name'] = recipient_name
    emails_dict['recipient_address'] = recipient_address
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    #<your code>#
    date_info = re.search(r'Date:.*',mail).group() if re.search(r'Date:.*',mail) is not None else 'None'
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    #<your code>#
    from datetime import datetime as dt
    date__ = re.search(r'(?<=Date:\s).*(?=\s-)',date_info).group() if re.search(r'(?<=Date:\s).*(?=\s-)',date_info) is not None else None
    date_ = dt.strptime(date__,'%a, %d %b %Y %H:%M:%S') if date__ is not None else None
    date = f'{date_:%d %m %Y}' if date_ is not None else None
    #Step3: 將取得的日期資訊存入字典中
    #<your code>#
    emails_dict['date'] = date
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    #<your code>#
    subject = re.search(r'Subject:.*',mail).group() if re.search(r'Subject:.*',mail) else None
    #Step2: 移除不必要文字 (hint: Subject: )
    #<your code>#
    subject = re.search(r'(?<=Subject:\s).*',subject).group() if re.search(r'(?<=Subject:\s).*',subject) else None
    #Step3: 將取得的主旨存入字典中
    #<your code>#
    emails_dict['subject'] = subject
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    #<your code>#
    emails_list.append(emails_dict)

In [202]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,date,email_body,recipient_address,recipient_name,sender_email,sender_name,subject
0,,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,james_ngola2002@maktoob.com,,james_ngola2002@maktoob.com,james_ngola2002@maktoob.com,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",R@M,,bensul2004nng@spinfinder.com,bensul2004nng@spinfinder.com,URGENT ASSISTANCE /RELATIONSHIP (P)
2,,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,obong_715@epatra.com,,obong_715@epatra.com,obong_715@epatra.com,GOOD DAY TO YOU
3,,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,webmaster@aclweb.org,,obong_715@epatra.com,obong_715@epatra.com,GOOD DAY TO YOU
4,,"Dear sir, \n \nIt is with a heart full of hope...",m_abacha03@www.com,,m_abacha03@www.com,m_abacha03@www.com,I Need Your Assistance.
5,,ATTENTION: ...,davidkuta@yahoo.com,,davidkuta@postmark.net,davidkuta@postmark.net,Partnership
6,,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)...",tunde_dosumu@lycos.com,,tunde_dosumu@lycos.com,tunde_dosumu@lycos.com,Urgent Attention
7,,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...,william2244drallo@maktoob.com,,william2244drallo@maktoob.com,william2244drallo@maktoob.com,URGENT BUSINESS PRPOSAL
8,,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\...",R@M,,abdul_817@rediffmail.com,abdul_817@rediffmail.com,THANK YOU
9,,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)...",barrister_td@lycos.com,,barrister_td@lycos.com,barrister_td@lycos.com,Urgent Assistance
