# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
#讀取文本資料
with open('sample_emails.txt', 'r') as f:
    sample_corpus = f.read()

In [2]:
#sample_corpus[:2000]

---
### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [3]:
import re

In [4]:
pattern_sender = r'From: .*'
match = re.findall(pattern_sender, sample_corpus)
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

---
### 只讀取寄件者姓名

In [5]:
pattern_senderName = r'From: (.*) <(.*)>'
match_senderName = re.finditer(pattern_senderName, sample_corpus)
for ma in match_senderName:
    print(ma.group(1))

"MR. JAMES NGOLA."
"Mr. Ben Suleman"
"PRINCE OBONG ELEME"


In [6]:
pattern_NameEmail = r'\"(.*)\" <(.*)>'
for info in match:
    print(re.search(pattern_NameEmail, info).group(1))

MR. JAMES NGOLA.
Mr. Ben Suleman
PRINCE OBONG ELEME


---
### 只讀取寄件者電子信箱

In [7]:
p_email = r'\b\w+@\w+.*\b'
for info in match:
    print(re.search(p_email, info).group())

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


In [8]:
for info in match:
    print(re.search(pattern_NameEmail, info).group(2))

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


---
### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [9]:
p_senderInfo = r'(?<=@)\w+(?=\.)'
for info in match:
    print(re.search(p_senderInfo, info).group())

maktoob
spinfinder
epatra


---
### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [10]:
# 1) 使用兩種 pattern 搭配 search
p_emailName = r'\b\w+(?=@)'
p_senderInfo = r'(?<=@)\w+(?=\.)'
for info in match:
    print(f'{re.search(p_emailName, info).group()}, {re.search(p_senderInfo, info).group()}')

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra


In [11]:
# 2) 使用 or 搭配 findall (返回 list)
pat = r'\w+(?=@)|(?<=@)\w+(?=\.)'
for info in match:
    print(re.findall(pat, info))

['james_ngola2002', 'maktoob']
['bensul2004nng', 'spinfinder']
['obong_715', 'epatra']


In [12]:
# 3) 改寫自解答的 split 方法
pat = r'\w+@(?<=@)\w+(?=\.)'
for info in match:
    for line in re.findall(pat, info):
        name, domain = re.split('@', line)
        print(f'{name}, {domain}')

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra


---
### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

---
### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [13]:
###讀取文本資料:fradulent_emails.txt###
with open('all_emails.txt', 'r', encoding='utf8', errors='ignore') as f:
    data = f.read()

In [14]:
len(data)

17330528

In [15]:
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###

#data[:5000] #觀察每封email開始與結束位置

In [16]:
emails = re.split('From r', data, flags=re.M)
len(emails)

3978

In [17]:
emails = emails[1:]  #刪除第一筆的空資料
len(emails)

3977

---
### 從文本中擷取所有寄件者與收件者的姓名和地址

In [18]:
import re
import pandas as pd
import email

In [19]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    
    ##### Sender
    match_sender = re.search(r'From:.*', mail)
    if match_sender is not None:  #檢查是否配對成功，否則後續的.group()會報錯
        sender_name = re.search(r'(?<=\").*(?=\")', match_sender.group())
        sender_address = re.search(r'\w+@.*\b', match_sender.group())
    else:
        sender_name = None
        sender_address = None
    
    if sender_address is not None:
        emails_dict['寄件者E-mail'] = sender_address.group()
    else:
        emails_dict['寄件者E-mail'] = sender_address
    if sender_name is not None:
        emails_dict['寄件者名稱'] = sender_name.group()
    else:
        emails_dict['寄件者名稱'] = sender_name
    
    ##### Recipient       
    match_to = re.search(r'To:.*', mail)
    if match_to is not None:
        to_name = re.search(r'(?<=\").*(?=\")', match_to.group())
        to_email = re.search(r'\w+@.*\b', match_to.group())
    else:
        sender_name = None
        sender_address = None
   
    if to_email is not None:
        emails_dict['收件者E-mail'] = to_email.group()
    else:
        emails_dict['收件者E-mail'] = to_email
    if to_name is not None:
        emails_dict['收件者名稱'] = to_name.group()
    else:
        emails_dict['收件者名稱'] = to_name
    
    ##### Date
    match_date = re.search(r'Date:.*', mail)
    if match_date is not None:
        date_info = re.search(r'\d+\s\w+\s\d+', match_date.group())
    else:
        date_info = None
    
    if date_info is not None:
        emails_dict['日期'] = date_info.group()
    else:
        emails_dict['日期'] = date_info
    
    ##### Subject
    subject = re.search(r'(?<=Subject:).*', mail)    
    if subject is not None:
        emails_dict['主旨'] = subject.group()
    else:
        emails_dict['主旨'] = subject
    
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["內文"] = body
    except:
        emails_dict["內文"] = None
    
    emails_list.append(emails_dict)

In [20]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,寄件者E-mail,寄件者名稱,收件者E-mail,收件者名稱,日期,主旨,內文
0,james_ngola2002@maktoob.com,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,bensul2004nng@spinfinder.com,Mr. Ben Suleman,R@M,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,obong_715@epatra.com,PRINCE OBONG ELEME,obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,obong_715@epatra.com,PRINCE OBONG ELEME,webmaster@aclweb.org,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,m_abacha03@www.com,Maryam Abacha,m_abacha03@www.com,,1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,davidkuta@postmark.net,,davidkuta@yahoo.com,,02 Nov 2002,Partnership,ATTENTION: ...
6,tunde_dosumu@lycos.com,Barrister tunde dosumu,tunde_dosumu@lycos.com,,,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,william2244drallo@maktoob.com,William Drallo,william2244drallo@maktoob.com,,3 Nov 2002,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,abdul_817@rediffmail.com,MR USMAN ABDUL,R@M,,04 Nov 2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,barrister_td@lycos.com,Tunde Dosumu,barrister_td@lycos.com,,,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
