# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#讀取文本資料
with open('/content/drive/MyDrive/NLP/sample_emails.txt','r',encoding='utf-8') as f:
  sample_corpus = f.read()

In [4]:
sample_corpus.split('\n')[:25]

['From r  Wed Oct 30 21:41:56 2002',
 'Return-Path: <james_ngola2002@maktoob.com>',
 'X-Sieve: cmu-sieve 2.0',
 'Return-Path: <james_ngola2002@maktoob.com>',
 'Message-Id: <200210310241.g9V2fNm6028281@cs.CU>',
 'From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'Reply-To: james_ngola2002@maktoob.com',
 'To: webmaster@aclweb.org',
 'Date: Thu, 31 Oct 2002 02:38:20 +0000',
 'Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP',
 'X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM',
 'MIME-Version: 1.0',
 'Content-Type: text/plain; charset="us-ascii"',
 'Content-Transfer-Encoding: 8bit',
 'X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311',
 'Status: O',
 '',
 'FROM:MR. JAMES NGOLA.',
 'CONFIDENTIAL TEL: 233-27-587908.',
 'E-MAIL: (james_ngola2002@maktoob.com).',
 '',
 'URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.',
 '',
 '',
 'DEAR FRIEND,']

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [5]:
import re

In [6]:
pattern = "(From:\s.?[\w\.\s]+.?\s<[\w\.]+@[A-Za-z\.\-]+>)"
match = re.findall(pattern,sample_corpus,flags=re.M)
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [7]:
pattern = "((?<=From:\s).?[\w\.\s]+\")"
match = re.findall(pattern,sample_corpus,flags=re.M)
match

['"MR. JAMES NGOLA."', '"Mr. Ben Suleman"', '"PRINCE OBONG ELEME"']

### 只讀取寄件者電子信箱

In [8]:
pattern = "(?<=\"\s)(<[\w\.]+@[A-Za-z\.\-]+>)"
match = re.findall(pattern,sample_corpus)
for m in match:
  if m!=[]:
    print(m)

<james_ngola2002@maktoob.com>
<bensul2004nng@spinfinder.com>
<obong_715@epatra.com>


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [9]:
pattern = "(?<=\"\s<)([\w\.]+@[A-Za-z\.\-]+)"
pattern2 = "((?<=@)[\w\.\-]+(?=[\.]))"
emails = re.findall(pattern,sample_corpus,flags=re.M)
for email in emails:
  org = re.findall(pattern2,email,flags=re.M)
  print(org[0])

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [10]:
pattern = "(?<=\"\s<)([\w\.]+@[A-Za-z\.\-]+)"
pattern2 = "((?<=@)[\w\.\-]+(?=[\.]))"
pattern3 = "([\w\.]+(?=@))"
emails = re.findall(pattern,sample_corpus)
for email in emails:
  acc = re.findall(pattern3,email)
  org = re.findall(pattern2,email)
  print(f"{acc[0]},{org[0]}")

james_ngola2002,maktoob
bensul2004nng,spinfinder
obong_715,epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [11]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
with open('/content/drive/MyDrive/NLP/all_emails.txt','r',encoding='windows-1252') as f:
  corpus = f.read()
    
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
pattern = r"From r" 
emails = re.split(pattern,corpus)[1:]


###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###

len(emails) #查看有多少封email

3977

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [12]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取得寄件者資訊 (hint: From:)
    pattern = r"(From:\s.?[\w\.\s]+.?\s<[\w\.]+@[A-Za-z\.\-]+>)"
    pattern2 = r"From:\s[\w\.]+@[\w\-\.]+|From.[\.\w\s][^\n]+"

    from_match = re.findall(pattern,mail)
    if from_match == []:
      from_match = re.findall(pattern2,mail)

    #Step2: 取得姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #Step3: 將取得的姓名與地址存入字典中
    if len(from_match)==1:
      name = from_match[0].split(':')[1].split('<')[0].replace("\"","")
      address = from_match[0].split('<')[1].strip('>')
      emails_dict['sender_name']=name
      emails_dict['sender_add']=address

    elif len(from_match)==2:
      p_add = r'[\w\.]+@[A-Za-z\.\-]+'
      p_name = "(From.[\s]*)"

      address = re.findall(p_add,from_match[0])
      name = re.sub(p_name,"",from_match[1])
      emails_dict['sender_name']=name
      emails_dict['sender_add']=address[0]

    elif from_match == []:
      emails_dict['sender_name']=None
      emails_dict['sender_add']=None

    ###取得收件者姓名與地址###
    #Step1: 取得收件者資訊 (hint: To:)
    pattern = r"((?<!Reply-)To:[<\w\.\s]+@[\w\-\.]+)"
    to_match = re.findall(pattern,mail)    

    #Step2: 取得姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #Step3: 將取得的姓名與地址存入字典中
    if to_match != []:
      to_match = re.sub('To:','',to_match[0])
      emails_dict['receiver_name'] = None
      emails_dict['receiver_add'] = to_match
    elif to_match == []:
      emails_dict['receiver_name'] = None
      emails_dict['receiver_add'] = None
           
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    pattern =r'Date:.[^\n]*'
    pattern2 =r"[\d]{1,2}\s[A-Z][a-z]{2}\s[\d]{4}"
    date_match = re.findall(pattern,mail)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    if date_match != []:
      date = re.search(pattern2,date_match[0])
      date = date.group()
    elif date_match == []:
      date = None
        
    #Step3: 將取得的日期資訊存入字典中
    emails_dict['date'] = date
        
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    pattern = r"(?<=Subject:)[\w\s][^\n]+"
    subject_match = re.findall(pattern,mail)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    subject = subject_match[0].strip()
    
    #Step3: 將取得的主旨存入字典中
    emails_dict['Subject'] = subject
    
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)

In [49]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,sender_name,sender_add,receiver_name,receiver_add,date,Subject,email_body
0,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,webmaster@aclweb.org,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,Mr. Ben Suleman,bensul2004nng@spinfinder.com,,R@M,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,PRINCE OBONG ELEME,obong_715@epatra.com,,webmaster@aclweb.org,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,PRINCE OBONG ELEME,obong_715@epatra.com,,webmaster@aclweb.org,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,Maryam Abacha,m_abacha03@www.com,,R@M,1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,Kuta David,davidkuta@postmark.net,,davidkuta@yahoo.com,02 Nov 2002,Partnership,ATTENTION: ...
6,Barrister tunde dosumu,tunde_dosumu@lycos.com,,,,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,William Drallo,william2244drallo@maktoob.com,,webmaster@aclweb.org,3 Nov 2002,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,MR USMAN ABDUL,abdul_817@rediffmail.com,,R@M,04 Nov 2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,Tunde Dosumu,barrister_td@lycos.com,,,,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
