# week2 Assignment

## 邮件结构

- 将邮件的存储格式分为以下几部分
   - 'Metadata': 这部分包含从MIME头提取的信息, 如'From', 'Receive', 'Send_time', 'Subject'等等
   - 'Content': 这部分会包含所有的内容信息, 如'Body_text', 'Body_html', 'Recite', 'Attachment', 'Signature'等等
   - 'Entities': 包含邮件中提到的各种实体, 如'Name', 'Organization', 'Time', 'Position', 'Tel'
   - 'Relation': 包含邮件内的各种关系, 如邮件之间的关系, 邮件内容的语义关系.

## 思路

- 利用flanker提取MIME头, 将信息初步提取, 这是可以完成'Metadata'部分信息的提取
- 利用Regex将邮件内容分解, 段落, 引用, 附件, 签名档一次提取出来.
- 利用NLTK, jieba等分词工具, 进一步细化, 提取各个实体.
- 最后进行更深入的关系分析提取.
- 这样层层递进, 逐渐深入,


## 难点

- 分段有可能比较混乱, 这里可能会花一点时间
- 引用通过'>','>>'来判断
- 签名档因为比较复杂, 格式不一, 甚至有的没有, 有的特别简单, 信息不够全面
- 关系表示, 邮件内部, 邮件外部

## Tips

- 邮件的结尾都是--boundary--
    - 此处有坑


In [5]:
import os
import re
import json
import flanker

from bs4 import BeautifulSoup
from flanker import mime

### 读取数据

In [117]:
def prepare_data(filename='2013-11.mbx'):
    with open(filename, 'r') as f:
        data = f.read()
    f.close()
    data_list = filter(None, re.split(r'From\s([\w+.?]+@(\w+\.)+(\w+))', data))  #  
    # Here I have to add twice for-loop, I haven't analyse the reason
    for data in data_list:
        if len(str(data)) < 500:
            data_list.remove(data)
    for data in data_list:
        if len(str(data)) < 500:
            data_list.remove(data)
    return data_list  

In [118]:
data_list = prepare_data('2013-11.mbx')

In [120]:
def extract_headers(msg_string):
    mime_dict = {}
    msg = mime.from_string(msg_string)
    msg_list = msg.headers.items()
    mime_keys = ['From', 'Date', 'Cc', 'To', 'Subject', ]
    for item in msg_list:
        if item[0] in mime_keys:
            mime_dict[item[0]] = item[1]
    return mime_dict

In [121]:
extract_headers(data_list[3])

{'Cc': u'\u4e2d\u6587HTML5\u540c\u6a02\u6703ML <public-html-ig-zh@w3.org>',
 'Date': u'Wed, 6 Nov 2013 20:09:11 +0800',
 'From': u'\u8463\u798f\u8208 Bobby Tung <bobbytung@wanderer.tw>',
 'Subject': u'Re: \u95dc\u65bc<cite>\u5143\u7d20\u6700\u8fd1\u7684\u5b9a\u7fa9\u4fee\u6539',
 'To': u'Yijun Chen <ethantw@me.com>'}

In [184]:
def create_name_list(data_list):
    name_list = []
    p = re.compile(ur'\"?([\w\s\(\)]+|[\x80-\xff]+)\"?\s<')
    for message_string in data_list:
        msg = mime.from_string(message_string)
        for item in msg.headers.items():
            if item[0] == 'From':
                name = p.search(item[1].encode('utf-8')).group(1)
                name_list.append(name)
    name_list = list(set(name_list))
    name_list += ['Cindy', 'Kenny', 'Chen Yijun', 'Chunming', '-ambrose']
    name_list.remove('com')
    name_list.remove(' Chunming')
    name_list.remove(' Bobby Tung')
    name_list.remove('Hawkeyes Wind')
    return name_list

In [204]:
def extract_signature(message_string, name_list):
    signature_list = []
    for name in name_list:
        p_name = re.compile(r'^%s.+'%name, re.MULTILINE | re.DOTALL)
        msg = mime.from_string(message_string)
        for part in msg.parts:
            if not isinstance(part.body, (type(None), str)):
                if p_name.findall(part.body.encode('utf-8')):
                    signature_list += p_name.findall(part.body.encode('utf-8'))
    signature = None
    for item in signature_list:
        if len(item) < 300: 
            signature = item # 已经知道小于300的就一个
    if not signature:
        return None
    elif 'Hawkeyes Wind' in signature or 'Zhiqiang' in signature: # 只能不断添加规则...
        return None
    elif '<' in signature:
        soup = BeautifulSoup(item, 'html.parser')
        signature = soup.get_text()
        return signature
    else:
        return signature
    

In [210]:
def extract_content(message_string, name_list):
    
    content_dict = {}
    p = re.compile(ur'\"?([\w\s\(\)]+|[\x80-\xff]+)\"?\s<')
    msg = mime.from_string(message_string)   
    for part in msg.parts:
        if not isinstance(part.body, (type(None), str)):
            content_dict[str(part)] = part.body
    signature = extract_signature(message_string, name_list)
    content_dict['Signature'] = signature
    return content_dict
    # key 未修正, 直接用了带()的值, 附件也未区分, 直接根据content-type有什么添加什么
    # 编码可能有问题

In [208]:
name_list = create_name_list(data_list)

In [213]:
extract_content(data_list[4], name_list)

{'(text/html)': u'<div dir="ltr">\u6700\u8fd1\u597d\u591a\u4e89\u8bae\u6027\u7684\u6539\u52a8\u554a\u3002<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/11/6 \u8463\u798f\u8208 Bobby Tung <span dir="ltr">&lt;<a href="mailto:bobbytung@wanderer.tw" target="_blank">bobbytung@wanderer.tw</a>&gt;</span><br>\r\n<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>Hi Ethan</div><div><br></div><div>\u627e\u4e86\u4e00\u4e0bHTMLWG\u7684\u8a0e\u8ad6\uff0c\u9019\u5152\u6709\u4e00\u4e32\u516b\u6708\u7684\u5c0d\u8a71\uff1a</div><div><br></div>\r\n<div><a href="http://lists.w3.org/Archives/Public/public-html/2013Aug/0067.html" target="_blank">http://lists.w3.org/Archives/Public/public-html/2013Aug/0067.html</a></div><div><br></div><div>\u611f\u89ba\u4e0a\u5c31\u662f\u5168\u4e16\u754c\u90fd\u505a\u932f\uff0c\u8207\u5176\u77ef\u6b63\u9019\u4e16\u754c\uff0c\u4e0d\u5982\u5c07\u932f\