# week2 Assignment

## 邮件结构

- 将邮件的存储格式分为以下几部分
   - 'Metadata': 这部分包含从MIME头提取的信息, 如'From', 'Receive', 'Send_time', 'Subject'等等
   - 'Content': 这部分会包含所有的内容信息, 如'Body_text', 'Body_html', 'Recite', 'Attachment', 'Signature'等等
   - 'Entities': 包含邮件中提到的各种实体, 如'Name', 'Organization', 'Time', 'Position', 'Tel'
   - 'Relation': 包含邮件内的各种关系, 如邮件之间的关系, 邮件内容的语义关系.

## 思路

- 利用flanker提取MIME头, 将信息初步提取, 这是可以完成'Metadata'部分信息的提取
- 利用Regex将邮件内容分解, 段落, 引用, 附件, 签名档一次提取出来.
- 利用NLTK, jieba等分词工具, 进一步细化, 提取各个实体.
- 最后进行更深入的关系分析提取.
- 这样层层递进, 逐渐深入,


## 难点

- 分段有可能比较混乱, 这里可能会花一点时间
- 引用通过'>','>>'来判断
- 签名档因为比较复杂, 格式不一, 甚至有的没有, 有的特别简单, 信息不够全面
- 关系表示, 邮件内部, 邮件外部

## Tips

- 邮件的结尾都是--boundary--
    - 此处有坑


In [224]:
import os
import re
import json
import flanker
import jieba.posseg as pseg

from bs4 import BeautifulSoup
from flanker import mime
from nltk.tag import StanfordNERTagger

In [225]:
# Add models of NLTK  
os.environ["CLASSPATH"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09"  
os.environ["STANFORD_MODELS"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09/models"

In [226]:
# Add Tagger
st = StanfordNERTagger('/Users/xpgeng/Github/kg-beijing/class1/week1/homework/xpgeng/english.all.3class.distsim.crf.ser.gz')

### 读取数据

In [117]:
def prepare_data(filename='2013-11.mbx'):
    with open(filename, 'r') as f:
        data = f.read()
    f.close()
    data_list = filter(None, re.split(r'From\s([\w+.?]+@(\w+\.)+(\w+))', data))  #  
    # Here I have to add twice for-loop, I haven't analyse the reason
    for data in data_list:
        if len(str(data)) < 500:
            data_list.remove(data)
    for data in data_list:
        if len(str(data)) < 500:
            data_list.remove(data)
    return data_list  

In [118]:
data_list = prepare_data('2013-11.mbx')

### 提取MIME头信息, 邮件内容, 签名档

In [120]:
def extract_headers(msg_string):
    mime_dict = {}
    msg = mime.from_string(msg_string)
    msg_list = msg.headers.items()
    mime_keys = ['From', 'Date', 'Cc', 'To', 'Subject', ]
    for item in msg_list:
        if item[0] in mime_keys:
            mime_dict[item[0]] = item[1]
    return mime_dict

In [121]:
extract_headers(data_list[3])

{'Cc': u'\u4e2d\u6587HTML5\u540c\u6a02\u6703ML <public-html-ig-zh@w3.org>',
 'Date': u'Wed, 6 Nov 2013 20:09:11 +0800',
 'From': u'\u8463\u798f\u8208 Bobby Tung <bobbytung@wanderer.tw>',
 'Subject': u'Re: \u95dc\u65bc<cite>\u5143\u7d20\u6700\u8fd1\u7684\u5b9a\u7fa9\u4fee\u6539',
 'To': u'Yijun Chen <ethantw@me.com>'}

In [184]:
def create_name_list(data_list):
    name_list = []
    p = re.compile(ur'\"?([\w\s\(\)]+|[\x80-\xff]+)\"?\s<')
    for message_string in data_list:
        msg = mime.from_string(message_string)
        for item in msg.headers.items():
            if item[0] == 'From':
                name = p.search(item[1].encode('utf-8')).group(1)
                name_list.append(name)
    name_list = list(set(name_list))
    name_list += ['Cindy', 'Kenny', 'Chen Yijun', 'Chunming', '-ambrose']
    name_list.remove('com')
    name_list.remove(' Chunming')
    name_list.remove(' Bobby Tung')
    name_list.remove('Hawkeyes Wind')
    return name_list

In [204]:
def extract_signature(message_string, name_list):
    signature_list = []
    for name in name_list:
        p_name = re.compile(r'^%s.+'%name, re.MULTILINE | re.DOTALL)
        msg = mime.from_string(message_string)
        for part in msg.parts:
            if not isinstance(part.body, (type(None), str)):
                if p_name.findall(part.body.encode('utf-8')):
                    signature_list += p_name.findall(part.body.encode('utf-8'))
    signature = None
    for item in signature_list:
        if len(item) < 300: 
            signature = item # 已经知道小于300的就一个
    if not signature:
        return None
    elif 'Hawkeyes Wind' in signature or 'Zhiqiang' in signature: # 只能不断添加规则...
        return None
    elif '<' in signature:
        soup = BeautifulSoup(item, 'html.parser')
        signature = soup.get_text()
        return signature
    else:
        return signature
    

In [210]:
def extract_content(message_string, name_list):
    
    content_dict = {}
    p = re.compile(ur'\"?([\w\s\(\)]+|[\x80-\xff]+)\"?\s<')
    msg = mime.from_string(message_string)   
    for part in msg.parts:
        if not isinstance(part.body, (type(None), str)):
            content_dict[str(part)] = part.body
    signature = extract_signature(message_string, name_list)
    content_dict['Signature'] = signature
    return content_dict
    # key 未修正, 直接用了带()的值, 附件也未区分, 直接根据content-type有什么添加什么

In [208]:
name_list = create_name_list(data_list)

In [214]:
extract_content(data_list[0], name_list)

{'(text/html)': u'<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi friends,<br><br>In light of the upcoming TPAC, I\'d like to suggest a joint meeting between the two IGs. <br><br>There have been some discussion in the Chinese IG on things related to publishing and I thought it will be nice for us to catch up with the Digital Publishing IG.<br><br>Agenda<br><br>0. Mutual introduction<br>1. CSS3 text (some discussion on our side)<br>2. digital publishing requirement for chinese language (Bobby has written a spec/requirement and it\'d be nice to know how everyone thinks)<br>3. anything else?<br><br>This discussion won\'t be an exhaustive one, rather it is to put names to faces, discuss the agendas, and hopefully drive future online discussions.<br><br>If everyone is cool, maybe we can do a 90 minute on Thursday during TPAC? <br><br>---<br>Zi Bin Cheah<br>HTML5 Chinese IG chair<br><br><div apple-content-edited="tr

### 提取实体

In [222]:
msg = mime.from_string(data_list[60])
msg_list = msg.headers.items()
for item in  msg_list:
    print item

('Received', u'from lisa.w3.org ([128.30.52.41])\tby frink.w3.org with esmtp (Exim 4.72)\t(envelope-from <hawkeyes0.cn@gmail.com>)\tid 1Vg4I3-00055D-Aq\tfor public-html-ig-zh@listhub.w3.org; Tue, 12 Nov 2013 03:05:19 +0000')
('Received', u'from mail-pb0-f53.google.com ([209.85.160.53])\tby lisa.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16)\t(Exim 4.72)\t(envelope-from <hawkeyes0.cn@gmail.com>)\tid 1Vg4I2-000462-9t\tfor public-html-ig-zh@w3.org; Tue, 12 Nov 2013 03:05:19 +0000')
('Received', u'by mail-pb0-f53.google.com with SMTP id up7so6125577pbc.26        for <public-html-ig-zh@w3.org>; Mon, 11 Nov 2013 19:04:51 -0800 (PST)')
('Dkim-Signature', u'v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20120113;        h=message-id:date:from:user-agent:mime-version:to:cc:subject         :references:in-reply-to:content-type;        bh=PzjpVCFu3us5lBXmTxuAs7UwX5dTHDlsSGmQS1I6c00=;        b=mJ6VNzBsbulgOb9ZMcHj7/+XH2HcESKcy1hCxdgzhnFRrDSWLs+NDLWDSfoaUzePTc         4fHY70swPObsi9Aj/