# week2 Assignment

## 邮件结构

- 将邮件的存储格式分为以下几部分
   - 'Metadata': 这部分包含从MIME头提取的信息, 如'From', 'Receive', 'Send_time', 'Subject'等等
   - 'Content': 这部分会包含所有的内容信息, 如'Body_text', 'Body_html', 'Recite', 'Attachment', 'Signature'等等
   - 'Entities': 包含邮件中提到的各种实体, 如'Name', 'Organization', 'Time', 'Position', 'Tel'
   - 'Relation': 包含邮件内的各种关系, 如邮件之间的关系, 邮件内容的语义关系.

## 思路

- 利用flanker提取MIME头, 将信息初步提取, 这是可以完成'Metadata'部分信息的提取
- 利用Regex将邮件内容分解, 段落, 引用, 附件, 签名档一次提取出来.
- 利用NLTK, jieba等分词工具, 进一步细化, 提取各个实体.
- 最后进行更深入的关系分析提取.
- 这样层层递进, 逐渐深入,


## 难点

- 分段有可能比较混乱, 这里可能会花一点时间
    - 分段还未做
- 引用通过'>','>>'来判断
    - 引用也未做
- 签名档因为比较复杂, 格式不一, 甚至有的没有, 有的特别简单, 信息不够全面
    - 能够粗略的提取签名档, 但格式并未统一
- 关系表示, 邮件内部, 邮件外部
    - 不知道该怎么表示关系.

## Tips

- 邮件的结尾都是--boundary--
    - 此处有坑: 写正则的时候boundary的字符串结尾有'\_', 还跟html里'!--'有冲突, 总之在用boundary分离邮件时花费了很多时间.
    - 反而不如直接用'From .....@....'去分离.


In [1]:
import os
import re
import json
import flanker
import jieba.posseg as pseg

from bs4 import BeautifulSoup
from flanker import mime
from nltk.tag import StanfordNERTagger

In [2]:
# Add models of NLTK  
os.environ["CLASSPATH"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09"  
os.environ["STANFORD_MODELS"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09/models"

In [3]:
# Add Tagger
st = StanfordNERTagger('/Users/xpgeng/Github/kg-beijing/class1/week1/homework/xpgeng/english.all.3class.distsim.crf.ser.gz')

### 读取数据

In [4]:
def prepare_data(filename='2013-11.mbx'):
    with open(filename, 'r') as f:
        data = f.read()
    f.close()
    data_list = filter(None, re.split(r'From\s([\w+.?]+@(\w+\.)+(\w+))', data))  #  
    # Here I have to add twice for-loop, I haven't analyse the reason
    for data in data_list:
        if len(str(data)) < 500:
            data_list.remove(data)
    for data in data_list:
        if len(str(data)) < 500:
            data_list.remove(data)
    return data_list  

In [5]:
data_list = prepare_data('2013-11.mbx')

### 提取MIME头信息, 邮件内容, 签名档

In [6]:
def extract_headers(msg_string):
    mime_dict = {}
    msg = mime.from_string(msg_string)
    msg_list = msg.headers.items()
    mime_keys = ['From', 'Date', 'Cc', 'To', 'Subject', ]
    for item in msg_list:
        if item[0] in mime_keys:
            mime_dict[item[0]] = item[1]
    return mime_dict

In [7]:
extract_headers(data_list[3])

{'Cc': u'\u4e2d\u6587HTML5\u540c\u6a02\u6703ML <public-html-ig-zh@w3.org>',
 'Date': u'Wed, 6 Nov 2013 20:09:11 +0800',
 'From': u'\u8463\u798f\u8208 Bobby Tung <bobbytung@wanderer.tw>',
 'Subject': u'Re: \u95dc\u65bc<cite>\u5143\u7d20\u6700\u8fd1\u7684\u5b9a\u7fa9\u4fee\u6539',
 'To': u'Yijun Chen <ethantw@me.com>'}

In [8]:
def create_name_list(data_list):
    name_list = []
    p = re.compile(ur'\"?([\w\s\(\)]+|[\x80-\xff]+)\"?\s<')
    for message_string in data_list:
        msg = mime.from_string(message_string)
        for item in msg.headers.items():
            if item[0] == 'From':
                name = p.search(item[1].encode('utf-8')).group(1)
                name_list.append(name)
    name_list = list(set(name_list))
    name_list += ['Cindy', 'Kenny', 'Chen Yijun', 'Chunming', '-ambrose']
    name_list.remove('com')
    name_list.remove(' Chunming')
    name_list.remove(' Bobby Tung')
    name_list.remove('Hawkeyes Wind')
    return name_list

In [9]:
def extract_signature(message_string, name_list):
    signature_list = []
    for name in name_list:
        p_name = re.compile(r'^%s.+'%name, re.MULTILINE | re.DOTALL)
        msg = mime.from_string(message_string)
        for part in msg.parts:
            if not isinstance(part.body, (type(None), str)):
                if p_name.findall(part.body.encode('utf-8')):
                    signature_list += p_name.findall(part.body.encode('utf-8'))
    signature = None
    for item in signature_list:
        if len(item) < 300: 
            signature = item # 已经知道小于300的就一个
    if not signature:
        return None
    elif 'Hawkeyes Wind' in signature or 'Zhiqiang' in signature: # 只能不断添加规则...
        return None
    elif '<' in signature:
        soup = BeautifulSoup(item, 'html.parser')
        signature = soup.get_text()
        return signature
    else:
        return signature
    

In [10]:
def extract_content(message_string, name_list):
    
    content_dict = {}
    p = re.compile(ur'\"?([\w\s\(\)]+|[\x80-\xff]+)\"?\s<')
    msg = mime.from_string(message_string)   
    for part in msg.parts:
        if not isinstance(part.body, (type(None), str)):
            content_dict[str(part)] = part.body
    signature = extract_signature(message_string, name_list)
    content_dict['Signature'] = signature
    return content_dict
    # key 未修正, 直接用了带()的值, 附件也未区分, 直接根据content-type有什么添加什么

In [11]:
name_list = create_name_list(data_list)

In [12]:
extract_content(data_list[0], name_list)

{'(text/html)': u'<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi friends,<br><br>In light of the upcoming TPAC, I\'d like to suggest a joint meeting between the two IGs. <br><br>There have been some discussion in the Chinese IG on things related to publishing and I thought it will be nice for us to catch up with the Digital Publishing IG.<br><br>Agenda<br><br>0. Mutual introduction<br>1. CSS3 text (some discussion on our side)<br>2. digital publishing requirement for chinese language (Bobby has written a spec/requirement and it\'d be nice to know how everyone thinks)<br>3. anything else?<br><br>This discussion won\'t be an exhaustive one, rather it is to put names to faces, discuss the agendas, and hopefully drive future online discussions.<br><br>If everyone is cool, maybe we can do a 90 minute on Thursday during TPAC? <br><br>---<br>Zi Bin Cheah<br>HTML5 Chinese IG chair<br><br><div apple-content-edited="tr

### 提取实体

In [13]:
def observe_data(data_list):
    for data in data_list:
        content_dict = extract_content(data, name_list)
        for k, v in content_dict.items():
            if k == '(text/plain)':
                print v

In [31]:
def extract_entities(data):
    entity_dict = {}
    organizations = []
    names = []
    words = None
    for k, v in extract_content(data, name_list).items():
        if k == '(text/plain)':
            words = pseg.cut(v)
            for word, flag in words:
                if flag == 'nt':
                    organizations.append(word)
                elif flag == 'nr':
                    names.append(word)
    names = list(set(names))
    remove_list = [u'於', u'後', u'大大增加', u'索引', u'關於', u'麼', u'安', u'明白', u'连']
    names = [name for name in names if name not in remove_list]
    entity_dict['Organization'] = organizations
    entity_dict['Name'] = list(set(names))
    return entity_dict

In [32]:
extract_entities(data_list[4])

{'Name': [u'\u5361\u5217\u5c3c',
  u'\u5b89\u5a1c',
  u'\u9b6f\u8fc5',
  u'\u7b1b\u5361\u723e',
  u'\u5927\u76f8',
  u'\u675c\u9b6f\u9580',
  u'\u6625\u79cb\u5de6\u6c0f',
  u'\u8463\u798f\u8208',
  u'\u65af\u5927\u6797'],
 'Organization': []}

### 提取关系

In [33]:
def extract_relations(data):
    relations_dict = {}
    msg = mime.from_string(data)
    for item in msg.headers.items():
        if item[0] == 'In-Reply-To':
            relations_dict[item[0]] = item[1]
    return relations_dict

In [34]:
extract_relations(data_list[44])

{'In-Reply-To': u'<A8DD11E7EBEF4EF0AA731A864157B84F@gmail.com>'}

### 生成JSON格式的数据

In [46]:
result = {}
with open('W3C.json', 'a') as f:
    for data in data_list:
        result['headers'] = extract_headers(data)
        result['content'] = extract_content(data, name_list)
        result['entity'] = extract_entities(data)
        result['relation'] = extract_relations(data)        
        f.write(json.dumps(result, indent=4, sort_keys=True))
        f.write('\n\n')
f.close()