# Assignment_2

## 问题
- Data: W3C HTML5 中文兴趣小组一个月的邮件存档, 格式mbox
- 试定位邮件中的签名档, 并尽可能提取多的字段
- 编码的问题, 邮件的编码不统一.


## 思考
- 是不是应该考虑先来一波暴力提取
- 很多乱码, 涉不涉及转码?
- 每封邮件的起止位置标识是什么?
    - 答: From ....@......
- 如何尽可能多的提取信息?
    - 如何让正则通用性更高
- 新思路: 用分词提取所有姓名, 然后根据每一条邮件内容提取签名档
- 因为提取的姓名格式有几个形式不同, 比如 中英混杂, 名字中间带括号.
    - 所以考虑先生成一份处理过的name_list, 然后, 编译这份name_list到正则表达式.
    - 但是这么做应该会导致程序比较慢, 尤其是当name_list长度特别大时. 
    - 可以通过观察name_list, 发现里面很多条目是重复的, 因为邮件互相往来嘛... 本着'最小'的理念, 先试试!
- 另外一个想法:
    - 对全文进行分词, 提取人名, 加之前通过Flanker提取的人名得到一份名单.
- 签名档没有固定格式, 感觉找不到一个统一的规则去提取签名档...心塞!
    - 难道要写一个巨能容错的正则(规则复杂, 冗长)去匹配, 然后在筛选?


## 过程
- 根据mbox格式分开每一封邮件
- 利用flanker print 邮件内容, 观察是否有一般的提取模式
    - 期间遇到了编码问题, 有NoneType, 以及str, 还有Unicode.
- 发现无法利用内容提取, 想到利用发件人姓名
- 如何提取发件人姓名?
    - 利用flanker在header里面找到'From', 可以提取发件人姓名, 邮箱.


## References
- [mbox-wikipedia](https://en.wikipedia.org/wiki/Mbox)
- 

In [2]:
# -*- coding: utf-8 -*-
import re
import flanker
from flanker import mime

In [3]:
with open('2013-11.mbx', 'r') as f:
    data = f.read()

In [4]:
data_list_2 = re.split(r'From.+ 2013', data)

In [5]:
data_list = filter(None, re.split(r'From\s([\w+.?]+@(\w+\.)+(\w+))', data))  # ([\w+.?]+@(\w+\.)+(\w+))

In [6]:
len(data_list)

672

In [7]:
info_list = []
for data in data_list:
    msg = mime.from_string(data)
    if len(msg.headers.items()) != 0:
        info_list.append(data)
print len(info_list)

168


In [8]:
p1 = re.compile(r'([\x80-\xff]+)')
name = p1.findall('董福興 Bobby Tung')
print name[0]

董福興


In [70]:
name_list = []
p = re.compile(ur'\"?([\w\s\(\)]+|[\x80-\xff]+)\"?\s<')
for message_string in info_list:
    msg = mime.from_string(message_string)
    for item in msg.headers.items():
        if item[0] == 'From':
            name = p.search(item[1].encode('utf-8')).group(1)
            name_list.append(name)
            print name
    #p_name = re.compile(r'(^%s)|(.+)'%name)
    #for part in msg.parts:
    #    if not isinstance(part.body, (type(None), str)):
    #        print p_name.findall(part.body.encode('utf-8'))

Zi Bin Cheah
Zi Bin Cheah
Yijun Chen
 Bobby Tung
John Hax
Yijun Chen
John Hax
Zi Bin Cheah
一丝
Hawkeyes Wind
Ambrose LI
Hao (Kenny) Lu
Hao (Kenny) Lu
Bobby Tung
Bobby Tung
梁海
Bobby Tung
梁海
Bobby Tung
梁海
Hao (Kenny) Lu
Bobby Tung
Hao (Kenny) Lu
Hawkeyes Wind
Hawkeyes Wind
octw chen
梁海
梁海
Hawkeyes Wind
octw chen
octw chen
octw chen
梁海
octw chen
梁海
 Chunming
Hawkeyes Wind
octw chen
octw chen
octw chen
Bobby Tung
Bobby Tung
梁海
octw chen
Bobby Tung
octw chen
Bobby Tung
octw chen
octw chen
梁海
Bobby Tung
octw chen
梁海
octw chen
梁海
octw chen
梁海
Bobby Tung
John Hax
John Hax
Hawkeyes Wind
Hawkeyes Wind
Hawkeyes Wind
Xiaoqian Cindy Wu
Bobby Tung
Xidorn Quan
octw chen
octw chen
Doris Wang
Doris Wang
John Hax
John Hax
octw chen
John Hax
John Hax
John Hax
John Hax
octw chen
Doris Wang
John Hax
梁海
梁海
Bobby Tung
梁海
John Hax
Bobby Tung
梁海
octw chen
Bobby Tung
octw chen
octw chen
com
Hao (Kenny) Lu
John Hax
Ambrose LI
Hawkeyes Wind
Hawkeyes Wind
Sunruinan
John Hax
com
John Hax
octw chen
John Hax
梁海
梁海
Hao

In [71]:
name_list = list(set(name_list))

In [72]:
name_list += ['Cindy', 'Kenny', 'Chen Yijun']

In [73]:
name_list.append('Chunming')

In [74]:
name_list.append('-ambrose')

In [75]:
name_list.remove('com')
name_list.remove(' Chunming')
name_list.remove(' Bobby Tung')

In [76]:
name_list

['Ambrose LI',
 'Xiaoqian(Cindy) Wu',
 '\xe6\xa2\x81\xe6\xb5\xb7',
 'Zi Bin Cheah',
 'octw chen',
 'Xiaoqian Cindy Wu',
 'John Hax',
 'Hawkeyes Wind',
 '\xe4\xb8\x80\xe4\xb8\x9d',
 'Sunruinan',
 'Doris Wang',
 'Bobby Tung',
 'Xidorn Quan',
 'Yijun Chen',
 '\xe5\x90\xb3\xe6\x97\xad\xe6\x98\x8c',
 'Cheah Zi Bin',
 'Jingtao Liu',
 'Hao (Kenny) Lu',
 'Zhiqiang Zhang',
 'Cindy',
 'Kenny',
 'Chen Yijun',
 'Chunming',
 '-ambrose']

In [86]:
signature_list = []
for name in name_list:
    p_name = re.compile(r'^%s.+'%name, re.MULTILINE | re.DOTALL)
    for info in info_list:
        msg = mime.from_string(info)
        for part in msg.parts:
            if not isinstance(part.body, (type(None), str)):
                if p_name.findall(part.body.encode('utf-8')):
                    signature_list += p_name.findall(part.body.encode('utf-8'))

In [95]:
cut_list = []
for item in signature_list:
    if len(item) < 300:
        cut_list.append(item)
for item in cut_list:
    print "Signature: %s" %item

Signature: Zi Bin Cheah
HTML5 Chinese IG chair



Signature: Hawkeyes Wind


Signature: Hawkeyes Wind</pre>
  </body>
</html>

Signature: Hawkeyes Wind


Signature: Hawkeyes Wind</pre>
  </body>
</html>

Signature: Hawkeyes Wind


Signature: Hawkeyes Wind


Signature: Hawkeyes Wind


Signature: Hawkeyes Wind</pre>
  </body>
</html>

Signature: Hawkeyes Wind


Signature: Hawkeyes Wind</pre>
  </body>
</html>

Signature: Hawkeyes Wind</pre>
  </font></span></div>

</blockquote></div><br></div>

Signature: Hawkeyes Wind</pre>
  </font></span></div>

</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>

Signature: Hawkeyes Wind


Signature: Hawkeyes Wind</pre>
  </body>
</html>

Signature: Hawkeyes Wind


Signature: Hawkeyes Wind</pre>
  </body>
</html>

Signature: Hawkeyes Wind</pre>
  </font></span></div>

</blockquote></div><br></div>

Signature: Hawkeyes Wind<br>
<br>
<br>
</font></span></blockquote></div><br></div>

Signature: Bobby Tung<br>

In [None]:
for message_string in info_list:
    msg = mime.from_string(message_string)
    for name in name_list:
       p_name = re.compile(r'(^%s)|(.+)'%name)
    #for part in msg.parts:
    #    if not isinstance(part.body, (type(None), str)):
    #        print p_name.findall(part.body.encode('utf-8'))