# Assignment

## 问题 1

- 综合分词工具和正则表达式提取邮件签名档

- 下面有几个来自真实邮件的签名档，请尽可能提取下面的关键字段
    - 姓名
    - 单位
    - 电话号码
    - 电子邮件
- 数据如下

> 刘三 Liu, San  
+86 15912348765  
sfghsdfg@abc.org.cn    
\--------------------------   
> 李四  
北清大数据产业联合会   
电话：010-34355675  
邮箱：lisi@beiqingdata.com  
地址：北京市海淀区北清大学东楼201室    
\--------------------------  
> John Smith  
Data and Web Science Group  
University of Mannheim, Germany    
http://dws.informatik.uni-mannheim.de/~johnsmith  
Tel: +49 621 123 4567  
\--------------------------  
> 王五  
CSDN-全球最大中文IT技术社区（www.csdn.net）  
电话:010-51661202-257  
手机:13934567890  
E-mail:gdagsdfs@csdn.net  
QQ、微信：34534563  
地址：北京市朝阳区广顺北大街33号院一号楼福码大厦B座12层  
\--------------------------  
> 张三  
北京市张三律师事务所|Beijing Zhangsan Law Firm  
北京市海淀区中关村有条街1号，邮编：100080  
No. 1 Youtiao Street , ZhongGuanCun West, Haidian District, Beijing 100080  
Mobile: 15023345465|Email: dfgasedt@126.com  



## 思路
- 区分汉语, 英语
- 用分词工具提取姓名
- 用正则提取电话, 邮箱等通讯方式
- 难点: 单位如何正确提取? 

## 问题
- 因为数据量比较小, 可以针对特殊形式加条件, 那么如何才能找出更一般的提取办法, 使得能够应对更大的数据量.

In [1]:
# -*- coding: utf-8 -*-
import os
import re
import jieba
import jieba.posseg as pseg
import pynlpir
from nltk.tag import StanfordNERTagger

In [2]:
os.environ["CLASSPATH"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09"
os.environ["STANFORD_MODELS"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09/models"

In [3]:
jieba.load_userdict('user_dict.txt')

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8n/zjbv19sd0xg400w9dpmzvsd40000gn/T/jieba.cache
Loading model cost 0.436 seconds.
Prefix dict has been built succesfully.


In [4]:
with open('data.txt', 'r') as f:
    data = f.read()

In [5]:
p = re.compile(r'-{2,}')
signature_list = p.split(data)

In [6]:
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

In [7]:
def judge_lang(words):
    word_list = filter(None, re.split(r',|\s+', words))
    for word in word_list:
        if not word.isalpha():
            return "Chinese"
        else:
            return "English"

In [115]:
def extract_information(signature, language, st):
    words_list = re.split(r'\n|\|', signature)
    orgnization_list = []
    tel_list = []
    email_list = []
    name_list = []
    
    p_email = re.compile(r'\w+@(\w+\.)+(\w+)')
    p_tel = re.compile(r'''(\+([\d|\s]+)) # +86 132 2345 2345
                   | (电话.+([\d|\s|\-]+))
                   | (Tel.+([\d|\s|\-]+))
                   | (手机.+([\d|\s|\-]+))
                   | Mobile.+([\d|\s|\-]+)''', re.VERBOSE)
    
    for item in words_list:
        if p_tel.search(item):
            tel_group = p_tel.search(item).group()
            tel_list.append(tel_group)
        elif p_email.search(item):
            m = p_email.search(item).group()
            email_list.append(m)
        elif language == "Chinese":
            words = pseg.cut(item)
            flag_list = [flag for word, flag in words]
            if 'nt' in flag_list:
                orgnization_list.append(item)
            elif 'nr'in flag_list:
                name_list.append(item)
        elif language == "English":
            flag_list = [flag for word, flag in st.tag(item.split())]
            if 'ORGANIZATION' in flag_list:
                orgnization_list.append(item)
            elif 'PERSON' in flag_list:
                name_list.append(item)
        else:
            return "nothing to extract!"
    
    information_dict = {"name": list(set(name_list)), "tel": tel_list, 
                        "email": email_list, "orgnization": orgnization_list}
    return information_dict         

In [133]:
for signature in signature_list:
    info_dict = extract_information(signature, judge_lang(signature), st)
    print '姓名: {name}'.format(name=info_dict['name'][0])
    for item in info_dict['orgnization']:
        print "单位: %s" % item
    for item in info_dict['tel']:
        print item
    print "Email: %s" % info_dict['email']
    print '-'*7

姓名: 刘三 Liu, San
+86 15912348765
Email: ['sfghsdfg@abc.org.cn']
-------
姓名: 李四
单位: 北清大数据产业联合会
电话：010-34355675
Email: ['lisi@beiqingdata.com']
-------
姓名: John Smith
单位: Data and Web Science Group
单位: University of Mannheim, Germany
Tel: +49 621 123 4567
Email: []
-------
姓名: 王五
单位: CSDN-全球最大中文IT技术社区（www.csdn.net）
电话:010-51661202-257
手机:13934567890
Email: ['gdagsdfs@csdn.net']
-------
姓名: 张三
单位: 北京市张三律师事务所
Mobile: 15023345465
Email: ['dfgasedt@126.com']
-------
