# Assignment

## 问题 1

- 综合分词工具和正则表达式提取邮件签名档

- 下面有几个来自真实邮件的签名档，请尽可能提取下面的关键字段
    - 姓名
    - 单位
    - 电话号码
    - 电子邮件
- 数据如下

> 刘三 Liu, San  
+86 15912348765  
sfghsdfg@abc.org.cn    
\--------------------------   
> 李四  
北清大数据产业联合会   
电话：010-34355675  
邮箱：lisi@beiqingdata.com  
地址：北京市海淀区北清大学东楼201室    
\--------------------------  
> John Smith  
Data and Web Science Group  
University of Mannheim, Germany    
http://dws.informatik.uni-mannheim.de/~johnsmith  
Tel: +49 621 123 4567  
\--------------------------  
> 王五  
CSDN-全球最大中文IT技术社区（www.csdn.net）  
电话:010-51661202-257  
手机:13934567890  
E-mail:gdagsdfs@csdn.net  
QQ、微信：34534563  
地址：北京市朝阳区广顺北大街33号院一号楼福码大厦B座12层  
\--------------------------  
> 张三  
北京市张三律师事务所|Beijing Zhangsan Law Firm  
北京市海淀区中关村有条街1号，邮编：100080  
No. 1 Youtiao Street , ZhongGuanCun West, Haidian District, Beijing 100080  
Mobile: 15023345465|Email: dfgasedt@126.com  



## 思路
- 区分汉语, 英语
- 用分词工具提取姓名
- 用正则提取电话, 邮箱等通讯方式
- 难点: 单位如何正确提取?
    - 中文: 使用jieba, 添加自定义字典
    - 英文: 使用NLTK

## 问题
- 因为数据量比较小, 可以针对特殊形式加条件, 那么如何才能找出更一般的提取办法, 使得能够应对更大的数据量.

In [1]:
# -*- coding: utf-8 -*-
import os
import re
import jieba
import jieba.posseg as pseg
import pynlpir
from nltk.tag import StanfordNERTagger

## Preparation

In [2]:
# Add models of NLTK 
os.environ["CLASSPATH"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09"  # Here I use the absolute directory
os.environ["STANFORD_MODELS"] = "/Users/xpgeng/Library/stanford-ner-2015-12-09/models"

In [3]:
# Load user dict
jieba.load_userdict('user_dict.txt')

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8n/zjbv19sd0xg400w9dpmzvsd40000gn/T/jieba.cache
Loading model cost 0.439 seconds.
Prefix dict has been built succesfully.


In [4]:
# Read data
with open('data.txt', 'r') as f:
    data = f.read()

### 按"--------"将signature分开

In [5]:
# Divide data by "-----"
p = re.compile(r'-{2,}')
signature_list = p.split(data)

In [6]:
# Add Tagger
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

### 因为中英签名混杂在一起, 光用jieba不能识别英文的组织, 故添加如下函数, 分辨中英文signature

In [7]:
def judge_lang(words):
    '''
    Parameters: Someone's signature (str)
    Return: Language
    '''
    word_list = filter(None, re.split(r',|\s+', words))
    for word in word_list:
        if not word.isalpha():
            return "Chinese"
        else:
            return "English"

### 对每一条signature提取相关消息, 由于结构相对比较简单, 并未将每个if...改写成function.

In [8]:
def extract_information(signature, language, st):
    '''
    Parameters: 
        signature: str
        language: "Chinese" or "English"
        st: Stanford Tagger
    Return:
        information_dict: dict
    '''
    words_list = re.split(r'\n|\|', signature)    # split signature by \n, |
    organization_list = []
    tel_list = []
    email_list = []
    name_list = []
    
    p_email = re.compile(r'\w+@(\w+\.)+(\w+)')
    p_tel = re.compile(r'''(\+([\d|\s]+)) # +86 132 2345 2345
                   | (电话.+([\d|\s|\-]+))
                   | (Tel.+([\d|\s|\-]+))
                   | (手机.+([\d|\s|\-]+))
                   | Mobile.+([\d|\s|\-]+)''', re.VERBOSE)
    
    for item in words_list:
        
        # extract tel list
        if p_tel.search(item): 
            tel_group = p_tel.search(item).group()
            tel_list.append(tel_group) 
        
        # extract email list
        elif p_email.search(item):
            m = p_email.search(item).group()
            email_list.append(m) 
            
        # extract name and orgnization from Chinese signature
        elif language == "Chinese":
            words = pseg.cut(item)
            flag_list = [flag for word, flag in words]
            if 'nt' in flag_list:
                organization_list.append(item)
            elif 'nr'in flag_list:
                name_list.append(item)
        
        # Extract name and orgnization from English signature
        elif language == "English":
            flag_list = [flag for word, flag in st.tag(item.split())]
            if 'ORGANIZATION' in flag_list:
                organization_list.append(item)
            elif 'PERSON' in flag_list:
                name_list.append(item)
        else:
            return "Nothing to extract!"
    
    information_dict = {"name": list(set(name_list)), "tel": tel_list, 
                        "email": email_list, "organization": organization_list}
    return information_dict         

### 将结果写入txt中, 这里花费了些时间在调整格式....

In [21]:
# Write into result.txt
with open('result.txt', 'a') as f:
    for signature in signature_list:
        info_dict = extract_information(signature, judge_lang(signature), st)
        name = '姓名: {name}'.format(name=info_dict['name'][0])
        organization = '单位:'
        for item in info_dict['orgnization']:
            organization += "%s " % item
        tel = ''
        for item in info_dict['tel']:
            tel += "%s\t" %item
        email = "Email: %s" % info_dict['email']
        info = '%s\n%s\n%s\n%s\n-------\n'% (name, organization, tel, email)
        f.write(info)
        print info
f.close()

姓名: 刘三 Liu, San
单位:
+86 15912348765	
Email: ['sfghsdfg@abc.org.cn']
-------

姓名: 李四
单位:北清大数据产业联合会 
电话：010-34355675	
Email: ['lisi@beiqingdata.com']
-------

姓名: John Smith
单位:Data and Web Science Group University of Mannheim, Germany 
Tel: +49 621 123 4567	
Email: []
-------

姓名: 王五
单位:CSDN-全球最大中文IT技术社区（www.csdn.net） 
电话:010-51661202-257	手机:13934567890	
Email: ['gdagsdfs@csdn.net']
-------

姓名: 张三
单位:北京市张三律师事务所 
Mobile: 15023345465	
Email: ['dfgasedt@126.com']
-------



## 总结
- 第一次做信息提取, 感觉始终是在做试验, 试验正则表达式是否正确提取自己需要的信息等等
- 反而真正写代码并没有花费多少时间
- 在结果输出上又花费了很久才调出自己满意的格式, 意识到信息提取了, 表达原来也要花时间...
- 因为加入了词性标注, 运行速度比较慢, 代码还有改进的空间. 
- Anyway, 先拿出MVP再说!