# 1.Data Crawling


The scrapy project is under this [directory](https://github.com/iou2much/NLP-tasks/tree/master/banks_crawler/banks_crawler). This [crawler](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/banks_crawler/spiders/cncb.py) implements the data extraction . 

In [1]:
import multiprocessing
from multiprocessing.queues import Queue
from scrapy import  signals 
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
import pandas as pd
import numpy as np
import logging
from pprint import pprint
import os.path
from stanza.nlp.corenlp import CoreNLPClient

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)


  """


This part is an function start the scrapy project to crawl.

In [2]:
class CrawlerWorker(multiprocessing.Process):
    def __init__(self, result_queue, spider, settings=None):
        multiprocessing.Process.__init__(self)
        self.settings = settings or get_project_settings()
        self.result_queue = result_queue
        self.spider = spider
        self.items = []
        dispatcher.connect(self._item_scraped, signals.item_scraped)

    def _item_scraped(self, item):
        self.items.append(item)

    def run(self):
        self.crawler_process = CrawlerProcess(self.settings)
        crawler = self.crawler_process.create_crawler(self.spider)
        self.crawler_process.crawl(crawler)
        self.crawler_process.start()
        self.result_queue.put(self.items)
        self.crawler_process.stop()

def crawl():
    result_queue = Queue(ctx=multiprocessing.get_context())
    crawler = CrawlerWorker(result_queue, "cncb")
    crawler.start()
    return result_queue.get()

In [3]:
data_path = 'cncb-qna.csv'
if os.path.isfile(data_path):
    df = pd.read_csv(data_path)
else:
    result = crawl()
    df = pd.DataFrame(result,columns=['category', 'question', 'answer', 'language'])
    df.to_csv('cncb-qna.csv',encoding='utf-8',index=False)

# 2. Language Vector Space Model



###  Introduction for the dicts
[dict.txt](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/dict.txt) is the default dict of jieba. I use [sc2tc.py]https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/sc2tc.py) to generate a traditional version, [dict-tc.txt](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/dict-tc.txt).

[sougou.dict](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/sougou.dict) is generate from some dicts of [Sougou's](http://pinyin.sogou.com/dict/). The Sougou's dicts are in scel format. So I use this [script](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/scel2txt.py) to convert them into text format.

In [4]:
import jieba
jieba.load_userdict("dics/dict-tc.txt")
jieba.load_userdict("dics/sougou.dict")

from gensim import corpora, models, similarities
from collections import defaultdict

Building prefix dict from the default dictionary ...
2017-05-15 18:58:01,286 : DEBUG : Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/3b/lpzhjvns3q501g94r7r8nkxw0000gn/T/jieba.cache
2017-05-15 18:58:01,289 : DEBUG : Loading model from cache /var/folders/3b/lpzhjvns3q501g94r7r8nkxw0000gn/T/jieba.cache
Loading model cost 0.865 seconds.
2017-05-15 18:58:02,153 : DEBUG : Loading model cost 0.865 seconds.
Prefix dict has been built succesfully.
2017-05-15 18:58:02,155 : DEBUG : Prefix dict has been built succesfully.


## *DocAnalyzer* is the implements for these tasks:

### 2.1 Tokenize questions into words¶

    tokenize() method tokenizes some docs.

### 2.2 Build a TFIDF model using questions and answers
    build_tfidf_model() builds a TFIDF model.
### 3. Similarity Comparison
    similarity() returns the most similar sentence.
### 4. Named Entity Recognition
    get_entities() returns the list of entities in the sentence.

In [5]:
class DocAnalyzer:
    def __init__(self,docs):
        self.texts = []
        self.dic = None
        self.tfidf = None
        self.corpus = None
        self.dic_length = 0
        self.coreNLP_client = None
        self.tokenize(docs)
    
    def tokenize(self,docs):
        words=[]
        punct = set(u'''， %/:!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐､﹒
        ﹔﹕﹖﹗﹚﹜﹞！），．：；？｜｝︴︶︸︺︼︾﹀﹂﹄﹏､～￠
        々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖（［｛￡￥〝︵︷︹︻
        ︽︿﹁﹃﹙﹛﹝（｛“‘-—_…''')
        # for str/unicode
        filterpunt = lambda s: ''.join(filter(lambda x: x not in punct, s))
        # for list
        filterpuntl = lambda l: list(filter(lambda x: x not in punct, l))

        for doc in docs.map(lambda x:jieba.cut(str(x))):
            words.append(filterpuntl(list(doc)))

        frequency = defaultdict(int)
        for text in words:
            for token in text:
                frequency[token] += 1
        self.texts = [[token for token in text if frequency[token] > 1] for text in words]
    
    def build_tfidf_model(self):
        dic = corpora.Dictionary(self.texts)
        self.dic = dic
        self.dic_length = len(dic)
        dic.save('/tmp/cncb-qna.dict')
        self.corpus = [dic.doc2bow(text) for text in self.texts]
        corpora.MmCorpus.serialize('/tmp/cncb-qna.mm', self.corpus)
        self.tfidf = models.TfidfModel(self.corpus)
        
    def similarity(self,sentence):
        if self.tfidf is None:
            self.build_tfidf_model()
        index = similarities.SparseMatrixSimilarity(self.tfidf[self.corpus], num_features=self.dic_length)
        
        new_vec = self.dic.doc2bow(jieba.cut(sentence))
        similary = tuple(enumerate(index[self.tfidf[new_vec]]))
        index,sim = max(similary, key=lambda x:x[1])
        return sim,''.join(self.texts[index])

    def get_coreNLP_client(self):
        if self.coreNLP_client is None:
            self.coreNLP_client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'lemma', 'tokenize', 'pos', 'ner']) 
            
        return self.coreNLP_client
        
    def get_entities(self, sentence):
        client = self.get_coreNLP_client()
        
        annotated = client.annotate(sentence)
        result = []
        for s in annotated.sentences:
            for token in s:
                if token.ner != 'O':
                    result.append((token.word, token.ner))
        return result

analyzer = DocAnalyzer(df.question)


#### Test *similarity()*

In [6]:
print('The maximum similary is %s: %s'%analyzer.similarity('取消正在处理的指令'))

The maximum similary is 0.55071: 如果我想取消正在处理中或已部份成交的指令我可以怎样做


#### Test * get_entities() *
Before you test the get_entities() method, the ** StanfordCoreNLPServer ** should be launched already.

The installation guide of StanfordCoreNLPServer is on [Stanford Site](https://nlp.stanford.edu/software/CRF-NER.shtml)

The package I use here are,
* **[CoreNLP](https://stanfordnlp.github.io/CoreNLP/)**
* **[Named Entity Recognizer version 3.7.0](https://nlp.stanford.edu/software/CRF-NER.shtml#Download)**,
* **[3.7.0 Chinese models
](https://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar) **, 
* ** [3.7.0 English models](https://nlp.stanford.edu/software/stanford-english-corenlp-2016-10-31-models.jar) **

After deploy the package into the Standford directory, set an English-Chinese supported configuraion in the ** [custom.properties](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/StanfordNLP/corenlp/custom.properties) **.

In [7]:
import subprocess,time
def start_stanfordNLP_server():
    command = "ps aux|grep StanfordCoreNLPServer"
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=None, shell=True)
    output = process.communicate()
    if len(output) > 1 and 'edu.stanford.nlp.pipeline.StanfordCoreNLPServer' in str(output[0]) :
        stop_stanfordNLP_server()
        time.sleep(3)

    command="""
    java -Xmx4g -cp \
    "StanfordNLP/corenlp/*:StanfordNLP/corenlp/:StanfordNLP/ner/:StanfordNLP/parser/:StanfordNLP/postagger/" \
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordNLP/corenlp/custom.properties \
    -port 9000 -timeout 15000 &
    """
    print('starting the stanfordNLP server')

    process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
    output = process.communicate()
def stop_stanfordNLP_server():
    command="""!ps aux|grep StanfordCoreNLPServer|awk '{print $2}'|xargs kill -9 > /dev/null 2>&1"""
    print('killing the stanfordNLP server')
    subprocess.call(command, stdout=subprocess.PIPE, shell=True)

##### Start the server

In [8]:
start_stanfordNLP_server()

killing the stanfordNLP server
starting the stanfordNLP server


Test the get_entities() method. At the first time you invoke it, please wait a few seconds, cuz it will load the model for the first time. As we can see here, the default Chinese model of Stanford NER is not good enough, it regconize 星期天 as NUMBER. If we use some customized corpus to train the data, it might be better.

In [9]:
print(analyzer.get_entities("这个星期天北京会下雨"))
print('-'*100)
print(analyzer.get_entities('Rami Eid is studying at Stony Brook University in NY,America.I don\'t like Beckham.'))


[('星期天', 'NUMBER'), ('北京', 'GPE'), ('会', 'NUMBER')]
----------------------------------------------------------------------------------------------------
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('NY', 'LOCATION'), ('Beckham', 'PERSON')]


## 5. Voice Recognition


For this task, most of my time is spending on the data preprocessing.

### Dataset

I find this large "tagged" corpus for this, since there's hardly a large Cantonese dataset on line.
* [Bible Audio in Cantonese](http://disc.fuyin.tv/soft/html/582.html)
* [Textbook](http://download.o-bible.com:8080/hgb.gz)

When I say tagged, actually it's not. It also has some issue to fix before I use it.
* The audio file are in chapters, ie, per chapter every audio file. It will be too long for the input of the deep network. So it should be split into smaller files.
* The content of the audio and the text are exactly the same, but after the audio files are splitted, the text should be dealed with too.

Also, for a real world application, this dataset is too ideal. 
* Cuz it's Bible, the reading speed is too steady and slow. 
* Plus, there's no background noises at all in these audio. When we use the model trained with it, we might hardly catch the background noises feature.



### Framework
For this task, I use mozilla's [DeepSpeech project](https://github.com/mozilla/DeepSpeech) It's based on Baidu's Deep Speech research paper, implemented on Tensorflow, and supported [warp-CTC](https://g,ithub.com/baidu-research/warp-ctc).

Besides, it's only supporting characters in Alphabet . So I encode the Chinese characters before input into the DNN, and decode the output of forward results. The mapping data is in this [file](https://github.com/iou2much/NLP-tasks/blob/master/asr_for_cantonese/dataset/char_map.txt).

Like these,

    飘 djh
    尼 dji
    怪 djj
    赠 djk
    占 djl
    掐 djm
    齐 djn
    篷 djo
    ......
    

This approach has some issue too, the model will only learn the vocal features ,but not learn the language feature. Also the model use Levenshtein distance as a optimal target, so when it predict a character from audio, it will return a result with a minimum edit distance.

Now let's have a glance of the codes for the tasks mention above.

In ** [asr_for_cantonese/format.py](https://github.com/iou2much/NLP-tasks/blob/master/asr_for_cantonese/format.py) **

* **mp3toWav**: convent mp3 to wav format
* **formatTxt**: split the text book into chapters like audio file
* **char_map**: generate the char_map.txt
* **encode_str & decode_str **: encode or decode the Chinese character to Alphabet


In ** [asr_for_cantonese/split.py](https://github.com/iou2much/NLP-tasks/blob/master/asr_for_cantonese/split.py) **
* **split_wav** : split a big wav file into pieces, and split the relevant text .
* **split_txt** : split the training and validation dataset.

Finally, I get a bunch of wav files, and a ** [csv file](https://github.com/iou2much/NLP-tasks/blob/master/asr_for_cantonese/dataset/trans.full.csv)** to input into the DeepSpeech model. 

And run ** [this script](https://github.com/iou2much/NLP-tasks/blob/master/asr_for_cantonese/bin/run-bibles.sh) ** to start training the data.

Before running the following code, please compile the Tensorflow lib and DeepSpeech navtive client.Just follow this [guide](https://github.com/mozilla/DeepSpeech/tree/master/native_client)


In [11]:
import sys
import scipy.io.wavfile as wav
from deepspeech import DeepSpeech
from asr_for_cantonese.format import decode_str

# The model is over hundred megabytes, I
model_path = '/Users/chibs/playground/bigdata-stack/themes/speech-reg/DeepSpeech/bibles_export/output_graph.pb'
ds = DeepSpeech(model_path, 26, 9)

fs, audio = wav.read('test_data/12_chunk_05.wav')
output = ds.stt(audio, fs)
print('code: %s '%output)
print('Chinese: %s'%decode_str(output))
print('-'*100)

fs, audio = wav.read('test_data/18_chunk_32.wav')
output = ds.stt(audio, fs)
print('code: %s '%output)
print('Chinese: %s'%decode_str(output))
print('-'*100)

fs, audio = wav.read('test_data/18_chunk_42.wav')
output = ds.stt(audio, fs)
print('code: %s '%output)
print('Chinese: %s'%decode_str(output))
print('-'*100)



code: bvq wj abi dgs a cyi cbt bkk cqr ado rw c gm ml 
Chinese: 亚伯横就撇着耶和华脐山孽胳久
----------------------------------------------------------------------------------------------------
code: pj crd awp am aqk aaj aml 
Chinese: 不然你实在笑了
----------------------------------------------------------------------------------------------------
code: bqs ayq alk pu bqb re cdo cjg crd biv cju dcd csn dbl ays bso deg cdo xx dkl ckh dby dgx ata bud pj crd ays bnk dcl aku bch 
Chinese: 察看他们所行的果然尽像那达到我耳中的声音一样吗若是不然我也必知道
----------------------------------------------------------------------------------------------------


The last one is from the test set.

In [12]:
#the sentence from the original text
a='罗得举看见约旦河的全平原直到琐珥都是滋润的那地在耶和华未灭所多玛摩拉以先如同耶和华的园子也像埃及地'
fs, audio = wav.read('test_data/13_chunk_14.wav')
output = ds.stt(audio, fs)
b=decode_str(output)
print('code: %s '%output)
print('Chinese: %s'%b)


correct=0
for i in range(len(a)):
    if a[i]==b[i]:
        correct+=1
print(correct/len(a))

code: h cak  bkb api cnf bfs cxo ays cdo byj kc bau w dek bqb avw aeh bud ei ahn cdo dcd awm bqk cbt bkk cqr apm cnc bqb bxi ak bpkk bbu brq vc axg al ks cbt bkk cqr cdo ael xi bnk aes jm vk a 
Chinese: 罗得诓卑见约笼我的惭烘原鸿络所育都是父追的那害羔耶和华官环所多玛摩拉圆毁埠堂耶和华的熔子也将埃及撇
0.5306122448979592


** I hope I explain well enough, but I just provide the result here. The whole workflow , including data preprocessing part and training part , should be written into an automatic script, but I haven't had time to do it yet. **


## Thank you.
That's all the work I've done. Thanks for your time.
