# 1.Data Crawling


The scrapy project is under this [directory](https://github.com/iou2much/NLP-tasks/tree/master/banks_crawler/banks_crawler). This [crawler](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/banks_crawler/spiders/cncb.py) implements the data extraction . 

In [1]:
import multiprocessing
from multiprocessing.queues import Queue
from scrapy import  signals 
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
import pandas as pd
import numpy as np
import logging
from pprint import pprint
import os.path
from stanza.nlp.corenlp import CoreNLPClient

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)


  """


This part is an function start the scrapy project to crawl.

In [2]:
class CrawlerWorker(multiprocessing.Process):
    def __init__(self, result_queue, spider, settings=None):
        multiprocessing.Process.__init__(self)
        self.settings = settings or get_project_settings()
        self.result_queue = result_queue
        self.spider = spider
        self.items = []
        dispatcher.connect(self._item_scraped, signals.item_scraped)

    def _item_scraped(self, item):
        self.items.append(item)

    def run(self):
        self.crawler_process = CrawlerProcess(self.settings)
        crawler = self.crawler_process.create_crawler(self.spider)
        self.crawler_process.crawl(crawler)
        self.crawler_process.start()
        self.result_queue.put(self.items)
        self.crawler_process.stop()

def crawl():
    result_queue = Queue(ctx=multiprocessing.get_context())
    crawler = CrawlerWorker(result_queue, "cncb")
    crawler.start()
    return result_queue.get()

In [3]:
data_path = 'cncb-qna.csv'
if os.path.isfile(data_path):
    df = pd.read_csv(data_path)
else:
    result = crawl()
    df = pd.DataFrame(result,columns=['category', 'question', 'answer', 'language'])
    df.to_csv('cncb-qna.csv',encoding='utf-8',index=False)

2017-05-15 17:43:13,176 : INFO : Scrapy 1.3.3 started (bot: banks_crawler)
2017-05-15 17:43:13,180 : INFO : Overridden settings: {'BOT_NAME': 'banks_crawler', 'LOG_ENABLED': False, 'NEWSPIDER_MODULE': 'banks_crawler.spiders', 'SPIDER_MODULES': ['banks_crawler.spiders']}
2017-05-15 17:43:13,224 : INFO : Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-05-15 17:43:13,281 : INFO : Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloaderm

 'question': '我需要提供什麼資料作買賣指令？'}
2017-05-15 17:43:14,043 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '',
 'category': '證券交易服務一般資訊',
 'language': 'tc',
 'question': '當我買賣證券時，中信銀行(國際)何時會於我的賬戶內扣除我已購入之證券款項 ？'}
2017-05-15 17:43:14,047 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '',
 'category': '證券交易服務一般資訊',
 'language': 'tc',
 'question': '我可如何透過網上理財查詢證劵交易之購買力？'}
2017-05-15 17:43:14,051 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '',
 'category': '证券交易服务一般资讯',
 'language': 'sc',
 'question': '我会收到哪些种类的账户结单？ '}
2017-05-15 17:43:14,055 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '

 'question': '未成交的交易指示被取消時，交易狀況及結餘會有什麽變化？'}
2017-05-15 17:43:14,125 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '',
 'category': '香港证券交易服务',
 'language': 'sc',
 'question': '我可选择什么交易盘？ '}
2017-05-15 17:43:14,129 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '',
 'category': '香港证券交易服务',
 'language': 'sc',
 'question': '我可落碎股盘吗？ '}
2017-05-15 17:43:14,134 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '',
 'category': '香港证券交易服务',
 'language': 'sc',
 'question': '什么是「有效日期」特性？ '}
2017-05-15 17:43:14,139 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'An online order confirmation with a

             'should I do?'}
2017-05-15 17:43:14,180 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'All unmatched days orders will expire at the end of the Good Till '
           'Day. You will have to re-enter the order after the trading hours '
           'if you wish to keep the order for the next business day.',
 'category': 'General Information of Securities Trading Service',
 'language': 'en',
 'question': 'What will be happened if the orders are not executed by the end '
             'of the Good Till Day?'}
2017-05-15 17:43:14,184 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'The withheld fund of the cancelled order will be released to '
           'Usable Balance for Trading immediately.',
 'category': 'General Information of Securities Trading Service',
 'language': 'en',


{'answer': '', 'category': '香港證券交易服務', 'language': 'tc', 'question': '我可落碎股盤嗎？'}
2017-05-15 17:43:14,271 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '',
 'category': '香港證券交易服務',
 'language': 'tc',
 'question': '什麼是「有效日期」特性？'}
2017-05-15 17:43:14,276 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '',
 'category': '沪港通',
 'language': 'sc',
 'question': '我可以于什么渠道买卖沪股通证券？'}
2017-05-15 17:43:14,279 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '',
 'category': '沪港通',
 'language': 'sc',
 'question': '请问沪股通证券买卖接受什么指示盘？ '}
2017-05-15 17:43:14,283 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answe

 'question': '沪股通证券的每手单位 / 碎股、买卖规模及最低上落价位如何？'}
2017-05-15 17:43:14,358 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '',
 'category': '沪港通',
 'language': 'sc',
 'question': '请问我的买入指示有可能会以碎股成交？'}
2017-05-15 17:43:14,362 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'We only accept At-auction Limit Order within 199-price spread and '
           'Enhanced Limit Order within 23-price spread of the nominal price. '
           'Instructions that out of the spread will be rejected.',
 'category': 'Hong Kong Securities Trading Services',
 'language': 'en',
 'question': 'Why the order is rejected with an alert message "The input price '
             'is out of the spread range"?'}
2017-05-15 17:43:14,368 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securi

             'Nasdaq-Amex Pilot Program (PP)?'}
2017-05-15 17:43:14,446 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': "Shanghai-Hong Kong Stock Connect ('SH-HK Connect') is a securities "
           'trading and clearing links programme to be developed by Hong Kong '
           "Exchanges and Clearing Limited ('HKEX'), Shanghai Stock Exchange "
           "('SSE') and China Securities Depository and Clearing Corporation "
           "Limited ('ChinaClear'), aiming to achieve a breakthrough in mutual "
           'market access between the Mainland and Hong Kong.Under SH-HK '
           "Connect, The Stock Exchange of Hong Kong Limited ('SEHK'), a "
           'wholly-owned subsidiary of HKEX, and SSE will establish mutual '
           'order-routing connectivity and related technical infrastructure to '
           'enable investors of their respective market to trade designated '
    

             'Security to buy a new SSE Security before its settlement?'}
2017-05-15 17:43:14,516 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'Order amendment is not allowed for Northbound Trading. If you want '
           'to amend your trading order, you have to cancel the order first '
           'and place a new order after successfully cancelled.',
 'category': 'Shanghai-Hong Kong Stock Connect',
 'language': 'en',
 'question': 'Can I amend or cancel my SSE Securities trading order?'}
2017-05-15 17:43:14,520 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'You have to pay third-party charges such as handling fee, '
           'securities management fee, transfer fee, stamp duty as well as '
           'bank charges such as brokerage commission fee. For details, please '
        

             'conditions?'}
2017-05-15 17:43:14,590 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'Once you have linked up a CNY settlement account with your '
           'securities trading account, you will be able to trade SSE '
           'securities from next working day. To link up a settlement account '
           'for an existing securities trading account, please login to our '
           'i-banking platform or visit our branches.',
 'category': 'Shanghai-Hong Kong Stock Connect',
 'language': 'en',
 'question': 'I already have a securities trading account with the Bank, can I '
             'trade SSE Securities instantly?'}
2017-05-15 17:43:14,595 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '',
 'category': '美國證券交易服務',
 'language': 'tc',
 'question': '我可以於什麼渠道買賣美國證券？'}
2

 'question': '何時為美國證券市場的交易時間？'}
2017-05-15 17:43:14,640 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '',
 'category': '美國證券交易服務',
 'language': 'tc',
 'question': '我需要多久才可以沽出我的證券及何時可以收到已沽售證券之金額 ？'}
2017-05-15 17:43:14,644 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '',
 'category': '美國證券交易服務',
 'language': 'tc',
 'question': '我可否於香港的公眾假期時買賣美國之證券？'}
2017-05-15 17:43:14,648 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/sc/index.jsp>
{'answer': '黄表申请者\r\n'
           '\t\t\t\t\t\t\t退款将会于指定的退款日直接存入客户的证券户口内。白表申请者\r\n'
           '\t\t\t\t\t\t\t'
           '发行人之股份过户处将以支票形式把有关之认购金额退回客户。退款支票将于发行人公布之退款日期当天，以邮递方式寄回客户于网上认购新股服务申请表上所填报之地址。',
 'category': 'EIPO',
 'language': 'sc',
 'question': '怎样取回因认购申请未获成功分配而退回的认购金

 'question': '客戶需符合甚麼資格才可透過i-banking認購新股？'}
2017-05-15 17:43:14,718 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '第一步 \r\n'
           '\t\t\t\t\t\t\t於決定認購新股前，你應該細閱有關之招股書，才作出投資決定。你可於登入i-banking後，到 '
           '"證券交易" > "新股認購"中取得有關之招股資料第二步  \r\n'
           '\t\t\t\t\t\t\t'
           '按"網上認購"鍵及閣下欲申請之新股，細閱電子公開招股服務之條款及細則及一般條款，填寫申請表及發出有關之付款指示，以便我們於特定時間從你的指定賬戶扣除認購股款 '
           '第三步  \r\n'
           '\t\t\t\t\t\t\t確認你的申請詳情及於網上遞交申請表後，你即獲發參考編號以作記錄 第四步  \r\n'
           '\t\t\t\t\t\t\t請確保你的指定支賬賬戶有足夠款項以便我們於所定的時間過賬，否則有關申請將會自動作廢，亦不作另行通知',
 'category': 'EIPO',
 'language': 'tc',
 'question': '網上認購新股有什麼步驟？'}
2017-05-15 17:43:14,723 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'Yes, trading will be available if US securities markets are open.',
 'category': 'U.S. Securities Trading Services'

             'Share Registrars for allotment?'}
2017-05-15 17:43:14,789 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp>
{'answer': '認購高於投資者的需要\r\n'
           '\t\t\t\t\t\t\t'
           '投資人士可能因未必會完全獲取其認購數量而認購多於自己的實際需要,但假如是次招股行動並沒有超額認購，投資者將會獲得所有認購的數量及支付全數應繳款項。 '
           '市場風險\r\n'
           '\t\t\t\t\t\t\t當公司掛牌上市後，其股份有機會跌破招股價。 業務風險\r\n'
           '\t\t\t\t\t\t\t'
           '決定投資之前，你應瞭解有關公司之業務性質，徹底細讀其招股書、公司財務報告、甚至尋求專業人士的意見，方才作出投資決定。',
 'category': 'EIPO',
 'language': 'tc',
 'question': '認購新股公司股份涉及的風險'}
2017-05-15 17:43:14,796 : DEBUG : Scraped from <200 https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/en/index.jsp>
{'answer': 'For Yellow Form Applicants\r\n'
           '\t\t\t\t\t\t\tthe application money (or part thereof) will be '
           'credited directly to your Securities Trading account on the refund '
           'day specified in the

# 2. Language Vector Space Model



###  Introduction for the dicts
[dict.txt](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/dict.txt) is the default dict of jieba. I use [sc2tc.py]https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/sc2tc.py) to generate a traditional version, [dict-tc.txt](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/dict-tc.txt).

[sougou.dict](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/sougou.dict) is generate from some dicts of [Sougou's](http://pinyin.sogou.com/dict/). The Sougou's dicts are in scel format. So I use this [script](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/dics/scel2txt.py) to convert them into text format.

In [4]:
import jieba
jieba.load_userdict("dics/dict-tc.txt")
jieba.load_userdict("dics/sougou.dict")

from gensim import corpora, models, similarities
from collections import defaultdict

Building prefix dict from the default dictionary ...
2017-05-15 15:13:23,636 : DEBUG : Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/3b/lpzhjvns3q501g94r7r8nkxw0000gn/T/jieba.cache
2017-05-15 15:13:23,641 : DEBUG : Loading model from cache /var/folders/3b/lpzhjvns3q501g94r7r8nkxw0000gn/T/jieba.cache
Loading model cost 3.928 seconds.
2017-05-15 15:13:27,567 : DEBUG : Loading model cost 3.928 seconds.
Prefix dict has been built succesfully.
2017-05-15 15:13:27,574 : DEBUG : Prefix dict has been built succesfully.


## *DocAnalyzer* is the implements for these tasks:

### 2.1 Tokenize questions into words¶

    tokenize() method tokenizes some docs.

### 2.2 Build a TFIDF model using questions and answers
    build_tfidf_model() builds a TFIDF model.
### 3. Similarity Comparison
    similarity() returns the most similar sentence.
### 4. Named Entity Recognition
    get_entities() returns the list of entities in the sentence.

In [75]:
class DocAnalyzer:
    def __init__(self,docs):
        self.texts = []
        self.dic = None
        self.tfidf = None
        self.corpus = None
        self.dic_length = 0
        self.coreNLP_client = None
        self.tokenize(docs)
    
    def tokenize(self,docs):
        words=[]
        punct = set(u'''， %/:!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐､﹒
        ﹔﹕﹖﹗﹚﹜﹞！），．：；？｜｝︴︶︸︺︼︾﹀﹂﹄﹏､～￠
        々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖（［｛￡￥〝︵︷︹︻
        ︽︿﹁﹃﹙﹛﹝（｛“‘-—_…''')
        # for str/unicode
        filterpunt = lambda s: ''.join(filter(lambda x: x not in punct, s))
        # for list
        filterpuntl = lambda l: list(filter(lambda x: x not in punct, l))

        for doc in docs.map(lambda x:jieba.cut(str(x))):
            words.append(filterpuntl(list(doc)))

        frequency = defaultdict(int)
        for text in words:
            for token in text:
                frequency[token] += 1
        self.texts = [[token for token in text if frequency[token] > 1] for text in words]
    
    def build_tfidf_model(self):
        dic = corpora.Dictionary(self.texts)
        self.dic = dic
        self.dic_length = len(dic)
        dic.save('/tmp/cncb-qna.dict')
        self.corpus = [dic.doc2bow(text) for text in self.texts]
        corpora.MmCorpus.serialize('/tmp/cncb-qna.mm', self.corpus)
        self.tfidf = models.TfidfModel(self.corpus)
        
    def similarity(self,sentence):
        if self.tfidf is None:
            self.build_tfidf_model()
        index = similarities.SparseMatrixSimilarity(self.tfidf[self.corpus], num_features=self.dic_length)
        
        new_vec = self.dic.doc2bow(jieba.cut(sentence))
        similary = tuple(enumerate(index[self.tfidf[new_vec]]))
        index,sim = max(similary, key=lambda x:x[1])
        return sim,''.join(self.texts[index])

    def get_coreNLP_client(self):
        if self.coreNLP_client is None:
            self.coreNLP_client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'lemma', 'tokenize', 'pos', 'ner']) 
            
        return self.coreNLP_client
        
    def get_entities(self, sentence):
        client = self.get_coreNLP_client()
        
        annotated = client.annotate(sentence)
        result = []
        for s in annotated.sentences:
            for token in s:
                if token.ner != 'O':
                    result.append((token.word, token.ner))
        return result

analyzer = DocAnalyzer(df.question)


#### Test *similarity()*

In [76]:
print('The maximum similary is %s: %s'%analyzer.similarity('取消正在处理的指令'))

The maximum similary is 0.55071: 如果我想取消正在处理中或已部份成交的指令我可以怎样做


#### Test * get_entities() *
Before you test the get_entities() method, the ** StanfordCoreNLPServer ** should be launched already.

The installation guide of StanfordCoreNLPServer is on [Stanford Site](https://nlp.stanford.edu/software/CRF-NER.shtml)

The package I use here are,
* **[CoreNLP](https://stanfordnlp.github.io/CoreNLP/)**
* **[Named Entity Recognizer version 3.7.0](https://nlp.stanford.edu/software/CRF-NER.shtml#Download)**,
* **[3.7.0 Chinese models
](https://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar) **, 
* ** [3.7.0 English models](https://nlp.stanford.edu/software/stanford-english-corenlp-2016-10-31-models.jar) **

After deploy the package into the Standford directory, set an English-Chinese supported configuraion in the ** [custom.properties](https://github.com/iou2much/NLP-tasks/blob/master/banks_crawler/StanfordNLP/corenlp/custom.properties) **.

In [77]:
import subprocess,time
def start_stanfordNLP_server():
    command = "ps aux|grep StanfordCoreNLPServer"
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=None, shell=True)
    output = process.communicate()
    if len(output) > 1 and 'edu.stanford.nlp.pipeline.StanfordCoreNLPServer' in str(output[0]) :
        stop_stanfordNLP_server()
        time.sleep(3)

    command="""
    java -Xmx4g -cp \
    "StanfordNLP/corenlp/*:StanfordNLP/corenlp/:StanfordNLP/ner/:StanfordNLP/parser/:StanfordNLP/postagger/" \
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordNLP/corenlp/custom.properties \
    -port 9000 -timeout 15000 &
    """
    print('starting the stanfordNLP server')

    process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
    output = process.communicate()
def stop_stanfordNLP_server():
    command="""!ps aux|grep StanfordCoreNLPServer|awk '{print $2}'|xargs kill -9 > /dev/null 2>&1"""
    print('killing the stanfordNLP server')
    subprocess.call(command, stdout=subprocess.PIPE, shell=True)

##### Start the server

In [70]:
start_stanfordNLP_server()

killing the stanfordNLP server
starting the stanfordNLP server


Test the get_entities() method. At the first time you invoke it, please wait a few seconds, cuz it will load the model for the first time. As we can see here, the default Chinese model of Stanford NER is not good enough, it regconize 星期天 as NUMBER. If we use some customized corpus to train the data, it might be better.

In [80]:
print(analyzer.get_entities("这个星期天北京会下雨"))
print('-'*100)
print(analyzer.get_entities('Rami Eid is studying at Stony Brook University in NY,America.I don\'t like Beckham.'))


[('星期天', 'NUMBER'), ('北京', 'GPE'), ('会', 'NUMBER')]
----------------------------------------------------------------------------------------------------
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('NY', 'LOCATION'), ('Beckham', 'PERSON')]


## 5. Voice Recognition


For this task, most of my time is spending on the data preprocessing.

### Dataset

I find this large "tagged" corpus for this, since there's hardly a large Cantonese dataset on line.
* [Bible Audio in Cantonese](http://disc.fuyin.tv/soft/html/582.html)
* [Textbook](http://download.o-bible.com:8080/hgb.gz)

When I say tagged, actually it's not. It also has some issue to fix before I use it.
* The audio file are in chapters, ie, per chapter every audio file. It will be too long for the input of the deep network. So it should be split into smaller files.
* The content of the audio and the text are exactly the same, but after the audio files are splitted, the text should be dealed with too.

Also, for a real world application, this dataset is too ideal. 
* Cuz it's Bible, the reading speed is too steady and slow. 
* Plus, there's no background noises at all in these audio. When we use the model trained with it, we might hardly catch the background noises feature.



### Framework
For this task, I use mozilla's [DeepSpeech project](https://github.com/mozilla/DeepSpeech) It's based on Baidu's Deep Speech research paper, implemented on Tensorflow, and supported [warp-CTC](https://g,ithub.com/baidu-research/warp-ctc).

Besides, it's only supporting characters in Alphabet . So I encode the Chinese characters before input into the DNN, and decode the output of forward results.

In [1]:
import NLP_tasks#asr_for_cantonese

ModuleNotFoundError: No module named 'NLP_tasks'