## Intro and text classification

__Rule-based methods__
- Regular expressions
- Semantic slot filling: CFG
    - Context-free grammars
    
![](../../images/1.png)


![](../../images/2.png)

![](../../images/3.png)

    


__Probabilistic modeling and machine learning__
- Likelihood maximization
- Linear classifiers

        Perform good enough in many tasks
            - eg. sequence labeling
        Allow us not to be blinded with the hype
            - eg. word2vec / distributional semantics
        Help to further improve DL models
            - eg. word alignment prior in machine translation


__Deep Learning__
- RNN

![](../../images/4.png)

- CNN






















## Simple recap of the application of NLP

![](../../images/5.png)

![](../../images/6.png)

![](../../images/7.png)

![](../../images/8.png)

![](../../images/9.png)

![](../../images/10.png)

![](../../images/11.png)

![](../../images/12.png)


- Libraries

![](../../images/13.png)

![](../../images/14.png)

![](../../images/15.png)

![](../../images/16.png)

![](../../images/17.png)



## Implementation: Text preprocessing

### Additional notes: __all the taggers in nltk__


#### pos_tag

In [4]:
import nltk
# pos_tag (pos_tag load the Standard treebank POS tagger)
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

![](../../images/18.png)



In [5]:
 nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

#### Automatic Tagging

In [7]:
# 因为tag要根据词的context，所以tag是以sentense为单位的，而不是word为单位，因为如果以词为单位，一个句子的结尾词会影响到下个句子开头词的tag，
# 这样是不合理的，以句子为单位可以避免这样的错误，让context的影响不会越过sentense

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_tagged_sents

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

#### The Regular Expression Tagger

In [14]:
patterns = [
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals
(r'.*\'s$', 'NN'), # possessive nouns
(r'.*s$', 'NNS'), # plural nouns
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'.*ly$', 'RB'), # adv
(r'.*', 'NN')] # nouns (default)
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(['And', 'now', 'for', 'something', 'completely', 'different'])
regexp_tagger.tag(brown.words(categories='news'))

[('The', 'NN'),
 ('Fulton', 'NN'),
 ('County', 'NN'),
 ('Grand', 'NN'),
 ('Jury', 'NN'),
 ('said', 'NN'),
 ('Friday', 'NN'),
 ('an', 'NN'),
 ('investigation', 'NN'),
 ('of', 'NN'),
 ("Atlanta's", 'NN'),
 ('recent', 'NN'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', 'NN'),
 ('no', 'NN'),
 ('evidence', 'NN'),
 ("''", 'NN'),
 ('that', 'NN'),
 ('any', 'NN'),
 ('irregularities', 'VBZ'),
 ('took', 'NN'),
 ('place', 'NN'),
 ('.', 'NN'),
 ('The', 'NN'),
 ('jury', 'NN'),
 ('further', 'NN'),
 ('said', 'NN'),
 ('in', 'NN'),
 ('term-end', 'NN'),
 ('presentments', 'NNS'),
 ('that', 'NN'),
 ('the', 'NN'),
 ('City', 'NN'),
 ('Executive', 'NN'),
 ('Committee', 'NN'),
 (',', 'NN'),
 ('which', 'NN'),
 ('had', 'NN'),
 ('over-all', 'NN'),
 ('charge', 'NN'),
 ('of', 'NN'),
 ('the', 'NN'),
 ('election', 'NN'),
 (',', 'NN'),
 ('``', 'NN'),
 ('deserves', 'VBZ'),
 ('the', 'NN'),
 ('praise', 'NN'),
 ('and', 'NN'),
 ('thanks', 'NNS'),
 ('of', 'NN'),
 ('the', 'NN'),
 ('City', 'NN'),
 ('

#### The Lookup Tagger

In [23]:
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
#most_freq_words = fd.keys()[:100]
fd

FreqDist({'compassion': 1,
          'southpaw': 5,
          'Thakhek': 1,
          'expense': 7,
          'two-family': 1,
          'fine': 17,
          'creature': 2,
          'blonde': 1,
          'Lemon': 3,
          'Rob': 1,
          'KKK': 1,
          'Zone': 1,
          'decent': 2,
          'companies': 18,
          "O'Clock": 1,
          "Emperor's": 1,
          'utility': 5,
          'gruonded': 1,
          'Latin': 7,
          'lay-offs': 4,
          '1.5': 1,
          '$125': 1,
          '3-run': 1,
          'Pye': 1,
          'Mark': 3,
          'seven-hit': 1,
          'stag': 1,
          'construed': 1,
          'chocolate': 1,
          'kept': 16,
          'room': 17,
          'warbling': 1,
          'Tareytown': 1,
          'tour': 7,
          'intruders': 1,
          'Displayed': 1,
          'Comedian': 1,
          '$450': 2,
          'growing': 5,
          'topics': 2,
          'narcotic': 2,
          'multi-family': 1,
      

In [17]:
cfd

ConditionalFreqDist(nltk.probability.FreqDist,
                    {'compassion': FreqDist({'NN': 1}),
                     'southpaw': FreqDist({'NN': 5}),
                     'Thakhek': FreqDist({'NP': 1}),
                     'expense': FreqDist({'NN': 7}),
                     'two-family': FreqDist({'JJ': 1}),
                     'fine': FreqDist({'JJ': 12, 'NN': 4, 'RB': 1}),
                     'creature': FreqDist({'NN': 2}),
                     'blonde': FreqDist({'JJ': 1}),
                     'Lemon': FreqDist({'NP': 3}),
                     'Rob': FreqDist({'NP': 1}),
                     'KKK': FreqDist({'NN': 1}),
                     'Zone': FreqDist({'NN-TL': 1}),
                     'decent': FreqDist({'JJ': 2}),
                     'companies': FreqDist({'NNS': 18}),
                     "O'Clock": FreqDist({'RB-TL': 1}),
                     "Emperor's": FreqDist({'NN$-TL': 1}),
                     'utility': FreqDist({'NN': 5}),
                     'gruon

#### Unigram Tagging (no context)

In [24]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) #Training 
unigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

#### N-gram tagger

In [26]:
bigram_tagger = nltk.BigramTagger(brown_tagged_sents)
bigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'CS'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

这样有个问题，如果tag的句子中的某个词的context在训练集里面没有，哪怕这个词在训练集中有，也无法进行标注，还是要通过`backoff`来解决这样的问题


In [29]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(brown_tagged_sents, backoff=t0)
t2 = nltk.BigramTagger(brown_tagged_sents, backoff=t1)
t2.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'CS'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

n-gram tagger存在的问题是:
- model会占用比较大的空间
- 还有就是在考虑context时，只会考虑前面词的tag，而不会考虑词本身



#### Brill tagging

用存储rule来代替model，这样可以节省大量的空间，同时在rule中不限制仅考虑tag，也可以考虑word本身

例子:

(1) replace NN with VB when the previous word is TO;

(2) replace TO with IN when the next tag is NNS.


![](../../images/19.png)


第一步用unigram tagger对所有词做一遍tagging，这里面可能有很多不准确的

下面就用rule来纠正第一步中guess错的那些词的tag，最终得到比较准确的tagging


