## Intro and text classification

__Rule-based methods__
- Regular expressions
- Semantic slot filling: CFG
    - Context-free grammars
    
![](../../images/1.png)


![](../../images/2.png)

![](../../images/3.png)

    


__Probabilistic modeling and machine learning__
- Likelihood maximization
- Linear classifiers

        Perform good enough in many tasks
            - eg. sequence labeling
        Allow us not to be blinded with the hype
            - eg. word2vec / distributional semantics
        Help to further improve DL models
            - eg. word alignment prior in machine translation


__Deep Learning__
- RNN

![](../../images/4.png)

- CNN






















## Simple recap of the application of NLP

![](../../images/5.png)

![](../../images/6.png)

![](../../images/7.png)

![](../../images/8.png)

![](../../images/9.png)

![](../../images/10.png)

![](../../images/11.png)

![](../../images/12.png)


- Libraries

![](../../images/13.png)

![](../../images/14.png)

![](../../images/15.png)

![](../../images/16.png)

![](../../images/17.png)



## Implementation: Text preprocessing

### Additional notes: __all the taggers in nltk__


#### pos_tag

In [4]:
import nltk
# pos_tag (pos_tag load the Standard treebank POS tagger)
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

![](../../images/18.png)



In [5]:
 nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

#### Automatic Tagging

In [7]:
# 因为tag要根据词的context，所以tag是以sentense为单位的，而不是word为单位，因为如果以词为单位，一个句子的结尾词会影响到下个句子开头词的tag，
# 这样是不合理的，以句子为单位可以避免这样的错误，让context的影响不会越过sentense

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_tagged_sents

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

#### The Regular Expression Tagger

In [14]:
patterns = [
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals
(r'.*\'s$', 'NN'), # possessive nouns
(r'.*s$', 'NNS'), # plural nouns
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'.*ly$', 'RB'), # adv
(r'.*', 'NN')] # nouns (default)
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(['And', 'now', 'for', 'something', 'completely', 'different'])
regexp_tagger.tag(brown.words(categories='news'))

[('The', 'NN'),
 ('Fulton', 'NN'),
 ('County', 'NN'),
 ('Grand', 'NN'),
 ('Jury', 'NN'),
 ('said', 'NN'),
 ('Friday', 'NN'),
 ('an', 'NN'),
 ('investigation', 'NN'),
 ('of', 'NN'),
 ("Atlanta's", 'NN'),
 ('recent', 'NN'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', 'NN'),
 ('no', 'NN'),
 ('evidence', 'NN'),
 ("''", 'NN'),
 ('that', 'NN'),
 ('any', 'NN'),
 ('irregularities', 'VBZ'),
 ('took', 'NN'),
 ('place', 'NN'),
 ('.', 'NN'),
 ('The', 'NN'),
 ('jury', 'NN'),
 ('further', 'NN'),
 ('said', 'NN'),
 ('in', 'NN'),
 ('term-end', 'NN'),
 ('presentments', 'NNS'),
 ('that', 'NN'),
 ('the', 'NN'),
 ('City', 'NN'),
 ('Executive', 'NN'),
 ('Committee', 'NN'),
 (',', 'NN'),
 ('which', 'NN'),
 ('had', 'NN'),
 ('over-all', 'NN'),
 ('charge', 'NN'),
 ('of', 'NN'),
 ('the', 'NN'),
 ('election', 'NN'),
 (',', 'NN'),
 ('``', 'NN'),
 ('deserves', 'VBZ'),
 ('the', 'NN'),
 ('praise', 'NN'),
 ('and', 'NN'),
 ('thanks', 'NNS'),
 ('of', 'NN'),
 ('the', 'NN'),
 ('City', 'NN'),
 ('

#### The Lookup Tagger

In [23]:
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
#most_freq_words = fd.keys()[:100]
fd

FreqDist({'compassion': 1,
          'southpaw': 5,
          'Thakhek': 1,
          'expense': 7,
          'two-family': 1,
          'fine': 17,
          'creature': 2,
          'blonde': 1,
          'Lemon': 3,
          'Rob': 1,
          'KKK': 1,
          'Zone': 1,
          'decent': 2,
          'companies': 18,
          "O'Clock": 1,
          "Emperor's": 1,
          'utility': 5,
          'gruonded': 1,
          'Latin': 7,
          'lay-offs': 4,
          '1.5': 1,
          '$125': 1,
          '3-run': 1,
          'Pye': 1,
          'Mark': 3,
          'seven-hit': 1,
          'stag': 1,
          'construed': 1,
          'chocolate': 1,
          'kept': 16,
          'room': 17,
          'warbling': 1,
          'Tareytown': 1,
          'tour': 7,
          'intruders': 1,
          'Displayed': 1,
          'Comedian': 1,
          '$450': 2,
          'growing': 5,
          'topics': 2,
          'narcotic': 2,
          'multi-family': 1,
      

In [17]:
cfd

ConditionalFreqDist(nltk.probability.FreqDist,
                    {'compassion': FreqDist({'NN': 1}),
                     'southpaw': FreqDist({'NN': 5}),
                     'Thakhek': FreqDist({'NP': 1}),
                     'expense': FreqDist({'NN': 7}),
                     'two-family': FreqDist({'JJ': 1}),
                     'fine': FreqDist({'JJ': 12, 'NN': 4, 'RB': 1}),
                     'creature': FreqDist({'NN': 2}),
                     'blonde': FreqDist({'JJ': 1}),
                     'Lemon': FreqDist({'NP': 3}),
                     'Rob': FreqDist({'NP': 1}),
                     'KKK': FreqDist({'NN': 1}),
                     'Zone': FreqDist({'NN-TL': 1}),
                     'decent': FreqDist({'JJ': 2}),
                     'companies': FreqDist({'NNS': 18}),
                     "O'Clock": FreqDist({'RB-TL': 1}),
                     "Emperor's": FreqDist({'NN$-TL': 1}),
                     'utility': FreqDist({'NN': 5}),
                     'gruon

#### Unigram Tagging (no context)

In [24]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) #Training 
unigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

#### N-gram tagger

In [26]:
bigram_tagger = nltk.BigramTagger(brown_tagged_sents)
bigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'CS'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

这样有个问题，如果tag的句子中的某个词的context在训练集里面没有，哪怕这个词在训练集中有，也无法进行标注，还是要通过`backoff`来解决这样的问题


In [29]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(brown_tagged_sents, backoff=t0)
t2 = nltk.BigramTagger(brown_tagged_sents, backoff=t1)
t2.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'CS'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

n-gram tagger存在的问题是:
- model会占用比较大的空间
- 还有就是在考虑context时，只会考虑前面词的tag，而不会考虑词本身



#### Brill tagging

用存储rule来代替model，这样可以节省大量的空间，同时在rule中不限制仅考虑tag，也可以考虑word本身

例子:

(1) replace NN with VB when the previous word is TO;

(2) replace TO with IN when the next tag is NNS.


![](../../images/19.png)


第一步用unigram tagger对所有词做一遍tagging，这里面可能有很多不准确的

下面就用rule来纠正第一步中guess错的那些词的tag，最终得到比较准确的tagging

> 那么这些rules是怎么生成的?

在training阶段自动生成的: 

During its training phase, the tagger guesses values for T1, T2, and C, to create thousands of candidate rules. Each rule is scored according to its net benefit: the number of incorrect tags that it corrects, less the number
of correct tags it incorrectly modifies.

----

rules的例子:

- NN -> VB if the tag of the preceding word is 'TO'
- NN -> VBD if the tag of the following word is 'DT'
- NN -> VBD if the tag of the preceding word is 'NNS'
- NN -> NNP if the tag of words i-2...i-1 is '-NONE-'
- NN -> NNP if the tag of the following word is 'NNP'
- NN -> NNP if the text of words i-2...i-1 is 'like'
- NN -> VBN if the text of the following word is '*-1'


----




### Token normalization

![](../../images/20.png)

![](../../images/21.png)

In [1]:
import nltk
text1 = 'feet, cats, wolves, talked'
text2 = 'feet cats wolves talked'
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens1 = tokenizer.tokenize(text1)
tokens2 = tokenizer.tokenize(text2)
tokens1, tokens2

(['feet', ',', 'cats', ',', 'wolves', ',', 'talked'],
 ['feet', 'cats', 'wolves', 'talked'])

In [3]:
stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens1)

'feet , cat , wolv , talk'

## Feature extraction from text

### BOW

![](../../images/22.png)

> how to preserve some order info?

![](../../images/23.png)


![](../../images/25.png)

> Question 


![](../../images/24.png)





### TF-IDF (词频-逆文件频率)

是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

一个词语在一篇文章中出现次数越多, 同时在所有文档中出现次数越少, 越能够代表该文章



#### TF (Term Frequency)

![](../../images/29.png)

#### IDF (Inverse document frequency)

- $N = |D|$ : total number of documents in corpus
- $\mid {d \in D: t \in d} \mid$: number of documents where term $t$ appears
- $idf(t, D) = \log \frac{N}{\mid {d \in D: t \in d} \mid}$

#### TF-IDF

$$tfidf(t,d,D) = tf(t,d) \cdot idf(t,D)$$
- A high weight if TF-IDF is reached by a __high term frequency (TF)__ in the given document and __a low document frequency of the term (IDF)__ in the whole


![](../../images/28.png)



In [2]:
import numpy as np
import pandas as pd
import csv
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
DIR = "./all/"


def load_train_data(skip_content=False):
    categories = ['cooking', 'robotics', 'travel', 'crypto', 'diy', 'biology']
    train_data = []
    for cat in categories:
        if skip_content:
            data = pd.read_csv("{}{}.csv".format(DIR, cat), usecols=['id', 'title', 'tags'])
        else:
            data = pd.read_csv("{}{}.csv".format(DIR, cat))
        data['category'] = cat
        train_data.append(data)
    
    return pd.concat(train_data)
load_train_data()

Unnamed: 0,id,title,content,tags,category
0,1,How can I get chewy chocolate chip cookies?,<p>My chocolate chips cookies are always too c...,baking cookies texture,cooking
1,2,How should I cook bacon in an oven?,<p>I've heard of people cooking bacon in an ov...,oven cooking-time bacon,cooking
2,3,What is the difference between white and brown...,"<p>I always use brown extra large eggs, but I ...",eggs,cooking
3,4,What is the difference between baking soda and...,<p>And can I use one in place of the other in ...,substitutions please-remove-this-tag baking-so...,cooking
4,5,"In a tomato sauce recipe, how can I cut the ac...",<p>It seems that every time I make a tomato sa...,sauce pasta tomatoes italian-cuisine,cooking
5,6,What ingredients (available in specific region...,<p>I have a recipe that calls for fresh parsle...,substitutions herbs parsley,cooking
6,9,What is the internal temperature a steak shoul...,<p>I'd like to know when to take my steaks off...,food-safety beef cooking-time,cooking
7,11,How should I poach an egg?,<p>What's the best method to poach an egg with...,eggs basics poaching,cooking
8,12,"How can I make my Ice Cream ""creamier""",<p>My ice cream doesn't feel creamy enough. I...,ice-cream,cooking
9,17,How long and at what temperature do the variou...,"<p>I'm interested in baking thighs, legs, brea...",baking chicken cooking-time,cooking


In [3]:
def load_test_data():
    test_data = pd.read_csv(DIR + 'test.csv')
    return test_data

test_data = load_test_data()
test_data

Unnamed: 0,id,title,content
0,1,What is spin as it relates to subatomic partic...,<p>I often hear about subatomic particles havi...
1,2,What is your simplest explanation of the strin...,<p>How would you explain string theory to non ...
2,3,"Lie theory, Representations and particle physics",<p>This is a question that has been posted at ...
3,7,Will Determinism be ever possible?,<p>What are the main problems that we need to ...
4,9,Hamilton's Principle,<p>Hamilton's principle states that a dynamic ...
5,13,What is sound and how is it produced?,"<p>I've been using the term ""sound"" all my lif..."
6,15,What experiment would disprove string theory?,<p>I know that there's big controversy between...
7,17,Why does the sky change color? Why the sky is ...,<p>Why does the sky change color? Why the sky ...
8,19,How's the energy of particle collisions calcul...,<p>Physicists often refer to the energy of col...
9,21,Monte Carlo use,<p>Where is the Monte Carlo method used in phy...


In [12]:
def merge(row):
    title = row['title']
    content = row['content']
    clean_content = BeautifulSoup(content, "html.parser")
    clean_content = clean_content.get_text()
    row['text'] = title + " " + clean_content
    return row

In [15]:
nlp_test_data = test_data.apply(merge, axis=1)[['id', 'text']]

In [14]:
nlp_test_data

Unnamed: 0,id,text
0,1,What is spin as it relates to subatomic partic...
1,2,What is your simplest explanation of the strin...
2,3,"Lie theory, Representations and particle physi..."
3,7,Will Determinism be ever possible? What are th...
4,9,Hamilton's Principle Hamilton's principle stat...
5,13,What is sound and how is it produced? I've bee...
6,15,What experiment would disprove string theory? ...
7,17,Why does the sky change color? Why the sky is ...
8,19,How's the energy of particle collisions calcul...
9,21,Monte Carlo use Where is the Monte Carlo metho...


In [16]:
tfidf = TfidfVectorizer(analyzer = "word", 
                        max_features = 5000, 
                        stop_words="english", 
                        ngram_range=(1,2))
features = tfidf.fit_transform(nlp_test_data['text']).toarray()

In [19]:
pd.DataFrame(features) # very sparse matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0,0,0,0,0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
tfidf_tags = []
top_n = -5

feature_array = np.array(tfidf.get_feature_names())
print(feature_array)
tfidf_sorting = np.argsort(features)
print(tfidf_sorting)

for i, e in enumerate(tfidf_sorting):
    tmp_tags = []
    indexes = e[top_n:]
    for idx in indexes:
        cur_tag = feature_array[idx]
        if features[i][idx] > 0.1 and len(cur_tag)>3 and '_' not in cur_tag:
            tmp_tags.append(cur_tag.replace(' ', '-'))
    tfidf_tags.append(" ".join(tmp_tags))

![](../../images/30.png)

![](../../images/31.png)


![](../../images/32.png)

![](../../images/33.png)

![](../../images/34.png)





### Hashing Example

![](../../images/35.png)

![](../../images/36.png)







