# Corpus and Lexicon

- __author__: Evgeny A. Stepanov
- __e-mail__: stepanov.evgeny.a@gmail.com

Dan Jurafsky and James H. Martin's __Speech and Language Processing__ ([3rd ed. draft](https://web.stanford.edu/~jurafsky/slp3/)) is advised for reading. 

- Section *Corpora and Counting* covers some concepts of *Chapter 2: "Regular Expressions, Text Normalization, Edit Distance"*.

__Requirements__

- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset

    - run `git clone https://github.com/esrel/NL2SparQL4NLU.git`
    
- [spaCy](https://spacy.io/)
    - run `pip install spacy`
    - run `python -m spacy download en` to install English language model
    
- [NLTK](http://www.nltk.org/)
    - run `pip install nltk`

- [scikit-learn](https://scikit-learn.org/)
    - run `pip install scikit-learn`
    

__Alternative Corpora__

Use __only__ if you know how to work with JSON!

- https://github.com/howl-anderson/ATIS_dataset
- https://github.com/sebischair/NLU-Evaluation-Corpora
- https://github.com/sonos/nlu-benchmark
- https://github.com/clinc/oos-eval

## 1. Corpora and Counting

### 1.1. Corpus

[Corpus](https://en.wikipedia.org/wiki/Text_corpus) is a collection of written or spoken texts that is used for language research. Before doing anything with a corpus we need to know its properties:

__Corpus Properties__:
- *Format* -- how to read/load it?
- *Language* -- which tools/models can I use?
- *Annotation* -- what it is intended for?
- *Split* for __Evaluation__: (terminology varies from source to source)

| Set         | Purpose                                       |
|:------------|:----------------------------------------------|
| Training    | training model, extracting rules, etc.        |
| Development | tuning, optimization, intermediate evaluation |
| Test        | final evaluation (remains unseen)             |


#### NL2SparQL4NLU

- __Format__:

    - Utterance (sentence) per line
    - Tokenized
    - Lowercased

- __Language__: English monolingual

- __Annotation__: None (for now)

- __Split__: training & test sets

#### Exercise

- define a function to load a corpus into a list-of-lists

- load `NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt`
- print first `2` tokens of the first `10` utterances


In [93]:
trn_path='NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt'
tst_path='NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt'

In [94]:
def load(path, is_swl=False):
    with open(path) as file:
        return [([word for word in line.strip('\n').split(' ')] if not is_swl else line.strip('\n')) for line in file]

trn = load(trn_path)
tst = load(tst_path)

for u in trn[:10]: print(f'{u[0]}\t{u[1]}')

who	plays
show	credits
who	was
find	the
who	played
who	was
who	played
who	was
find	the
cast	and


### 1.2. Corpus Descriptive Statistics (Counting)

*Corpus* description in terms of:

- total number of tokens
- total number of utterances


#### Exercise

- define a function to compute corpus descriptive statistics -- number of utterance and token counts
- compute the statistics for the __training__ and __test__ sets of NL2SparQL4NLU dataset. 
- compare the computed statistics with the reference values below.


| Metric           | Train  | Test   |
|------------------|-------:|-------:|
| Total Tokens     | 21,453 |  7,117 |
| Total Utterances |  3,338 |  1,084 |


In [95]:
from functools import *

def tot_tokens(data):
    return reduce(lambda sum, u: sum + len(u), data, 0)

print('Metric\t\tTrain\tTest')
print(f'Total Tokens:\t{tot_tokens(trn)}\t{tot_tokens(tst)}')
print(f'Total Utter.:\t{len(trn)}\t{len(tst)}')

Metric		Train	Test
Total Tokens:	21453	7117
Total Utter.:	3338	1084


#### Exercise

- define a function to compute average number of tokens per utterance statistic


In [96]:
def avg_tokens(data):
    return tot_tokens(data) / len(data)

print('Metric\t\tTrain\tTest')
print(f'Avg Tokens:\t{avg_tokens(trn):.2f}\t{avg_tokens(tst):.2f}')

Metric		Train	Test
Avg Tokens:	6.43	6.57


## 2. Lexicon

[Lexicon](https://en.wikipedia.org/wiki/Lexicon) is the *vocabulary* of a language. In linguistics, a lexicon is a language's inventory of lexemes.

Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalog of a language's words; and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. 

*Lexicon (or Vocabulary) Size* is one of the statistics reported for corpora. While *Word Count* is the number of __tokens__, *Lexicon Size* is the number of __types__ (unique words).


### 2.1. Lexicon Size

#### Exercise

- define a function to compute a lexicon from corpus in a list-of-lists format
    - sort the list alphabetically
    
- compute the lexicon of the training set of NL2SparQL4NLU dataset
- compare its size to the reference value below.

| Metric       | Value |
|--------------|------:|
| Lexicon Size | 1,729 |


In [111]:
import numpy as np

def lexicon(data):
    return np.unique(np.concatenate(data)).tolist()

print('Metric\t\tTrain\tTest')
print(f'Lexicon Size:\t{len(lexicon(trn))}\t{len(lexicon(tst))}')

Metric		Train	Test
Lexicon Size:	1729	1039


### 2.2. Frequency List

In Natural Language Processing (NLP), [a frequency list](https://en.wikipedia.org/wiki/Word_lists_by_frequency) is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

What is a "word"?

- case sensitive counts
- case insensitive counts (our corpus is lowercased)

#### Exercise

- define a function to compute a frequency list for a corpus
- compute frequency list for the training set of NL2SparQL4NLU dataset
- report `5` most frequent words (use can use provided `nbest` function to get a dict of top N items)
- compare the frequencies to the reference values below

| Word   | Frequency |
|--------|----------:|
| the    |     1,337 |
| movies |     1,126 |
| of     |       607 |
| in     |       582 |
| movie  |       564 |


In [112]:
def nbest(d, n=1):
    """
    get n max values from a dict
    :param d: input dict (values are numbers, keys are stings)
    :param n: number of values to get (int)
    :return: dict of top n key-value pairs
    """
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

In [113]:
def lexicon_freq(data):
    return dict(zip(*np.unique(np.concatenate(data), return_counts=True)))

print('Word\tFrequency')
for w, f in nbest(lexicon_freq(trn), n=5).items(): print(f'{w}\t{f}')

Word	Frequency
the	1337
movies	1126
of	607
in	582
movie	564


### 2.3. Lexicon Operations

It is common to process the lexicon according to the task at hand (not every transformation makes sense for all tasks). The common operations are removing words by frequency (minimum or maximum, i.e. *Frequency Cut-Off*) and removing words for a specific lists (i.e. *Stop Word Removal*).

In computing, [stop words](https://en.wikipedia.org/wiki/Stop_words) are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose.

### Exercises

##### Frequency Cut-Off

- define a function to compute a lexicon from a frequency list applying minimum and maximum frequency cut-offs

    - use default values for min and max
    
- using frequency list for the training set of NL2SparQL4NLU dataset
    
    - compute lexicon applying:
    
        - minimum cut-off 2 (remove words that appear less than 2 times, i.e. remove [hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon))
        - maximum cut-off 100 (remove words that appear more that 100 times)
        - both minimum and maximum thresholds together
        
    - report size for each comparing to the reference values in the table

| Operation  | Min | Max | Size |
|------------|----:|----:|-----:|
| original   | N/A | N/A | 1729 |
| cut-off    |   2 | N/A |  950 |
| cut-off    | N/A | 100 | 1694 |
| cut-off    |   2 | 100 |  915 |


In [114]:
def freq_cutoff(data, minf=None, maxf=None):
    frequencies = lexicon_freq(data)
    return [w for w, f in frequencies.items() if (minf is None or f >= minf) and (maxf is None or f <= maxf)]

print('Operation\tMin\tMax\tSize')
print(f'original\tN/A\tN/A\t{len(freq_cutoff(trn))}')
print(f'cut-off\t\t2\tN/A\t{len(freq_cutoff(trn, minf=2))}')
print(f'cut-off\t\tN/A\t100\t{len(freq_cutoff(trn, maxf=100))}')
print(f'cut-off\t\t2\t100\t{len(freq_cutoff(trn, minf=2, maxf=100))}')


Operation	Min	Max	Size
original	N/A	N/A	1729
cut-off		2	N/A	950
cut-off		N/A	100	1694
cut-off		2	100	915


##### Stop Word Removal

- define a function to read/load a list of words in token-per-line format (i.e. lexicon)
- load stop word list from `NL2SparQL4NLU/extras/english.stop.txt`
- using Python's built it `set` [methods](https://docs.python.org/2/library/stdtypes.html#set):
    
    - define a function to compute overlap of two lexicons
    - define a function to apply a stopword list to a lexicon

- compare the 100 most frequent words in frequency list of the training set to the list of stopwords (report count)
- apply stopword list to the lexicon of the training set
- report size of the resulting lexicon comparing to the reference values.

| Operation       | Size |
|-----------------|-----:|
| original        | 1729 |
| no stop words   | 1529 |
| top 100 overlap |   50 |

In [101]:
swl_path='NL2SparQL4NLU/extras/english.stop.txt'

In [121]:
def lexicon_overlap(lex1, lex2):
    return list(set(lex1).intersection(set(lex2)))

def apply_swl(lexicon, swl):
    return list(set(lexicon) - set(swl))

lex = lexicon(trn)
swl = load(swl_path, is_swl=True)

lexicon_overlap(lex, swl)
apply_swl(lex, swl)

print('Operation\tSize')
print(f'original\t{len(lex)}')
print(f'no stop words\t{len(apply_swl(lex, swl))}')



Operation	Size
original	1729
no stop words	1529


##### Exercise: Alternative Stop Words

Compare the stop word list above to the stop word lists from the popular python libraries in terms of overlaps.
(Use `set` `intersection`)

- spaCy
- NLTK
- scikit-learn

    
For NLTK you need to download them first

```python
import nltk
nltk.download('stopwords')
```

In [108]:
from spacy.lang.en.stop_words import STOP_WORDS as SPACY_STOP_WORDS
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as SKLEARN_STOP_WORDS
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

NLTK_STOP_WORDS = set(stopwords.words('english'))

print('spaCy: {}'.format(len(SPACY_STOP_WORDS)))
print('NLTK: {}'.format(len(NLTK_STOP_WORDS)))
print('sklearn: {}'.format(len(SKLEARN_STOP_WORDS)))


spaCy: 326
NLTK: 179
sklearn: 318
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mdestro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Basic Text Pre-processing

Both frequency cut-off and stop word removal are frequently used text pre-processing steps. Depending on the application, there are several other common text pre-processing steps that are usually applied for tranforming text for Machine Learning tasks.

__Text Normalization Steps__

- removing extra white spaces

- tokenization
    - documents to sentences (sentence segmentation/tokenization)
    - sentences to tokens

- lowercasing/uppercasing


- removing punctuation

- removing accent marks and other diacritics 

- removing stop words (see above)

- removing sparse terms (frequency cut-off)

- number normalization
    - numbers to words (i.e. `10` to `ten`)
    - number words to numbers (i.e. `ten` to `10`)
    - removing numbers

- verbalization (specifically for speech applications)

    - numbers to words
    - expanding abbreviations (or spelling out)
    - reading out dates, etc.
    

- [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)
    - the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

- [stemming](https://en.wikipedia.org/wiki/Stemming)
    - the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.


### 3.1. Tokenization and Lemmatization with spaCy

The default spaCy NLP pipeline does several processing steps including __tokenization__, *part of speech tagging*, __lemmatization__, *dependency parsing* and *Named Entity Recognition* (ignore the ones in *italics* for today). 


SpaCy produces a `Doc` object that contains `Token`s. It is possible to access lemmatized form of a token using its `lemma_` attribute.

In [4]:
import spacy

nlp = spacy.load('en')

txt = 'who plays luke on star wars new hope'
doc = nlp(txt)

lemmas = [token.lemma_ for token in doc]

print(lemmas)

['who', 'play', 'luke', 'on', 'star', 'war', 'new', 'hope']


#### Exercise

- Lemmatize the dataset with spaCy
- compute the lexicon of the training set of NL2SparQL4NLU dataset (or the one you have chosen)
- compare its size to the "raw" counts

### 3.2. Stemming with NLTK

SpaCy does not provide any stemming algorithms.
NLTK, on the other hand, provides two algorithms [`Porter Stemmer`](https://tartarus.org/martin/PorterStemmer/) and [`Snowball Stemmer`](https://snowballstem.org/algorithms/) (a.k.a. `Porter2`). 

__Note__: Please read the original description of the algorithmsm, if you are interested.

Since stemming works on token level, we need to provide tokens. Which we can obtain either from `spacy`'s `Doc` or just *whitespace tokenization*

```python

tokens = [token.text for token in doc]

```

or

```python
tokens = text.split()
```

In [5]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

txt = 'who plays luke on star wars new hope'
tokens = txt.split()
print(tokens)

stems = [stemmer.stem(token) for token in tokens]
print(stems)

['who', 'plays', 'luke', 'on', 'star', 'wars', 'new', 'hope']
['who', 'play', 'luke', 'on', 'star', 'war', 'new', 'hope']


#### Exercise
- Stem the dataset with NTLK
- compute the lexicon of the training set of NL2SparQL4NLU dataset (or the one you have chosen)
- compare its size to the "raw" and lemmatized counts