# Tokens and terms

Common complications in extracting text:
- single documents may have multiple languages (eg. French email with German PDF)
- What is the unit document? (a group of files? a single file?)

# Tokenisation

Simple breaking up of a sentence based on a few delimiters.

## Issues and types of delims
1. Apostrophes (eg. `Finland's capital`) do we remove the apostrophe?
2. Hyphens (eg. `state-of-the-art`) break up the hyphenated sequence
3. Spaces
This is not a one size fits all solution as we need to know the use case.


##  Numbers, dates and other dangerous things

How do we try to process dates like these:

```
3/20/13
55 B.C.
B-52
```

## Language issue

- French has words such as `L'ensemble` this is supposed to match with `un ensemble`
- German noun compounds are not segmented - can be a very long combined sentence with no spaces.
- Japanese and Chinese has no spaces between words.

In [4]:
print ('hello world'.split(' '))

['hello', 'world']


# Stop words

With a **stop list** we can exclude the most common words from dictionary. eg. `a`, `be`, ...

In [6]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
  
example_sent = "This is a sample sentence, showing off the stop words filtration."
  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(filtered_sentence) 

['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


# Normalisation

We would usually normalise words so that we don't have to store too much data.
- Deleting periods: U.S.A -> USA
- Deleting hyphens: anti-discriminatory -> antidiscriminatory
- Removal of accents and umlauts

# Case folding

Reduce all letters to lower case.  
Often better to reduce to lower case as user's inputs are usually in lower case.

# Lemmatisation

Reduce inflectionla/ variant forms to base form.  
eg.
- am, she, is -> be
- car, cars, car's, cars' -> car

Lemmatisation implies doing *proper* reduction to dictionary form.

In [9]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

wordnet_lemmatizer.lemmatize("dogs")

'dog'

# Stemming

Reduce terms to their "roots" before indexing.  
Stemming suggest crude affix chopping eg. `automate, automata, automation -> automat`

## Porter's algorithm

### Typical rules
- sess -> ss
- ies -> i
- ational -> ate
- tional -> tion

In [12]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

porter_stemmer.stem("automation")

'autom'

# Evaluation

## Rank of techniques which will help reduce vocab size
1. Stop words - a document is usually made up of a lot of stopwords
2. Lemmatisation/ Stemming - a lot of words will be shortened to base forms
3. Case folding - worst as a lot of permutation for words regardless if its just lower case only.


## Do stemming and other normalisations help?
- Harms precision for some queries in English but very helpful for some - mixed results
- Definitely useful for Spanish, German, Finnish.

# Ultimately
1. Language-specific
2. Application-specific