Segmenting, Tokenizing, & Stemming
====

By The End of This Workbook You Should Be Able To:
----

- Tokenize words 
- List the advantages and disadvantages of stemming
- Select between different stemming algorithms

<br>
<br> 
<br>

----
Tokenization
-----

Toenization: Breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens

Not dressing up like _Lord of the Rings_
![](http://www.coolest-homemade-costumes.com/images/great-costume-idea-01.jpg)

That is Tolkenization ðŸ˜‰

The simplest way to tokenize is to split on white space

In [1]:
sentence1 = 'Sky is blue and trees are green'
sentence1.split(' ')

['Sky', 'is', 'blue', 'and', 'trees', 'are', 'green']

Sometimes you might also want to deal with abbreviations, hypenations, puntuations and other characters.

In those cases, you would want to use regex.

However, going through a sentence multiple times can be slow to run if the corpus is long

In [2]:
import re
sentence2 = 'This state-of-the-art technology is cool, isn\'t it?'
sentence2 = re.sub('-', ' ', sentence2)
sentence2 = re.sub('[,|.|?]', '', sentence2)
sentence2 = re.sub('n\'t', ' not', sentence2)
sentence2_tokens = re.split('\s+', sentence2)
print(sentence2_tokens)

['This', 'state', 'of', 'the', 'art', 'technology', 'is', 'cool', 'is', 'not', 'it']


In this case, there are 11 tokens and the size of the vocabulary is 10

In [3]:
print('Number of tokens:', len(sentence2_tokens))
print('Number of vocabulary:', len(set(sentence2_tokens)))

Number of tokens: 11
Number of vocabulary: 10


### Exercises

TODO: Tokenize the follow sentence 

In [13]:
sentence3 = 'If only Bradley\'s arm was longer. Best photo ever. #oscars'

In [14]:
sentence3 = re.sub('[\'.#]+','',sentence3)
sentence3 = sentence3.split(" ")
sentence3

['If',
 'only',
 'Bradleys',
 'arm',
 'was',
 'longer',
 'Best',
 'photo',
 'ever',
 'oscars']

In [16]:
print(len(sentence3),'tokens')
print(len(set(sentence3)),'vocab')

10 tokens
10 vocab


TODO: Count the number of tokens and vocabulary

<details><summary>
Click here for a solution.
</summary>
`
sentence3 = re.sub('[#|\'|.]', '' , sentence3).lower()
print re.split('\s+', sentence3)

10 tokens and 10 vocabulary
`
</details>

<br>
<br> 
<br>

----
Stemming
----

Stemming removes affixes to reduce to tokens to their base form (stems)

**For example:**

`automates`, `automating` and `automatic` could be stemmed to `automat`

<br>



There are 3 types of commonly used stemmers, and each consists of slightly different rules for systematically replacing affixes in tokens. In general, Lancaster stemmer stems the most aggresively, i.e. removing the most suffix from the tokens, followed by Snowball and Porter 

1. **Porter Stemmer:**

    - Most commonly used stemmer and the most gentle stemmers
    - The most computationally intensive of the algorithms (Though not by a very significant margin)
    - The oldest stemming algorithm in existence

2. **Snowball Stemmer:**

    - Universally regarded as an improvement over the Porter Stemmer
    - Slightly faster computation time than the Porter Stemmer

3. **Lancaster Stemmer:**
    
    - Very aggressive stemming algorithm
    - With Porter and Snowball Stemmers, the stemmed representations are usually fairly intuitive to a reader
    - With Lancaster Stemmer, shorter tokens that are stemmed will become totally obfuscated
    - The fastest algorithm and will reduce the vocabulary
    - However, if one desires more distinction between tokens, Lancaster Stemmer is not recommended

### Examples
The following code demonstartes the difference between the different stemmers

In [17]:
from nltk import stem

tokens =  ['player', 'playa', 'playas', 'pleyaz'] 

# Define Porter Stemmer
porter = stem.porter.PorterStemmer()
# Define Snowball Stemmer
snowball = stem.snowball.EnglishStemmer()
# Define Lancaster Stemmer
lancaster = stem.lancaster.LancasterStemmer()

print('Porter Stemmer:', [porter.stem(i) for i in tokens])
print('Snowball Stemmer:', [snowball.stem(i) for i in tokens])
print('Lancaster Stemmer:', [lancaster.stem(i) for i in tokens])

Porter Stemmer: ['player', 'playa', 'playa', 'pleyaz']
Snowball Stemmer: ['player', 'playa', 'playa', 'pleyaz']
Lancaster Stemmer: ['play', 'play', 'playa', 'pleyaz']


### Exercises

Why would one use a stemmer during tokenization? Think of the size of the vocabulary.

You might want to use a stemmer so that you can capture duplicate words that have the same meaning. For example, being and be have the same stem and are not different vocabulary wise.

If you want a fast stemmer, which stemmer would you choose? What is the disadvatange of that stemmer?

Lancaster Stemmer is the fastest.  Disadvantages are that it obfuscates the stems so they are not readable.

Otherwise if you want a stemmer that preserves the original word as much as possible and still have a reasonable speed, which stemmer would you use?

Snowball


<details><summary>
Click here for a solution.
</summary>

```
- To reduce the size of the vocabulary. For example, we can consider "think" and "thinks" to be same token since they carry the same meaning almost all the time

- Lancaster is the fastest stemmer but it would obfuscate smaller words, making it impossible for humans to tell what word it was originally

- Snowball Stemmer is an improvement of the Porter Stemmer and it strikes a balance between quality and speed
```
</details>

## Summary

- Tokenization separates words in a sentence
- You would normalize or process the sentence during tokenization to obtain sensible tokens
- These normalizations include:
  - Replaceing special characters with spaces such as `,.-=!` using regex
  - Lowercasing
  - Stemming to remove the suffix of tokens to make tokens more uniform
- There are three types of commonly used stemmers. They are Porter, Snowball and Lancaster
- Lancaster is the fastest and also the most aggressive stemmer and Snowball is a good balance between speed and quality of stemming

<br>
<br> 
<br>

----