## Overall Process of Simple Tokenization

Tokenization is a process of segregating the sentence into tokens. Here is a very simple tokenizer using 'tr', but we are also sorting (sort), removing duplicate lines (uniq), and then examining just the top few lines (head).  Note that "-c" counts the repeated lines while removing duplicates.

Note: In Ubuntu Linux shell, uppercase and mixed case show up as two different lists:

```
tr -sc 'A-Za-z' '\n' < shakes.txt | sort | uniq -c | head
      1
   2405 A
     72 AARON
     16 ABBESS
      3 ABBOT
      8 ABERGAVENNY
     18 ABHORSON
      2 ABOUT
      6 ABRAM
     79 ACHILLES
     ....
      25 Aaron
      1 Abandon
      1 Abandoner
      6 Abate
      1 Abates
      5 Abbess
      6 Abbey
      5 Abbot
      2 Abel
      2 Aberga
      1 Abetting
```
But inside colab there there's a slight distinction:

In [None]:
!tr -sc 'A-Za-z' '\n' < shakes.txt | sort | uniq -c | head

## Simple segregation of the sentence into tokens (without sorting or removing duplicates).

Here we separate into tokens, but without sorting or removing duplicates.

In [None]:
!tr -sc 'A-Za-z' '\n' < shakes.txt | head

## Segregation of the sentence into tokens with sorting

Here we are sorting, but without duplicate removal. 

In [None]:
!tr -sc 'A-Za-z' '\n' < shakes.txt | sort | head


a
a
a
a
a
a
a
a
a


## Try the same command with 'sort -u' instead of 'uniq -c'

Is it the same? Almost! Same as uniq, but doesn't count up the duplicated lines.

In [None]:
!tr -sc 'A-Za-z' '\n' < shakes.txt | sort -u | head


a
A
Aaron
AARON
abaissiez
abandon
Abandon
abandoned
Abandoner


## More Counting: Merging upper and lower case

Example: Here 'a' and 'A' are considered the same.

In [None]:
!tr 'A-Z' 'a-z' < shakes.txt | tr -sc 'a-z' '\n' | sort | uniq -c | head

      1 
  16384 a
     97 aaron
      1 abaissiez
     10 abandon
      2 abandoned
      1 abandoner
      2 abase
      1 abash
     15 abate


## More Counting: Do the above step, but then sort the words by their counts.

Here I'm taking a few more lines using 'head' so that we can see some of the more interesting cases a bit further down on the list. (What are "d" and "s"?) Note: -n sorts with numeric data and -r reverse the order.

In [None]:
!tr 'A-Z' 'a-z' < shakes.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r | head -n 20

  30280 the
  28468 and
  23971 i
  21268 to
  18834 of
  16384 a
  14695 you
  13199 my
  12415 in
  12257 that
   9920 is
   9086 not
   8917 d
   8544 with
   8431 s
   8305 for
   8287 me
   8250 it
   7588 his
   7408 be


## 4 Tokenization in Python using "re"

4.1 Recall the very simple ELIZA example (no preprocessing)

In [None]:
import re

input = 'I AM SAD'
re.sub('I AM (DEPRESSED|SAD)', r'WHY DO YOU THINK YOU ARE \1', input)

'WHY DO YOU THINK YOU ARE SAD'

4.2 Now let's apply a more complex regular expression that removes punctuation and splits off special characters like +, /, and -

In [None]:
import re

pattern2 = r"""(?x)                   # set flag to allow verbose regexps
               (?:[A-Z]\.)+           # don't break up abbreviations, e.g. U.S.A.
               |\d+(?:\.\d+)?%?       # allow numbers, incl. currency and percentages
               |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
               |(?:[+/\-@&*])         # special characters with meanings
             """
text1="My cat weighs 5 kg, +/- 2 grams."
text2="I live in the U.S.A."
text3="I'm 100% sure it costs $40.00."
text4="What's the be-all-and-end-all?"
print(re.findall(pattern2, text1))
print(re.findall(pattern2, text2))
print(re.findall(pattern2, text3))
print(re.findall(pattern2, text4))

['My', 'cat', 'weighs', '5', 'kg', '+', '/', '-', '2', 'grams']
['I', 'live', 'in', 'the', 'U.S.A.']
["I'm", '100%', 'sure', 'it', 'costs', '40.00']
["What's", 'the', 'be-all-and-end-all']


## 5 Tokenization using Spacy

5.1 Install required dependencies and download dataset.

In [5]:
!python3 -m spacy download en_core_web_md

from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


5.2 Import dependencies and load dataset for tokenization

In [6]:
#Run this cell to import dependencies for tokenization
import spacy
from spacy.tokens.token import Token
from typing import List

nlp = spacy.load("en_core_web_md")
sample_text = "My cat weighs 5 kg, +/- 2 grams."


5.3 Run spaCy tokenizer

In [7]:
#Run tokenization
def tokenize(text: str) -> List[Token]:
  """
  :param text: text as a pythong string object
  :return: a list of objects of type Token
  """
  doc = nlp(text) # spacy converts the given text into a list of tokens
  return [w for w in doc]

print("Sample Text: ", sample_text)
print("Tokenizer output: ", tokenize(sample_text))

Sample Text:  My cat weighs 5 kg, +/- 2 grams.
Tokenizer output:  [My, cat, weighs, 5, kg, ,, +, /-, 2, grams, .]
