#### StopWords

In [2]:
tweet = """I’m amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public finetuned checkpoints, is good enough for the job.

Both impressed, and a little disappointed how rarely I get to actually train a model that matters :("""

In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pedro\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [5]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

stop_words[:5]

['i', 'me', 'my', 'myself', 'we']

In [6]:
stop_words = set(stop_words)

In [7]:
tweet = tweet.lower().split()

tweet

['i’m',
 'amazed',
 'how',
 'often',
 'in',
 'practice,',
 'not',
 'only',
 'does',
 'a',
 '@huggingface',
 'nlp',
 'model',
 'solve',
 'your',
 'problem,',
 'but',
 'one',
 'of',
 'their',
 'public',
 'finetuned',
 'checkpoints,',
 'is',
 'good',
 'enough',
 'for',
 'the',
 'job.',
 'both',
 'impressed,',
 'and',
 'a',
 'little',
 'disappointed',
 'how',
 'rarely',
 'i',
 'get',
 'to',
 'actually',
 'train',
 'a',
 'model',
 'that',
 'matters',
 ':(']

In [8]:
tweet_no_stopwords = [word for word in tweet if word not in stop_words]

print("With stopwords:", ' '.join(tweet))
print("Without:", ' '.join(tweet_no_stopwords))

With stopwords: i’m amazed how often in practice, not only does a @huggingface nlp model solve your problem, but one of their public finetuned checkpoints, is good enough for the job. both impressed, and a little disappointed how rarely i get to actually train a model that matters :(
Without: i’m amazed often practice, @huggingface nlp model solve problem, one public finetuned checkpoints, good enough job. impressed, little disappointed rarely get actually train model matters :(


### Tokens Introduction

Typically in NLP we will find that models consume a token, which can represent a multitude of different things, such as:

A word
Part of a word
A single character
Puntuation mark [,!-.]
Special token like <URL>, or <NAME>
Model-specific special tokens, like [CLS] and [SEP] for BERT

For the BERT transformer model there are *five* special tokens that are used by the model, these are:

| Token | Meaning |
| --- | --- |
| **[PAD]** | Padding token, allows us to maintain same-length sequences (512 tokens for Bert) even when different sized sentences are fed in |
| **[UNK]** | Used when a word is unknown to Bert |
| **[CLS]** | Appears at the start of every sequence |
| **[SEP]** | Indicates a seperator or end of sequence |
| **[MASK]** | Used when masking tokens, for example in training with masked language modelling (MLM) |

So if we take the *'NLP models'* tweet, processing that directly with our BERT specific tokens might look like this:

```
['[CLS]', '[UNK]', 'thinks', 'that', 'the', 'nlp', 'models', 'that', '[UNK]', 'made', 'are', 'super', 'cool', '[SEP]', '[PAD]', '[PAD]', ..., '[PAD]']
```

Here, we have:

* Applied **\[CLS\]** token to indicate the start of the sequence.
* Both username tokens *@elonmusk* and *@joebloggs* were not 'known' words to BERT so BERT replaced them with unknown tokens **\[UNK\]**, alternatively we could have replaced these with our own special **user** tokens.
* Added **\[SEP\]* token to the end of our sequence.
* Padded the sequence upto the required length of 512 tokens *(required due to fixed input sequence length of BERT model)* using **\[PAD\]** tokens.

#### Stemming

In [None]:
txt = "I am amazed by how amazingly amazing you are"

We use different forms of the word amaze a total of three times. Each of these different forms is called an 'inflection', which is the modification of a word to slightly adjust the meaning or context of the word. When we tokenize this text we produce three different tokens for each inflection of happy, which is okay but in many applications this level of granularity in the semantic meaning of the word is not required and can damage model performance.

Later, when we get to using more complex, sophisticated models (eg BERT), we will use different methods that maintain the inflection of each word - but it is important to understand stemming as it was a very important part of text preprocessing for a very long time, and still relevant to many applications.

To apply stemming we will be using the NLTK package, which provides several different stemmers, we will test the PorterStemmer and LancasterStemmer.

In [1]:
words_to_stem = ['happy', 'happiest', 'happier', 'cactus', 'cactii', 'elephant', 'elephants', 'amazed', 'amazing', 'amazingly', 'cement', 'owed', 'maximum']

from nltk.stem import PorterStemmer, LancasterStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()

stemmed = [(porter.stem(word), lancaster.stem(word)) for word in words_to_stem]

print("Porter | Lancaster")
for stem in stemmed:
    print(f"{stem[0]} | {stem[1]}")

Porter | Lancaster
happi | happy
happiest | happiest
happier | happy
cactu | cact
cactii | cacti
eleph | eleph
eleph | eleph
amaz | amaz
amaz | amaz
amazingli | amaz
cement | cem
owe | ow
maximum | maxim


#### Lemmatization

Lemmatization is very similiar to stemming in that it reduces a set of inflected words down to a common word. The difference is that lemmatization reduces inflections down to their real root words, which is called a lemma. If we take the words 'amaze', 'amazing', 'amazingly', the lemma of all of these is 'amaze'. Compared to stemming which would usually return 'amaz'. Generally lemmatization is seen as more advanced than stemming.

In [5]:
words = ['amaze', 'amazed', 'amazing']

In [6]:
import nltk

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pedro\AppData\Roaming\nltk_data...


True

In [12]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [15]:
from nltk.corpus import wordnet

[lemmatizer.lemmatize(word, wordnet.VERB) for word in words]

['amaze', 'amaze', 'amaze']

# Normalization

Unicode normalization is used to *normalize* different but similiar characters. For example the following unicode characters (and character combinations) are equivalent:

**Canonical Equivalence**

| | | Equivalence Reason |
| --- | --- | --- |
| Ç | C◌̧ | Combined character sequences |
| 가 | ᄀ ᅡ | Conjoined Korean characters |

**Compatibility equivalence**

| | | Equivalence Reason |
| --- | --- | --- |
| ℌ | H | Font variant |
| \[NBSP\] | \[SPACE\] | Both are linebreak sequences |
| ① | 1 | Circled variant |
| x² | x2 | Superscript |
| xⱼ | xj | Subscript |
| ½ | 1/2 | Fractions |

We have mentioned two different types of equivalence here, canonical and compatibility equivalence.

**Canonical equivalence** means both forms are fundamentally the same and when rendered are indistinguishable. For example we can take the unicode for `'Ç' \u00C7` or the unicode for `'C' \u0043` and `'̧' \u0327`, when the latter two characters are rendered together they look the same as the first character:

#### Normal Forms

So it is in these cases that we use unicode normalization to *normalize* our characters into matching pairs. As there are different forms of equivalence, there are also different forms of normalization. These are all called **N**ormal **F**orm, and there are four different methods:

| Name | Abbreviation | Description | Example |
| --- | --- | --- | --- |
| Form D | NFD | *Canonical* decomposition | `Ç` → `C ̧` |
| Form C | NFC | *Canoncial* decomposition followed by *canonical* composition | `Ç` → `C ̧` → `Ç` |
| Form KD | NFKD | *Compatibility* decomposition | `ℌ ̧` → `H ̧` |
| Form KC | NFKC | *Compatibility* decomposition followed by *canonical* composition | `ℌ ̧` → `H ̧` → `Ḩ` |

Let's take a look at each of these forms in action. Our C with cedilla character Ç can be represented in two ways, as a single character called *Latin capital C with cedilla* (*\u00C7*), or as two characters called *Latin capital C* (*\u0043*) and *combining cedilla* (*\u0327*):

##### NFD and NFC

In [16]:
import unicodedata

c_with_cedilla = "\u00C7"  # Latin capital C with cedilla (single character)
c_with_cedilla

'Ç'

In [17]:
c_plus_cedilla = "\u0043\u0327"  # \u0043 = Latin capital C, \u0327 = 'combining cedilla' (two characters)
c_plus_cedilla

'Ç'

If we perform NFD on our C with cedilla character \u00C7, we decompose the character into it's smaller components, which are the Latin capital C character, and combining cedilla character \u0043 + \u0327. This means that if we compare an NFD normalized C with cedilla character to both the C character and the cedilla character, we will return true:

In [18]:
unicodedata.normalize('NFD', c_with_cedilla) == c_plus_cedilla

True

However, if we perform NFC on our C with cedilla character \u00C7, we decompose the character into the smaller components \u0043 + \u0327, and then compose them back to \u00C7, and so they will not match:

In [19]:
unicodedata.normalize('NFC', c_with_cedilla) == c_plus_cedilla

False

But if we switch the NFC encoding to instead be performed on our two characters \u0043 + \u0327, they will first be decomposed (which will do nothing as they are already decomposed), then compose them into the single \u00C7 character:

In [20]:
c_with_cedilla == unicodedata.normalize('NFC', c_plus_cedilla)

True

##### NFKD and NFKC

The NFK encodings do not decompose characters into smaller components, they decompose characters into their normal versions. For example if we take the fancy format ℌ \u210B, we cannot decompose this into multiple characters and so NFD or NFC encoding will do nothing. However, if we apply NFKD, we will find that our fancy ℌ \u210B becomes a plain, boring H \u0048:

In [21]:
unicodedata.normalize('NFKD', 'ℌ')

'H'

In [23]:
"\u210B\u0327"

'ℋ̧'

Applying our compatibility decomposition normalization (NFKD) gives us a capital H character, and a combining cedilla character as two seperate encodings:

In [24]:
unicodedata.normalize('NFKD', "\u210B\u0327").encode('utf-8')

b'H\xcc\xa7'

But if we apply NFKC, we first perform compatibility decomposition, into the two seperate characters, before merging them during canonical composition:unicodedata.normalize('NFKC', "\u210B\u0327").encode('utf-8')

In [25]:
unicodedata.normalize('NFKC', "\u210B\u0327").encode('utf-8')

b'\xe1\xb8\xa8'

Because the only difference between these two methods is a canonical composition, we see no difference between the two character sets when they are rendered:

In [27]:
unicodedata.normalize('NFKC', "\u210B\u0327"), unicodedata.normalize('NFKD', "\u210B\u0327"), 

('Ḩ', 'Ḩ')