Natural Language Processing describes the translation of human language – in written or spoken form – into a format that machines can understand.

### Tokenization

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    "EEBDA is fun!",
    "EEBDA is a great course.",
    "house",
    "House",
    "Houses"]

tokenizer = Tokenizer(num_words = 11) # the maximum number of words to keep, based on word frequency
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index # the words and the assigned integers are stored
print(word_index)

2023-12-27 13:15:55.625435: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


{'eebda': 1, 'is': 2, 'house': 3, 'fun': 4, 'a': 5, 'great': 6, 'course': 7, 'houses': 8}


The more frequent words like “EEBDA” or “is” are assigned lower integers. Upper case letters are converted to lower case letters. Consequently, the words “house” and “House” are mapped as a single token (`“house”: 3`), which makes sense since each expression provide the same information content for subsequent model evaluations. Finally, it should be noted that “houses” is assigned to its own token. Thus, plural forms of words are not automatically assigned to the singular form.

In [26]:
sensitive_cases = [ 'mouse', 'Mouse', 'Mice']

tokenizer_sensitive_cases = Tokenizer(num_words = 10) 
tokenizer_sensitive_cases.fit_on_texts(sensitive_cases)

print(tokenizer_sensitive_cases.word_index)

{'mouse': 1, 'mice': 2}


The `num_words` parameter is the maximum number of words to keep. In the example below, we want the most frequent 100 words. 

In [17]:
sentences_2 = [
    "I love my dog",
    "I love my cat"
]

# We create an instance of a tokenizer object
tokenizer_2 = Tokenizer(num_words = 100)

# Go through the text
tokenizer_2.fit_on_texts(sentences_2)

print(tokenizer_2.word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


The `Tokenizer` is smart enough to catch some exceptions. For example, if we update our `sentences_2` by adding a third sentence with `"dog!"`, the exclamation point will be spotted and a new token will not be created:

In [18]:
sentences_2 = [
    "I love my dog",
    "I love my cat",
    "You love my dog!"
]

tokenizer_2.fit_on_texts(sentences_2)
print(tokenizer_2.word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


>  By default, all punctuation is removed, turning the texts into space-separated sequences of words.

In [19]:
help(Tokenizer)

Help on class Tokenizer in module keras.src.preprocessing.text:

class Tokenizer(builtins.object)
 |  Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, analyzer=None, **kwargs)
 |  
 |  Text tokenization utility class.
 |  
 |  Deprecated: `tf.keras.preprocessing.text.Tokenizer` does not operate on
 |  tensors and is not recommended for new code. Prefer
 |  `tf.keras.layers.TextVectorization` which provides equivalent functionality
 |  through a layer which accepts `tf.Tensor` input. See the
 |  [text loading tutorial](https://www.tensorflow.org/tutorials/load_data/text)
 |  for an overview of the layer and text handling in tensorflow.
 |  
 |  This class allows to vectorize a text corpus, by turning each
 |  text into either a sequence of integers (each integer being the index
 |  of a token in a dictionary) or into a vector where the coefficient
 |  for each token could be binary, based on word count, ba

We can also manipulate the tokenizer in such a way that the punctuation marks are not filtered:

In [22]:
sentences_2 = [
    "I love my dog-",
    "I love my cat?",
    "You love my dog!"
]

tokenizer_2 = Tokenizer(num_words = 100, filters=".") # filtering any single character
tokenizer_2.fit_on_texts(sentences_2)
print(tokenizer_2.word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog-': 4, 'cat?': 5, 'you': 6, 'dog!': 7}


### Stemming and Lemmatization

The word “House” would be created from the word “Houses” after the application of stemming methods. Lemmatization allows correct transfer to the root word by referring to a dictionary.

In [4]:
import spacy
import pandas as pd

# spacy.cli.download("en_core_web_md")

nlp = spacy.load("en_core_web_md") # load english model

document = nlp("EEBDA is SO much fun!")

pd.DataFrame({"Token": [word.text for word in document],
              "Base": [word.lemma_ for word in document]})

Unnamed: 0,Token,Base
0,EEBDA,EEBDA
1,is,be
2,SO,so
3,much,much
4,fun,fun
5,!,!


### Word Embeddings

Words are now mapped n-dimensionally (vectors), which allows the decomposition of single words.

In [5]:
embed = nlp("dog")
    
embed.vector[0:10] # show first 10 entries for embedding

array([  1.233  ,   4.2963 ,  -7.9738 , -10.121  ,   1.8207 ,   1.4098 ,
        -4.518  ,  -5.2261 ,  -0.29157,   0.95234], dtype=float32)

In [6]:
doc1 = nlp("dog")
doc2 = nlp("cat")

# Similarity of two words
doc1.similarity(doc2)

0.8220816752553904