Natural Language Processing describes the translation of human language – in written or spoken form – into a format that machines can understand.

### Tokenization

In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    "EEBDA is fun!",
    "EEBDA is a great course.",
    "house",
    "House",
    "Houses"]

tokenizer = Tokenizer(num_words = 11) # the maximum number of words to keep, based on word frequency
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index # the words and the assigned integers are stored
print(word_index)

{'eebda': 1, 'is': 2, 'house': 3, 'fun': 4, 'a': 5, 'great': 6, 'course': 7, 'houses': 8}


The more frequent words like “EEBDA” or “is” are assigned lower integers. Upper case letters are converted to lower case letters. Consequently, the words “house” and “House” are mapped as a single token (`“house”: 3`), which makes sense since each expression provide the same information content for subsequent model evaluations. Finally, it should be noted that “houses” is assigned to its own token. Thus, plural forms of words are not automatically assigned to the singular form.

### Stemming and Lemmatization

The word “House” would be created from the word “Houses” after the application of stemming methods. Lemmatization allows correct transfer to the root word by referring to a dictionary.

In [4]:
import spacy
import pandas as pd

# spacy.cli.download("en_core_web_md")

nlp = spacy.load("en_core_web_md") # load english model

document = nlp("EEBDA is SO much fun!")

pd.DataFrame({"Token": [word.text for word in document],
              "Base": [word.lemma_ for word in document]})

Unnamed: 0,Token,Base
0,EEBDA,EEBDA
1,is,be
2,SO,so
3,much,much
4,fun,fun
5,!,!


### Word Embeddings

Words are now mapped n-dimensionally (vectors), which allows the decomposition of single words.

In [5]:
embed = nlp("dog")
    
embed.vector[0:10] # show first 10 entries for embedding

array([  1.233  ,   4.2963 ,  -7.9738 , -10.121  ,   1.8207 ,   1.4098 ,
        -4.518  ,  -5.2261 ,  -0.29157,   0.95234], dtype=float32)

In [6]:
doc1 = nlp("dog")
doc2 = nlp("cat")

# Similarity of two words
doc1.similarity(doc2)

0.8220816752553904