# Tokenization

**1. Kochmar mentions several steps required in a typical NLP pipeline, one of them being *Split into words*. Why is this step necessary? Why can we not just feed the text as it is into a model?**

The 5 steps include
1. Define classes
2. Split into words
3. Extract features
4. Train classifier
5. Test & evaluate

Step 2, Split into words, is necessary to identify words in the text. Processing the entire text as one unit is often less useful because it overlooks the specific meanings and functions of each word. Therefore, the text is usually split into words to facilitate the feature-extaction process in Step 3. 

**2. Simply splitting on "words" (i.e. whitespace) is rarely enough. Consider the sentence below ("That U.S.A. poster-print costs $12.40...") and name some problems that arise from splitting on whitespace.**

Splitting the text into words based on whitespace ['That', 'U.S.A', 'poster-print', 'costs', '$12.40...'], issues arise since the punctuation marks are not seperated from the words. This will become problematic when extracting the words from the text accurately. The is especially evident with '$12.40...', where the currency symbol '$' is a significant piece of information and should be identified as one token associated with the number, which should also be idenitfied as a single token. In addition, the three dots should ideally be a separate token, as it is a punctuation mark, not part of the price. 

In [6]:
# If you wish, experiment with implementing different rules for tokenization. You will see that the "ruleset" quickly grows if you want to account for all types of edge cases...
sentence = "That U.S.A. poster-print costs $12.40..."

def your_rulebased_tokenizer(sentence):
    tokens = []
    return tokens

your_rulebased_tokenizer(sentence)

[]

NLTK has several tokenizers implemented, such as a specific one for Twitter data. Below, indicated by the `TODO`-tag, you should find and import various tokenizers and add them to the list of tokenizers:

`tokenizers = [tokenizer1, tokenizer2, ..., tokenizerN]`

Tokenize the sentence with at least three different tokenizers supplied by NLTK and comment on your findings. You will find the documentation for NLTK's tokenizers [here](https://www.nltk.org/_modules/nltk/tokenize.html) useful.

In [112]:
from typing import List
from nltk.tokenize import word_tokenize, wordpunct_tokenize, regexp_tokenize

# this is the base class of tokenizers in nltk
from nltk.tokenize.api import TokenizerI


# this is just a simple example of how a tokenizer can be implemented
class MyWhitespaceTokenizer(TokenizerI):
    def __init__(self):
        super().__init__()

    def tokenize(self, text: str) -> List[str]:
        return text.split()


sentence = "That U.S.A. poster-print costs $12.40..."

# ************************************************************

# tokenizer which splits text into words based on punctuation
class WordTokenizer(TokenizerI):
    def tokenize(self, text: str) -> List[str]:
        return word_tokenize(text)

# tokenizer which splits words from punctuation    
class WordPunctTokenizer(TokenizerI):
    def tokenize(self, text: str) -> List[str]:
        return wordpunct_tokenize(text)

# tokenizer which splits each sentence    
class RegExpTokenizer(TokenizerI):
    def tokenize(self, text: str) -> List[str]:
        return regexp_tokenize(text, pattern=r'\w+')
    
# ************************************************************
tokenizers = [
    MyWhitespaceTokenizer(),
    WordTokenizer(),
    WordPunctTokenizer(),
    RegExpTokenizer()
]

# Leave this function as-is
def tokenize(tokenizers: List[TokenizerI], sentence: str) -> None:
    for tokenizer in tokenizers:
        assert isinstance(tokenizer, TokenizerI)
        tokenized = tokenizer.tokenize(sentence)
        print(f"{tokenizer.__class__.__name__} ({len(tokenized)} tokens)\n{tokenized}\n")


tokenize(tokenizers, sentence)

MyWhitespaceTokenizer (5 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40...']

WordTokenizer (7 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']

WordPunctTokenizer (16 tokens)
['That', 'U', '.', 'S', '.', 'A', '.', 'poster', '-', 'print', 'costs', '$', '12', '.', '40', '...']

RegExpTokenizer (9 tokens)
['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']



The findings show that each tokenizer handles the sentence quite differently. The MyWhitespaceTokenizer maintains tokens such as "U.S.A." and "poster-print", but tokenizes "$12.40..." as a single token. The WordTokenizer and the WordPunctTokenizer are, in contrast to MyWhitespaceTokenizer, able to identify the number and the dots as seperate tokens, in addition to seperating the currency symbol from the number. However, the WordTokenizer identifies the number "12.40" as a single token, while the WordPunctTokenizer identifies the first and the second part of the cost as single tokens, "12","." and "40". In addition, the word "poster-print" is a single entity in the WordTokenizer, whereas three tokens in the WordPunctTokenizer. Similarly, "U.S.A." is tokenized as a single entity in the WordTokenizer, compared to as 6 tokens, "U", ".", "S", ".", "A", ".",  in the WordPunctTokenizer. The latter causes the tokens to lose its meaning. The RegExpTokenizer removes all punctuation, and instead only tokenizes the words. 

In this example, the WordTokenizer seems to be the most appropriate tokenizer as it identifies amounts and compound terms. However, in other applications, the WordPunctTokenizer might be more useful in cases where an analysis of the individidual characters is necessary. 

# 2. Language modeling
We have now studied the bigger models like BERT and GPT-based language models. A simpler language model, however, can implemented using n-grams.

**1. What is an n-gram?**

N-grams are used in NLP processing to predict text based on some previous context. The "n" in n-grams represents the number of characters or words considered as context. A unigram considers a single character or word, a bigram considers two, a trigram three, and so on. 

**2. Use NLTK to print out bigrams and trigrams for the given sentence below. Your function should support any number of N.**

In [113]:
from nltk import ngrams, word_tokenize

sentence = "That U.S.A. poster-print costs $12.40... I'd pay $5.00 for it."

# tokenize text
def tokenize(text: str) -> List[str]:
        return word_tokenize(text)

# generate and print each n-gram
for n in [2, 3]:
    print(f"--- {n}-grams ---")
    for ngram in ngrams(tokenize(sentence), n):
        print(ngram)


--- 2-grams ---
('That', 'U.S.A.')
('U.S.A.', 'poster-print')
('poster-print', 'costs')
('costs', '$')
('$', '12.40')
('12.40', '...')
('...', 'I')
('I', "'d")
("'d", 'pay')
('pay', '$')
('$', '5.00')
('5.00', 'for')
('for', 'it')
('it', '.')
--- 3-grams ---
('That', 'U.S.A.', 'poster-print')
('U.S.A.', 'poster-print', 'costs')
('poster-print', 'costs', '$')
('costs', '$', '12.40')
('$', '12.40', '...')
('12.40', '...', 'I')
('...', 'I', "'d")
('I', "'d", 'pay')
("'d", 'pay', '$')
('pay', '$', '5.00')
('$', '5.00', 'for')
('5.00', 'for', 'it')
('for', 'it', '.')


**3. Based on your intuition for language modeling, how can n-grams be used for word predictions?**

Based on my intuition, n-grams can be used in character and word predictions by assessing which characters or words often occur together in a similar context.

**4. NLTK includes the `FreqDist` class, which produces the frequency distribution of words in a sentence. Use it to print out the two most common words in the text below.**

In [114]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

text = "That that is is that that is not. Is that it? It is. You sure? Surely it is!"

# count the frequency of each word 
fdist = FreqDist()

 # convert each word to lowercase and increment its count in the frequency distribution
for word in word_tokenize(text):
    fdist[word.lower()] += 1

# print the two most frequent words
fdist.tabulate(2)

  is that 
   6    5 


**5. Use your n-gram function from question 2.2 to print out the most common trigram of the text in question 2.4**

In [115]:
from nltk import ngrams, word_tokenize

sentence = "That U.S.A. poster-print costs $12.40... I'd pay $5.00 for it."

# tokenize text
def tokenize(text: str) -> List[str]:
    return word_tokenize(text)

# create trigrams from the tokenized text
trigrams = ngrams(tokenize(sentence), 3)

# count the frequency of each trigram
fdist = FreqDist(trigrams)

# print the most common trigram
fdist.tabulate(1)

('That', 'U.S.A.', 'poster-print') 
                                 1 


**6. You may have discovered that you would need to implement some form of preprocessing to get the correct answer to the previous tasks. Preprocessing/cleaning/normalization is often necessary for the desired results. If you were to process the text of a news site or blog post, can you think of some preprocessing steps that would be useful?**

It would be useful to make the text lower-case so all words are treated equal regardless of their case. In addition, punctuation should be removed so words such as "U.S.A." and "USA" are treated equally. Spell-checking might also be feasible to perform, to ensure that words with mispellings are not ignored. Words that are contracted, e.g. "I'd", should be expanded, such as to "I would". Performing stemming and lemmatization would also be useful to identify similar words, e.g. "costs" and "cost". 

# 3. Word Representations
For more information on word representations, consult the lab description file and course material.

**1. Describe the main differences between bag-of-words and one-hot encoding through examples.**

Both bag-of-words and one-hot encoding are techniques used in text preprocessing. 

One-hot encoding represents a document as a numeric vector. The vector consists of ones, the "hot" values, for every word present in the document, and zeros for any words not present in the document, based on a set of unique words in a corpus. This way, one-hot encoding tracks the presence or absence of words.

Bag-of-words, on the other hand, represents a document as a "bag" of the unqiue words in the document, where the frequency of the appearance of each word is counted. This technique provides insight into the importance of each word.

The example below demonstrates the differences between the two techniques. The one-hot encoding only consists of zeros and ones, whereas the bag-of-words consists of the frequency of a particular word in each sentence. 

In [130]:
from nltk.tokenize import regexp_tokenize

# code source: https://www.geeksforgeeks.org/one-hot-encoding-in-nlp/

# document collection
corpus = ["The quick brown fox jumps over the lazy dog.",
          "Lazy cats and quick dogs are often seen in the park.", 
          "The dog is not lazy but is rather quick and energetic."]

def tokenize(text: str) -> List[str]:
    return regexp_tokenize(text, pattern=r'\w+')

# unique set of words from the corpus
unique_words = set()
for sentence in corpus:
    for word in tokenize(sentence):
        unique_words.add(word.lower())

# map each word to an index in a dictionary 
word_to_index = {}
for i, word in enumerate(unique_words):
    word_to_index[word] = i

# initialize lists for one-hot vectors and bag-of-words vectors
one_hot = []
bag_of_words = []

# add each sentence as a vector to the lists
for sentence in corpus:
    one_hot_vector = [0] * len(unique_words)
    bag_of_words_vector = [0] * len(unique_words)

    for word in tokenize(sentence):
        index = word_to_index[word.lower()]
        one_hot_vector[index] = 1
        bag_of_words_vector[index] += 1

    one_hot.append(one_hot_vector)
    bag_of_words.append(bag_of_words_vector)

# print results
print("Unique words:", unique_words)
print("\nOne-hot encoded vectors:")
for i, vector in enumerate(one_hot):
    print(f"Sentence {i+1}: {vector}")

print("\nBag-of-words vectors:")
for i, vector in enumerate(bag_of_words):
    print(f"Sentence {i+1}: {vector}")

Unique words: {'is', 'fox', 'often', 'jumps', 'not', 'the', 'in', 'quick', 'rather', 'but', 'are', 'over', 'seen', 'lazy', 'dogs', 'park', 'and', 'cats', 'brown', 'dog', 'energetic'}

One-hot encoded vectors:
Sentence 1: [0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0]
Sentence 2: [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Sentence 3: [1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1]

Bag-of-words vectors:
Sentence 1: [0, 1, 0, 1, 0, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0]
Sentence 2: [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Sentence 3: [2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1]


**2. What are the limitations of the above representations?**

Both representations suffer from sparsity issues in large datasets due to the potential high volume of zero-values.

**3. Example of word embedding techniques, such as Word2Vec and GloVe are considered *dense* representations. How do dense word embeddings relate to the *distributional hypothesis*?**

The distributional hypothesis states that words that occur in similar contexts usually have similar meanings. This hypothesis is related to dense word embeddings since such techniques generate word embeddings based on the context of a given word and is presented in a vector space where semantically similar words are located close to each other.