# Tokenization

**1. Kochmar mentions several steps required in a typical NLP pipeline, one of them being *Split into words*. Why is this step necessary? Why can we not just feed the text as it is into a model?**

An ml-model cannot take raw text as input, only numbers. Therefore, we need to process the data in a way that lets the model interpret the data. We could split the text into characters but this would be a less meaningful way to interpret the data compared to words.

**2. Simply splitting on "words" (i.e. whitespace) is rarely enough. Consider the sentence below ("That U.S.A. poster-print costs $12.40...") and name some problems that arise from splitting on whitespace.**

There can be words that consist of other words, like "poster-print". Here, "(dollar)12.40..." would be one word, when splitting it into "$", "12.40" and "..." could perhaps be a more sensible way to interpret its meaning. Also, "costs" is very closely related to the meaning of "cost", the only difference being that it refers to a singular noun. Not dealing with these types of words would mean that the model in practice would have to learn the same word twice.

Another challenge is how different letters are used in coding languages, which can be interpreted as having another meaning for the code.

In [3]:
# If you wish, experiment with implementing different rules for tokenization. You will see that the "ruleset" quickly grows if you want to account for all types of edge cases...
sentence = "That U.S.A. poster-print costs $12.40..."

def your_rulebased_tokenizer(sentence):
    tokens = []
    current_token = ""
    for char in sentence:
        if char == " ":
            tokens.append(current_token)   
            current_token = ""
        else:
            current_token += char
    return tokens

your_rulebased_tokenizer(sentence)

['That', 'U.S.A.', 'poster-print', 'costs']

NLTK has several tokenizers implemented, such as a specific one for Twitter data. Below, indicated by the `TODO`-tag, you should find and import various tokenizers and add them to the list of tokenizers:

`tokenizers = [tokenizer1, tokenizer2, ..., tokenizerN]`

Tokenize the sentence with at least three different tokenizers supplied by NLTK and comment on your findings. You will find the documentation for NLTK's tokenizers [here](https://www.nltk.org/_modules/nltk/tokenize.html) useful.

In [3]:
from typing import List

# this is the base class of tokenizers in nltk
from nltk.tokenize.api import TokenizerI
from nltk.tokenize import WordPunctTokenizer, TreebankWordTokenizer

# this is just a simple example of how a tokenizer can be implemented
class MyWhitespaceTokenizer(TokenizerI):
    def __init__(self):
        super().__init__()

    def tokenize(self, text: str) -> List[str]:
        return text.split()


sentence = "That U.S.A. poster-print costs $12.40..."

# ************************************************************
# TODO: import and add the tokenizers you want to try out here
# ************************************************************
tokenizers = [
    MyWhitespaceTokenizer(),
    WordPunctTokenizer(),
    TreebankWordTokenizer()
]

# Leave this function as-is
def tokenize(tokenizers: List[TokenizerI], sentence: str) -> None:
    for tokenizer in tokenizers:
        assert isinstance(tokenizer, TokenizerI)
        tokenized = tokenizer.tokenize(sentence)
        print(f"{tokenizer.__class__.__name__} ({len(tokenized)} tokens)\n{tokenized}\n")


tokenize(tokenizers, sentence)

MyWhitespaceTokenizer (5 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40...']

WordPunctTokenizer (16 tokens)
['That', 'U', '.', 'S', '.', 'A', '.', 'poster', '-', 'print', 'costs', '$', '12', '.', '40', '...']

TreebankWordTokenizer (7 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']



Comment:
The WordPunctTokenizer split the sentence up into 16 tokens while the TreebankWordTokenizer used 7 tokens. I feel like the WordPunctTokenizer splits it too much since word like U.S.A. and poster-print lose their meaning. I would prefer the third option to the first since it makes sense to let "..." be a token that can be interpreted in context of the whole sentence, not just the price. Separating $ from 12.40 also makes sense since it identifies the number and the $ carries a meaning of its own.

# 2. Language modeling
We have now studied the bigger models like BERT and GPT-based language models. A simpler language model, however, can implemented using n-grams.

**1. What is an n-gram?**

A collection of n words. 2-grams would be [a, collection], [collection, of] and so on..

**2. Use NLTK to print out bigrams and trigrams for the given sentence below. Your function should support any number of N.**

In [27]:
sentence = "That U.S.A. poster-print costs $12.40... I'd pay $5.00 for it."

# ************************************
# TODO: your implementation of n-grams
# ************************************

from nltk.tokenize import TreebankWordTokenizer

def nGrams(n, tokenizer, sentence):
    nGrams = []
    tokenizedSentence = tokenizer.tokenize(sentence)
    for i in range(len(tokenizedSentence) - n + 1):
        nGrams.append(tokenizedSentence[i:i+n])
    return nGrams


nGrams(2, TreebankWordTokenizer(), sentence)

[['That', 'U.S.A.'],
 ['U.S.A.', 'poster-print'],
 ['poster-print', 'costs'],
 ['costs', '$'],
 ['$', '12.40'],
 ['12.40', '...'],
 ['...', 'I'],
 ['I', "'d"],
 ["'d", 'pay'],
 ['pay', '$'],
 ['$', '5.00'],
 ['5.00', 'for'],
 ['for', 'it'],
 ['it', '.']]

**3. Based on your intuition for language modeling, how can n-grams be used for word predictions?**

Word prediction is all about letting the code understand the broader context. Predicting the next word besed on solely one word will result in very generic sentences that I assume won't make much sense.

**4. NLTK includes the `FreqDist` class, which produces the frequency distribution of words in a sentence. Use it to print out the two most common words in the text below.**

In [15]:
text = "That that is is that that is not. Is that it? It is. You sure? Surely it is!"

# TODO
from nltk import FreqDist

for word, frequency in FreqDist(text.split()).most_common():
    print(word, frequency)


that 4
is 3
That 1
not. 1
Is 1
it? 1
It 1
is. 1
You 1
sure? 1
Surely 1
it 1
is! 1


**5. Use your n-gram function from question 2.2 to print out the most common trigram of the text in question 2.4**

In [32]:
from collections import Counter

trigrams = nGrams(3, TreebankWordTokenizer(), text)

nGrams_tuples = [tuple(ngram) for ngram in trigrams]

# Count the frequency of each n-gram
nGrams_freq = Counter(nGrams_tuples)

# Print the frequency of each n-gram
for ngram, frequency in nGrams_freq.items():
    print(f"{ngram}: {frequency}")

('That', 'that', 'is'): 1
('that', 'is', 'is'): 1
('is', 'is', 'that'): 1
('is', 'that', 'that'): 1
('that', 'that', 'is'): 1
('that', 'is', 'not.'): 1
('is', 'not.', 'Is'): 1
('not.', 'Is', 'that'): 1
('Is', 'that', 'it'): 1
('that', 'it', '?'): 1
('it', '?', 'It'): 1
('?', 'It', 'is.'): 1
('It', 'is.', 'You'): 1
('is.', 'You', 'sure'): 1
('You', 'sure', '?'): 1
('sure', '?', 'Surely'): 1
('?', 'Surely', 'it'): 1
('Surely', 'it', 'is'): 1
('it', 'is', '!'): 1


**6. You may have discovered that you would need to implement some form of preprocessing to get the correct answer to the previous tasks. Preprocessing/cleaning/normalization is often necessary for the desired results. If you were to process the text of a news site or blog post, can you think of some preprocessing steps that would be useful?**

Separating symbols from words, such as dots, exclamaition marks, dollar signs. Removing capitol letters from the first words of sentences.

# 3. Word Representations
For more information on word representations, consult the lab description file and course material.

**1. Describe the main differences between bag-of-words and one-hot encoding through examples.**

#### Vocabulary: {The, cat, is, cute, dog, happy}


#### Bag of Words (BoW):

Document 1: "The cat is cute." - BoW Vector: [1, 1, 1, 1, 0, 0]

Document 2: "The dog is happy." - BoW Vector: [1, 0, 1, 0, 1, 1]


#### One-hot encoding:

Document 1: "The cat is cute."
| Word  | Vector         |
|-------|----------------|
| The   | [1, 0, 0, 0, 0, 0] |
| cat   | [0, 1, 0, 0, 0, 0] |
| is    | [0, 0, 1, 0, 0, 0] |
| cute  | [0, 0, 0, 1, 0, 0] |

Document 2: "The dog is happy."
| Word  | Vector         |
|-------|----------------|
| The   | [1, 0, 0, 0, 0, 0] |
| dog   | [0, 0, 0, 0, 1, 0] |
| is    | [0, 0, 1, 0, 0, 0] |
| happy | [0, 0, 0, 0, 0, 1] |



**2. What are the limitations of the above representations?**

Bag-of-words loses the order of the words while one-hot encoding is very demanding in size. Bag-of-words also loses the meaning of polysemous words like "can", that need to be understood in context.

**3. Example of word embedding techniques, such as Word2Vec and GloVe are considered *dense* representations. How do dense word embeddings relate to the *distributional hypothesis*?**

The Distributional Hypothesis is a fundamental concept in linguistics and natural language processing (NLP) that suggests words that occur in similar contexts tend to have similar meanings.

Embedding technipues captures these semantic relationships by placing words that are similiar to each other close on a high dimensional plane. This technique can capture that "Iceland" and "vikings" often appear together, also "Iceland" and "countries". But the words "vikings" and "countries" will be further apart since there will be fewer co-occurences of these words.