## NLP Fundamentals 

Instructor : __Victor Geislinger__, data scientist and machine learning engineer using natural language processing (NLP). This included building and deploying NLP classification models as well as working on and with large language models (LLMs).

### What We'll Learn

![image.png](attachment:e0ff01af-e0ba-4119-94da-3f224b8db183.png)

__Intro to NLP :__ An introduction to different NLP tasks and applications and the challenges in working with language data

__Encoding Text Data :__ How to encode text data using tokenization and embeddings

__Text Generation :__ Techniques for the NLP task of text generation using recurrent neural networks

#### Natural Languages vs. Structured Languages
__Natural languages :__ A natural language is a language that evolved naturally through human communication, such as Spanish, Mandarin, or American Sign Language.

__Structured Language :__ A structured language is an invented or constructed language, such as a computer programming language.

__NLP: Natural Language Processing__  NLP reveals structure and meaning from human language to computers and its importance has grown in the modern age.

![image.png](attachment:78bc7de6-8c0c-4047-825b-cbf02b123977.png)

### Key NLP Applications 

| **Application/Task**               | **Definition**                                                   |
|------------------------------------|------------------------------------------------------------------|
| Speech Recognition                 | Converts spoken language to text                                 |
| Text Classification                | Categorizes pieces of text based on content                      |
| Machine Translation                | Translates text from one language to another                     |
| Text Summarization                 | Creates concise summaries while retaining meaning                |
| Question Answering                 | Answers natural language questions using documents               |
| Chatbots & Conversational Agents   | Converse back and forth in natural language                      |


### Extractive vs. Abstractive Summarization
__Extractive:__ Directly quotes main points from the source

__Abstractive:__ Summarizes with novel words and phrases


### Challenges in NLP with Computers
Language is complex, nuanced, and ambiguous, making it challenging for computers to process language data.

NLP Needs (Some) Clean and Labeled Text Data, Data can have misspellings, unconventional words, and biases that can be difficult and expensive for humans to label.

__NLP stands for "natural language processing" and forms the bridge between human communication and computer logic.__

Typical NLP tasks and applications include:

Speech recognition

Text classification

Machine translation

Text summarization

Question answering

Chatbots

__Some of the major challenges in NLP are the complexity, nuance, and ambiguity of natural language, as well as the difficulty of acquiring clean and labeled text data.__

## Normalization, Pretokenization, Tokenization, and Post-processing in NLP

#### Normalization

Normalization in Natural Language Processing (NLP) is the crucial first step of text preprocessing. It involves transforming text into a consistent format to ensure that it can be processed efficiently and accurately by NLP models. The primary goal of normalization is to reduce the variability in the text data. Common steps in normalization include lowercasing all characters to avoid case sensitivity issues, removing punctuation and special characters to focus on the core content, expanding contractions (e.g., "don't" to "do not") to standardize expressions, and applying stemming or lemmatization to reduce words to their root or base forms (e.g., "running" to "run" or "better" to "good"). These steps help in creating a uniform dataset, making it easier for NLP algorithms to analyze and learn from the text.

#### Pretokenization

Pretokenization is the process that precedes the actual tokenization step in NLP. It involves breaking down the normalized text into preliminary units, often based on whitespace and punctuation. Pretokenization simplifies the tokenization process by handling common text structures and edge cases upfront. For instance, it can separate punctuation from words (e.g., "Hello, world!" to ["Hello", ",", "world", "!"]), handle special tokens like hashtags and mentions (e.g., "@user" to ["@", "user"]), and split compound words into subwords (e.g., "unhappiness" to ["un", "happiness"]). Pretokenization ensures that the text is in a more manageable form for the subsequent tokenization step, improving the efficiency and accuracy of tokenization.

#### Tokenization
![image.png](attachment:a9abe263-5410-4f3d-ae76-10693cd7bd0e.png)

Tokenization is the process of converting the pretokenized text into discrete units called tokens, which can be words, subwords, or even characters, depending on the NLP model's requirements. Tokenization is essential because it transforms the continuous stream of text into structured inputs that models can process. For example, word tokenization would convert a sentence like "ChatGPT helps users" into ["ChatGPT", "helps", "users"]. Subword tokenization, used by models like BERT and GPT, breaks down words into smaller units, allowing the model to handle unknown or rare words effectively. This step is crucial for feature extraction, as it represents the text in a format that can be numerically encoded and fed into machine learning algorithms for further processing.

#### Post-processing

Post-processing in NLP involves the steps taken after the model has generated its output to refine and format the results for the end user. This step can include several tasks such as detokenization (reassembling tokens back into coherent text), correcting grammatical errors, and applying any necessary transformations to ensure the output is in a human-readable form. For instance, if a model generates a sequence of tokens like ["hello", ",", "world", "!"], post-processing would combine these tokens into the sentence "Hello, world!". Additionally, post-processing might involve filtering out any inappropriate or irrelevant content generated by the model, ensuring that the final output meets the desired standards and requirements. This step is crucial for delivering polished and useful results in real-world applications of NLP.

Each of these stages—normalization, pretokenization, tokenization, and post-processing—plays a critical role in the NLP pipeline, ensuring that text data is appropriately prepared, processed, and refined for accurate and effective analysis and generation by language models.




Convert to all lowercase letters __: Normalization__

Adds special tags, such as the tokens that mark the beginning and end of a sentence __: Postprocessing__

Breaks original text into smaller chunks __: Pretokenization__

Creates blocks of text that are used as tokens __: Tokenization__

Replace or remove accented characters __: Normalization__


### Encoding and Decoding Text with Hugging Face Tokenizers


In [1]:
from transformers import AutoTokenizer

### Loading Tokenizer

In this notebook, we'll explore Hugging Face's tokenizers by using a pretrained
model. Hugging Face has many tokenizers available that have already been trained
for specific models and tasks!

In [3]:
# Choose a pretrained tokenizer to use
my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

### Encoding: Text to Tokens
#### Tokens: String Representations


In [5]:
# Simple method getting tokens from text
raw_text = '''Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!'''
tokens = my_tokenizer.tokenize(raw_text)

print(tokens)

['Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!']


In [6]:
# This method also returns special tokens depending on the pretrained tokenizer
detailed_tokens = my_tokenizer(raw_text).tokens()

print(detailed_tokens)

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']


#### Tokens: Integer ID Representations

In [7]:
# Way to get tokens as integer IDs
print(my_tokenizer.encode(raw_text))

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


In [8]:
print(detailed_tokens)

# Tokenizer method to get the IDs if we already have the tokens as strings
detailed_ids = my_tokenizer.convert_tokens_to_ids(detailed_tokens)
print(detailed_ids)

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']
[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


Another way can look a little complex but can be useful when working with
tokenizers for certain tasks.

In [9]:
# Returns an object that has a few different keys available
my_tokenizer(raw_text)

{'input_ids': [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
# focus on `input_ids` which are the IDs associated with the tokens.
print(my_tokenizer(raw_text).input_ids)

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


### Decoding: Tokens to Text

We of course can use the tokenizer to go from token IDs to tokens and back to text!

In [13]:
# Integer IDs for tokens
ids = my_tokenizer.encode(raw_text)

# The inverse of the .enocde() method: .decode()
my_tokenizer.decode(ids)

"[CLS] Rory's shoes are magenta and so are Corey's but they aren't nearly as dark! [SEP]"

In [14]:
# To ignore special tokens (depending on pretrained tokenizer)
my_tokenizer.decode(ids, skip_special_tokens=True)

"Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!"

In [15]:
# List of tokens as strings instead of one long string
my_tokenizer.convert_ids_to_tokens(ids)

['[CLS]',
 'Rory',
 "'",
 's',
 'shoes',
 'are',
 'mage',
 '##nta',
 'and',
 'so',
 'are',
 'Corey',
 "'",
 's',
 'but',
 'they',
 'aren',
 "'",
 't',
 'nearly',
 'as',
 'dark',
 '!',
 '[SEP]']

> One thing to consider is if a string is outside of the tokenizer's vocabulary,
> also known as an "unkown" token.
> 
> They are typically represented with `[UNK]` or
> some other similar variant.


<!--
If the tokenizer encoded the text so each character was a token (which is
actually not as easy as it sounds), then it would be impossible to have an
"unknown" token. Word-based tokenization will always be in danger of having 
"unknown" tokens since it's virtually impossible to have every possible word (
and "non-word") in its vocabulary!

And so you might think that subword tokenization wouldn't have an issue with
"unknown" tokens. And although there are fewer than word-based tokenization, it
does happen!

--------------------------------------------------------------------------------

Tokenizers are specific so it's important to use a tokenizer that will recognize
most of the text you're working with! For example, a lot of tokenizers might not
consider emoji as tokens but could be really important if emoji are especially
numerous in your data (like a corpus of chat messages)!

If you're seeing a lot of "unknown" tokens with the text you're working with,
might consider using a different tokenizer appropiate for the task. Or it's also
possible to fine-tune a pretrained model or train one from scratch!

-->

In [16]:
phrase = '🥱 the dog next door kept barking all night!!'
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))

🥱 the dog next door kept barking all night!!
['[CLS]', '[UNK]', 'the', 'dog', 'next', 'door', 'kept', 'barking', 'all', 'night', '!', '!', '[SEP]']
[CLS] [UNK] the dog next door kept barking all night!! [SEP]


In [17]:
phrase = '''wow my dad thought mcdonalds sold tacos \N{SKULL}'''
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))

wow my dad thought mcdonalds sold tacos 💀
['[CLS]', 'w', '##ow', 'my', 'dad', 'thought', 'm', '##c', '##don', '##ald', '##s', 'sold', 'ta', '##cos', '[UNK]', '[SEP]']
[CLS] wow my dad thought mcdonalds sold tacos [UNK] [SEP]


### Properties of Hugging Face Tokenizers

