# Lab: Tokenization in Natural Language Processing (NLP)


### Objective: By the end of this lab, you will understand the concept of tokenization in NLP, different types of tokenization  and apply hands-on techniques to tokenize text using Python libraries such as nltk, regex, and Hugging Face's transformers.

## 1. Introduction to Tokenization 



### Why Tokenization Matters:

- Enables machine learning models to interpret text.

- Necessary for downstream tasks like text classification sentiment analysis, and machine translation.


## 2. Types of Tokenization

1. Word Tokenization: Splitting text into words.
    - Example:  "I love NLP" ->['I', 'love', 'NLP']

2. Subword Tokenization: Splitting text into subword units, often used in modern NLP models.
    - Example:  "playing" -> ['play', '##ing']

3. Character Tokenization: Splitting text into individual characters.
    - Example: "NLP" -> ['N', 'L', 'P']

## 3. Tokenization with Regex



- \w: Matches any alphanumeric character (equivalent to [a-zA-Z0-9_])

- \W: Matches any non-alphanumeric character

- \s: Matches any whitespace character

- \S: Matches any non-whitespace character

- \d: Matches any digit character

- \D: Matches any non-digit character

- \w{N}: Matches any alphanumeric N characters

- +: One or more of the preceding expression

- *: Zero or more of the preceding expression
- | : Or operator
- []: Used to match cases , [A-Z] will get you any uppercase letter
- (): Used for grouping , (A-Z) will get you only any string or substring equal to (A-Z)



In [1]:
import re

# Simple text for tokenization
text = "Tokenization is crucial for NLP! Let's tokenize this text."

# Tokenizing using regex (split on spaces and punctuation)
tokens = re.findall(r'\w+|\S', text)
print("Regex Tokens:", tokens)


Regex Tokens: ['Tokenization', 'is', 'crucial', 'for', 'NLP', '!', 'Let', "'", 's', 'tokenize', 'this', 'text', '.']


In [15]:
import re

# Sample tweet
tweet = "Loving the new features of #AI in 2024! Thanks @OpenAI for the great work. #Innovation"

# Tokenize hashtags and mentions
hashtags = re.findall(r'#\w+', tweet)
mentions = re.findall(r'@\w+', tweet)

print("Hashtags:", hashtags)
print("Mentions:", mentions)

Hashtags: ['#AI', '#Innovation']
Mentions: ['@OpenAI']


In [17]:
import re

# Sample CamelCase text
camel_case_text = "NaturalLanguageProcessingIsFun"

# Regex to split CamelCase words
words = re.findall(r'[A-Z][a-z]*', camel_case_text)

print(words)

['Natural', 'Language', 'Processing', 'Is', 'Fun']


### Exercise 1: Tokenize Sentences

### Write a regex pattern to tokenize a paragraph into individual sentences.

### Task: Use regex to split the following text into sentences.

In [16]:
text="Hello world! How are you doing today? It's a sunny day. Let's enjoy it."


### Exercise 2 : Tokenize words including Apostrophes
### Write a regex pattern to extract words from a sentence while retaining contractions and words with apostrophes.

In [18]:
sentence = "It's a beautiful day, don't you think?"


### Exercise 3: Special Case Tokenization

### Write a function to detect and separate URLs and emails during tokenization.

In [4]:
sentence = "Visit us at https://example.com or email info@example.com."


## 4. Tokenization with NLTK

- ### 4.1   word tokenization with nltk

In [2]:
import nltk


# Download the punkt package
nltk.download('punkt_tab')



from nltk.tokenize import word_tokenize

sentence = "Tokenization is essential for text processing."

tokens = word_tokenize(sentence)
print("Word Tokens:", tokens)


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/mohamedhassanien/nltk_data...


Word Tokens: ['Tokenization', 'is', 'essential', 'for', 'text', 'processing', '.']


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


- ### 4.2 Sentence Tokenization with nltk

In [3]:
from nltk.tokenize import sent_tokenize


paragraph = "Tokenization is crucial in NLP. It helps machines understand language."


sentences = sent_tokenize(paragraph)
print("Sentences:", sentences)


Sentences: ['Tokenization is crucial in NLP.', 'It helps machines understand language.']


### Exercise 4: Sentence tokenization
### Tokenize the paragraph into individual sentences, observing how NLTK deals with abbreviations.

In [5]:
paragraph = "Dr. Smith is a renowned scientist. He earned his Ph.D. in AI in 2020. Can you believe that?"


### Exercise 5 : Word Tokenization

### Tokenize this sentence into individual words using the word_tokenize function from NLTK.

In [6]:
sentence = "NLTK is a powerful Python library for natural language processing!"


### Exercise 6: Word tokenize

### Tokenize a sentence into words, remove punctuation, and filter out stop words.

### Hint: use from nltk.corpus import stopwords

In [7]:
sentence = "The quick brown fox, jumps over a lazy dog!"


## 5 Subword Tokenization with Byte Pair Encoding (BPE) 

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Tokenization is a critical step for language models."
tokens = tokenizer.tokenize(text)
print("Subword Tokens:", tokens)


  from .autonotebook import tqdm as notebook_tqdm


Subword Tokens: ['token', '##ization', 'is', 'a', 'critical', 'step', 'for', 'language', 'models', '.']


### Exercise 7 : BPE

### Try tokenizing this sentence: "Deep learning models require tokenization."

## 6 Tokenization Challenges

- Ambiguity: Words like "don't" could be split into ["do", "n't"] depending on the tokenizer. Try asking chatgpt 4o how many 'r' in strawberry

- Languages: Tokenization rules differ by language (e.g., arabic tokenization is more complex).

- Subword vs. Word: The choice of tokenization can significantly affect the performance of models.