# Tokenization Practicals

Tokenization in Natural Language Processing (NLP) is like breaking down a sentence into smaller pieces, called "tokens." Imagine you have a sentence like "I love pizza." When you tokenize it, you might break it down into individual words: ["I", "love", "pizza"].

This process helps computers understand and analyze text more easily. Instead of dealing with one long string of text, they work with these smaller chunks, which can be words, phrases, or even characters. Tokenization is the first step in many NLP tasks, allowing machines to process and interpret language more effectively!

In [1]:
!pip install nltk

# Website -> https://www.nltk.org/



In [10]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [7]:
corpus = """Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. Through NLP, machines can understand, interpret, and generate human languages, making it possible for applications like chatbots, voice assistants, and
language translation tools to function. Techniques like tokenization, stemming, and lemmatization help break down text into manageable units for processing. Word embeddings, on the other hand, provide meaningful numeric representations of words, enabling models to capture relationships between words based on their usage. As NLP continues to evolve, its impact on technology and everyday life becomes increasingly significant.
"""

print(corpus)

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. Through NLP, machines can understand, interpret, and generate human languages, making it possible for applications like chatbots, voice assistants, and
language translation tools to function. Techniques like tokenization, stemming, and lemmatization help break down text into manageable units for processing. Word embeddings, on the other hand, provide meaningful numeric representations of words, enabling models to capture relationships between words based on their usage. As NLP continues to evolve, its impact on technology and everyday life becomes increasingly significant.



In [9]:
# Tokenize corpus to sentences

nltk.download('punkt')
documents = sent_tokenize(corpus)

for doc in documents:
    print(f'Document: {doc}')

Document: Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language.
Document: Through NLP, machines can understand, interpret, and generate human languages, making it possible for applications like chatbots, voice assistants, and
language translation tools to function.
Document: Techniques like tokenization, stemming, and lemmatization help break down text into manageable units for processing.
Document: Word embeddings, on the other hand, provide meaningful numeric representations of words, enabling models to capture relationships between words based on their usage.
Document: As NLP continues to evolve, its impact on technology and everyday life becomes increasingly significant.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
# Tokenize paragraph -> words
words = word_tokenize(corpus)
print(words)

# Sentence words -> words
words_list = []

for doc in documents:
    words_list.append(word_tokenize(doc))

print(words_list)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language', '.', 'Through', 'NLP', ',', 'machines', 'can', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'languages', ',', 'making', 'it', 'possible', 'for', 'applications', 'like', 'chatbots', ',', 'voice', 'assistants', ',', 'and', 'language', 'translation', 'tools', 'to', 'function', '.', 'Techniques', 'like', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', 'help', 'break', 'down', 'text', 'into', 'manageable', 'units', 'for', 'processing', '.', 'Word', 'embeddings', ',', 'on', 'the', 'other', 'hand', ',', 'provide', 'meaningful', 'numeric', 'representations', 'of', 'words', ',', 'enabling', 'models', 'to', 'capture', 'relationships', 'between', 'words', 'based', 'on', 'their', 'usage', '.', 'As', 'NLP', 'continues', 'to', 'evolve', ',', 'its', 'impact', '

In [20]:
# Using wordpunct_tokenize
from nltk.tokenize import wordpunct_tokenize
word_punct = wordpunct_tokenize(corpus)

print(word_punct)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language', '.', 'Through', 'NLP', ',', 'machines', 'can', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'languages', ',', 'making', 'it', 'possible', 'for', 'applications', 'like', 'chatbots', ',', 'voice', 'assistants', ',', 'and', 'language', 'translation', 'tools', 'to', 'function', '.', 'Techniques', 'like', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', 'help', 'break', 'down', 'text', 'into', 'manageable', 'units', 'for', 'processing', '.', 'Word', 'embeddings', ',', 'on', 'the', 'other', 'hand', ',', 'provide', 'meaningful', 'numeric', 'representations', 'of', 'words', ',', 'enabling', 'models', 'to', 'capture', 'relationships', 'between', 'words', 'based', 'on', 'their', 'usage', '.', 'As', 'NLP', 'continues', 'to', 'evolve', ',', 'its', 'impact', '

In [21]:
# TreebankWordTokenizer - All full stops except the last one in the corpus are not treated as separate words.

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
words = tokenizer.tokenize(corpus)

print(words)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language.', 'Through', 'NLP', ',', 'machines', 'can', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'languages', ',', 'making', 'it', 'possible', 'for', 'applications', 'like', 'chatbots', ',', 'voice', 'assistants', ',', 'and', 'language', 'translation', 'tools', 'to', 'function.', 'Techniques', 'like', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', 'help', 'break', 'down', 'text', 'into', 'manageable', 'units', 'for', 'processing.', 'Word', 'embeddings', ',', 'on', 'the', 'other', 'hand', ',', 'provide', 'meaningful', 'numeric', 'representations', 'of', 'words', ',', 'enabling', 'models', 'to', 'capture', 'relationships', 'between', 'words', 'based', 'on', 'their', 'usage.', 'As', 'NLP', 'continues', 'to', 'evolve', ',', 'its', 'impact', 'on', 'technology