# Tokenization

* Tokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens. 
* These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy.

In [2]:
# !pip install nltk

In [None]:
import nltk
nltk.download('punkt_tab')

In [4]:
paragraph = """
Tokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy. Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.
"""

sentence = nltk.sent_tokenize(paragraph)
print("sentence tokenizer: \n", sentence)

word = nltk.word_tokenize(paragraph)
print("word tokenizer: \n", word)

sentence tokenizer: 
 ['\nTokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens.', 'These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy.', 'Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.']
word tokenizer: 
 ['Tokenization', 'is', 'a', 'crucial', 'process', 'in', 'natural', 'language', 'processing', '(', 'NLP', ')', 'where', 'text', 'is', 'broken', 'down', 'into', 'smaller', 'units', 'called', 'tokens', '.', 'These', 'tokens', 'can', 'be', 'words', ',', 'subwords', ',', 'characters', ',', 'or', 'other', 'meaningful', 'elements', 'depending', 'on', 'the', 'tokenization', 'strategy', '.', 'Tokenization', 'is', 'a', 'fundamental', 'step', 'in', 'preparing', 'text', 'data', 'for', 'machine', 'learning', 'model

# Stemming

* Stemming is a text normalization technique in natural language processing (NLP) that reduces words to their root or base form, called the stem, by removing affixes (suffixes or prefixes). 
* The stem may not necessarily be a linguistically valid word.
* It is faster due to simple rules.
* Applications: search engines, sentiment analysis, text mining

In [None]:
import nltk
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [12]:
# stopwords.words('english')

**Porter Stemmer**

In [11]:
paragraph = """
Tokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy. Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.
"""

sentence = nltk.sent_tokenize(paragraph)
print("sentence tokenizer: \n", sentence)
stemmer = PorterStemmer()

for i in range(len(sentence)):
    words = nltk.word_tokenize(sentence[i])
    words = [stemmer.stem(word) for word in words if word not in stopwords.words('english')]
    sentence[i] = " ".join(words)
print("After stemming: \n", sentence)

sentence tokenizer: 
 ['\nTokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens.', 'These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy.', 'Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.']
After stemming: 
 ['token crucial process natur languag process ( nlp ) text broken smaller unit call token .', 'these token word , subword , charact , meaning element depend token strategi .', 'token fundament step prepar text data machin learn model , particularli task like text classif , languag model , translat , sentiment analysi .']


**Snowball Stemmer**

In [13]:
from nltk.stem.snowball import SnowballStemmer

In [15]:
paragraph = """
Tokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy. Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.
"""

sentence = nltk.sent_tokenize(paragraph)
print("sentence tokenizer: \n", sentence)
stemmer = SnowballStemmer('english')

for i in range(len(sentence)):
    words = nltk.word_tokenize(sentence[i])
    words = [stemmer.stem(word) for word in words if word not in stopwords.words('english')]
    sentence[i] = " ".join(words)
print("After stemming: \n", sentence)

sentence tokenizer: 
 ['\nTokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens.', 'These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy.', 'Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.']
After stemming: 
 ['token crucial process natur languag process ( nlp ) text broken smaller unit call token .', 'these token word , subword , charact , meaning element depend token strategi .', 'token fundament step prepar text data machin learn model , particular task like text classif , languag model , translat , sentiment analysi .']


# Lemmatization

* Lemmatization is a text normalization technique in natural language processing (NLP) that reduces a word to its lemma, which is its dictionary form or base meaning. 
* Unlike stemming, lemmatization considers the context and part of speech (POS) of a word to produce linguistically valid outputs.
* It is slower due to complex processing.
* Depends on dictionary and morphological rules to output valid word.

In [19]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pooja\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pooja\AppData\Roaming\nltk_data...


In [20]:
paragraph = """
Tokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy. Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.
"""

sentence = nltk.sent_tokenize(paragraph)
print("sentence tokenizer: \n", sentence)
lemmatizer = WordNetLemmatizer()

for i in range(len(sentence)):
    words = nltk.word_tokenize(sentence[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')]
    sentence[i] = " ".join(words)
print("After stemming: \n", sentence)

sentence tokenizer: 
 ['\nTokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens.', 'These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy.', 'Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.']
After stemming: 
 ['Tokenization crucial process natural language processing ( NLP ) text broken smaller unit called token .', 'These token word , subwords , character , meaningful element depending tokenization strategy .', 'Tokenization fundamental step preparing text data machine learning model , particularly task like text classification , language modeling , translation , sentiment analysis .']


# Bag of Words

* The Bag of Words (BoW) transforms text into a numerical format that machine learning models can process by considering the occurrence of words, without paying attention to grammar, word order, or context.
* BOW is the special case of n-gram, where n=1
* In the context of n-grams, a unigram is an n-gram with n=1, meaning it considers only individual words without looking at their surrounding words or context.

In [21]:
import re
import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

In [28]:
paragraph = """
Tokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy. Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.
"""
# text cleaning
sentence = nltk.sent_tokenize(paragraph)
print("sentence tokenizer: \n", sentence)
lemmatizer = WordNetLemmatizer()
corpus = []
for i in range(len(sentence)):
    review = re.sub('^[a-zA-Z]', "", sentence[i])
    review = review.lower().split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in stopwords.words('english')]
    sentence[i] = " ".join(review)
    corpus.append(sentence[i])
print("corpus: \n", corpus)

# vectorization
cv = CountVectorizer()
output = cv.fit_transform(corpus).toarray()
print("Features: ", cv.get_feature_names_out())
print("*"*100)
print("Vocabulary:", cv.vocabulary_)
print("*"*100)
print("Bag of Words: \n", output)

sentence tokenizer: 
 ['\nTokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens.', 'These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy.', 'Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.']
corpus: 
 ['tokenization crucial process natural language processing (nlp) text broken smaller unit called tokens.', 'hese token words, subwords, characters, meaningful element depending tokenization strategy.', 'okenization fundamental step preparing text data machine learning models, particularly task like text classification, language modeling, translation, sentiment analysis.']
Features:  ['analysis' 'broken' 'called' 'characters' 'classification' 'crucial'
 'data' 'depending' 'element' 'fundamental' 'hese' 'languag

In [29]:
# bigram n=2
cv = CountVectorizer(ngram_range=(2,2))
output = cv.fit_transform(corpus).toarray()
print("Features: ", cv.get_feature_names_out())
print("*"*100)
print("Vocabulary:", cv.vocabulary_)
print("*"*100)
print("Bag of Words: \n", output)

Features:  ['broken smaller' 'called tokens' 'characters meaningful'
 'classification language' 'crucial process' 'data machine'
 'depending tokenization' 'element depending' 'fundamental step'
 'hese token' 'language modeling' 'language processing' 'learning models'
 'like text' 'machine learning' 'meaningful element'
 'modeling translation' 'models particularly' 'natural language'
 'nlp text' 'okenization fundamental' 'particularly task' 'preparing text'
 'process natural' 'processing nlp' 'sentiment analysis' 'smaller unit'
 'step preparing' 'subwords characters' 'task like' 'text broken'
 'text classification' 'text data' 'token words' 'tokenization crucial'
 'tokenization strategy' 'translation sentiment' 'unit called'
 'words subwords']
****************************************************************************************************
Vocabulary: {'tokenization crucial': 34, 'crucial process': 4, 'process natural': 23, 'natural language': 18, 'language processing': 11, 'processi

# TF-IDF

* A refinement of BoW that accounts for the importance of words in the entire corpus.

Formula:
TF-IDF = TF×IDF

where:
TF: Term frequency (how often a term appears in a document).

IDF: Inverse document frequency (how unique the term is across documents).



In [30]:
import re
import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [35]:
paragraph = """
Tokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy. Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.
"""
# text cleaning
sentence = nltk.sent_tokenize(paragraph)
print("sentence tokenizer: \n", sentence)
lemmatizer = WordNetLemmatizer()
corpus = []
for i in range(len(sentence)):
    review = re.sub('^[a-zA-Z]', "", sentence[i])
    review = review.lower().split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in stopwords.words('english')]
    sentence[i] = " ".join(review)
    corpus.append(sentence[i])
print("corpus: \n", corpus)

# vectorization
cv = TfidfVectorizer()
output = cv.fit_transform(corpus).toarray()
print("Features: ", cv.get_feature_names_out(), len(cv.get_feature_names_out()))
print("*"*100)
print("Vocabulary:", cv.vocabulary_)
print("*"*100)
print("TFIDF: \n", output, output.shape)
print("*"*100)
print("IDF: \n", cv.idf_)

sentence tokenizer: 
 ['\nTokenization is a crucial process in natural language processing (NLP) where text is broken down into smaller units called tokens.', 'These tokens can be words, subwords, characters, or other meaningful elements depending on the tokenization strategy.', 'Tokenization is a fundamental step in preparing text data for machine learning models, particularly for tasks like text classification, language modeling, translation, and sentiment analysis.']
corpus: 
 ['tokenization crucial process natural language processing (nlp) text broken smaller unit called tokens.', 'hese token words, subwords, characters, meaningful element depending tokenization strategy.', 'okenization fundamental step preparing text data machine learning models, particularly task like text classification, language modeling, translation, sentiment analysis.']
Features:  ['analysis' 'broken' 'called' 'characters' 'classification' 'crucial'
 'data' 'depending' 'element' 'fundamental' 'hese' 'languag