<a href="https://colab.research.google.com/github/glitcher007/NLP/blob/main/NLP_preprocessing2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tokenization is the process of breaking down a text into individual units, such as words or subwords, known as tokens. Tokenization is a crucial step in natural language processing (NLP) as it serves as the foundation for many downstream tasks.

In Python, the Natural Language Toolkit (NLTK) and the spaCy library are popular choices for tokenization. I'll provide examples for both.

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')  # Download the punkt tokenizer (if not already downloaded)

def tokenize_sentence(text):
    sentences = sent_tokenize(text)
    return sentences

def tokenize_word(text):
    words = word_tokenize(text)
    return words

# Example usage:
text = "Tokenization is an essential step in NLP. It breaks text into individual units, such as words or subwords."
sentence_tokens = tokenize_sentence(text)
word_tokens = tokenize_word(text)

print("Sentence Tokens:", sentence_tokens)
print("\nWord Tokens:", word_tokens)


In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')  # Download the punkt tokenizer (if not already downloaded)

def tokenize_sentence(text):
    sentences = sent_tokenize(text)
    return sentences

def tokenize_word(text):
    words = word_tokenize(text)
    return words

# Example usage:
text = "Tokenization is an essential step in NLP. It breaks text into individual units, such as words or subwords."
sentence_tokens = tokenize_sentence(text)
word_tokens = tokenize_word(text)

print("Sentence Tokens:", sentence_tokens)
print("\nWord Tokens:", word_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Sentence Tokens: ['Tokenization is an essential step in NLP.', 'It breaks text into individual units, such as words or subwords.']

Word Tokens: ['Tokenization', 'is', 'an', 'essential', 'step', 'in', 'NLP', '.', 'It', 'breaks', 'text', 'into', 'individual', 'units', ',', 'such', 'as', 'words', 'or', 'subwords', '.']


#spaCy Tokenization:

In [2]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

def spacy_tokenize(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    return tokens

# Example usage:
text = "Tokenization is an essential step in NLP. It breaks text into individual units, such as words or subwords."
spacy_tokens = spacy_tokenize(text)

print("spaCy Tokens:", spacy_tokens)


spaCy Tokens: ['Tokenization', 'is', 'an', 'essential', 'step', 'in', 'NLP', '.', 'It', 'breaks', 'text', 'into', 'individual', 'units', ',', 'such', 'as', 'words', 'or', 'subwords', '.']


#stemming

Stemming is a text normalization technique in natural language processing (NLP) that involves reducing words to their root or base form, known as the "stem." The idea is to remove suffixes from words so that similar words map to the same stem.

In Python, the Natural Language Toolkit (NLTK) provides a popular stemming module. One common stemming algorithm is the Porter Stemmer. Here's an example using NLTK:

In [3]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download the punkt tokenizer (if not already downloaded)

def stem_text(text):
    porter_stemmer = PorterStemmer()
    words = word_tokenize(text)
    stemmed_words = [porter_stemmer.stem(word) for word in words]
    stemmed_text = ' '.join(stemmed_words)
    return stemmed_text

# Example usage:
text = "Stemming is an essential technique for natural language processing. It helps to reduce words to their base form."
stemmed_text = stem_text(text)
print("Original Text:", text)
print("\nStemmed Text:", stemmed_text)


Original Text: Stemming is an essential technique for natural language processing. It helps to reduce words to their base form.

Stemmed Text: stem is an essenti techniqu for natur languag process . it help to reduc word to their base form .


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In this example, the PorterStemmer from NLTK is used to perform stemming on the input text. It's important to note that stemming may produce a stem that is not a valid word in the language. For example, "running" might be stemmed to "run," but "run" is a valid word.

While stemming can be beneficial for some tasks, it's not always suitable for all applications. In some cases, lemmatization (reducing words to their base or dictionary form) may be preferred, as it tends to produce valid words.

Adjust the text normalization techniques based on your specific NLP task and the requirements of your text processing pipeline.

In [6]:
!pip install indic-nlp-library




For Hindi, you can use the Hindi Stemmer provided by the Indic NLP Library for Python. This library is specifically designed for processing Indian languages, including Hindi. The Indic NLP Library offers various linguistic processing tools, including stemmers, tokenizers, and part-of-speech taggers.

Here's how you can use the Hindi Stemmer from the Indic NLP Library:

In [8]:
from indi-cnlp.stemmer import Stemmer

def hindi_stem_text(text):
    stemmer = Stemmer.Stemmer('hi')
    stems = stemmer.stem(text)
    stemmed_text = ' '.join(stems)
    return stemmed_text

# Example usage:
text_hindi = "हिन्दी भाषा में स्टेमिंग का उदाहारण।"
stemmed_text_hindi = hindi_stem_text(text_hindi)
print("Original Text (Hindi):", text_hindi)
print("\nStemmed Text (Hindi):", stemmed_text_hindi)


SyntaxError: ignored

Lemmatization is the process of reducing words to their base or root form, known as the "lemma." Unlike stemming, which involves removing suffixes from words, lemmatization considers the entire word and transforms it into a valid word that exists in the language.

In Python, the Natural Language Toolkit (NLTK) and spaCy are popular libraries that provide lemmatization functionality.

In [9]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download the punkt tokenizer (if not already downloaded)
nltk.download('wordnet')  # Download the WordNet database (if not already downloaded)

def lemmatize_text_nltk(text):
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    lemmatized_text = ' '.join(lemmatized_words)
    return lemmatized_text

# Example usage:
text = "Lemmatization is important for natural language processing. It helps to reduce words to their base form."
lemmatized_text_nltk = lemmatize_text_nltk(text)
print("Original Text:", text)
print("\nLemmatized Text (NLTK):", lemmatized_text_nltk)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Original Text: Lemmatization is important for natural language processing. It helps to reduce words to their base form.

Lemmatized Text (NLTK): Lemmatization is important for natural language processing . It help to reduce word to their base form .


In [10]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

def lemmatize_text_spacy(text):
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

# Example usage:
text = "Lemmatization is important for natural language processing. It helps to reduce words to their base form."
lemmatized_text_spacy = lemmatize_text_spacy(text)
print("Original Text:", text)
print("\nLemmatized Text (spaCy):", lemmatized_text_spacy)


Original Text: Lemmatization is important for natural language processing. It helps to reduce words to their base form.

Lemmatized Text (spaCy): lemmatization be important for natural language processing . it help to reduce word to their base form .
