<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/NLP_basics/01_NLP_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Preprocessing


![](https://devopedia.org/images/article/293/1027.1608556695.png)



In [None]:
import nltk
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
def clean_text(text):
    # 1. Lowercasing
    text = text.lower()

    # 2. Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # 3. Tokenization
    tokens = nltk.word_tokenize(text)

    # 4. Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]

    # 5. Stemming or Lemmatization (choose one)
    # Stemming
    stemmer = PorterStemmer()
    #tokens = [stemmer.stem(w) for w in tokens]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]

    # 6. Remove numbers (if needed)
    # tokens = [w for w in tokens if not w.isdigit()]

    # 7. Join tokens back into a string
    cleaned_text = " ".join(tokens)

    return cleaned_text

email_text = """
Hi Team,

Just a quick reminder about our project update meeting tomorrow at 10:00 AM in the conference room. Please come prepared with your progress reports and any questions you might have. If you can't attend, let me know in advance.

Looking forward to seeing everyone there!

Best,
John
"""

print("Original Message")
print(email_text)
print("="*20)

cleaned_text = clean_text(email_text)
cleaned_text
print("Cleaned Message")
print("="*20)
print(cleaned_text)
print("-"*20)


Original Message

Hi Team,

Just a quick reminder about our project update meeting tomorrow at 10:00 AM in the conference room. Please come prepared with your progress reports and any questions you might have. If you can't attend, let me know in advance.

Looking forward to seeing everyone there!

Best,
John

Cleaned Message
hi team quick reminder project update meeting tomorrow 1000 conference room please come prepared progress report question might cant attend let know advance looking forward seeing everyone best john
--------------------


### Stemming and Lemmatization

Stemming and Lemmatization are text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base or root form.

    Pros:

    - Speed: Stemming is generally faster because it uses simple rules.
    - Simplicity: Easy to implement and understand.
    
    Cons:

    - Accuracy: Can be less accurate as it may produce non-existent words (e.g., "studies" becomes "studi").
    - Context Ignorance: Does not consider the context or part of speech, leading to potential errors.

**Lemmatization**

Lemmatization reduces words to their base or dictionary form (lemma) by considering the context and part of speech. For example, "running" becomes "run" and "better" becomes "good".

    Pros:

    - Accuracy: More accurate as it produces valid words.
    - Context Awareness: Considers the context and part of speech, leading to better results.
    Cons:

    - Speed: Slower compared to stemming because it involves looking up words in a dictionary.
    - Complexity: More complex to implement and requires more computational resources.
Why Are They Used?
Both stemming and lemmatization are used to:

Reduce Vocabulary Size: By converting words to their root form, the vocabulary size is reduced, making it easier to analyze and model the text.
Improve Model Performance: Helps in improving the performance of NLP models by standardizing words, which can lead to better generalization.
Enhance Text Analysis: Facilitates more accurate text analysis by grouping similar words together.


**Stemming**

there are different methods: Porter Stemming and Snowball Stemming methods

In [None]:
sentence = "This function implements the Bag of Words (BoW) model."

tokens = nltk.word_tokenize(sentence)
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]
print(stemmed_words)

['thi', 'function', 'implement', 'the', 'bag', 'of', 'word', '(', 'bow', ')', 'model', '.']


**Lemmatization**

In [None]:
sentence = "This function implements the Bag of Words (BoW) model."

tokens = nltk.word_tokenize(sentence)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in tokens]
print(lemmatized_tokens)

['This', 'function', 'implement', 'the', 'Bag', 'of', 'Words', '(', 'BoW', ')', 'model', '.']
