<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/NLP_basics/01_NLP_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Preprocessing


![](https://devopedia.org/images/article/293/1027.1608556695.png)



In [58]:
import nltk
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
def clean_text(text):
    # 1. Lowercasing
    text = text.lower()

    # 2. Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # 3. Tokenization
    tokens = nltk.word_tokenize(text)

    # 4. Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]

    # 5. Stemming or Lemmatization (choose one)
    # Stemming
    stemmer = PorterStemmer()
    #tokens = [stemmer.stem(w) for w in tokens]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]

    # 6. Remove numbers (if needed)
    # tokens = [w for w in tokens if not w.isdigit()]

    # 7. Join tokens back into a string
    cleaned_text = " ".join(tokens)

    return cleaned_text

email_text = """
Hi Team,

Just a quick reminder about our project update meeting tomorrow at 10:00 AM in the conference room. Please come prepared with your progress reports and any questions you might have. If you can't attend, let me know in advance.

Looking forward to seeing everyone there!

Best,
John
"""

print("Original Message")
print(email_text)
print("="*20)

cleaned_text = clean_text(email_text)
cleaned_text
print("Cleaned Message")
print("="*20)
print(cleaned_text)
print("-"*20)


Original Message

Hi Team,

Just a quick reminder about our project update meeting tomorrow at 10:00 AM in the conference room. Please come prepared with your progress reports and any questions you might have. If you can't attend, let me know in advance.

Looking forward to seeing everyone there!

Best,
John

Cleaned Message
hi team quick reminder project update meeting tomorrow 1000 conference room please come prepared progress report question might cant attend let know advance looking forward seeing everyone best john
--------------------


### Stemming and Lemmatization

Stemming and Lemmatization are text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base or root form.

    Pros:

    - Speed: Stemming is generally faster because it uses simple rules.
    - Simplicity: Easy to implement and understand.
    
    Cons:

    - Accuracy: Can be less accurate as it may produce non-existent words (e.g., "studies" becomes "studi").
    - Context Ignorance: Does not consider the context or part of speech, leading to potential errors.

**Lemmatization**

Lemmatization reduces words to their base or dictionary form (lemma) by considering the context and part of speech. For example, "running" becomes "run" and "better" becomes "good".

    Pros:

    - Accuracy: More accurate as it produces valid words.
    - Context Awareness: Considers the context and part of speech, leading to better results.
    Cons:

    - Speed: Slower compared to stemming because it involves looking up words in a dictionary.
    - Complexity: More complex to implement and requires more computational resources.
Why Are They Used?
Both stemming and lemmatization are used to:

Reduce Vocabulary Size: By converting words to their root form, the vocabulary size is reduced, making it easier to analyze and model the text.
Improve Model Performance: Helps in improving the performance of NLP models by standardizing words, which can lead to better generalization.
Enhance Text Analysis: Facilitates more accurate text analysis by grouping similar words together.


**Stemming**

there are different methods: Porter Stemming and Snowball Stemming methods

In [24]:
sentence = "This function implements the Bag of Words (BoW) model."

tokens = nltk.word_tokenize(sentence)
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]
print(stemmed_words)

['thi', 'function', 'implement', 'the', 'bag', 'of', 'word', '(', 'bow', ')', 'model', '.']


**Lemmatization**

In [23]:
sentence = "This function implements the Bag of Words (BoW) model."

tokens = nltk.word_tokenize(sentence)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in tokens]
print(lemmatized_tokens)

['This', 'function', 'implement', 'the', 'Bag', 'of', 'Words', '(', 'BoW', ')', 'model', '.']


# Feature Extraction


Feature extraction in Natural Language Processing (NLP) involves converting text data into numerical representations that can be used by machine learning models. Here are some well-known techniques:

1. **Bag of Words (BoW)**: This technique involves tokenizing the text and creating a vocabulary of all unique words. Each document is then represented as a vector indicating the presence or absence (or frequency) of these words.

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**: This method adjusts the frequency of words by how often they appear in all documents, giving more importance to words that are unique to a document.

3. **N-grams**: This technique involves creating combinations of 'n' consecutive words or characters from the text, capturing the context and order of words.

4. **Word Embeddings**: Techniques like Word2Vec, GloVe, and FastText create dense vector representations of words, capturing semantic relationships between them.

5. **Doc2Vec**: An extension of Word2Vec, this technique generates vector representations for entire documents, preserving the context of words within the document.

6. **Named Entity Recognition (NER)**: This method identifies and classifies key information (entities) in the text, such as names of people, organizations, locations, etc.

7. **Part-of-Speech (POS) Tagging**: This involves labeling words with their respective parts of speech (nouns, verbs, adjectives, etc.), which can be useful for syntactic analysis.

8. **Latent Dirichlet Allocation (LDA)**: A topic modeling technique that discovers abstract topics within a collection of documents.

9. **Dependency Parsing**: This technique analyzes the grammatical structure of a sentence, identifying relationships between "head" words and words which modify those heads.

10. **BERT** (Bidirectional Encoder Representations from Transformers): A state-of-the-art technique that uses transformers to generate context-aware embeddings for words in a sentence12.

These techniques are fundamental in various NLP tasks such as text classification, sentiment analysis, and information retrieval

## Bag of Words

In [1]:
# prompt: Write an example for Bag of Words (BoW) showing its real-life application ?

import re
from collections import Counter

def bag_of_words(text):

  # Preprocessing: Convert to lowercase and remove punctuation
  text = text.lower()
  text = re.sub(r'[^\w\s]', '', text)

  # Tokenization: Split the text into words
  words = text.split()

  # Count word frequencies
  word_counts = Counter(words)

  return dict(word_counts)

# Example Usage
text = """
# ## Feature extraction
Feature extraction in Natural Language Processing (NLP) involves converting text data into numerical representations that can be used by machine learning models.
"""

bow = bag_of_words(text)
print(bow)


# Real-life Application: Spam Detection

# Assume we have a list of emails, some spam and some not spam
emails = [
  "Free money! Click here now!", # Spam
  "Meeting reminder: Project X discussion", # Not Spam
  "Congratulations! You won a prize!", # Spam
  "Check out our latest product updates" # Not Spam
]

# Create BoW representations for each email
email_bows = [bag_of_words(email) for email in emails]


# Spam detection logic: We can use the BoW to detect spam by finding words frequently occurring in spam emails.
spam_keywords = ["free", "money", "prize", "click", "won"]
for i, bow in enumerate(email_bows):
    spam_score = 0
    for word in spam_keywords:
        spam_score += bow.get(word, 0) # Check if the word exists in BoW, if not default to 0
    print(f"Email {i+1}: Spam score = {spam_score}")

# Based on the spam score, we can classify the email as spam or not spam using some threshold

{'feature': 2, 'extraction': 2, 'in': 1, 'natural': 1, 'language': 1, 'processing': 1, 'nlp': 1, 'involves': 1, 'converting': 1, 'text': 1, 'data': 1, 'into': 1, 'numerical': 1, 'representations': 1, 'that': 1, 'can': 1, 'be': 1, 'used': 1, 'by': 1, 'machine': 1, 'learning': 1, 'models': 1}
Email 1: Spam score = 3
Email 2: Spam score = 0
Email 3: Spam score = 2
Email 4: Spam score = 0


In [4]:
bow

{'check': 1, 'out': 1, 'our': 1, 'latest': 1, 'product': 1, 'updates': 1}

In [11]:
text = """
# ## Feature extraction
Feature extraction in Natural Language Processing (NLP) involves converting text data into numerical representations that can be used by machine learning models.
"""

## Clean the text:
alpha_numeric = r"[^a-zA-Z0-9]"
clean_text = re.sub(alpha_numeric, " ", text).strip()

words = clean_text.split()
bag={}
for word in words:
    word = word.lower().strip()
    if word in bag.keys():
        bag[word] += 1
    else:
        bag[word] = 1
print(bag)

{'feature': 2, 'extraction': 2, 'in': 1, 'natural': 1, 'language': 1, 'processing': 1, 'nlp': 1, 'involves': 1, 'converting': 1, 'text': 1, 'data': 1, 'into': 1, 'numerical': 1, 'representations': 1, 'that': 1, 'can': 1, 'be': 1, 'used': 1, 'by': 1, 'machine': 1, 'learning': 1, 'models': 1}


### TF-IDF

- Representation: TF-IDF adjusts the word counts by considering the importance of each word in the context of the entire corpus.
- Term Frequency (TF): Measures how frequently a word appears in a document.
- Inverse Document Frequency (IDF): Measures how important a word is by considering how common or rare it is across all documents.

<img src="https://www.researchgate.net/publication/376247075/figure/fig2/AS:11431281209841725@1701888441866/TF-IDFTerm-Frequency-Inverse-Document-Frequency_W640.jpg" height=200/>

### Step.1 Load the documents

In [25]:
# Sample documents
documents = [
    "this is the first document",
    "this document is the second document",
    "and this is the third one",
    "is this the first document"
]

### Step 2. Split the documents

List down all the words in all the documents. Assuming that we have already preprocessed the text

In [35]:
vocabulary = set()
for doc in documents:
    vocabulary.update(doc.split())

vocabulary

{'and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'}

### Step 3. Calculate the term frequency

Calculate the frequency of words occuring for each doc separately

In [55]:
def word_count(sentence):
    word_freq = {}
    for word in sentence.split(" "):
        if word in word_freq.keys():
            word_freq[word] +=1
        else:
            word_freq[word] =1

    return word_freq


def calculate_tf(word_freq):
    arr = np.array(list(word_freq.values()), dtype=np.float16)
    return arr/len(word_freq.keys())

s = "calculate the frequency of words occuring for each doc separately"
word_freq=  word_count(s)
print(word_freq)

tf = calculate_tf(word_freq)
print(tf)

{'calculate': 1, 'the': 1, 'frequency': 1, 'of': 1, 'words': 1, 'occuring': 1, 'for': 1, 'each': 1, 'doc': 1, 'separately': 1}
[0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]


In [57]:
tf = {}
for i,doc in enumerate(documents):

    word_freq = word_count(doc)
    tf_values = calculate_tf(word_freq)
    tf[i] = tf_values

print(tf)

{0: array([0.2, 0.2, 0.2, 0.2, 0.2], dtype=float16), 1: array([0.2, 0.4, 0.2, 0.2, 0.2], dtype=float16), 2: array([0.1666, 0.1666, 0.1666, 0.1666, 0.1666, 0.1666], dtype=float16), 3: array([0.2, 0.2, 0.2, 0.2, 0.2], dtype=float16)}


### Step 4. Calculate Inverse Document Frequency (IDF)

In [61]:
doc_word_freq = {}
for i,doc in enumerate(documents):
    word_freq = word_count(doc)
    doc_word_freq[i] = word_freq

doc_word_freq

{0: {'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1},
 1: {'this': 1, 'document': 2, 'is': 1, 'the': 1, 'second': 1},
 2: {'and': 1, 'this': 1, 'is': 1, 'the': 1, 'third': 1, 'one': 1},
 3: {'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1}}

In [64]:
idf = {}
tolerance = 1 ## To avoid division by zero error
for word in vocabulary:
    doc_count = 0
    for doc in documents:
        if word in doc:
            doc_count+=1

    idf_val = np.log(len(documents)/(doc_count+tolerance))
    idf[word] = idf_val

idf

{'this': -0.2231435513142097,
 'third': 0.6931471805599453,
 'is': -0.2231435513142097,
 'the': -0.2231435513142097,
 'one': 0.6931471805599453,
 'second': 0.6931471805599453,
 'and': 0.6931471805599453,
 'first': 0.28768207245178085,
 'document': 0.0}

In [31]:
idf = {}
for word in vocabulary:
    doc_count = 0
    for i in range(len(documents)):
        if word in documents[i].split():
            doc_count += 1
    idf[word] = 1 + (len(documents) / doc_count)  # Adding 1 to avoid division by zero

idf

{'this': 2.0,
 'third': 5.0,
 'is': 2.0,
 'the': 2.0,
 'one': 5.0,
 'second': 5.0,
 'and': 5.0,
 'first': 3.0,
 'document': 2.333333333333333}

### Step 5. Calculate Tf-IDF



In [32]:
# Calculate TF-IDF
tfidf = {}
for i in range(len(documents)):
  tfidf[i] = {}
  for word in vocabulary:
    tfidf[i][word] = (tf[i].get(word, 0) / sum(tf[i].values())) * idf[word]

tfidf

{0: {'this': 0.4,
  'third': 0.0,
  'is': 0.4,
  'the': 0.4,
  'one': 0.0,
  'second': 0.0,
  'and': 0.0,
  'first': 0.6000000000000001,
  'document': 0.4666666666666666},
 1: {'this': 0.3333333333333333,
  'third': 0.0,
  'is': 0.3333333333333333,
  'the': 0.3333333333333333,
  'one': 0.0,
  'second': 0.8333333333333333,
  'and': 0.0,
  'first': 0.0,
  'document': 0.7777777777777777},
 2: {'this': 0.3333333333333333,
  'third': 0.8333333333333333,
  'is': 0.3333333333333333,
  'the': 0.3333333333333333,
  'one': 0.8333333333333333,
  'second': 0.0,
  'and': 0.8333333333333333,
  'first': 0.0,
  'document': 0.0},
 3: {'this': 0.4,
  'third': 0.0,
  'is': 0.4,
  'the': 0.4,
  'one': 0.0,
  'second': 0.0,
  'and': 0.0,
  'first': 0.6000000000000001,
  'document': 0.4666666666666666}}

In [33]:
# Print the TF-IDF matrix
for doc_id, tfidf_scores in tfidf.items():
    print(f"Document {doc_id}:")
    for word, score in tfidf_scores.items():
        print(f"  {word}: {score:.4f}")
    print()

Document 0:
  this: 0.4000
  third: 0.0000
  is: 0.4000
  the: 0.4000
  one: 0.0000
  second: 0.0000
  and: 0.0000
  first: 0.6000
  document: 0.4667

Document 1:
  this: 0.3333
  third: 0.0000
  is: 0.3333
  the: 0.3333
  one: 0.0000
  second: 0.8333
  and: 0.0000
  first: 0.0000
  document: 0.7778

Document 2:
  this: 0.3333
  third: 0.8333
  is: 0.3333
  the: 0.3333
  one: 0.8333
  second: 0.0000
  and: 0.8333
  first: 0.0000
  document: 0.0000

Document 3:
  this: 0.4000
  third: 0.0000
  is: 0.4000
  the: 0.4000
  one: 0.0000
  second: 0.0000
  and: 0.0000
  first: 0.6000
  document: 0.4667



### Implemeting with Sklearn library

- TfidVectorizer

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the sentences
X = vectorizer.fit_transform(documents)

# Get the vocabulary
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)

## Vectorize the training data
train_X = vectorizer.transform(documents)
print("TF-IDF of all the documents is:\n", train_X.toarray())

# Vectorize a new sentence

new_X = vectorizer.transform([documents[0]])

# Convert to array and print
print("TF-IDF Vectorized form:", new_X.toarray())

Vocabulary: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
TF-IDF of all the documents is:
 [[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
TF-IDF Vectorized form: [[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


### Step 6: Analysis/application

Understanding the Vectors
Each document is represented as a vector where each element corresponds to the TF-IDF score of a word in the vocabulary. For example:

- Document 0 vector: `[0.4000, 0.0000, 0.4000, 0.4000, 0.0000, 0.0000, 0.0000, 0.6000, 0.4667]`
- Document 1 vector: `[0.3333, 0.0000, 0.3333, 0.3333, 0.0000, 0.8333, 0.0000, 0.0000, 0.7778]`

Next Steps
1. Model Training: Use these vectors as input features to train machine learning models for tasks like text classification, clustering, or sentiment analysis.

2. Similarity Measurement: Calculate the similarity between documents using metrics like cosine similarity, which can be useful for information retrieval or document clustering.

3. Visualization: Visualize the document vectors using techniques like PCA (Principal Component Analysis) or t-SNE to understand the distribution of documents in the feature space