# Text Preprocessing in NLP: A Concise Guide 

## Why Preprocess Text?

Imagine you're a chef preparing a complex dish. You wouldn't use every ingredient in your kitchen, right? Similarly, in Natural Language Processing (NLP), we don't use raw text as-is. We carefully select and prepare our "ingredients" (words and features) to make our "dish" (model) as delicious (accurate) as possible.

Let's explore this concept using spam classification as an example.

### Spam Classification Example

To classify emails as spam or not:
1. Analyze both spam and non-spam emails
2. Identify distinctive features (e.g., average length, common words)
3. Use these features to train a model

But here's the catch: emails can be messy! They might contain:
- Emojis 😊
- Special characters @#$%
- Numbers 123
- HTML tags <br>
- Varying letter cases

This "noise" can make pattern recognition challenging. That's where text preprocessing comes in!

## What is Text Preprocessing?

Text preprocessing is like sorting through your groceries before cooking:

> 🛒 **Text Preprocessing** = Picking the most useful "ingredients" from your raw text data

It's a crucial step in NLP where we clean and transform raw text to make it more suitable for analysis.

### Benefits of Text Preprocessing

- Reduces complexity
- Lowers computational cost
- Eliminates redundancy
- Improves model generalization

## The Text Preprocessing Pipeline

Here's a typical sequence of text preprocessing steps:

1. **Lowercase Conversion** 
   - Why? "Hello" and "hello" are the same word

2. **Tokenization** 
   - What? Breaking text into individual words or subwords

3. **Punctuation Removal** 
   - Why? "hello" and "hello!" are essentially the same

4. **Stopword Removal** 
   - What? Removing common words like "the", "is", "at"

5. **Vectorization** 
   - Why? Converts text to numerical format for machine learning models

Remember, this pipeline can be customized based on your specific NLP task!

## Key Takeaway

Text preprocessing is about cleaning and transforming raw text to extract the most relevant features for your NLP task. It's an essential step that can significantly impact your model's performance.

# NLP Terminology

## Corpus
- Definition: A collection of documents in a dataset
- Purpose: Serves as the primary data source for NLP tasks
- Note: Plural form is "corpora"

## Documents
- Definition: Individual units within a corpus
- Composition: Typically sentences or paragraphs
- Importance: Basic units of analysis in many NLP tasks

## Vocabulary
- Definition: Set of all unique words in a corpus
- Also known as: Lexicon or dictionary
- Use: Fundamental for text analysis and model training

## Tokens
- Definition: Individual words or subword units
- Process: Result of tokenization
- Importance: Basic units for most NLP operations


# Important Libraries

In [1]:
import re
import nltk 
import string 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt to /home/verykul/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/verykul/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/verykul/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Upper to Lower case conversion 

In [2]:
text = """Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. 
It is primarily concerned with providing computers the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. 
Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches of machine learning and deep learning.
Major tasks in natural language processing are speech recognition, text classification, natural-language understanding, and natural-language generation. 
"""

In [3]:
text = text.lower()
print(text)

natural language processing (nlp) is an interdisciplinary subfield of computer science and artificial intelligence. 
it is primarily concerned with providing computers the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. 
typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches of machine learning and deep learning.
major tasks in natural language processing are speech recognition, text classification, natural-language understanding, and natural-language generation. 



# Regular Expression 

> Regular expression can be used to remove any unwanted pattern from our texts. It is most commonly used in natural language tasks to remove html tags, special characters and other non desired patterns.

In [24]:
text = "Hello, World! Welcome to Python programming: the best & most powerful language."

pattern = r'[^a-zA-Z0-9s\s]'  #matches any character that is not letter, digit or white space 

cleaned_text = re.sub(pattern, '', text)
print('original_text:', text)
print('cleaned_text:',cleaned_text)

original_text: Hello, World! Welcome to Python programming: the best & most powerful language.
cleaned_text: Hello World Welcome to Python programming the best  most powerful language


> We can also remove html tags from our text using Regular Expression as well 

In [25]:
text = "<html><body><h1>Hello, World!</h1><p>This is a paragraph.</p></body></html>"

pattern = r'<[^>]+>'
cleaned_text = re.sub(pattern, '', text)
print('original text:', text)
print('cleaned text:', cleaned_text)

original text: <html><body><h1>Hello, World!</h1><p>This is a paragraph.</p></body></html>
cleaned text: Hello, World!This is a paragraph.


# Punctuation Removal 


- As the name suggests we remove punctuation from the documents. But, you may ask why we remove them ? Well we could keep them if we want and they are kept for some applications of NLP. But,<br>

- **In the context of prediction and finding patterns. Example being - Spam classifier, Fake news classifier etc. For tasks like these punctuations contribute so less to the overall decision that they are literally redundent and in return they add noise to our model.** <br>

- Therefore it is just better to get rid of them for tasks like these. But, for tasks like Natural Language Generation, Machine Translation they are kept so the model could understand grammar and produce grammatical correct texts.

For now this is how you can remove punctuation from documents.

In [4]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [5]:
text = "This is ! $$punctutation removal$$, using the string library."

In [6]:
cleaned_text = [word for word in text if not word in string.punctuation]
print(''.join(cleaned_text))

This is  punctutation removal using the string library


# Tokenization

- Tokenization is the process of breaking text into smaller units called tokens. These tokens are typically words, numbers, or punctuation marks.

- **Types of tokenization**:
  1. Word tokenization: Splits text into individual words
  2. Sentence tokenization: Divides text into sentences
  3. Subword tokenization (e.g., BPE, WordPiece)
  4. Language-specific tokenizers for non-English texts

- **Implementation methods**:
  - NLTK library: Offers both word and sentence tokenization
  - Python's built-in `split()` function: A simple way to tokenize based on whitespace

- **Importance**:
  - Fundamental step in text preprocessing
  - Enables further analysis and feature extraction
  - Crucial for tasks like part-of-speech tagging and named entity recognition

- **Challenges**:
  - Handling contractions (e.g., "don't")
  - Dealing with hyphenated words
  - Managing special characters and punctuation

Remember: The choice of tokenization method can significantly impact downstream NLP tasks.

`Example - 1`

In [7]:
text = "This is how tokenization is done to word. This is tokenizatio using NLTK"

# word tokenization 
print(word_tokenize(text))   # output is a list of words
print(sent_tokenize(text))   # output is a list of sentences 

['This', 'is', 'how', 'tokenization', 'is', 'done', 'to', 'word', '.', 'This', 'is', 'tokenizatio', 'using', 'NLTK']
['This is how tokenization is done to word.', 'This is tokenizatio using NLTK']


`Example - 2`

In [8]:
text = "This email is not valid: example@gmail.com"

print(word_tokenize(text))   

['This', 'email', 'is', 'not', 'valid', ':', 'example', '@', 'gmail.com']


In [9]:
print(text.split())

['This', 'email', 'is', 'not', 'valid:', 'example@gmail.com']


- In the second example you can see the NLTK word tokenizer converts the email to three different tokens'example', '@' and 'gmail.com'. while the split function just returned the whole email as a token.

# Stopwords

- Stopwords also called Function words are words that are mostly use to make sentences grammatically correct. They do not necessarily contribute to the true meaning of the sentence and removing them does not affects the overall context of the sentence.
- NLTK library already has a stopwords API which returns a list of all the stopwords for different languages. Here is an example for the stopwords in English and how you can remove them.

In [10]:
text = "NLTK, or the Natural Language Toolkit, is a treasure trove of a library for text preprocessing. It’s one of my favorite Python libraries. NLTK has a list of stopwords stored in 16 different languages"
text_tokens = word_tokenize(text)

In [11]:
STOPWORDS = stopwords.words('english')
# print(STOPWORDS)        

In [12]:
filtered_sentence = [word for word in text_tokens if not word in STOPWORDS]
print('The text after removal of stopwords: \n')
print(' '.join(filtered_sentence))

The text after removal of stopwords: 

NLTK , Natural Language Toolkit , treasure trove library text preprocessing . It ’ one favorite Python libraries . NLTK list stopwords stored 16 different languages


- Well the output is different from what we gave but you can see that the meaning of the sentence is not lost. We can still somehow manage to understand the sentence.

# Stemming 

- Stemming is essentially just breaking a complex word into its base or root form. To understand stemming it's better if we see the example first - 

In [13]:
stemmer = PorterStemmer()

In [14]:
text = "She was a great Dancer. Her performance was historical"
tokens = word_tokenize(text)

In [15]:
stemmed_words = [stemmer.stem(word) for word in tokens]
print(stemmed_words)

['she', 'wa', 'a', 'great', 'dancer', '.', 'her', 'perform', 'wa', 'histor']


- You can see that the resulting words are broken into their base form but some of them are not from the english dictionary, they are not a real word of english. This is limitation of stemming, they not necessarily result in a meaning word and it ocuurs more than often.
- Then why is it used you may ask ?
    - well stemming is used because it is faster than other approaches. So if the corpus is large and we want faster conversion we can
      consider using stemming
    - Another scenario of using stemming is when we don't care about the true meaning of the word and an approximate word of the true word
      will suffice for our use case.

- Application of stemming:
    1. Spam classification
    2. Sentimental analysis
    3. duplicate matching

# Lemmatization 

- Lemmatization is essentially stemming but the catch here is, it results in a meaningful word. That's the only difference between them.
- Since it results in a meaningful word it tends to be slower than stemming.
- Therefore it is only preferred when we want to conserve the true word.

In [16]:
lemmatizer = WordNetLemmatizer()

In [17]:
text = "He was going to the castle. To steal the magical computer"
tokens = word_tokenize(text)

In [18]:
lemmatized_word = [lemmatizer.lemmatize(word) for word in tokens]
print(lemmatized_word)

['He', 'wa', 'going', 'to', 'the', 'castle', '.', 'To', 'steal', 'the', 'magical', 'computer']


- But wait, why is 'was' converted to 'wa', that is not meaningful at all.
- Also there is no change in the words of the sentence as well they are not converted to their root form.
- Well there are some implications here as well. To understand why this happens we have to know how the nltk lemmatizer works.
  Lemmatization is a process of reducing words to their base or dictionary form, known as the lemma. It works as follows:

> Lemmatization removes inflectional endings from words. Inflections are changes in word forms that indicate grammatical functions, such as
  plurals for nouns or tense for verbs.

1. The lemmatizer analyzes the word's context and part of speech to determine its proper base form.
2. It then checks if the resulting word exists in a dictionary. If found, it returns this base form (lemma).
3. Lemmatization considers the word's meaning and context, unlike stemming, which simply truncates words based on rules.
4. The goal is to return a valid dictionary word that preserves the word's core meaning.

- Lemmatizer works only when the a parameter `pos` is given to it. By default the parameter is set to `pos='n'`, which means **NOUN**. Hence when we called lemmatizer in our example it assumed all the text as **NOUN** and converted them into base form which are the same as original. 
Similary, for 'was' it assummed that 'was' is a noun and is a plural for 'wa' and removed the last letter 's' from the word 'was'.
- Now when we provide the pos value and call lemmatizer again on our text we will get the base form of the word as verbs.

In [19]:
lemmatized_word = [lemmatizer.lemmatize(word, pos='v') for word in tokens]
print(lemmatized_word)

['He', 'be', 'go', 'to', 'the', 'castle', '.', 'To', 'steal', 'the', 'magical', 'computer']
