====================================================================================================

<style>
blue {
  color: skyblue;
}
</style>

## 1) **Lowercasing**
Lowercasing converts all characters in a text to <blue>**lowercase**</blue>. It ensures uniformity by treating words like <blue>**"Dog"**</blue> and <blue>**"dog"**</blue> as the same entity. This is important for many NLP tasks since capitalization usually doesn't change the meaning of words.

Example:\
Input: "Natural Language Processing"\
Output: "natural language processing" 

In [2]:
text = "Hello WorlD! Welcome to our NLP - Deep Learning Bootcamp"
lowercased_text = text.lower()

print(lowercased_text)

hello world! welcome to our nlp - deep learning bootcamp


<style>
blue {
  color: skyblue;
}
</style>
## 2) **Removing Punctuation & Special Characters**

Punctuation marks (like <blue>**commas**</blue>, <blue>**periods**</blue>, <blue>**dash**</blue> etc.) and special characters (like <blue>**@**</blue>, <blue>**#**</blue>, <blue>**$**</blue>, etc.) are often not meaningful in many NLP tasks. Removing them helps clean the text for better analysis.

Example:\
Input: "Hello! How are you doing @today?"\
Output: "Hello How are you doing today"


In [3]:
import re
text = "Hello, world ✋✋! Welcome to?* our&/|~^+%'\" NLP - Deep Learning🧠 Bootcamp🤩."
punctuation_pattern = r'[^\w\s]'
text_cleaned = re.sub(punctuation_pattern, '', text)
print(text_cleaned)

Hello world  Welcome to our NLP  Deep Learning Bootcamp


<style>
blue {
  color: skyblue;
}
</style>
## 3) **Stop - Words Removal**

Stop-words are common words like <blue>**"the"**</blue>, <blue>**"is"**</blue>, <blue>**"in"**</blue>, <blue>**"and"**</blue> that don't contribute significant meaning to the text. Removing them helps reduce the size of the dataset <blue>**without losing important context**</blue>.

Example:\
Input: "This is a sample sentence"\
Output: "sample sentence"

In [9]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Remove stopwords function for any language
def remove_stopwords(text, language):
    stop_words = set(stopwords.words(language))
    word_tokens = text.split()
    filtered_text = [word for word in word_tokens if word not in stop_words]
    print(f"Language: {language}")
    print("Filtered Text:", filtered_text)

# English Example
en_text = "Hello World! This is an NLP - Deep Learning Bootcamp. Hope this is fun!"
remove_stopwords(en_text, "english")

# Hindi + English - Example
hi_text = "Yeh ek bahut accha din hai and I am feeling awesome"
remove_stopwords(hi_text, "hinglish")

Language: english
Filtered Text: ['Hello', 'World!', 'This', 'NLP', '-', 'Deep', 'Learning', 'Bootcamp.', 'Hope', 'fun!']
Language: hinglish
Filtered Text: ['Yeh', 'din', 'I', 'feeling', 'awesome']


<style>
blue {
  color: skyblue;
}
</style>
## 4) **Removal of URLs**

URLs are often <blue>**irrelevant**</blue> in NLP tasks and can add noise to the data. Removing them ensures cleaner text without <blue>**web links**</blue> that don’t contribute to the context.

Example:\
Input: "Check out this link: https://example.com"\
Output: "Check out this link"

In [10]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

text = "I hope this bootcamp is useful for you. You can share it with your friends at https://example.com"
remove_urls(text)

'I hope this bootcamp is useful for you. You can share it with your friends at '

<style>
blue {
  color: skyblue;
}
</style>
## 5) **Removal of HTML Tags**

HTML tags are used in web data but are <blue>**unnecessary**</blue> in NLP tasks. <blue>**Stripping**</blue> out HTML tags cleans the text extracted from <blue>**web pages**</blue>.

Example:\
Input: "&lt;p>This is a paragraph.&lt;/p>"\
Output: "This is a paragraph."

In [12]:
import re

text = """<html><div>
<h1>NLP - Deep Learning</h1>
<p>Removal of HTML tags</p>
<a href="https://example.com"></a>
</div></html>"""

html_tags_pattern = r'<.*?>'
text_without_html_tags = re.sub(html_tags_pattern, '', text)
print(text_without_html_tags)


NLP - Deep Learning
Removal of HTML tags




<style>
blue {
  color: skyblue;
}
</style>
## 6) **Stemming**

Stemming reduces a word to its <blue>**base**</blue> or <blue>**root**</blue> form, which might not always be a valid word. The idea is to <blue>**strip**</blue> off <blue>**prefixes**</blue> or <blue>**suffixes**</blue>. It’s a quick and less computationally expensive way of normalizing words. Stemming is preferred when the <blue>**meaning**</blue> of the word is <blue>**not important**</blue> for analysis. for example: <blue>**Spam Detection**</blue>

Example:\
Input: "Playing", "Played", "Plays"\
Output: "Play"

<blue>**Porter stemming**</blue> algorithm is one of the most common stemming algorithms which is basically designed to <blue>**remove**</blue> and <blue>**replace**</blue> well-known <blue>**suffixes**</blue> of English words. Although the Porter Stemming Algorithm was developed for English texts, it can be adapted to different languages. However, it is more effective to use natural language processing tools and algorithms specifically designed for different languages, like the library <blue>**iNLTK**</blue> offers these tools for <blue>**Indic Languages**</blue>. You can find it out here: <blue>**https://github.com/goru001/inltk**</blue> 

<div style="font-style: italic; text-align: center;" markdown="1">
<img width="30%" src="https://cdn.botpenguin.com/assets/website/Stemming_53678d43bc.png">
</div>

In [14]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
 
def stem_words(text):
    word_tokens = text.split()
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems
 
text = 'text preprocessing section in course nlp - deep learning'
stem_words(text)

['text', 'preprocess', 'section', 'in', 'cours', 'nlp', '-', 'deep', 'learn']

<style>
blue {
  color: skyblue;
}
</style>
## 7) **Lemmatization**

Lemmatization is a more advanced technique compared to stemming. It <blue>**reduces**</blue> a word to its <blue>**base form (called a lemma)**</blue> while ensuring the <blue>**output**</blue> is a <blue>**valid word**</blue>. It uses context to determine whether the word is in singular, plural, or tense forms.

Example:\
Input: "Running", "Ran"\
Output: "Run"

In our lemmatization example, we will be using a popular lemmatizer called <blue>**WordNet**</blue> lemmatizer. WordNet is a word association database for English and a useful resource for English lemmatization. A popular lemmatizer used for Hindi is developed by <blue>**JohSnowLabs**</blue> can be found here: <blue>**https://sparknlp.org/2020/07/29/lemma_hi.html**</blue>

<div style="font-style: italic; text-align: center;" markdown="1">
<img width="30%" src="https://cdn.botpenguin.com/assets/website/Lemmatization_5338fc7c3e.png">
</div>

In [18]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_word(text):
    word_tokens = text.split()
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas
 
text = 'text preprocessing section in course nlp - deep learning'
print(lemmatize_word(text))

['text', 'preprocessing', 'section', 'in', 'course', 'nlp', '-', 'deep', 'learn']


<style>
blue {
  color: skyblue;
}
</style>
## 8) **Tokenization**

Tokenization is the process of <blue>**splitting**</blue> a text into <blue>**individual units**</blue> like words, phrases, or sentences, called <blue>**tokens**</blue>. These tokens form the building blocks for further processing and analysis in NLP tasks.

Example:\
Input: "Congratulations you are almost at the end of this file."\
Output: ["Congratulations", "you", "are", "almost", "at", "the", "end", "of", "this", "file", "."]

There are different methods and libraries available to perform tokenization. <blue>**SpaCy**</blue> and <blue>**Gensim**</blue> are some of the libraries that can be used to accomplish the task.
Tokenization can be used to separate words or sentences. If the text is split into <blue>**words**</blue> using some separation technique it is called <blue>**word tokenization**</blue> and the same separation done for <blue>**sentences**</blue> is called <blue>**sentence tokenization**</blue>.

In [19]:
import nltk
from nltk.tokenize import word_tokenize

text = "Congratulations you are almost at the end of this file."

tokens = word_tokenize(text)
print(tokens)

['Congratulations', 'you', 'are', 'almost', 'at', 'the', 'end', 'of', 'this', 'file', '.']
