## Text preprocessing

Text preprocessing is a method to clean the text data and make it ready to feed data to the model. Text data contains noise in various forms like emotions, punctuation, text in a different case.


Let’s see the various different steps that are followed while preprocessing the data also used for dimensionality reduction.
Text Preprocessing:


<b>1] Noise Removal</b>

<b>2] Tokenization</b>

<b>3] Text Normalization</b>

<b>4] Stemming</b>

<b>5] Lemmatization</b>

<b>6] Stopword Removal</b>

<b>7] Part-of-Speech Tagging</b>


![image-3.png](attachment:image-3.png)





# Noise Removal
In natural language processing, noise removal is a text preprocessing task devoted to stripping text of formatting.
Noise removal is about removing characters digits and pieces of text that can interfere with your text analysis. Noise removal is one of the most essential text preprocessing steps.

In [3]:
import re
 
text = "Does she live in Paris?"
                                           
 
# remove punctuation
result = re.sub(r'[\.\?\!\,\:\;\"]', '', text)
 
print(result)

Does she live in Paris


# Tokenization
In natural language processing, tokenization is the text preprocessing task of breaking up text into smaller components of text (known as tokens).
It is a method in which sentences are converted into words.
![image-2.png](attachment:image-2.png)


In [8]:
from nltk.tokenize import word_tokenize
 
text = "I live in Pune"
tokenized = word_tokenize(text)

print(tokenized)

['I', 'live', 'in', 'Pune']


In [10]:
token = word_tokenize("Text preprocessing in NLP")
token

['Text', 'preprocessing', 'in', 'NLP']

# Text Normalization
In natural language processing, normalization encompasses many text preprocessing tasks including stemming, lemmatization, upper or lowercasing, and stopwords removal.
![image-2.png](attachment:image-2.png)


# Lowercasing

The tokenized words into lower case format. (NLU -> nlu). Words having the same meaning like nlp and NLP if they are not converted into lowercase then these both will constitute as non-identical words in the vector space model.

In [15]:
t="The tokenized words into lower case format"
Lowercase = []
for lowercase in t:
    Lowercase.append(lowercase.lower())
Lowercase

['t',
 'h',
 'e',
 ' ',
 't',
 'o',
 'k',
 'e',
 'n',
 'i',
 'z',
 'e',
 'd',
 ' ',
 'w',
 'o',
 'r',
 'd',
 's',
 ' ',
 'i',
 'n',
 't',
 'o',
 ' ',
 'l',
 'o',
 'w',
 'e',
 'r',
 ' ',
 'c',
 'a',
 's',
 'e',
 ' ',
 'f',
 'o',
 'r',
 'm',
 'a',
 't']

# Stopword Removal
In natural language processing, stopword removal is the process of removing words from a string that don’t provide any information about the tone of a statement.
These are the most often used that do not have any significance while determining the two different documents like (a, an, the, etc.) so they are to be removed. Check the below image wherefrom the sentence “Introduction to Natural Language Processing” the “to” word is removed.

In [20]:
from nltk.corpus import stopwords 
 
    
word1="stopword removal is the process of removing words from a string. "
stop_words = set(stopwords.words('english')) 
 
statement_no_stop = [word for word in word1 if word not in stop_words]
statement_no_stop

['p',
 'w',
 'r',
 ' ',
 'r',
 'e',
 'v',
 'l',
 ' ',
 ' ',
 'h',
 'e',
 ' ',
 'p',
 'r',
 'c',
 'e',
 ' ',
 'f',
 ' ',
 'r',
 'e',
 'v',
 'n',
 'g',
 ' ',
 'w',
 'r',
 ' ',
 'f',
 'r',
 ' ',
 ' ',
 'r',
 'n',
 'g',
 '.',
 ' ']

# Stemming

 It is the process in which the words are converted to its base from. Check the below code implementation where the words of the sentence are converted to the base form.
Stemming is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes).



In [24]:
from nltk.stem import PorterStemmer
 
tokenized = ["words"," are"," converted"," to","its"," base","from"]
 
ps = PorterStemmer()
stem = [ps.stem(token) for token in tokenized]
 
print(stem)

['word', ' are', ' convert', ' to', 'it', ' base', 'from']


In [25]:
print(ps.stem('jumping'))
print(ps.stem('lately'))
print(ps.stem('assess'))
print(ps.stem('ran'))

jump
late
assess
ran


# Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma 


In [29]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\deshm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [31]:
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('ran', 'v'))
print(lemmatizer.lemmatize('better', 'a'))

run
good


# Difference between stemming and lemmatization

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas

![image.png](attachment:image.png)

# Part-of-Speech Tagging
In natural language processing, part-of-speech tagging is the process of assigning a part of speech to every word in a string. Using the part of speech can improve the results of lemmatization.