### Text Normalization/Preprocessing

Text preprocessing is the process to clean the text before any model execution. There are several operations which we can perform to clean the data, such as :

1. Stemming
2. Lemmatization
3. StopWords

There are several libraries to achieve this process. Here we will be using `nltk`

In [30]:
import nltk

In [35]:
text = "Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Major tasks in natural language processing are speech recognition, text classification, natural language understanding, and natural language generation."
textList = text.split()

#### Stemming
The process of reduce words to their root or base form. Stemming uses rule based approach to do the conversion which does not certain the valid word.

In [31]:
from nltk import PorterStemmer

In [32]:
ps = PorterStemmer()

In [33]:
# sample stemming
ps.stem("growing")

'grow'

In [None]:
ps.stem("happily") # there is no word -> happili ??

'happili'

In [36]:
# let's do it for our text 
stemText = [ps.stem(item) for item in textList]

In [37]:
# let's see the difference
for i in range(len(textList)):
    if textList[i] != stemText[i]:
        print( stemText[i])

natur
languag
process
(nlp)
comput
scienc
especi
artifici
it
primarili
concern
provid
comput
abil
encod
natur
languag
thu
close
relat
inform
knowledg
represent
comput
major
task
natur
languag
process
natur
languag
natur
languag


here we saw, lots of words that got stemmed on the basis of PorterStemmer algo. Some words are not event correct. So we have to use thi wisely.

#### Lemmatization
The process to reduce words to their base or dictionary form, known as the lemma. It uses data source and returns valid dictionary word.

In [None]:
from nltk import WordNetLemmatizer

Before utilizing the lemmatization, we must download the resources first

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Bumblebee\AppData\Roaming\nltk_data...


True

In [None]:
# initializing the object for lemmatizer class
wnl = WordNetLemmatizer()

In [None]:
# sample lemmatize
wnl.lemmatize("lives") # should give life


'life'

In [15]:
# now let's do this for the given text and see the result
lemmatizeText = [wnl.lemmatize(item) for item in text.split()]

In [None]:
for i in range(len(textList)):
    if textList[i] != lemmatizeText[i]:
        print(lemmatizeText[i])

computer
task


now here we could see two words have been lemmatize.

#### StopWords
Common words in a sentence which does not add a significant value.

In [41]:
from nltk.corpus import stopwords

before using it, we must download it's sourced data

In [43]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Bumblebee\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [45]:
# let's see few stopwords, in english
stopwords.words("english")[0:10]

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

stopwords are also available in other languages, such as German, Indonesian, Portuguese, and Spanish

In [46]:
# now let's apply stop words in are text
stopwordsText = [ item for item in textList if item not in stopwords.words("english")]

In [47]:
stopwordsText

['Natural',
 'language',
 'processing',
 '(NLP)',
 'subfield',
 'computer',
 'science',
 'especially',
 'artificial',
 'intelligence.',
 'It',
 'primarily',
 'concerned',
 'providing',
 'computers',
 'ability',
 'process',
 'data',
 'encoded',
 'natural',
 'language',
 'thus',
 'closely',
 'related',
 'information',
 'retrieval,',
 'knowledge',
 'representation',
 'computational',
 'linguistics,',
 'subfield',
 'linguistics.',
 'Major',
 'tasks',
 'natural',
 'language',
 'processing',
 'speech',
 'recognition,',
 'text',
 'classification,',
 'natural',
 'language',
 'understanding,',
 'natural',
 'language',
 'generation.']

In [48]:
#  let's see the removed words

set(textList) - set(stopwordsText)

{'a', 'and', 'are', 'in', 'is', 'of', 'the', 'to', 'with'}

#### Tokenization
The process of breaking down the long text into a unit. A unit could a word, char or a even a sentence too, which is called a token.

1. Word tokenization: Splits text into individual words. 
2. Sentence tokenization: Splits text into individual sentences. 
3. Character tokenization: Splits text into individual characters. 
4. Subword tokenization: Splits text into meaningful sub-word units, like byte-pair encoding (BPE). 

In [50]:
wordTokens = text.split()
wordTokens

['Natural',
 'language',
 'processing',
 '(NLP)',
 'is',
 'a',
 'subfield',
 'of',
 'computer',
 'science',
 'and',
 'especially',
 'artificial',
 'intelligence.',
 'It',
 'is',
 'primarily',
 'concerned',
 'with',
 'providing',
 'computers',
 'with',
 'the',
 'ability',
 'to',
 'process',
 'data',
 'encoded',
 'in',
 'natural',
 'language',
 'and',
 'is',
 'thus',
 'closely',
 'related',
 'to',
 'information',
 'retrieval,',
 'knowledge',
 'representation',
 'and',
 'computational',
 'linguistics,',
 'a',
 'subfield',
 'of',
 'linguistics.',
 'Major',
 'tasks',
 'in',
 'natural',
 'language',
 'processing',
 'are',
 'speech',
 'recognition,',
 'text',
 'classification,',
 'natural',
 'language',
 'understanding,',
 'and',
 'natural',
 'language',
 'generation.']

#### Other techniques

There could be various mechanism to preprocess or normalize the text. It depends upon the use case to use case, like which one has to choose or works good. Other techniques can be like:

1. Replacing text with regex pattern
2. Removal of unwanted text
3. Removal of punctuations
4. Changing the case of the text etc.

##### Removal of Punctuations

In [51]:
import string

In [52]:
punctuationFreeText = [item for item in textList if item not in string.punctuation]

In [53]:
" ".join(punctuationFreeText)

'Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Major tasks in natural language processing are speech recognition, text classification, natural language understanding, and natural language generation.'