# Basic Text Preprocessing

## Removing extra White Spaces

Most of the time the text data that you have may contain extra spaces in between the words, after or before a sentence. So to start with we will remove these extra spaces from each sentence by using regular expressions or by using any basic techniques.

In [None]:
example_text = "NLP  is an interesting     field.  "

In [None]:
print('\n')
print('Text before removing white spaces : "{}"'.format(example_text))
split_text = example_text.split(' ')
print('\n')
print('Let us split the text: "{}"'.format(split_text))
print('\n')
print('we see empty spaces in this list let us try to get rid of them!')
cleaned_text_tokens = []
for i in split_text:
  if i=='':
    pass
  else:
    cleaned_text_tokens.append(i)
print('\n')
print('List after getting rid of the empty spaces "{}"'.format(cleaned_text_tokens))
print('\n')
print('This looks good, Now let us join them back using a space!')
cleaned_text = ' '.join(cleaned_text_tokens)
print('\n')
print('Text after removing white spaces  : "{}"'.format(cleaned_text))



Text before removing white spaces : "NLP  is an interesting     field.  "


Let us split the text: "['NLP', '', 'is', 'an', 'interesting', '', '', '', '', 'field.', '', '']"


we see empty spaces in this list let us try to get rid of them!


List after getting rid of the empty spaces "['NLP', 'is', 'an', 'interesting', 'field.']"


This looks good, Now let us join them back using a space!


Text after removing white spaces  : "NLP is an interesting field."


- Pretty much clear with what we have done, now lets move on to our next step.

## Removing Punctuations

Getting rid of punctuation is extremely important as they end up being tokens after tokenization of the text which will bring equal importance to it as it is with a normal word. As this text preprocessing often is useful for making TF-IDF models we donot want a punctuation mark to be a token and give its importance to the model!

In [None]:
# Importing string to print all punctuation marks.
import string

In [None]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


- There we go here we have all the punctuations.
- Now let us take an example and try to remove these punctuations from it!

In [None]:
example_text = "Hello! How are you!! I'm very excited that you're going for a trip to Europe!! Yayy!"

In [None]:
print('\n')
print('Text before cleaning : "{}"'.format(example_text))
new_text = ''
for i in example_text:# Iterationg through each element of the string so as to get rid of the punctuation marks.
  if i in string.punctuation:
    pass
  else:
    new_text=new_text+i 
print('\n')
print('Text after cleaning : "{}"'.format(new_text))



Text before cleaning : "Hello! How are you!! I'm very excited that you're going for a trip to Europe!! Yayy!"


Text after cleaning : "Hello How are you Im very excited that youre going for a trip to Europe Yayy"


- As we are done with the punctuation removal lets just go further.



## Lowering words

- Lower casing all words is very much important or else words like 'Hii' and 'hii' end up being different words where as they pretty much explain the same thing.
- We will split the words at spaces after all the removal of punctuations from the texts.

In [None]:
new_text

'Hello How are you Im very excited that youre going for a trip to Europe Yayy'

In [None]:
new_text = new_text.split(' ')
new_text = [i.lower() for i in new_text]
new_text = ' '.join(new_text)
print(new_text)

hello how are you im very excited that youre going for a trip to europe yayy


- Let us proceed a bit further by removing all those stopwords.

## Removing StopWords


### Stopwords explain nothing ?
- Well they do explain alot but in Classical ML approach the context based models are rarely seen and the solution becomes lot more better as we get rid of those words and try to make our *corpus of words(Bag of Words) as small as possible.*

In [None]:
import nltk

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

In [None]:
# First 10 words from the list
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [None]:
# Let us store them into a list and try to get rid of them.
eng_stopwords = stopwords.words('english')

- Let us use that cleaned text to remove stopwords from this.

In [None]:
new_text

'hello how are you im very excited that youre going for a trip to europe yayy'

In [None]:
# Algorithm for removing stopwords.
no_stopwords_text_list = []
for i in new_text.split(' '):
  if i in eng_stopwords:
    pass
  else:
    no_stopwords_text_list.append(i)
no_stopwords_text = ' '.join(no_stopwords_text_list)
print(no_stopwords_text)

hello im excited youre going trip europe yayy


- We can see the difference in the text. We got rid of alot noise from the text and by following this we can keep all the important words and can get rid of that noisy words.
- Let us do some Stemming and Lemmatizing.

## Stemming and Lemmatizing

- **Stemming:**A technique that takes the word to its root form.
- **Lemmatizing:**It also a technique to reduce a word to its root form.
- But the only difference is that lemmatization will be very effective when used Parts of Speech(POS) Tagging.

### Types of Stemming in NLTK

We basically have two types of Stemming in NLTK, there might be more but these two are very basic ones. They are- *Porter Stemmer, LancasterStemmer*

In [17]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

Lets see the differences upon how differently they work.

In [24]:
#create an object of class PorterStemmer
porter = PorterStemmer()
lancaster=LancasterStemmer()
#proide a word to be stemmed
print("Stemming using Porter Stemmer:")
print('cats-->',porter.stem("cats"))
print('trouble-->',porter.stem("trouble"))
print('troubling-->',porter.stem("troubling"))
print('troubled-->',porter.stem("troubled"))
print('friendship-->',porter.stem("friendship"))
print('destabilize-->',porter.stem("destabilize"))
print("\n")
print("Stemming using Lancaster Stemmer")
print('cats-->',lancaster.stem("cats"))
print('trouble-->',lancaster.stem("trouble"))
print('troubling-->',lancaster.stem("troubling"))
print('troubled-->',lancaster.stem("troubled"))
print('friendship-->',lancaster.stem("friendship"))
print('destabilize-->',lancaster.stem("destabilize"))

Stemming using Porter Stemmer:
cats--> cat
trouble--> troubl
troubling--> troubl
troubled--> troubl
friendship--> friendship
destabilize--> destabil


Stemming using Lancaster Stemmer
cats--> cat
trouble--> troubl
troubling--> troubl
troubled--> troubl
friendship--> friend
destabilize--> dest


- We can see they dont differ alot within themselves but the problem is LancasterStemmer(Introduced in 1990) is very much aggressive while stemming than Porterstemmer(Developed in 1979). It takes time and is aggressive cause it iterates over each and every letter in a word.

### Lemmatization

- Lemmatization basically is more convincing to apply cause it takes care of the context of the text and tries to make stemming more properly.

In [30]:
import nltk

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [31]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [32]:
# Always lowercase words in your text and try to get rid of all those punctuations out of your text.

sentence = "he was running and eating at same time he has bad habit of swimming after playing long hours in the sun"

In [40]:
print('Word before and after Lemmatizaion :')
for words in sentence.split(' '):
  print('{} --> {}'.format(words,wordnet_lemmatizer.lemmatize(words)))

Word before and after Lemmatizaion :
he --> he
was --> wa
running --> running
and --> and
eating --> eating
at --> at
same --> same
time --> time
he --> he
has --> ha
bad --> bad
habit --> habit
of --> of
swimming --> swimming
after --> after
playing --> playing
long --> long
hours --> hour
in --> in
the --> the
sun --> sun


- You might be surprised by the results, But this is something that works by knowing the context of the text. So POS tagging makes it lot much better and actually makes it work.
- In .lemmatize() method there is a parameter called as *'pos'* which accepts a single letter as its value and tags accordingly. In our case we set *'pos'='v'* cause mostly the verb forms of the words have extensions and we are worried about them to get rid of.

In [41]:
print('Word before and after Lemmatizaion using pos=v :')
for words in sentence.split(' '):
  print('{} --> {}'.format(words,wordnet_lemmatizer.lemmatize(words,pos='v')))

Word before and after Lemmatizaion using pos=v :
he --> he
was --> be
running --> run
and --> and
eating --> eat
at --> at
same --> same
time --> time
he --> he
has --> have
bad --> bad
habit --> habit
of --> of
swimming --> swim
after --> after
playing --> play
long --> long
hours --> hours
in --> in
the --> the
sun --> sun


- Now we see the difference. So the rule is to POS tag the sentence primarily and then apply Stemming i.e., Lemmatizing so that we stem the words properly based on the context.

- So this was all about the Stemming and all basic Preprocessing techniques.