Text preprocessing Using NLTK Python Library

In [None]:
# word tokenization

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

# Punkt Sentence Tokenizer

# This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

# The NLTK data package includes a pre-trained Punkt tokenizer for English.

str1="i AM hasmukh mer, a machine learning engineer"


print(word_tokenize(str1))

['i', 'AM', 'hasmukh', 'mer', ',', 'a', 'machine', 'learning', 'engineer']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# sentence tokenization

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

str2="i AM hasmukh mer, a machine learning engineer. i am from gujrat"


print(sent_tokenize(str2))

['i AM hasmukh mer, a machine learning engineer.', 'i am from gujrat']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Stemming

From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, stemming a word may not result in actual words.

For Example: Mangoes ---> Mango

             Boys ---> Boy
             
             going ---> go
             
             
If our sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into their root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster Stemmer. We usually use Porter stemmer among them.

In [None]:
# Stemming


# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')

ps = PorterStemmer()

str3="i AM hasmukh mer, a machine learning engineer. i am from gujrat"

str3=word_tokenize(str3)

stem_words=[ps.stem(i) for i in str3] # list comprehension
# for i in str3:
#   print(ps.stem(i))

print(stem_words)
  


['i', 'am', 'hasmukh', 'mer', ',', 'a', 'machin', 'learn', 'engin', '.', 'i', 'am', 'from', 'gujrat']


### Lemmatization

As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization.So, we added pos(parts-of-speech) as a parameter. 

In [None]:
#lemmitization

# import these modules
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt') # required for word_tokenize
nltk.download('wordnet') # required for lemmatize
nltk.download('omw-1.4') # required for lemmatize

lemmatizer = WordNetLemmatizer()

str4="i AM hasmukh mer, a machine learning engineer. i am from gujrat"

str4=word_tokenize(str4)

lemma_words=[lemmatizer.lemmatize(i) for i in str4] # list comprehension
lemma_words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


['i',
 'AM',
 'hasmukh',
 'mer',
 ',',
 'a',
 'machine',
 'learning',
 'engineer',
 '.',
 'i',
 'am',
 'from',
 'gujrat']

### Remove default stopwords:

Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [None]:
#remove stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords') # required for stopwords
nltk.download('punkt') # required for word_tokenize
str5="i AM hasmukh mer, a machine learning engineer. i am from gujrat"
stopwords=stopwords.words('english')
str5=word_tokenize(str5)

lemma_words=[i for i in str5 if i.lower() not in stopwords] # list comprehension
lemma_words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['hasmukh', 'mer', ',', 'machine', 'learning', 'engineer', '.', 'gujrat']

In [None]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '


The pos(parts of speech) explain you how a word is used in a sentence. In the sentence, a word have different contexts and semantic meanings. The basic natural language processing(NLP) models like bag-of-words(bow) fails to identify these relation between the words. 
For that we use pos tagging to mark a word to its pos tag based on its context in the data. Pos is also used to extract rlationship between the words.


In [4]:

# importing tokenize library
import nltk
from nltk.tokenize import word_tokenize 
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger') # required for pos tag
nltk.download("punkt")# required for word_tokenize
  
# convert text into word_tokens with their tags 
def pos_tagg(text): 
    word_tokens = word_tokenize(text) 
    return pos_tag(word_tokens) 
  
pos_tagg('are you hasmukh? a machine learning engineer') 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('are', 'VBP'),
 ('you', 'PRP'),
 ('hasmukh', 'VB'),
 ('?', '.'),
 ('a', 'DT'),
 ('machine', 'NN'),
 ('learning', 'NN'),
 ('engineer', 'NN')]

In the above example NNP stands for Proper noun, PRP stands for personal noun, IN as Preposition. We can get all the details pos tags using the Penn Treebank tagset

In [5]:
# downloading the tagset  
nltk.download('tagsets') 
  
# extract information about the tag 
nltk.help.upenn_tagset('PRP')


PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


In [6]:
# extract information about the tag 
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
