# Implementing Text Pre-processing using NLTK

## Tokenization
This is the process of splitting text objects into smaller units called **tokens**. Token could be words, numbers, symbols ngrams etc.

In [1]:
!pip install nltk



In [2]:
!pip install -U nltk

Requirement already up-to-date: nltk in c:\users\hp\anaconda3\lib\site-packages (3.5)


In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# importing tokenizing library
from nltk.tokenize import sent_tokenize, word_tokenize

In [5]:
# TOkenization
text = "Hi everyone, How are you doing? I am a member of DSN. This is a community project"

sent_tokenize(text) #returns  list of sentences

['Hi everyone, How are you doing?',
 'I am a member of DSN.',
 'This is a community project']

In [6]:
word_tokenize(text) # returns list of words

['Hi',
 'everyone',
 ',',
 'How',
 'are',
 'you',
 'doing',
 '?',
 'I',
 'am',
 'a',
 'member',
 'of',
 'DSN',
 '.',
 'This',
 'is',
 'a',
 'community',
 'project']

## Stemming

Removal of inflectional words from a token

In [7]:
# Importing the stemming library
from nltk.stem import PorterStemmer

In [8]:
# Stemming

stemmer = PorterStemmer()

print(stemmer.stem("laughing"))
print(stemmer.stem("laughs"))
print(stemmer.stem("laughed"))

laugh
laugh
laugh


Stemming has limitation as it sometimes generate non-meaningful terms. For example:

In [9]:
print(stemmer.stem("decreased"))

decreas


## Lemmatization

It is the systematic process of reducing a token to it Lemma

It makes use of vocabulary, word structure and part of speech tags.

It does a better job than stemming.

In [10]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
# Importinng the Lemmatizatipon library
from nltk.stem import WordNetLemmatizer


In [12]:
# Lemmatizing

lemma = WordNetLemmatizer()

print(lemma.lemmatize("increases"))
print(lemma.lemmatize("running"))
print(lemma.lemmatize("running", pos="v")) # pos = part of speech, v = verb

increase
running
run


In [13]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [14]:
from nltk import pos_tag

tokens = word_tokenize(text)
pos_tag(tokens)

[('Hi', 'NNP'),
 ('everyone', 'NN'),
 (',', ','),
 ('How', 'WRB'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('doing', 'VBG'),
 ('?', '.'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('a', 'DT'),
 ('member', 'NN'),
 ('of', 'IN'),
 ('DSN', 'NNP'),
 ('.', '.'),
 ('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('community', 'NN'),
 ('project', 'NN')]

In [15]:
#Getting Synonyms

from nltk.corpus import wordnet

wordnet.synsets("good")

[Synset('good.n.01'),
 Synset('good.n.02'),
 Synset('good.n.03'),
 Synset('commodity.n.01'),
 Synset('good.a.01'),
 Synset('full.s.06'),
 Synset('good.a.03'),
 Synset('estimable.s.02'),
 Synset('beneficial.s.01'),
 Synset('good.s.06'),
 Synset('good.s.07'),
 Synset('adept.s.01'),
 Synset('good.s.09'),
 Synset('dear.s.02'),
 Synset('dependable.s.04'),
 Synset('good.s.12'),
 Synset('good.s.13'),
 Synset('effective.s.04'),
 Synset('good.s.15'),
 Synset('good.s.16'),
 Synset('good.s.17'),
 Synset('good.s.18'),
 Synset('good.s.19'),
 Synset('good.s.20'),
 Synset('good.s.21'),
 Synset('well.r.01'),
 Synset('thoroughly.r.02')]

In [16]:
# Obtaining Ngrams

from nltk import ngrams

sentence = "What do you have for me"

n = 3
for gram in ngrams(word_tokenize(sentence), n):
    print(gram)

('What', 'do', 'you')
('do', 'you', 'have')
('you', 'have', 'for')
('have', 'for', 'me')


From the above

If n = 1, it is a Unigram

If n = 2, it is a Bigram 

If n = 3, it is a Trigram as we have in the example above. It returns a tuple of 3 elements from picking the words in 3.