#  Introduction to text processing using NLTK
The Natural Language Toolkit (NLTK) is a powerful Python package that
provides a set of diverse natural languages algorithms. It is free, open source,
easy to use, large community, and well documented. NLTK consists of the
most common algorithms such as tokenizing, part-of-speech tagging, stemming,
1
sentiment analysis, topic segmentation, and named entity recognition.
In this tutorial, you are going to cover the following topics:
- Text Analysis Operations using NLTK
- Tokenization
- Stopwords
- Lexicon Normalization such as Stemming and Lemmatization
- POS Tagging

Those are common practices in Natural Language Processing (NLP)
___
#### NLTK installation
The first step is to install NLTK. You can run the Anaconda Prompt. Type `conda install nltk`

In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

A new window should open, showing the `NLTK Downloader`. You don’t have
to download all packages. You need to download `Popular packages` by downloading popular package in the Collections
tab.

### Tokenization
The process of breaking down a
text paragraph into smaller chunks such as words or sentence is called Tokenization.

#### Sentence Tokenization
Sentence tokenizer breaks text paragraph into sentences. Use the following code
to use NLTK for sentence tokenizing.

In [2]:
text = """"Dogs ,Canis lupus familiaris, are domesticated mammals, not natural wild animals. They were originally bred from
wolves. They have been bred by humans for a long time, and
were the first animals ever to be domesticated.
Today, some dogs are used as pets, others are used to help
humans do their work. They are a popular pet because they
are usually playful, friendly, loyal and listen to humans.
Thirty million dogs in the United States are registered as
pets. Dogs eat both meat and vegetables, often mixed
together and sold in stores as dog food. Dogs often have
jobs, including as police dogs, army dogs, assistance dogs,
fire dogs, messenger dogs, hunting dogs, herding dogs, or
rescue dogs.
They are sometimes called "canines" from the Latin word for dog
- canis. Sometimes people also use "dog" to describe other
canids, such as wolves. A baby dog is called a pup or puppy.
A dog is called a puppy until it is about one year old.
Dogs are sometimes referred to as "man’s best friend" because
they are kept as domestic pets and are usually loyal and
like being around humans.
I am ICT student """

In [5]:
from nltk.tokenize import sent_tokenize
tokenized_text = sent_tokenize(text)    # sentence tokenization
print(tokenized_text)

['"Dogs ,Canis lupus familiaris, are domesticated mammals, not natural wild animals.', 'They were originally bred from\nwolves.', 'They have been bred by humans for a long time, and\nwere the first animals ever to be domesticated.', 'Today, some dogs are used as pets, others are used to help\nhumans do their work.', 'They are a popular pet because they\nare usually playful, friendly, loyal and listen to humans.', 'Thirty million dogs in the United States are registered as\npets.', 'Dogs eat both meat and vegetables, often mixed\ntogether and sold in stores as dog food.', 'Dogs often have\njobs, including as police dogs, army dogs, assistance dogs,\nfire dogs, messenger dogs, hunting dogs, herding dogs, or\nrescue dogs.', 'They are sometimes called "canines" from the Latin word for dog\n- canis.', 'Sometimes people also use "dog" to describe other\ncanids, such as wolves.', 'A baby dog is called a pup or puppy.', 'A dog is called a puppy until it is about one year old.', 'Dogs are somet

#### Word Tokenization
Word tokenizer breaks text paragraph into words. Use the following code to use
NLTK for word tokenizing.

In [13]:
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize(text)    # word tokenization
print(tokenized_word)

['``', 'Dogs', ',', 'Canis', 'lupus', 'familiaris', ',', 'are', 'domesticated', 'mammals', ',', 'not', 'natural', 'wild', 'animals', '.', 'They', 'were', 'originally', 'bred', 'from', 'wolves', '.', 'They', 'have', 'been', 'bred', 'by', 'humans', 'for', 'a', 'long', 'time', ',', 'and', 'were', 'the', 'first', 'animals', 'ever', 'to', 'be', 'domesticated', '.', 'Today', ',', 'some', 'dogs', 'are', 'used', 'as', 'pets', ',', 'others', 'are', 'used', 'to', 'help', 'humans', 'do', 'their', 'work', '.', 'They', 'are', 'a', 'popular', 'pet', 'because', 'they', 'are', 'usually', 'playful', ',', 'friendly', ',', 'loyal', 'and', 'listen', 'to', 'humans', '.', 'Thirty', 'million', 'dogs', 'in', 'the', 'United', 'States', 'are', 'registered', 'as', 'pets', '.', 'Dogs', 'eat', 'both', 'meat', 'and', 'vegetables', ',', 'often', 'mixed', 'together', 'and', 'sold', 'in', 'stores', 'as', 'dog', 'food', '.', 'Dogs', 'often', 'have', 'jobs', ',', 'including', 'as', 'police', 'dogs', ',', 'army', 'dogs',

#### Frequency Distribution
Let’s calculate the frequency distribution of those words using Python NLTK.
There is a function in NLTK called `FreqDist()` that does the job.

In [14]:
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_word)
print(fdist)

<FreqDist with 124 samples and 234 outcomes>


In [15]:
fdist.most_common(10)   # top 10 most common words

[(',', 18),
 ('.', 13),
 ('are', 10),
 ('dogs', 10),
 ('as', 7),
 ('and', 6),
 ('to', 5),
 ('dog', 5),
 ('``', 4),
 ('Dogs', 4)]

#### Stopwords
Any piece of text which is not relevant to the context of the data and the
end-output can be specified as the noise. For example – language stopwords
(commonly used words of a language – is, am, the, of, in etc), URLs or links,
social media entities (mentions, hashtags), punctuations and industry specific
words.

This step deals with removal of all types of noisy entities present in the
text. In NLTK for removing stopwords, you need to create a list of stopwords and
filter out your list of tokens from these words.

In [16]:
my_stopwords = ['is', 'a', 'this', ',', 'ICT']   # create a list of stop word that you want to remove

In [17]:
from nltk.corpus import stopwords
# NLTK also provide a set of common stopwords

In [18]:
nltk_stop_words = set(stopwords.words("english"))
print(nltk_stop_words)

{'himself', 'do', 'as', 'same', 'further', 'my', 'me', 'aren', 'up', 'weren', 't', 'couldn', 'have', 'you', 'and', 'about', 'all', 'against', "won't", 'now', 'should', 'his', 'hasn', "mustn't", 'to', 'each', 'who', 'ain', 'haven', 'while', "shan't", 'at', "don't", 'off', 've', "couldn't", "it's", 'your', 'won', "you'd", 'him', 'very', 'before', 'any', 'doesn', 'itself', 'between', 'why', "haven't", 'both', 'when', 'we', 'here', 'own', "isn't", 'because', 'with', 'of', 'being', 'most', "hadn't", 'below', 'our', "that'll", 'll', 'from', "you've", 'down', 'in', 'some', "she's", 'only', 'hers', 'having', 'if', 'does', 'don', 'what', 'again', 'ours', 'or', 'just', 'ma', 'too', 'but', 'her', 'yourself', 'these', 'nor', 'for', 'those', 'myself', 'so', 'which', 'into', 'it', 'shouldn', 'this', 'more', "you're", 'can', 'after', 'had', 'during', 'is', 's', 'y', 'whom', 'how', 'was', 'am', 'yours', 'under', 'wouldn', 'above', 'theirs', "didn't", "should've", 'by', 'their', 'a', 'isn', 'no', 'migh

#### Removing Stopwords
Next, you are going to use the stopwords that you prepared to remove noise
from our text.

In [24]:
filtered_words = []
for w in tokenized_word:
    if w not in my_stopwords:
        filtered_words.append(w)
        
# print("Tokenized Sentence:", tokenized_word)
# print()
print("Filtered Sentence:",filtered_words)

Filtered Sentence: ['``', 'Dogs', 'Canis', 'lupus', 'familiaris', 'are', 'domesticated', 'mammals', 'not', 'natural', 'wild', 'animals', '.', 'They', 'were', 'originally', 'bred', 'from', 'wolves', '.', 'They', 'have', 'been', 'bred', 'by', 'humans', 'for', 'long', 'time', 'and', 'were', 'the', 'first', 'animals', 'ever', 'to', 'be', 'domesticated', '.', 'Today', 'some', 'dogs', 'are', 'used', 'as', 'pets', 'others', 'are', 'used', 'to', 'help', 'humans', 'do', 'their', 'work', '.', 'They', 'are', 'popular', 'pet', 'because', 'they', 'are', 'usually', 'playful', 'friendly', 'loyal', 'and', 'listen', 'to', 'humans', '.', 'Thirty', 'million', 'dogs', 'in', 'the', 'United', 'States', 'are', 'registered', 'as', 'pets', '.', 'Dogs', 'eat', 'both', 'meat', 'and', 'vegetables', 'often', 'mixed', 'together', 'and', 'sold', 'in', 'stores', 'as', 'dog', 'food', '.', 'Dogs', 'often', 'have', 'jobs', 'including', 'as', 'police', 'dogs', 'army', 'dogs', 'assistance', 'dogs', 'fire', 'dogs', 'messen

In [20]:
NLTK_filtered_words = []
for w in tokenized_word:
    if w not in nltk_stop_words:
        NLTK_filtered_words.append(w)

print("Filtered sentence:", NLTK_filtered_words)

Filtered sentence: ['``', 'Dogs', ',', 'Canis', 'lupus', 'familiaris', ',', 'domesticated', 'mammals', ',', 'natural', 'wild', 'animals', '.', 'They', 'originally', 'bred', 'wolves', '.', 'They', 'bred', 'humans', 'long', 'time', ',', 'first', 'animals', 'ever', 'domesticated', '.', 'Today', ',', 'dogs', 'used', 'pets', ',', 'others', 'used', 'help', 'humans', 'work', '.', 'They', 'popular', 'pet', 'usually', 'playful', ',', 'friendly', ',', 'loyal', 'listen', 'humans', '.', 'Thirty', 'million', 'dogs', 'United', 'States', 'registered', 'pets', '.', 'Dogs', 'eat', 'meat', 'vegetables', ',', 'often', 'mixed', 'together', 'sold', 'stores', 'dog', 'food', '.', 'Dogs', 'often', 'jobs', ',', 'including', 'police', 'dogs', ',', 'army', 'dogs', ',', 'assistance', 'dogs', ',', 'fire', 'dogs', ',', 'messenger', 'dogs', ',', 'hunting', 'dogs', ',', 'herding', 'dogs', ',', 'rescue', 'dogs', '.', 'They', 'sometimes', 'called', '``', 'canines', "''", 'Latin', 'word', 'dog', '-', 'canis', '.', 'Some

### Lexicon Normalization
Lexicon normalization considers another type of noise in the text. For example,
connection, connected, connecting word reduce to a common word “connect”.
It reduces derivationally related forms of a word to a common root word.
The most common lexicon normalization practices are
- Stemming
- Lemmatization

#### Stemming
Stemming is a process of linguistic normalization, which reduces words to their
word root word or chops off the derivational affixes (“ing”, “ly”, “es”, “s” etc).
For example, connection, connected, connecting word reduce to a common word
“connect”.

In [21]:
from nltk.stem import PorterStemmer
p = PorterStemmer()
word="communication"
p.stem(word)

'commun'

In [23]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

stem = PorterStemmer()

stemmed_words=[]
for w in filtered_words:
    stemmed_words.append(stem.stem(w))
    
# print("Filtered Sentence:", filtered_words)
# print()
print("Stemmed Sentence:", stemmed_words)

Stemmed Sentence: ['``', 'dog', 'cani', 'lupu', 'familiari', 'are', 'domest', 'mammal', 'not', 'natur', 'wild', 'anim', '.', 'they', 'were', 'origin', 'bred', 'from', 'wolv', '.', 'they', 'have', 'been', 'bred', 'by', 'human', 'for', 'long', 'time', 'and', 'were', 'the', 'first', 'anim', 'ever', 'to', 'be', 'domest', '.', 'today', 'some', 'dog', 'are', 'use', 'as', 'pet', 'other', 'are', 'use', 'to', 'help', 'human', 'do', 'their', 'work', '.', 'they', 'are', 'popular', 'pet', 'becaus', 'they', 'are', 'usual', 'play', 'friendli', 'loyal', 'and', 'listen', 'to', 'human', '.', 'thirti', 'million', 'dog', 'in', 'the', 'unit', 'state', 'are', 'regist', 'as', 'pet', '.', 'dog', 'eat', 'both', 'meat', 'and', 'veget', 'often', 'mix', 'togeth', 'and', 'sold', 'in', 'store', 'as', 'dog', 'food', '.', 'dog', 'often', 'have', 'job', 'includ', 'as', 'polic', 'dog', 'armi', 'dog', 'assist', 'dog', 'fire', 'dog', 'messeng', 'dog', 'hunt', 'dog', 'herd', 'dog', 'or', 'rescu', 'dog', '.', 'they', 'are

#### Lemmatization
Lemmatization, on the other hand, is an organized and step by step procedure
of obtaining the root form of the word, it makes use of vocabulary (dictionary
importance of words) and morphological analysis (word structure and grammar
relations). 

In [66]:
from nltk.stem.wordnet import WordNetLemmatizer

lem = WordNetLemmatizer()
lem.lemmatize("walking", pos='v')

'walk'

In [68]:
word = "flying"
print(lem.lemmatize(word, pos='v'))
print(lem.lemmatize(word, pos='n'))

fly
flying


#### POS Tagging
The primary target of Part-of-Speech(POS) tagging is to identify the grammatical
group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE,
VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships
within the sentence and assigns a corresponding tag to the word.

In [69]:
sent = "Rin likes to eat chocolate"

tokens = nltk.word_tokenize(sent)
print(tokens)

nltk.pos_tag(tokens)

['Rin', 'likes', 'to', 'eat', 'chocolate']


[('Rin', 'NNP'),
 ('likes', 'VBZ'),
 ('to', 'TO'),
 ('eat', 'VB'),
 ('chocolate', 'NN')]