<a href="https://colab.research.google.com/github/kishlay-kk/nltk_chatbot/blob/master/Tokenising_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Tokenisation and Tagging of Text**
In order to classify and analyze a body of text in a more granular fashion, it is necessary to consider how to break it into individual sentences and words or "tokens". Broadly then there are two tasks:

***Sentence Tokenization***
***Word Tokenization*** 
To go beyond counting the frequency or occurence of actual words we need to classify words in general categories that signify their part in the construct of the sentence - for instance Noun, Verb Adjective etc. This is generally known as

Part of Speech or POS Tagging

***Sentence Tokenisation***


In [9]:
import nltk
nltk.download('punkt')

ulysses = "Mrkgnao! the cat said loudly. She blinked up out of her avid shameclosing eyes, mewing \
plaintively and long, showing him her milkwhite teeth. He watched the dark eyeslits narrowing \
with greed till her eyes were green stones. Then he went to the dresser, took the jug Hanlon's\
milkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.\
— Gurrhr! she cried, running to lap."

doc = nltk.sent_tokenize(ulysses)     #sent_tokenize:- Sentence Tokenize
for s in doc:
    print(">",s)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
> Mrkgnao!
> the cat said loudly.
> She blinked up out of her avid shameclosing eyes, mewing plaintively and long, showing him her milkwhite teeth.
> He watched the dark eyeslits narrowing with greed till her eyes were green stones.
> Then he went to the dresser, took the jug Hanlon'smilkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.— Gurrhr!
> she cried, running to lap.


***Word Tokenisation***

There are many methods for tokenising text into words. The default Penn Treebank Tokeniser is the tokeniser based on the Penn TreeBank Corpus. A few examples of different tokenisers giving different results are listed below:

TreebankWordTokenizer
WordPunctTokenizer
WhitespaceTokenize

We can see a simple illustration of the impact of chosing a different tokenisation method by looking at the different results we get for a simple sentence:

In [10]:
from nltk import word_tokenize
sentence = "Mary had a little lamb it's fleece was white as snow."

#Default Tokenisers (Tree Tookeniser)
tree_tokens=word_tokenize(sentence)
print("DEFAULT: ",tree_tokens)

                                 # Other Tokenisers
#Tokenising on the basis of punctuations
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()                                 #punct_tokenizer is a object of the class WordPunctTokenizer
punct_tokens = punct_tokenizer.tokenize(sentence)                                    # .tokenize is the function
print("PUNCTUATION : ",punct_tokens)

#Tokenising on the basis of spaces
space_tokenizer = nltk.tokenize.SpaceTokenizer()
space_tokens = space_tokenizer.tokenize(sentence)
print("SPACES : ",space_tokens)

DEFAULT:  ['Mary', 'had', 'a', 'little', 'lamb', 'it', "'s", 'fleece', 'was', 'white', 'as', 'snow', '.']
PUNCTUATION :  ['Mary', 'had', 'a', 'little', 'lamb', 'it', "'", 's', 'fleece', 'was', 'white', 'as', 'snow', '.']
SPACES :  ['Mary', 'had', 'a', 'little', 'lamb', "it's", 'fleece', 'was', 'white', 'as', 'snow.']


***Part of Speech Tagging***

For each word-token the nltk pos_tag method can be used to classify its Part of Speech (POS), automating the classification of words into their parts of speech and labeling them accordingly.

The outcome depends on how the sentence has been split up into individual tokens and which Tokensizer and Corpus the POS-tagger has been trained against:

In [12]:

import nltk
nltk.download('averaged_perceptron_tagger')           #Library File required to do tagging
pos = nltk.pos_tag(tree_tokens)
print(pos)
pos_space = nltk.pos_tag(space_tokens)
print(pos_space)
pos_space = nltk.pos_tag(punct_tokens)
print(pos_space)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'NN'), ('it', 'PRP'), ("'s", 'VBZ'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow', 'NN'), ('.', '.')]
[('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'JJ'), ("it's", 'NN'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow.', 'NN')]
[('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'NN'), ('it', 'PRP'), ("'", "''"), ('s', 'JJ'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow', 'NN'), ('.', '.')]


These are the tags under which words can be classified by the POS Tagger

#### PoS Tag Descriptions ###
CC | Coordinating conjunction  
CD | Cardinal number  
DT | Determiner  
EX | Existential there  
FW | Foreign word  
IN | Preposition or subordinating conjunction  
JJ | Adjective  
JJR | Adjective, comparative  
JJS | Adjective, superlative  
LS | List item marker  
MD | Modal  
NN | Noun, singular or mass  
NNS | Noun, plural  
NNP | Proper noun, singular  
NNPS | Proper noun, plural  
PDT | Predeterminer  
POS | Possessive ending  
PRP | Personal pronoun  
PRP\$ | Possessive pronoun  
RB | Adverb  
RBR | Adverb, comparative  
RBS | Adverb, superlative  
RP | Particle  
SYM | Symbol  
TO | to  
UH | Interjection  
VB | Verb, base form  
VBD | Verb, past tense  
VBG | Verb, gerund or present participle  
VBN | Verb, past participle  
VBP | Verb, non-3rd person singular present  
VBZ | Verb, 3rd person singular present  
WDT | Wh-determiner  
WP | Wh-pronoun  
WP$ | Possessive wh-pronoun  
WRB | Wh-adverb   


This Categorization can be used to derrive or obtain more specefic data from a text like  obtaining only nouns or pronouns etc

In [14]:
import re
regex = re.compile("^N.*")                                        #means categories starting from N// .* means rest can be anything 
nouns = []
for l in pos:                                                     #pos is the collection of tree tokenised data
    if regex.match(l[1]):
        nouns.append(l[0])
print("Nouns:", nouns)

Nouns: ['Mary', 'lamb', 'fleece', 'snow']


***Stemming and Lemmatizing***

Striping off the suffixes from words is known as stemming.
Mapping a word to a known dictionary word is know as lemmatization

There are multiple Stemming methods available and the the NLTK book references a few methods in particular:

The Porter Stemmer - see https://tartarus.org/martin/PorterStemmer/
Lancaster Stemmer - (Chris Paice, University of Lancaster) additionally the
Snowball Stemmer - "Porter 2" developed by Martin Porter is generally considered the de-facto optimal Stemmer
A list of other stemming methods can be found here: http://www.nltk.org/api/nltk.stem.html. Current Stemming and "Lemming" techniques are an inexact process as things currently stand.




These dont have a much significant practical application