# NLTK - basics 

In [1]:
import nltk

In [9]:
#nltk.download()

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

#### What is Tokenization?
A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. Tokenization is the process of splitting a string into a list of tokens.

In [3]:
mystring = "My favorite color is blue"

mystring.split()

['My', 'favorite', 'color', 'is', 'blue']

In [4]:
mystring = "My favorite colors are blue, red, and green."

In [6]:
mystring.split()

['My', 'favorite', 'colors', 'are', 'blue,', 'red,', 'and', 'green.']

the punctuation marks are grouped in with their adjacent word (e.g. blue,). This is problematic for NLP applications, as the goal of tokenization is generally to divide a set (corpus) of documents into a common set of building blocks that can then be used as a basis for comparison. Hence, it’s no good if “blue” in "My favorite color is blue" doesn’t match with “blue” in "My favorite colors are blue, red, and green." since the latter is tokenized as blue, rather than blue.

In [7]:
compare_list = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter',
                "My favorite color is blue",
                "My favorite colors are blue, red, and green."]

In [8]:
from nltk.tokenize import word_tokenize

#### 1. NLTK word_tokenize - separate words using spaces and punctuations.

In [9]:
word_tokens = []

for sent in compare_list:

    word_tokens.append(word_tokenize(sent))

word_tokens

[['https', ':', '//t.co/9z2J3P33Uc'],
 ['laugh/cry'],
 ['😬😭😓🤢🙄😱'],
 ['world', "'s", 'problems'],
 ['@', 'datageneral'],
 ['It', "'s", 'interesting'],
 ['do', "n't", 'spell', 'my', 'name', 'right'],
 ['all-nighter'],
 ['My', 'favorite', 'color', 'is', 'blue'],
 ['My',
  'favorite',
  'colors',
  'are',
  'blue',
  ',',
  'red',
  ',',
  'and',
  'green',
  '.']]

When dealing with well-formed, formal text, this standard word tokenizer makes a lot of sense and is likely to be sufficient. However, the same cannot be said for cases when our text data comes from more casual, slang-ridden sources like Twitter.

In [10]:
word_tokenize("@john lol that was #awesome :)")

['@', 'john', 'lol', 'that', 'was', '#', 'awesome', ':', ')']

most likely that we’d prefer for

- @ and john to be tokenized together as @john,
- \#  and awesome to be tokenized together as #awesome.
This is because we’d expect that word usage in the context of hastags or at-mentions is likely different from usage in plain text.

we would prefer that : and ) to be tokenized together as :), as :) is certainly more informative (e.g. for sentiment analysis) than the sum of its parts.

## 2. WordPunctTokenizer
WordPunctTokenizer splits all punctuations into separate tokens.

In [15]:
compare_list = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter',
                "My favorite color is blue",
                "My favorite colors are blue, red, and green."]

In [16]:
from nltk.tokenize import WordPunctTokenizer

In [17]:
punct_tokenizer = WordPunctTokenizer()

punct_tokens = []

for sent in compare_list:
    
    punct_tokens.append(punct_tokenizer.tokenize(sent))
punct_tokens

[['https', '://', 't', '.', 'co', '/', '9z2J3P33Uc'],
 ['laugh', '/', 'cry'],
 ['😬😭😓🤢🙄😱'],
 ['world', "'", 's', 'problems'],
 ['@', 'datageneral'],
 ['It', "'", 's', 'interesting'],
 ['don', "'", 't', 'spell', 'my', 'name', 'right'],
 ['all', '-', 'nighter'],
 ['My', 'favorite', 'color', 'is', 'blue'],
 ['My',
  'favorite',
  'colors',
  'are',
  'blue',
  ',',
  'red',
  ',',
  'and',
  'green',
  '.']]

this tokenizer successfully splits laugh/cry into 2 words. But the fallbacks are:
- The link ‘https://t.co/9z2J3P33Uc' is split into 7 words
- world's is split into 2 words by "'" character
- @datageneral is split into @ and datageneral
- don't is split into do and n't

Since these words should be considered as one word, this tokenizer is not what we want either. Is there a way that we can split words based on the space instead?

#### 3. TweetTokenizer

In [11]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()

In [12]:
tweet_tokens = []

for sent in compare_list:
    
    print(tweet_tokenizer.tokenize(sent))
    
    tweet_tokens.append(tweet_tokenizer.tokenize(sent))

['https://t.co/9z2J3P33Uc']
['laugh', '/', 'cry']
['😬', '😭', '😓', '🤢', '🙄', '😱']
["world's", 'problems']
['@datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all-nighter']
['My', 'favorite', 'color', 'is', 'blue']
['My', 'favorite', 'colors', 'are', 'blue', ',', 'red', ',', 'and', 'green', '.']


In [21]:
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? \
               The weather is great, and Python is awesome !\
               The sky is pinkish-blue. \
               You shouldn\'t eat cardboard."

In [22]:
sentence = """At eight o'clock on Thursday morning Arthur felt very good. But he didn't go to play"""

### sentence tokenizing

In [23]:
sent_tokenize(EXAMPLE_TEXT)

['Hello Mr. Smith, how are you doing today?',
 'The weather is great, and Python is awesome !',
 'The sky is pinkish-blue.',
 "You shouldn't eat cardboard."]

In [18]:
for sent in sent_tokenize(EXAMPLE_TEXT):
    print(sent)

Hello Mr. Smith, how are you doing today?
The weather is great, and Python is awesome !
The sky is pinkish-blue.
You shouldn't eat cardboard.


So there, we have created tokens, which are sentences. 

### word tokenizing

In [19]:
nltk.word_tokenize()

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 'felt',
 'very',
 'good',
 '.',
 'But',
 'he',
 'did',
 "n't",
 'go',
 'to',
 'play']

In [20]:
word_tokenize(EXAMPLE_TEXT)

['Hello',
 'Mr.',
 'Smith',
 ',',
 'how',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'The',
 'weather',
 'is',
 'great',
 ',',
 'and',
 'Python',
 'is',
 'awesome',
 '!',
 'The',
 'sky',
 'is',
 'pinkish-blue',
 '.',
 'You',
 'should',
 "n't",
 'eat',
 'cardboard',
 '.']

#### Observation. 
- First, notice that punctuation is treated as a separate token. 
- Also, notice the separation of the word "shouldn't" into "should" and "n't." 
- Finally, notice that "pinkish-blue" is indeed treated like the "one word" it was meant to be turned into

- Some words seem trivial - these are a form of "stop words"

In [21]:
text = "this is Ram's text, is'nt it?"

In [22]:
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

['this', 'is', "Ram's", 'text,', "is'nt", 'it?']

In [23]:
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['this', 'is', 'Ram', "'s", 'text', ',', "is'nt", 'it', '?']

In [24]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

['this', 'is', 'Ram', "'", 's', 'text', ',', 'is', "'", 'nt', 'it', '?']

### stop words

In [27]:
from nltk.corpus import stopwords

In [28]:
print(set(stopwords.words('english')))

{'off', 'again', 'ma', 'that', 'into', 'its', 'having', 'myself', 'with', 'our', 'now', 'such', 'an', 'll', 'wasn', 'where', 'before', 'you', 'she', 'of', 'your', 'mightn', 'too', "haven't", 'don', 'him', 'her', 'y', 'o', 'up', "hasn't", 'in', 'am', 'haven', 'below', "you'd", "you're", 'above', 'from', "it's", 'my', 'which', 'being', 'own', 'm', 'we', 'do', 'by', 'more', 'it', 'shouldn', 'yours', "shouldn't", 'so', 'wouldn', 'are', "she's", 'whom', 'doing', 'over', "won't", 'very', "mightn't", 'or', 'couldn', 'is', 'few', 'what', 'because', "couldn't", 'ourselves', 'will', 'themselves', 'against', 'while', 'them', 'no', "should've", 'on', 'hers', "weren't", 'all', 'herself', 'they', 'was', 'won', 'has', 'i', 'should', "wasn't", 'does', 'needn', 'some', 'these', 'there', 'once', 'just', 'as', 'were', 'isn', 'until', 're', 'at', 'and', "needn't", 'not', 's', 'yourselves', 'other', 'for', 'have', "you've", 'his', 'had', 'nor', 'any', "don't", 'most', 'aren', "aren't", 'each', 'yourself', 

In [29]:
example_sent = "This is a sample sentence, showing off the stop words filtration."

In [30]:
stop_words = set(stopwords.words('english'))

In [31]:
word_tokens = word_tokenize(example_sent)

In [32]:
# option 1
filtered_sentence = [w for w in word_tokens if not w in stop_words]

# option 2
filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


### Stemming words

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

I was taking a ride in the car.
I was riding in the car.

One of the most popular stemming algorithms is the __Porter stemmer__, which has been around since 1979.

In [33]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer

In [34]:
porter   = PorterStemmer()
lancaster= LancasterStemmer()
sno      = nltk.stem.SnowballStemmer('english')

In [36]:
word_list = ["cave", "caver", "caved"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
cave                 cave                 cav                  cave                
caver                caver                cav                  caver               
caved                cave                 cav                  cave                


In [37]:
word_list = ["run", "ran", "runner", "running"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
run                  run                  run                  run                 
ran                  ran                  ran                  ran                 
runner               runner               run                  runner              
running              run                  run                  run                 


In [38]:
word_list = ["cats", "trouble", "troubling", "troubled", "troublesome"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word))) 

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
cats                 cat                  cat                  cat                 
trouble              troubl               troubl               troubl              
troubling            troubl               troubl               troubl              
troubled             troubl               troubl               troubl              
troublesome          troublesom           troublesom           troublesom          


Notice how the PorterStemmer is 
- giving the root (stem) of the word "cats" by simply removing the 's' after cat. This is a suffix added to cat to make it plural. 
- But if we look at 'trouble', 'troubling' and 'troubled' they are stemmed to 'trouble' because **PorterStemmer algorithm does not follow linguistics rather a set of 05 rules for different cases that are applied in phases (step by step) to generate stems**


In [33]:
word_list = ["argue", "argued", "argues", "arguing", "argus"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
argue                argu                 argu                 argu                
argued               argu                 argu                 argu                
argues               argu                 argu                 argu                
arguing              argu                 argu                 argu                
argus                argu                 arg                  argus               


In [37]:
#A list of words to be stemmed
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))
for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
friend               friend               friend               friend              
friendship           friendship           friend               friendship          
friends              friend               friend               friend              
friendships          friendship           friend               friendship          
stabil               stabil               stabl                stabil              
destabilize          destabil             dest                 destabil            
misunderstanding     misunderstand        misunderstand        misunderstand       
railroad             railroad             railroad             railroad            
moonlight            moonlight            moonlight            moonlight           
football             footbal              footbal              footbal             


In [38]:
sentence="Pythoners are very intelligent and work very pythonly and now they are pythoning their way to success."
porter.stem(sentence)

'pythoners are very intelligent and work very pythonly and now they are pythoning their way to success.'

stemmer sees the entire sentence as a word, so it returns it as it is.

In [39]:
text = "My system keeps crashing his crashed yesterday, ours crashes daily"

print(' '.join([porter.stem(word) for word in text.split()]))
print(' '.join([lancaster.stem(word) for word in text.split()]))

My system keep crash hi crash yesterday, our crash daili
my system keep crash his crash yesterday, our crash dai


### lemmatization

Lemmatization is the process of converting a word to its base form. 

The difference between stemming and lemmatization is, 

> lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cutoff the ‘ing’ part and convert it to car.

    ‘Caring’ -> Lemmatization -> ‘Care’
    ‘Caring’ -> Stemming -> ‘Car’
    
ways to lemmatize:-

    Wordnet Lemmatizer
    Spacy Lemmatizer
    TextBlob
    CLiPS Pattern
    Stanford CoreNLP
    Gensim Lemmatizer
    TreeTagger

In [42]:
from nltk.stem import WordNetLemmatizer

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

In [43]:
word_list = ["friend", "friendship", "friends", "friendships","stabilize","destabilize","misunderstanding","railroad","moonlight","football"]

print("{0:20} {1:20}".format("Word","WordNetLemmatizer"))

for word in word_list:
    print("{0:20} {1:20} ".format(word, lemmatizer.lemmatize(word)))

Word                 WordNetLemmatizer   
friend               friend               
friendship           friendship           
friends              friend               
friendships          friendship           
stabilize            stabilize            
destabilize          destabilize          
misunderstanding     misunderstanding     
railroad             railroad             
moonlight            moonlight            
football             football             


In [42]:
# Lemmatize Single Word
print(lemmatizer.lemmatize("bats"))

print(lemmatizer.lemmatize("are"))

print(lemmatizer.lemmatize("feet"))

bat
are
foot


In [43]:
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)

['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']


In [44]:
for w in word_list:
    print(w, '-->', lemmatizer.lemmatize(w) )

The --> The
striped --> striped
bats --> bat
are --> are
hanging --> hanging
on --> on
their --> their
feet --> foot
for --> for
best --> best


Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. 

This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().

In [45]:
print(lemmatizer.lemmatize("stripes", 'v')) 
print(lemmatizer.lemmatize("stripes", 'n'))  

strip
stripe


## Generate the N-grams for the given sentence

The essential concepts in text mining is n-grams, which are a set of co-occurring or continuous sequence of n items from a sequence of large text or sentence. The item here could be words, letters, and syllables. 1-gram is also called as unigrams are the unique words present in the sentence. Bigram(2-gram) is the combination of 2 words. Trigram(3-gram) is 3 words and so on.

In [44]:
import nltk
from nltk.util import ngrams

In [56]:
text = 'Data science is a wonderful program, \
Data science is a land of opportunities,data science is about machine learning '

In [58]:
grams = 3
n_grams = ngrams(nltk.word_tokenize(text), grams)

In [59]:
n_grams

<zip at 0x15d5da36000>

In [60]:
list(nltk.trigrams(word_tokenize(text)))

[('Data', 'science', 'is'),
 ('science', 'is', 'a'),
 ('is', 'a', 'wonderful'),
 ('a', 'wonderful', 'program'),
 ('wonderful', 'program', ','),
 ('program', ',', 'Data'),
 (',', 'Data', 'science'),
 ('Data', 'science', 'is'),
 ('science', 'is', 'a'),
 ('is', 'a', 'land'),
 ('a', 'land', 'of'),
 ('land', 'of', 'opportunities'),
 ('of', 'opportunities', ','),
 ('opportunities', ',', 'data'),
 (',', 'data', 'science'),
 ('data', 'science', 'is'),
 ('science', 'is', 'about'),
 ('is', 'about', 'machine'),
 ('about', 'machine', 'learning')]

In [52]:
[' '.join(grams) for grams in n_grams]

['Data science is',
 'science is a',
 'is a wonderful',
 'a wonderful program',
 'wonderful program ,',
 'program , Data',
 ', Data science',
 'Data science is',
 'science is a',
 'is a land',
 'a land of',
 'land of opportunities',
 'of opportunities ,',
 'opportunities , data',
 ', data science',
 'data science is',
 'science is about',
 'is about machine',
 'about machine learning']

In [52]:
text = 'Data science is a wonderful program, \
Data science is a land of opportunities,data science is about machine learning '

In [53]:
bigrams = list(nltk.bigrams(word_tokenize(text)))

In [54]:
from collections import Counter

In [55]:
Counter(bigrams)

Counter({('Data', 'science'): 2,
         ('science', 'is'): 3,
         ('is', 'a'): 2,
         ('a', 'wonderful'): 1,
         ('wonderful', 'program'): 1,
         ('program', ','): 1,
         (',', 'Data'): 1,
         ('a', 'land'): 1,
         ('land', 'of'): 1,
         ('of', 'opportunities'): 1,
         ('opportunities', ','): 1,
         (',', 'data'): 1,
         ('data', 'science'): 1,
         ('is', 'about'): 1,
         ('about', 'machine'): 1,
         ('machine', 'learning'): 1})

OR equivalenty ...

In [56]:
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]
 
data = 'A class is a blueprint for the object.'
 
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object', '.']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object', 'object .']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object', 'the object .']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object', 'for the object .']


In [57]:
list(nltk.trigrams(word_tokenize(text)))

[('Data', 'science', 'is'),
 ('science', 'is', 'a'),
 ('is', 'a', 'wonderful'),
 ('a', 'wonderful', 'program'),
 ('wonderful', 'program', ','),
 ('program', ',', 'Data'),
 (',', 'Data', 'science'),
 ('Data', 'science', 'is'),
 ('science', 'is', 'a'),
 ('is', 'a', 'land'),
 ('a', 'land', 'of'),
 ('land', 'of', 'opportunities'),
 ('of', 'opportunities', ','),
 ('opportunities', ',', 'data'),
 (',', 'data', 'science'),
 ('data', 'science', 'is'),
 ('science', 'is', 'about'),
 ('is', 'about', 'machine'),
 ('about', 'machine', 'learning')]

## parts of speech tags

In [63]:
## Dummy text 
txt = '''Sukanya, Bhupen and Bhavik are my good friends. Sukanya is getting married next year. Marriage is a big step in one’s life.\  
       It is both exciting and frightening. But friendship is a sacred bond between people. \  
       It is a special kind of love between us. \
       Many of you must have tried searching for a friend \
       but never found the right one.'''

In [64]:
tokenized = sent_tokenize(txt) 

In [66]:
for i in tokenized: 
      
    # Word tokenizers is used to find the words  
    # and punctuation in a string 
    wordsList = nltk.word_tokenize(i) 
  
    # removing stop words from wordList 
    #wordsList = [w for w in wordsList if not w in stop_words]  
  
    #  Using a Tagger. Which is part-of-speech  
    # tagger or POS-tagger.  
    tagged = nltk.pos_tag(wordsList) 
  
    print(tagged) 

[('Sukanya', 'NNP'), (',', ','), ('Bhupen', 'NNP'), ('and', 'CC'), ('Bhavik', 'NNP'), ('are', 'VBP'), ('my', 'PRP$'), ('good', 'JJ'), ('friends', 'NNS'), ('.', '.')]
[('Sukanya', 'NNP'), ('is', 'VBZ'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN'), ('.', '.')]
[('Marriage', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('big', 'JJ'), ('step', 'NN'), ('in', 'IN'), ('one', 'CD'), ('’', 'NN'), ('s', 'NN'), ('life.\\', 'NN'), ('It', 'PRP'), ('is', 'VBZ'), ('both', 'DT'), ('exciting', 'VBG'), ('and', 'CC'), ('frightening', 'NN'), ('.', '.')]
[('But', 'CC'), ('friendship', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('sacred', 'JJ'), ('bond', 'NN'), ('between', 'IN'), ('people', 'NNS'), ('.', '.')]
[('\\', 'VB'), ('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('special', 'JJ'), ('kind', 'NN'), ('of', 'IN'), ('love', 'NN'), ('between', 'IN'), ('us', 'PRP'), ('.', '.')]
[('Many', 'JJ'), ('of', 'IN'), ('you', 'PRP'), ('must', 'MD'), ('have', 'VB'), ('tried', 'VBN'), ('searching', 'VBG'), ('for

one more example ..

In [61]:
from nltk import tag

sent = 'Today I will be learning about POS tags'

tagged_sent = tag.pos_tag(word_tokenize(sent))

tagged_sent


[('Today', 'NN'),
 ('I', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('learning', 'VBG'),
 ('about', 'IN'),
 ('POS', 'NNP'),
 ('tags', 'NNS')]

--------------------------------
#### Ready-made NE (Named Entity) chunker
---------------------

|NE Type |Examples |
|--------|---------|
|ORGANIZATION | Georgia-Pacific Corp., WHO |
|PERSON | Eddy Bonte, President Obama|
|LOCATION | Murray River, Mount Everest|
|DATE | June, 2008-06-29 |
|TIME | two fifty a m, 1:30 p.m.|
|MONEY | 175 million Canadian Dollars, GBP 10.40|
|PERCENT | twenty pct, 18.75 %|
|FACILITY | Washington Monument, Stonehenge
|GPE | South East Asia, Midlothian |

In [69]:
from nltk import chunk, tag

sent = 'Infosys is IT service provider, based out of banaglore. Was established on 1st Jan 1990\
it is making a profit of 56%. Annual turnover is 500 million US$ made a profit of 77%'

tagged_sent = tag.pos_tag(word_tokenize(sent))

tree = chunk.ne_chunk(tagged_sent)

print(tree)

(S
  (GPE Infosys/NNP)
  is/VBZ
  IT/NNP
  service/NN
  provider/NN
  ,/,
  based/VBN
  out/IN
  of/IN
  banaglore/NN
  ./.
  Was/NNP
  established/VBN
  on/IN
  1st/CD
  Jan/NNP
  1990it/CD
  is/VBZ
  making/VBG
  a/DT
  profit/NN
  of/IN
  56/CD
  %/NN
  ./.
  (PERSON Annual/JJ)
  turnover/NN
  is/VBZ
  500/CD
  million/CD
  US/NNP
  $/$
  made/VBD
  a/DT
  profit/NN
  of/IN
  77/CD
  %/NN)


In [70]:
ne_subtrees1 = tree.subtrees(filter=lambda t: t.label() in
                                                            ["ORGANIZATION", "PERSON",
                                                            "LOCATION",      "DATE","TIME",
                                                            "MONEY",         "PERCENT",
                                                            "FACILITY",      "GPE"])

In [71]:
ne_subtrees1

<generator object Tree.subtrees at 0x0000015D5D9E05E0>

In [68]:
ne_subtrees2 = tree.subtrees(filter=lambda t: t.label() in ["NNP"])

In [69]:
ne_subtrees_list1 = [tree for tree in ne_subtrees1]
print(ne_subtrees_list1)

[Tree('GPE', [('Infosys', 'NNP')]), Tree('PERSON', [('Annual', 'JJ')])]


In [70]:
ne_subtrees_list2 = [tree for tree in ne_subtrees2]
print(ne_subtrees_list2)

[]
