In [1]:
corpus = """The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities."""

In [2]:
print(corpus)

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.


In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
##tokenization
## sentence --> Paragraph
from nltk.tokenize import sent_tokenize           #tokenize the corpus in sentences.

In [5]:
documents = sent_tokenize(corpus)

In [6]:
type(documents)

list

In [7]:
for sentence in documents:
  print(sentence)

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.


In [8]:
##tokenization
## Paragraph --> words
##sentence --> words
from nltk.tokenize import word_tokenize          #tokenize the corpus in words.

In [9]:
word_tokenize(corpus)

['The',
 'Natural',
 'Language',
 'Toolkit',
 ',',
 'or',
 'more',
 'commonly',
 'NLTK',
 ',',
 'is',
 'a',
 'suite',
 'of',
 'libraries',
 'and',
 'programs',
 'for',
 'symbolic',
 'and',
 'statistical',
 'natural',
 'language',
 'processing',
 'for',
 'English',
 'written',
 'in',
 'the',
 'Python',
 'programming',
 'language',
 '.',
 'It',
 'supports',
 'classification',
 ',',
 'tokenization',
 ',',
 'stemming',
 ',',
 'tagging',
 ',',
 'parsing',
 ',',
 'and',
 'semantic',
 'reasoning',
 'functionalities',
 '.']

In [10]:
for sentence in documents:
  print(word_tokenize(sentence))

['The', 'Natural', 'Language', 'Toolkit', ',', 'or', 'more', 'commonly', 'NLTK', ',', 'is', 'a', 'suite', 'of', 'libraries', 'and', 'programs', 'for', 'symbolic', 'and', 'statistical', 'natural', 'language', 'processing', 'for', 'English', 'written', 'in', 'the', 'Python', 'programming', 'language', '.']
['It', 'supports', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'and', 'semantic', 'reasoning', 'functionalities', '.']


In [11]:
from nltk.tokenize import wordpunct_tokenize # this also tokenize punctuations

In [12]:
for sentence in documents:
  print(wordpunct_tokenize(sentence))


['The', 'Natural', 'Language', 'Toolkit', ',', 'or', 'more', 'commonly', 'NLTK', ',', 'is', 'a', 'suite', 'of', 'libraries', 'and', 'programs', 'for', 'symbolic', 'and', 'statistical', 'natural', 'language', 'processing', 'for', 'English', 'written', 'in', 'the', 'Python', 'programming', 'language', '.']
['It', 'supports', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'and', 'semantic', 'reasoning', 'functionalities', '.']


<h1>Stemming</h1>

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [13]:
word = ["playing", "dancing", "gone", "loves", "games", "playing","controlable","buses"]

In [14]:
from nltk.stem import PorterStemmer

In [15]:
stemmer  = PorterStemmer()

In [16]:
for i in word:
    print(i + "   ---->   " + stemmer.stem(i))

playing   ---->   play
dancing   ---->   danc
gone   ---->   gone
loves   ---->   love
games   ---->   game
playing   ---->   play
controlable   ---->   control
buses   ---->   buse


<h1>RegexpStemmer class</h1>
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [17]:
from nltk.stem import RegexpStemmer

In [18]:
regex = RegexpStemmer('ing$|s$|es$|able$|ne$', min=4)

In [19]:
for i in word:
    print(i + "   ---->   " + regex.stem(i))


playing   ---->   play
dancing   ---->   danc
gone   ---->   go
loves   ---->   lov
games   ---->   gam
playing   ---->   play
controlable   ---->   control
buses   ---->   bus


<h1>Snowball Stemmer</h1>
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [20]:
from nltk.stem import SnowballStemmer

In [21]:
snowball = SnowballStemmer('english')

In [22]:
for i in word:
    print(i + "   ---->   " + snowball.stem(i))


playing   ---->   play
dancing   ---->   danc
gone   ---->   gone
loves   ---->   love
games   ---->   game
playing   ---->   play
controlable   ---->   control
buses   ---->   buse


<h1>Lemmatizer</h1>

<h1>Wordnet Lemmatizer</h1>
Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example.

In [23]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [24]:
from nltk.stem import WordNetLemmatizer

In [25]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer = WordNetLemmatizer()

In [26]:
for i in word:
    print(i + "   ---->   " + lemmatizer.lemmatize(i, pos= 'v'))


playing   ---->   play
dancing   ---->   dance
gone   ---->   go
loves   ---->   love
games   ---->   game
playing   ---->   play
controlable   ---->   controlable
buses   ---->   bus


<h1>Stopword</h1>

In [27]:
apj = """I have three visions for India. In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. From Alexander onwards, The Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours. Yet we have not done this to any other nation. We have not conquered anyone. We have not grabbed their land, their culture, their history and Tried to enforce our way of life on them. Why? Because we respect the freedom of others.

That is why my first vision is that of FREEDOM. I believe that India got its first vision of this in 1857, when we started the war of Independence. It is this freedom that we must protect and nurture and build on. If we are not free, no one will respect us.

My second vision for India's DEVELOPMENT, For fifty years we have been A developing nation. It is time we see ourselves as a developed nation. We are among top 5 nations of the world in terms of GDP. We have 10 percent growth rate in most areas. Our poverty levels are falling. Our achievements are being globally recognized today. Yet we lack the self-confidence to see ourselves as a developed nation, self-reliant and self-assured. Isn't this incorrect?

I have a THIRD vision. India must stand up to the world. Because I believe that, unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic power. Both must go hand-in-hand. My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of space, Professor Satish Dhawan, who succeeded him and Dr.Brahm Prakash, father of nuclear material. I was lucky to have worked with all three of them closely and consider this the great opportunity of my life."""

In [28]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [29]:
stop = stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [31]:
apj =sent_tokenize(apj)


In [32]:
type(apj)

list

In [34]:
apj

['I have three visions for India.',
 'In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, The Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, their history and Tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.',
 'That is why my first vision is that of FREEDOM.',
 'I believe that India got its first vision of this in 1857, when we started the war of Independence.',
 'It is this freedom that we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 "My second vision for India's DEVELOPMENT, For fifty years we have been A developing nation.",
 'It is time we see ourselves as a deve

In [40]:
## Apply Stopwords And Filter And then Apply stemming

for i in range(len(apj)):
    words=nltk.word_tokenize(apj[i])
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    apj[i]=' '.join(words)# converting all the list of words into sentences

In [41]:
apj

['i three vision india .',
 'in 3000 year histori , peopl world come invad u , captur land , conquer mind .',
 'from alexand onward , the greek , turk , mogul , portugues , british , french , dutch , come loot u , take .',
 'yet nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom other .',
 'that first vision freedom .',
 'i believ india get first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect u .',
 "my second vision india 's develop , for fifti year a develop nation .",
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recogn today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 "is n't incorrect ?",
 'i third vision .',
 'india must stand world .',
 'becaus i believ , unless india stand world , one respect u .',
 'onl

In [38]:
## Apply Stopwords And Filter And then Apply lemmatizer

for i in range(len(apj)):
    words=nltk.word_tokenize(apj[i])
    words=[lemmatizer.lemmatize(word, pos = 'v') for word in words if word not in set(stopwords.words('english'))]
    apj[i]=' '.join(words)# converting all the list of words into sentences

In [39]:
apj

['I three vision India .',
 'In 3000 year history , people world come invade u , capture land , conquer mind .',
 'From Alexander onwards , The Greeks , Turks , Moguls , Portuguese , British , French , Dutch , come loot u , take .',
 'Yet do nation .',
 'We conquer anyone .',
 'We grab land , culture , history Tried enforce way life .',
 'Why ?',
 'Because respect freedom others .',
 'That first vision FREEDOM .',
 'I believe India get first vision 1857 , start war Independence .',
 'It freedom must protect nurture build .',
 'If free , one respect u .',
 "My second vision India 's DEVELOPMENT , For fifty year A develop nation .",
 'It time see develop nation .',
 'We among top 5 nation world term GDP .',
 'We 10 percent growth rate area .',
 'Our poverty level fall .',
 'Our achievement globally recognize today .',
 'Yet lack self-confidence see develop nation , self-reliant self-assured .',
 "Is n't incorrect ?",
 'I THIRD vision .',
 'India must stand world .',
 'Because I believe ,

<h1>POS TAGGING</h1>

In [42]:
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [45]:
for i in range(len(apj)):
    words=nltk.word_tokenize(apj[i])
    words=[word for word in words if word not in set(stopwords.words('english'))]
    pos_tag = nltk.pos_tag(words)
    print(pos_tag)

[('three', 'CD'), ('vision', 'NN'), ('india', 'NN'), ('.', '.')]
[('3000', 'CD'), ('year', 'NN'), ('histori', 'NN'), (',', ','), ('peopl', 'JJ'), ('world', 'NN'), ('come', 'VBP'), ('invad', 'NN'), ('u', 'NN'), (',', ','), ('captur', 'NN'), ('land', 'NN'), (',', ','), ('conquer', 'NN'), ('mind', 'NN'), ('.', '.')]
[('alexand', 'RB'), ('onward', 'RB'), (',', ','), ('greek', 'JJ'), (',', ','), ('turk', 'NN'), (',', ','), ('mogul', 'NN'), (',', ','), ('portugues', 'NNS'), (',', ','), ('british', 'JJ'), (',', ','), ('french', 'JJ'), (',', ','), ('dutch', 'VB'), (',', ','), ('come', 'VB'), ('loot', 'NN'), ('u', 'JJ'), (',', ','), ('take', 'VB'), ('.', '.')]
[('yet', 'RB'), ('nation', 'NN'), ('.', '.')]
[('conquer', 'NN'), ('anyon', 'NN'), ('.', '.')]
[('grab', 'NN'), ('land', 'NN'), (',', ','), ('cultur', 'NN'), (',', ','), ('histori', 'JJ'), ('tri', 'NN'), ('enforc', 'NN'), ('way', 'NN'), ('life', 'NN'), ('.', '.')]
[('whi', 'NN'), ('?', '.')]
[('becaus', 'NN'), ('respect', 'NN'), ('freedom

In [54]:
sent = "i love to play basketball"
words = nltk.word_tokenize(sent)
words

['i', 'love', 'to', 'play', 'basketball']

In [57]:
# Get the POS tags for the words
pos_tags = nltk.pos_tag_sents([words])

# Print the POS tags
print(pos_tags)

[[('i', 'NN'), ('love', 'VBP'), ('to', 'TO'), ('play', 'VB'), ('basketball', 'NN')]]


<h1>Named Entity Recognization</h1>

In [58]:
nltk.download('maxent_ne_chunker')


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [59]:
nltk.download('words')


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [60]:
sentence = """ The oldest classical British and Latin writings had little or no space between words and could be written in boustrophedon (alternating directions). Over time, text direction (left to right) became standardized. Word dividers and terminal punctuation became common. The first way to divide sentences into groups was the original paragraphos, similar to an underscore at the beginning of the new group.[2] The Greek parágraphos evolved into the pilcrow (¶), which in English manuscripts in the Middle Ages can be seen inserted inline between sentences."""

In [62]:
words=nltk.word_tokenize(sentence)

In [63]:
tag_elements=nltk.pos_tag(words)


In [64]:
nltk.ne_chunk(tag_elements).draw()

TclError: no display name and no $DISPLAY environment variable