## **Using NLTK**

In [1]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

In [2]:
text = '''Saturn is the sixth planet from the Sun and the second largest in the Solar System, after Jupiter. It is a gas giant, with an average radius of about nine times that of Earth.[27][28] It has an eighth the average density of Earth, but is over 95 times more massive.[29][30][31] Even though Saturn is almost as big as Jupiter, Saturn has less than a third the mass of Jupiter. Saturn orbits the Sun at a distance of 9.59 AU (1,434 million km), with an orbital period of 29.45 years.

Saturn's interior is thought to be composed of a rocky core, surrounded by a deep layer of metallic hydrogen, an intermediate layer of liquid hydrogen and liquid helium, and an outer layer of gas. Saturn has a pale yellow hue, due to ammonia crystals in its upper atmosphere. An electrical current in the metallic hydrogen layer is thought to give rise to Saturn's planetary magnetic field, which is weaker than Earth's, but has a magnetic moment 580 times that of Earth because of Saturn's greater size. Saturn's magnetic field strength is about a twentieth that of Jupiter.[32] The outer atmosphere is generally bland and lacking in contrast, although long-lived features can appear. Wind speeds on Saturn can reach 1,800 kilometres per hour (1,100 miles per hour).

The planet has a bright and extensive system of rings, composed mainly of ice particles, with a smaller amount of rocky debris and dust. At least 146 moons[33] orbit the planet, of which 63 are officially named; these do not include the hundreds of moonlets in the rings. Titan, Saturn's largest moon and the second largest in the Solar System, is larger (and less massive) than the planet Mercury and is the only moon in the Solar System that has a substantial atmosphere.'''

In [3]:
text

"Saturn is the sixth planet from the Sun and the second largest in the Solar System, after Jupiter. It is a gas giant, with an average radius of about nine times that of Earth.[27][28] It has an eighth the average density of Earth, but is over 95 times more massive.[29][30][31] Even though Saturn is almost as big as Jupiter, Saturn has less than a third the mass of Jupiter. Saturn orbits the Sun at a distance of 9.59 AU (1,434 million km), with an orbital period of 29.45 years.\n\nSaturn's interior is thought to be composed of a rocky core, surrounded by a deep layer of metallic hydrogen, an intermediate layer of liquid hydrogen and liquid helium, and an outer layer of gas. Saturn has a pale yellow hue, due to ammonia crystals in its upper atmosphere. An electrical current in the metallic hydrogen layer is thought to give rise to Saturn's planetary magnetic field, which is weaker than Earth's, but has a magnetic moment 580 times that of Earth because of Saturn's greater size. Saturn's 

In [6]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

# Loads the small English language model (en_core_web_sm) provided by spaCy.
# This model is sufficient for basic NLP tasks like tokenization, POS tagging, dependency parsing, and named entity recognition.

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [7]:
pos = pos_tag(word_tokenize(text))

In [8]:
print(pos)

[('Saturn', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('sixth', 'JJ'), ('planet', 'NN'), ('from', 'IN'), ('the', 'DT'), ('Sun', 'NNP'), ('and', 'CC'), ('the', 'DT'), ('second', 'JJ'), ('largest', 'JJS'), ('in', 'IN'), ('the', 'DT'), ('Solar', 'NNP'), ('System', 'NNP'), (',', ','), ('after', 'IN'), ('Jupiter', 'NNP'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('gas', 'NN'), ('giant', 'NN'), (',', ','), ('with', 'IN'), ('an', 'DT'), ('average', 'JJ'), ('radius', 'NN'), ('of', 'IN'), ('about', 'RB'), ('nine', 'CD'), ('times', 'NNS'), ('that', 'IN'), ('of', 'IN'), ('Earth', 'NNP'), ('.', '.'), ('[', 'CC'), ('27', 'CD'), (']', 'JJ'), ('[', '$'), ('28', 'CD'), (']', 'NN'), ('It', 'PRP'), ('has', 'VBZ'), ('an', 'DT'), ('eighth', 'JJ'), ('the', 'DT'), ('average', 'JJ'), ('density', 'NN'), ('of', 'IN'), ('Earth', 'NNP'), (',', ','), ('but', 'CC'), ('is', 'VBZ'), ('over', 'RB'), ('95', 'CD'), ('times', 'NNS'), ('more', 'RBR'), ('massive', 'JJ'), ('.', '.'), ('[', '$'), ('29', 'CD'), (']

### ```1. Count how many verbs are there?```

In [9]:
import re
verb_lst = set()
for word, tag in pos:
    if tag.startswith('V') and word.isalpha():
        verb_lst.add(word)

print(len(verb_lst))

16


In [10]:
print(verb_lst)

{'lacking', 'include', 'be', 'has', 'reach', 'is', 'give', 'do', 'thought', 'composed', 'ammonia', 'appear', 'are', 'orbits', 'named', 'surrounded'}


### ```2. Print all the nouns from this```

In [11]:
noun_lst = set()
for word, tag in pos:
    if tag.startswith('N') and word.isalpha() :
        noun_lst.add(word)
print(noun_lst)

print(f'\nCount the number of nouns: {len(noun_lst)}')

{'rise', 'moonlets', 'distance', 'field', 'debris', 'mass', 'System', 'moment', 'period', 'crystals', 'size', 'speeds', 'hour', 'hundreds', 'dust', 'amount', 'atmosphere', 'hydrogen', 'AU', 'Wind', 'contrast', 'twentieth', 'features', 'Titan', 'moon', 'kilometres', 'ice', 'rings', 'Mercury', 'density', 'orbit', 'radius', 'layer', 'liquid', 'Jupiter', 'Sun', 'Saturn', 'gas', 'miles', 'Solar', 'Earth', 'times', 'km', 'years', 'strength', 'system', 'moons', 'hue', 'particles', 'giant', 'outer', 'planet', 'helium', 'core', 'interior'}

Count the number of nouns: 55


### ```3. Print all the adjective_noun pair```

In [12]:
ad_noun = []
for i in range(len(pos)):
    if ((pos[i][1].startswith('JJ')) and (pos[i][0].isalpha())) and (pos[i+1][1].startswith('NN') or pos[i-1][1].startswith('NN')):
        ad_noun.append([pos[i][0],pos[i+1][0]])

In [13]:
ad_noun

[['sixth', 'planet'],
 ['average', 'radius'],
 ['average', 'density'],
 ['orbital', 'period'],
 ['rocky', 'core'],
 ['deep', 'layer'],
 ['metallic', 'hydrogen'],
 ['intermediate', 'layer'],
 ['liquid', 'helium'],
 ['outer', 'layer'],
 ['yellow', 'hue'],
 ['upper', 'atmosphere'],
 ['metallic', 'hydrogen'],
 ['magnetic', 'field'],
 ['magnetic', 'moment'],
 ['greater', 'size'],
 ['magnetic', 'field'],
 ['extensive', 'system'],
 ['smaller', 'amount'],
 ['rocky', 'debris'],
 ['largest', 'moon'],
 ['planet', 'Mercury'],
 ['only', 'moon'],
 ['substantial', 'atmosphere']]

## Indian Language POS tagging

In [14]:
nltk.download('indian')

[nltk_data] Downloading package indian to /root/nltk_data...
[nltk_data]   Unzipping corpora/indian.zip.


True

In [15]:
from nltk.corpus import indian
from nltk import TnT

In [16]:
for name in indian.fileids():
    print(name)
    print(len(indian.words(name)))

bangla.pos
10281
hindi.pos
9408
marathi.pos
19066
telugu.pos
9999


In [17]:
indian.sents('marathi.pos')

[["''", 'सनातनवाद्यांनी', 'व', 'प्रतिगाम्यांनी', 'समाज', 'रसातळाला', 'नेला', 'असताना', 'या', 'अंधारात', 'बाळशास्त्री', 'जांभेकर', 'यांनी', "'दर्पण'च्या", 'माध्यमातून', 'पहिली', 'ज्ञानज्योत', 'तेववली', ',', "''", 'असे', 'प्रतिपादन', 'नटसम्राट', 'प्रभाकर', 'पणशीकर', 'यांनी', 'केले', '.'], ['दर्पणकार', 'बाळशास्त्री', 'जांभेकर', 'यांच्या', '१९५व्या', 'जयंतीनिमित्त', 'महाराष्ट्र', 'संपादक', 'परिषद', 'व', 'सिंधुदुर्ग', 'जिल्हा', 'मराठी', 'पत्रकार', 'संघाच्या', 'वतीने', 'तसेच', 'महाराष्ट्र', 'जर्नलिस्ट', 'फाउंडेशन', 'व', 'महाराष्ट्र', 'ग्रामीण', 'पत्रकार', 'संघाच्या', 'सहभागाने', 'अभिवादन', 'कार्यक्रम', 'आयोजित', 'केला', 'होता', '.'], ...]

In [18]:
m_pos = indian.tagged_sents('marathi.pos')

In [19]:
m_pos

[[("''", 'SYM'), ('सनातनवाद्यांनी', 'NN'), ('व', 'CC'), ('प्रतिगाम्यांनी', 'NN'), ('समाज', 'NN'), ('रसातळाला', 'NN'), ('नेला', 'VM'), ('असताना', 'VAUX'), ('या', 'DEM'), ('अंधारात', 'NN'), ('बाळशास्त्री', 'NNPC'), ('जांभेकर', 'NNP'), ('यांनी', 'PRP'), ("'दर्पण'च्या", 'NNP'), ('माध्यमातून', 'NN'), ('पहिली', 'QO'), ('ज्ञानज्योत', 'NN'), ('तेववली', 'VM'), (',', 'SYM'), ("''", 'SYM'), ('असे', 'DEM'), ('प्रतिपादन', 'NN'), ('नटसम्राट', 'NNPC'), ('प्रभाकर', 'NNPC'), ('पणशीकर', 'NNP'), ('यांनी', 'PRP'), ('केले', 'VM'), ('.', 'SYM')], [('दर्पणकार', 'JJ'), ('बाळशास्त्री', 'NNPC'), ('जांभेकर', 'NNP'), ('यांच्या', 'PRP'), ('१९५व्या', 'QC'), ('जयंतीनिमित्त', 'NN'), ('महाराष्ट्र', 'NNPC'), ('संपादक', 'NNPC'), ('परिषद', 'NNP'), ('व', 'CC'), ('सिंधुदुर्ग', 'NNPC'), ('जिल्हा', 'NNPC'), ('मराठी', 'NNPC'), ('पत्रकार', 'NNPC'), ('संघाच्या', 'NNP'), ('वतीने', 'NN'), ('तसेच', 'PRP'), ('महाराष्ट्र', 'NNPC'), ('जर्नलिस्ट', 'NNPC'), ('फाउंडेशन', 'NNP'), ('व', 'CC'), ('महाराष्ट्र', 'NNPC'), ('ग्रामीण', 'NNPC'), 

### 1.1 Create the object of tagger

In [20]:
tagger = TnT()

### 1.2 Train the tagger with tags

In [21]:
tagger.train(m_pos)

In [22]:
sent = 'आंतरराष्ट्रीय क्रिकेट समिती ही क्रिकेट ह्या खेळाची आंतरराष्ट्रीय प्रशासकीय संघटना आहे.'

In [23]:
tagger.tag(word_tokenize(sent))

[('आंतरराष्ट्रीय', 'NNC'),
 ('क्रिकेट', 'NNC'),
 ('समिती', 'NN'),
 ('ही', 'DEM'),
 ('क्रिकेट', 'NN'),
 ('ह्या', 'Unk'),
 ('खेळाची', 'Unk'),
 ('आंतरराष्ट्रीय', 'JJ'),
 ('प्रशासकीय', 'Unk'),
 ('संघटना', 'NNP'),
 ('आहे', 'VM'),
 ('.', 'SYM')]

## **Using Spacy**

In [24]:
# !pip install spacy

1. token.pos_ (Coarse-Grained POS Tag):
   **Represents the general category of a word's part of speech.**
    - These categories are universal, meaning they are broadly applicable across languages (based on the Universal POS tags standard).
    - Examples:
          - NOUN: For nouns (e.g., "dog", "car").
          - VERB: For verbs (e.g., "run", "is").
          - ADJ: For adjectives (e.g., "happy", "blue").
          - ADV: For adverbs (e.g., "quickly", "very").
          - PRON: For pronouns (e.g., "he", "they").


2. token.tag_ (Fine-Grained POS Tag):
    **Provides a language-specific and detailed classification of the part of speech.**
    - Includes additional grammatical information such as tense, number, and case.
    - Examples (specific to English):
          - NN: Singular noun (e.g., "dog").
          - NNS: Plural noun (e.g., "dogs").
          - VBD: Past-tense verb (e.g., "ran").
          - VBG: Present participle or gerund (e.g., "running").
          - JJ: Adjective (e.g., "beautiful").
          - RB: Adverb (e.g., "quickly").


In [30]:
import spacy

'''
Processes the input text (text) using the loaded nlp model.
This step converts the text into a Doc object, which is spaCy's representation of a parsed text.'''

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

for token in doc:
    print(f'{token.text} =====> {token.pos_},{token.tag_},{(spacy.explain(token.tag_))}')

Saturn =====> NOUN,NN,noun, singular or mass
is =====> AUX,VBZ,verb, 3rd person singular present
the =====> DET,DT,determiner
sixth =====> ADJ,JJ,adjective (English), other noun-modifier (Chinese)
planet =====> NOUN,NN,noun, singular or mass
from =====> ADP,IN,conjunction, subordinating or preposition
the =====> DET,DT,determiner
Sun =====> PROPN,NNP,noun, proper singular
and =====> CCONJ,CC,conjunction, coordinating
the =====> DET,DT,determiner
second =====> ADJ,JJ,adjective (English), other noun-modifier (Chinese)
largest =====> ADJ,JJS,adjective, superlative
in =====> ADP,IN,conjunction, subordinating or preposition
the =====> DET,DT,determiner
Solar =====> PROPN,NNP,noun, proper singular
System =====> PROPN,NNP,noun, proper singular
, =====> PUNCT,,,punctuation mark, comma
after =====> ADP,IN,conjunction, subordinating or preposition
Jupiter =====> PROPN,NNP,noun, proper singular
. =====> PUNCT,.,punctuation mark, sentence closer
It =====> PRON,PRP,pronoun, personal
is =====> AUX,V