<a href="https://colab.research.google.com/github/krishanu34/DataScience/blob/main/01.NLP/05.Parts%20of%20speech%20Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Parts of Speech (POS) Tagging Documentation

Part of Speech (POS) tagging is the process of assigning a word to a particular part of speech, based on both its definition and its context. It's a fundamental step in many Natural Language Processing (NLP) tasks.

Here are some common parts of speech and their typical tags (using the Penn Treebank tagset, commonly used with NLTK):

*   **Noun:**
    *   NN: singular or mass noun (e.g., "dog", "water")
    *   NNS: plural noun (e.g., "dogs")
    *   NNP: proper singular noun (e.g., "John")
    *   NNPS: proper plural noun (e.g., "Americans")

*   **Verb:**
    *   VB: verb base form (e.g., "run")
    *   VBD: verb past tense (e.g., "ran")
    *   VBG: verb present participle or gerund (e.g., "running")
    *   VBN: verb past participle (e.g., "run")
    *   VBP: verb non-3rd person singular present (e.g., "run")
    *   VBZ: verb 3rd person singular present (e.g., "runs")

*   **Adjective:**
    *   JJ: adjective or numeral, ordinal (e.g., "big", "first")
    *   JJR: adjective comparative (e.g., "bigger")
    *   JJS: adjective superlative (e.g., "biggest")

*   **Adverb:**
    *   RB: adverb (e.g., "quickly")
    *   RBR: adverb comparative (e.g., "quicker")
    *   RBS: adverb superlative (e.g., "quickest")

*   **Pronoun:**
    *   PRP: personal pronoun (e.g., "I", "he", "she")
    *   PRP$: possessive pronoun (e.g., "my", "his", "her")

*   **Preposition:**
    *   IN: preposition or subordinating conjunction (e.g., "on", "in", "because")

*   **Conjunction:**
    *   CC: coordinating conjunction (e.g., "and", "but", "or")

*   **Interjection:**
    *   UH: interjection (e.g., "oh", "uh")

*   **Determiner:**
    *   DT: determiner (e.g., "the", "a", "this")

In [14]:
text = """
Natural language processing (NLP) is a fascinating field at the intersection of computer science, artificial intelligence, and linguistics.
It focuses on enabling computers to understand, interpret, and generate human language.
This involves a wide range of tasks, including text classification, sentiment analysis, machine translation, question answering, and text summarization.
NLP has become increasingly important in today's data-driven world, with applications in various industries such as healthcare, finance, and customer service.
One of the fundamental steps in many NLP tasks is text preprocessing, which involves cleaning and preparing the text data for analysis.
This often includes tasks like tokenization (breaking down text into individual words or sub-word units), stemming or lemmatization (reducing words to their root form), and removing stop words.
Stop words are common words like "the", "a", "is", and "in" that often don't carry significant meaning and can be removed to reduce noise and improve the performance of NLP models.
Another important aspect of NLP is feature extraction, which involves converting text data into numerical representations that can be used by machine learning algorithms.
Common techniques include bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), and word embeddings.
Bag-of-words represents text as a collection of word counts, while TF-IDF assigns weights to words based on their frequency in a document and across a corpus.
Word embeddings, such as Word2Vec and GloVe, represent words as dense vectors in a continuous vector space, capturing semantic relationships between words.
NLP models can be broadly categorized into traditional machine learning models and deep learning models.
Traditional models like Naive Bayes and Support Vector Machines have been used for tasks like text classification, while deep learning models,
such as Recurrent Neural Networks (RNNs) and Transformers, have achieved state-of-the-art results in various NLP tasks, particularly in areas like machine translation and text generation.
The field of NLP is constantly evolving, with new techniques and models being developed.
With the increasing availability of large datasets and computational resources, NLP is expected to play an even more significant role in the future,
enabling more natural and intuitive interactions between humans and computers.
"""

In [15]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [16]:
lemmatizer = WordNetLemmatizer()
sentences = nltk.sent_tokenize(text)

In [17]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  pos_tag = nltk.pos_tag(words)
  print(pos_tag)

[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('fascinating', 'VBG'), ('field', 'NN'), ('intersection', 'NN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('linguistics', 'NNS'), ('.', '.')]
[('It', 'PRP'), ('focus', 'VBZ'), ('enabling', 'VBG'), ('computer', 'NN'), ('understand', 'NN'), (',', ','), ('interpret', 'NN'), (',', ','), ('generate', 'VBP'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')]
[('This', 'DT'), ('involves', 'VBZ'), ('wide', 'JJ'), ('range', 'NN'), ('task', 'NN'), (',', ','), ('including', 'VBG'), ('text', 'JJ'), ('classification', 'NN'), (',', ','), ('sentiment', 'NN'), ('analysis', 'NN'), (',', ','), ('machine', 'NN'), ('translation', 'NN'), (',', ','), ('question', 'NN'), ('answering', 'NN'), (',', ','), ('text', 'JJ'), ('summarization', 'NN'), ('.', '.')]
[('NLP', 'NNP'), ('become', 'VBP'), ('increasingly', 'RB'), ('important', 'JJ'), ('today