# Part-of-speech (POS) tagging

Process in natural language processing (NLP) that involves labeling each word in a text corpus with its corresponding part of speech, such as noun, verb, adjective, etc. The NLTK (Natural Language Toolkit) is a popular Python library for NLP tasks, including POS tagging.

NLTK provides several methods for POS tagging, including the default tagger, regular expression tagger, unigram tagger, and bigram tagger, among others. These taggers use various techniques, such as rule-based approaches, statistical models, and machine learning algorithms, to assign the appropriate POS tags to words in a text corpus.

The accuracy of POS tagging depends on the quality of the training data and the effectiveness of the tagging algorithm used. POS tagging is a crucial step in many NLP applications, such as text classification, sentiment analysis, and information extraction, as it helps to identify the syntactic structure and meaning of a sentence.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('sentiwordnet')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import sentiwordnet as swn
from nltk.tag import pos_tag

from nltk.sentiment.vader import SentimentIntensityAnalyzer

import stanza
stanza.download('en')  # Download the English model

import statistics
import numpy as np
import pandas as pd

import re

import requests
from bs4 import BeautifulSoup

import newspaper

[nltk_data] Downloading package punkt to /home/pierluigi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-04-27 16:41:49 INFO: Downloading default packages for language: en (English) ...
2023-04-27 16:41:50 INFO: File exists: /home/pierluigi/stanza_resources/en/default.zip
2023-04-27 16:41:55 INFO: Finished downloading models and saved to /home/pierluigi/stanza_resources.


In [2]:
url = "https://www.foxnews.com/politics/republicans-respond-after-irs-whistleblower-says-hunter-biden-investigation-being-mishandled"

In [3]:
def get_article_info(url):
    # Create a newspaper Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Extract the title, subtitle, description, and main text
    title = article.title.strip()
    subtitle = article.meta_data.get("description", "").strip()
    description = article.meta_description.strip()
    text = article.text.strip()

    # Set the subtitle to the description if it is empty
    if not subtitle:
        subtitle = description.strip()

    # Concatenate the extracted strings
    article_text = f"{title}\n\n{subtitle}\n\n{text}"

    # Return the concatenated string
    return article_text

In [4]:
article = get_article_info(url)

# Tagging Sentences and identifying adjectives

In [5]:
sentence = nltk.sent_tokenize(article)
num_adjectives_list = []
for sent in sentence:
    tagged_words = nltk.pos_tag(nltk.word_tokenize(sent))
    num_adjectives = len([word for word, tag in tagged_words if tag.startswith('JJ')])
    num_adjectives_list.append(num_adjectives)

for i, sent in enumerate(sentence):
    print(f"Sentence {i+1}: {num_adjectives_list[i]}")


Sentence 1: 2
Sentence 2: 2
Sentence 3: 1
Sentence 4: 1
Sentence 5: 1
Sentence 6: 3
Sentence 7: 3
Sentence 8: 6
Sentence 9: 0
Sentence 10: 0
Sentence 11: 2
Sentence 12: 1
Sentence 13: 0
Sentence 14: 1
Sentence 15: 3
Sentence 16: 0
Sentence 17: 0
Sentence 18: 2
Sentence 19: 2
Sentence 20: 2
Sentence 21: 2
Sentence 22: 2
Sentence 23: 5
Sentence 24: 3
Sentence 25: 1
Sentence 26: 2
Sentence 27: 0
Sentence 28: 2
Sentence 29: 7
Sentence 30: 5
Sentence 31: 0
Sentence 32: 0
Sentence 33: 0


In [6]:
sentences = sent_tokenize(article)

total_adjectives = 0
total_words = 0

for sent in sentences:
    words = word_tokenize(sent)
    tagged_words = pos_tag(words)
    num_adjectives = len([word for word, tag in tagged_words if tag.startswith('JJ')])
    total_adjectives += num_adjectives
    total_words += len(words)

avg_adjectives = total_adjectives / total_words

print(f"Total words: {total_words}")
print(f"Total adjectives: {total_adjectives}")
print(f"Average number of adjectives in the article: {avg_adjectives:.2f} in this article.")

Total words: 830
Total adjectives: 61
Average number of adjectives in the article: 0.07 in this article.
