# Text Extraction Approach

Before delving into the analyses conducted, we need to focus our attention on a preliminary discussion concerning the panoramic vision of the process we have meticulously followed. This comprehensive task starts with the retrieval of physical links of news articles from the dataset, followed by the extraction of the textual content embedded within these links.

Central to our methodology is the exploration of different approaches to text preprocessing. Prior to directly examining the textual content, preprocessing serves as a pivotal step. It involves the systematic ”cleaning” of the text article, retaining only the essential components that contribute significantly to a correct analysis of the text. These components primarily include keywords and phrases that bear substantive relevance.

In [30]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('sentiwordnet')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import sentiwordnet as swn
from nltk.stem import WordNetLemmatizer

from nltk.sentiment.vader import SentimentIntensityAnalyzer

import stanza
stanza.download('en')  # Download the English model

import statistics
import numpy as np
import pandas as pd

import re

import requests
from bs4 import BeautifulSoup

import newspaper

[nltk_data] Downloading package punkt to /home/pierluigi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2024-05-20 16:52:03 INFO: Downloading default packages for language: en (English) ...
2024-05-20 16:52:05 INFO: File exists: /home/pierluigi/stanza_resources/en/default.zip
2024-05-20 16:52:11 INFO: Finished downloading models and saved to /home/pierluigi/stanza_resources.


## Converting a link into a string

We use the following script to convert an article into a string, which we will use for the tokenization of sentences and words, accordingly. 
This avoids saving the article in a .txt file every time.

In [31]:
url = "https://www.foxnews.com/politics/republicans-respond-after-irs-whistleblower-says-hunter-biden-investigation-being-mishandled"

In [32]:
def get_article_info(url):
    # Create a newspaper Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Extract the title, subtitle, description, and main text
    title = article.title.strip()
    subtitle = article.meta_data.get("description", "").strip()
    description = article.meta_description.strip()
    text = article.text.strip()

    # Set the subtitle to the description if it is empty
    if not subtitle:
        subtitle = description.strip()

    # Concatenate the extracted strings
    article_text = f"{title}\n\n{subtitle}\n\n{text}"

    # Return the concatenated string
    return article_text

In [33]:
article = get_article_info(url)
print(article)

Republicans respond after IRS whistleblower says Hunter Biden investigation is being mishandled

Members of Congress are calling for more transparency from the Biden administration after an IRS whistleblower said an investigation into Hunter Biden is being mishandled.

Lawmakers on Capitol Hill are calling for the Biden administration to be held accountable for "blocking" Congress and the public from learning more about Biden family members’ business deals with China.

The congressional outcries come as a whistleblower within the Internal Revenue Service alleges an investigation into Hunter Biden is being mishandled by the Biden administration. The whistleblower also alleges "clear conflicts of interest" in the investigation.

"It’s deeply concerning that the Biden Administration may be obstructing justice by blocking efforts to charge Hunter Biden for tax violations," House Committee on Oversight and Accountability Chairman James Comer told Fox News on Wednesday.

Comer, R-Ky., also s

## Tokenization of Sentences

Tokenization is a fundamental pre-processing step in Natural Language Pro- cessing (NLP) that involves breaking down a text into smaller units, typically words, phrases, or symbols. In the context of sentence tokenization, the process involves segmenting a text document into individual sentences based on certain rules or patterns.

During tokenization, a text document is scanned, and boundaries between sentences are identified based on punctuation marks like periods, exclamation marks, or question marks. Each identified sentence is then treated as a separate unit or token, enabling subsequent analysis to be performed on a sentence-by-sentence basis.

In [34]:
# Tokenize the text into sentences
sentences = sent_tokenize(article)
print(sentences)

['Republicans respond after IRS whistleblower says Hunter Biden investigation is being mishandled\n\nMembers of Congress are calling for more transparency from the Biden administration after an IRS whistleblower said an investigation into Hunter Biden is being mishandled.', 'Lawmakers on Capitol Hill are calling for the Biden administration to be held accountable for "blocking" Congress and the public from learning more about Biden family members’ business deals with China.', 'The congressional outcries come as a whistleblower within the Internal Revenue Service alleges an investigation into Hunter Biden is being mishandled by the Biden administration.', 'The whistleblower also alleges "clear conflicts of interest" in the investigation.', '"It’s deeply concerning that the Biden Administration may be obstructing justice by blocking efforts to charge Hunter Biden for tax violations," House Committee on Oversight and Accountability Chairman James Comer told Fox News on Wednesday.', 'Comer

Another way to print sentences:

In [35]:
# Print out each sentence
for sentence in sentences:
    print(sentence)

Republicans respond after IRS whistleblower says Hunter Biden investigation is being mishandled

Members of Congress are calling for more transparency from the Biden administration after an IRS whistleblower said an investigation into Hunter Biden is being mishandled.
Lawmakers on Capitol Hill are calling for the Biden administration to be held accountable for "blocking" Congress and the public from learning more about Biden family members’ business deals with China.
The congressional outcries come as a whistleblower within the Internal Revenue Service alleges an investigation into Hunter Biden is being mishandled by the Biden administration.
The whistleblower also alleges "clear conflicts of interest" in the investigation.
"It’s deeply concerning that the Biden Administration may be obstructing justice by blocking efforts to charge Hunter Biden for tax violations," House Committee on Oversight and Accountability Chairman James Comer told Fox News on Wednesday.
Comer, R-Ky., also said 

## Tokenization of Words and Stop Words Removal

Once we split the different sentences, and within the sentences obtained the individual tokens corresponding to each word and punctuation mark, we continue by applying a stop word removal operation. It consists in a crucial
step that involves the elimination of common words, known as stop words, from a given text. Stop words, such as ”the,” ”and,” and ”of,” are frequently occurring words in a language that generally do not contribute significant meaning to the context of a document.

By discarding stop words, the analysis focuses on the more meaningful and content-rich words, allowing for a more accurate understanding of the text’s underlying themes, sentiment, and key topics.

In [36]:
for i, sentence in enumerate(sentences):

    # Tokenize the sentence into words
    words = word_tokenize(sentence)

    print(words)

['Republicans', 'respond', 'after', 'IRS', 'whistleblower', 'says', 'Hunter', 'Biden', 'investigation', 'is', 'being', 'mishandled', 'Members', 'of', 'Congress', 'are', 'calling', 'for', 'more', 'transparency', 'from', 'the', 'Biden', 'administration', 'after', 'an', 'IRS', 'whistleblower', 'said', 'an', 'investigation', 'into', 'Hunter', 'Biden', 'is', 'being', 'mishandled', '.']
['Lawmakers', 'on', 'Capitol', 'Hill', 'are', 'calling', 'for', 'the', 'Biden', 'administration', 'to', 'be', 'held', 'accountable', 'for', '``', 'blocking', "''", 'Congress', 'and', 'the', 'public', 'from', 'learning', 'more', 'about', 'Biden', 'family', 'members', '’', 'business', 'deals', 'with', 'China', '.']
['The', 'congressional', 'outcries', 'come', 'as', 'a', 'whistleblower', 'within', 'the', 'Internal', 'Revenue', 'Service', 'alleges', 'an', 'investigation', 'into', 'Hunter', 'Biden', 'is', 'being', 'mishandled', 'by', 'the', 'Biden', 'administration', '.']
['The', 'whistleblower', 'also', 'alleges'

## Stop Words Removal

In [37]:
total_words = 0

for i, sentence in enumerate(sentences):
    # Tokenize the sentence into words
    words = word_tokenize(sentence)
    
    # Identify the stop words in the sentence
    stop_words = set(nltk.corpus.stopwords.words('english'))
    filtered_words = [word for word in words if word.lower() not in stop_words]
    stop_words_found = [word for word in words if word.lower() in stop_words]
    
    # Count all the words in the sentence
    all_words = len(words)
    total_words += all_words  # add the count of all_words to the total_words variable

    # Count all the stop words in the sentence
    all_stop_words = len(stop_words_found)

    # Print out the results for each sentence
    print("Sentence ", i+1)
    print("Total words:", all_words)
    print("Filtered words:", filtered_words)
    print("Number of filtered words:", len(filtered_words))
    print("Stop words identified:", stop_words_found)
    print("Number of stop words identified:", all_stop_words)
    print()

print("Total number of words:", total_words)  # print the total sum of all words

Sentence  1
Total words: 38
Filtered words: ['Republicans', 'respond', 'IRS', 'whistleblower', 'says', 'Hunter', 'Biden', 'investigation', 'mishandled', 'Members', 'Congress', 'calling', 'transparency', 'Biden', 'administration', 'IRS', 'whistleblower', 'said', 'investigation', 'Hunter', 'Biden', 'mishandled', '.']
Number of filtered words: 23
Stop words identified: ['after', 'is', 'being', 'of', 'are', 'for', 'more', 'from', 'the', 'after', 'an', 'an', 'into', 'is', 'being']
Number of stop words identified: 15

Sentence  2
Total words: 35
Filtered words: ['Lawmakers', 'Capitol', 'Hill', 'calling', 'Biden', 'administration', 'held', 'accountable', '``', 'blocking', "''", 'Congress', 'public', 'learning', 'Biden', 'family', 'members', '’', 'business', 'deals', 'China', '.']
Number of filtered words: 22
Stop words identified: ['on', 'are', 'for', 'the', 'to', 'be', 'for', 'and', 'the', 'from', 'more', 'about', 'with']
Number of stop words identified: 13

Sentence  3
Total words: 26
Filte

## Text Pre-processing Process

In [38]:
def remove_stop_words(text):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    
    # Identify the stop words for each sentence
    num_stop_words_per_sentence = []
    stop_words_per_sentence = []
    filtered_sentences = []
    num_words_per_sentence = []
    avg_stop_words_per_sentence = []
    total_words = 0
    
    for sentence in sentences:
        # Tokenize the sentence into words
        words = word_tokenize(sentence)
        num_words = len(words)
        total_words += num_words
        
        # Identify the stop words in the sentence
        stop_words = set(stopwords.words('english'))
        filtered_words = [w for w in words if not w.lower() in stop_words]
        
        # Add the number of stop words and filtered sentence to the output
        num_stop_words = num_words - len(filtered_words)
        num_stop_words_per_sentence.append(num_stop_words)
        stop_words_per_sentence.append(filtered_words)
        filtered_sentences.append(" ".join(filtered_words))
        num_words_per_sentence.append(num_words)
        
        # Calculate the average number of stop words per sentence
        avg_stop_words_per_sentence.append(num_stop_words / num_words)
    
    # Calculate summary statistics
    num_stop_words = sum(num_stop_words_per_sentence)
    num_sentences = len(sentences)
    avg_stop_words_per_sentence_all = num_stop_words / num_sentences
    max_stop_words_per_sentence = max(num_stop_words_per_sentence)
    min_stop_words_per_sentence = min(num_stop_words_per_sentence)
    avg_stop_words_per_word = num_stop_words / sum(num_words_per_sentence)
    
    # Calculate the average number of stop words per article
    avg_stop_words_per_sentence_avg = sum(avg_stop_words_per_sentence) / len(avg_stop_words_per_sentence)
    
    # Return the output
    return {
        'num_stop_words': num_stop_words,
        "total_words": total_words,
        'avg_stop_words_per_sentence_all': avg_stop_words_per_sentence_all,
        'max_stop_words_per_sentence': max_stop_words_per_sentence,
        'min_stop_words_per_sentence': min_stop_words_per_sentence,
        'avg_stop_words_per_word': avg_stop_words_per_word,
        'avg_stop_words_per_sentence': avg_stop_words_per_sentence,
        'avg_stop_words_per_sentence_avg': avg_stop_words_per_sentence_avg,
        'filtered_sentences': filtered_sentences,
        'stop_words_per_sentence': stop_words_per_sentence,
        'num_words_per_sentence': num_words_per_sentence,
    }


In [39]:
results = remove_stop_words(article)

print("Filtered sentences:")
for sentence in results["filtered_sentences"]:
    print(sentence)
    print("Average number of stop words per sentence:", round(results["avg_stop_words_per_sentence"][results["filtered_sentences"].index(sentence)], 2))
    print()

print("Statistics on stop words:")
print("Total number of words:", results["total_words"])
print("Number of stop words:", results["num_stop_words"])
print("Maximum number of stop words per sentence:", results["max_stop_words_per_sentence"])
print("Minimum number of stop words per sentence:", results["min_stop_words_per_sentence"])
print("Average number of stop words per article:", round(results["avg_stop_words_per_word"], 2))


Filtered sentences:
Republicans respond IRS whistleblower says Hunter Biden investigation mishandled Members Congress calling transparency Biden administration IRS whistleblower said investigation Hunter Biden mishandled .
Average number of stop words per sentence: 0.39

Lawmakers Capitol Hill calling Biden administration held accountable `` blocking '' Congress public learning Biden family members ’ business deals China .
Average number of stop words per sentence: 0.37

congressional outcries come whistleblower within Internal Revenue Service alleges investigation Hunter Biden mishandled Biden administration .
Average number of stop words per sentence: 0.38

whistleblower also alleges `` clear conflicts interest '' investigation .
Average number of stop words per sentence: 0.29

`` ’ deeply concerning Biden Administration may obstructing justice blocking efforts charge Hunter Biden tax violations , '' House Committee Oversight Accountability Chairman James Comer told Fox News Wednesda

## Stemming Analysis

Stemming helps in standardizing words and reducing them to a common root, making it easier to analyse and process text data.

Stemming involves removing prefixes or suffixes from words to obtain the root form, even if the resulting stem may not be a valid word. For instance, consider the words ”run,” ”running,” and ”ran.” The stem for all these words after stemming would be ”run.”

In [40]:
# Create a Porter stemmer object
stemmer = PorterStemmer()

words = word_tokenize(article)

# Perform stemming on each word using the Porter stemmer
stemmed_words = [stemmer.stem(word) for word in words]

# Combine the stemmed words back into a single string
output_text = ' '.join(stemmed_words)

# Write the output text to a new file
# with open('output.txt', 'w') as f:
#    f.write(output_text)

print(output_text)


republican respond after ir whistleblow say hunter biden investig is be mishandl member of congress are call for more transpar from the biden administr after an ir whistleblow said an investig into hunter biden is be mishandl . lawmak on capitol hill are call for the biden administr to be held account for `` block '' congress and the public from learn more about biden famili member ’ busi deal with china . the congression outcri come as a whistleblow within the intern revenu servic alleg an investig into hunter biden is be mishandl by the biden administr . the whistleblow also alleg `` clear conflict of interest '' in the investig . `` it ’ s deepli concern that the biden administr may be obstruct justic by block effort to charg hunter biden for tax violat , '' hous committe on oversight and account chairman jame comer told fox new on wednesday . comer , r-ky. , also said `` decept , shadi busi scheme '' have allow the biden to make `` million from foreign adversari like china . '' hun

## Lemmatization

Unlike stemming, which involves removing suffixes from words to obtain a root form (sometimes resulting in non-real words), lemmatization considers the context of the word and aims to transform it into a valid word lemma.

Lemmatization is a technique that involves reducing words to their base or root form, known as the lemma.

In [44]:
# Tokenize the input string
tokens = nltk.word_tokenize(output_text)

# Define the stop words to be removed
stop_words = set(stopwords.words('english'))

# Remove stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)


['republican', 'respond', 'ir', 'whistleblow', 'say', 'hunter', 'biden', 'investig', 'mishandl', 'member', 'congress', 'call', 'transpar', 'biden', 'administr', 'ir', 'whistleblow', 'said', 'investig', 'hunter', 'biden', 'mishandl', '.', 'lawmak', 'capitol', 'hill', 'call', 'biden', 'administr', 'held', 'account', '``', 'block', '``', 'congress', 'public', 'learn', 'biden', 'famili', 'member', '’', 'busi', 'deal', 'china', '.', 'congression', 'outcri', 'come', 'whistleblow', 'within', 'intern', 'revenu', 'servic', 'alleg', 'investig', 'hunter', 'biden', 'mishandl', 'biden', 'administr', '.', 'whistleblow', 'also', 'alleg', '``', 'clear', 'conflict', 'interest', '``', 'investig', '.', '``', '’', 'deepli', 'concern', 'biden', 'administr', 'may', 'obstruct', 'justic', 'block', 'effort', 'charg', 'hunter', 'biden', 'tax', 'violat', ',', '``', 'hous', 'committe', 'oversight', 'account', 'chairman', 'jame', 'comer', 'told', 'fox', 'new', 'wednesday', '.', 'comer', ',', 'r-ky.', ',', 'also', 

In [45]:
# Tokenize the text into words
words = word_tokenize(article)

# Initialize Porter Stemmer and WordNet Lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization
stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # 'v' for verb lemmatization

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in lemmatized_words if word.lower() not in stop_words]

# Join the words back into sentences
stemmed_text = ' '.join(stemmed_words)
lemmatized_text = ' '.join(lemmatized_words)
filtered_text = ' '.join(filtered_words)

print("Stemmed Text:")
print(stemmed_text)

print("\nLemmatized Text:")
print(lemmatized_text)

Stemmed Text:
republican respond after ir whistleblow say hunter biden investig is be mishandl member of congress are call for more transpar from the biden administr after an ir whistleblow said an investig into hunter biden is be mishandl . lawmak on capitol hill are call for the biden administr to be held account for `` block '' congress and the public from learn more about biden famili member ’ busi deal with china . the congression outcri come as a whistleblow within the intern revenu servic alleg an investig into hunter biden is be mishandl by the biden administr . the whistleblow also alleg `` clear conflict of interest '' in the investig . `` it ’ s deepli concern that the biden administr may be obstruct justic by block effort to charg hunter biden for tax violat , '' hous committe on oversight and account chairman jame comer told fox new on wednesday . comer , r-ky. , also said `` decept , shadi busi scheme '' have allow the biden to make `` million from foreign adversari like 

In [46]:
bigrams = list(nltk.bigrams(filtered_tokens))
trigrams = list(nltk.trigrams(filtered_tokens))

# Print the results
print("Bigrams:")
print(bigrams)
print("Trigrams:")
print(trigrams)

Bigrams:
[('republican', 'respond'), ('respond', 'ir'), ('ir', 'whistleblow'), ('whistleblow', 'say'), ('say', 'hunter'), ('hunter', 'biden'), ('biden', 'investig'), ('investig', 'mishandl'), ('mishandl', 'member'), ('member', 'congress'), ('congress', 'call'), ('call', 'transpar'), ('transpar', 'biden'), ('biden', 'administr'), ('administr', 'ir'), ('ir', 'whistleblow'), ('whistleblow', 'said'), ('said', 'investig'), ('investig', 'hunter'), ('hunter', 'biden'), ('biden', 'mishandl'), ('mishandl', '.'), ('.', 'lawmak'), ('lawmak', 'capitol'), ('capitol', 'hill'), ('hill', 'call'), ('call', 'biden'), ('biden', 'administr'), ('administr', 'held'), ('held', 'account'), ('account', '``'), ('``', 'block'), ('block', '``'), ('``', 'congress'), ('congress', 'public'), ('public', 'learn'), ('learn', 'biden'), ('biden', 'famili'), ('famili', 'member'), ('member', '’'), ('’', 'busi'), ('busi', 'deal'), ('deal', 'china'), ('china', '.'), ('.', 'congression'), ('congression', 'outcri'), ('outcri',

In [47]:
def lemmatize_bigrams(bigrams):
    lemmatizer = WordNetLemmatizer()
    
    lemmatized_bigrams = []
    for bigram in bigrams:
        lemma1 = lemmatizer.lemmatize(bigram[0])
        lemma2 = lemmatizer.lemmatize(bigram[1])
        lemmatized_bigrams.append([lemma1, lemma2])  # Append as a list instead of a tuple, for the dictionary (list of lists)
    return lemmatized_bigrams


In [48]:
lemmatized_bigrams = lemmatize_bigrams(bigrams)

print(lemmatized_bigrams)

[['republican', 'respond'], ['respond', 'ir'], ['ir', 'whistleblow'], ['whistleblow', 'say'], ['say', 'hunter'], ['hunter', 'biden'], ['biden', 'investig'], ['investig', 'mishandl'], ['mishandl', 'member'], ['member', 'congress'], ['congress', 'call'], ['call', 'transpar'], ['transpar', 'biden'], ['biden', 'administr'], ['administr', 'ir'], ['ir', 'whistleblow'], ['whistleblow', 'said'], ['said', 'investig'], ['investig', 'hunter'], ['hunter', 'biden'], ['biden', 'mishandl'], ['mishandl', '.'], ['.', 'lawmak'], ['lawmak', 'capitol'], ['capitol', 'hill'], ['hill', 'call'], ['call', 'biden'], ['biden', 'administr'], ['administr', 'held'], ['held', 'account'], ['account', '``'], ['``', 'block'], ['block', '``'], ['``', 'congress'], ['congress', 'public'], ['public', 'learn'], ['learn', 'biden'], ['biden', 'famili'], ['famili', 'member'], ['member', '’'], ['’', 'busi'], ['busi', 'deal'], ['deal', 'china'], ['china', '.'], ['.', 'congression'], ['congression', 'outcri'], ['outcri', 'come'],