# **Sentiment Analysis of a Political Blog with the Use of Vader**




**1. Overview of Sentiment Analysis**

**2. Research Question and Hypothesis**

**3. Methodology**

**4. The Code**

**5. Research Findings**

**6. Conclusions**

**7. References**


## **1. Overview of Sentiment Analysis**

Sentiment analysis, also known as opinion mining, is the process of determining and categorizing the sentiment expressed in a piece of text, such as a sentence, paragraph, or document. It involves using computational methods and natural language processing (NLP) techniques to analyze and understand the subjective information conveyed in the text.

The main goal of sentiment analysis is to identify the underlying sentiment or emotional tone of the text, which can be positive, negative, or neutral. By automatically classifying the sentiment of text data, sentiment analysis helps to extract valuable insights from large volumes of unstructured text and provides a quantitative measure of public opinion or sentiment towards specific topics, products, services, or events. (Pang and Lee 2008)

Sentiment analysis finds applications in various domains, including social media monitoring, customer feedback analysis, brand reputation management, market research, political analysis, and more. It helps businesses and organizations understand customer opinions, make data-driven decisions, identify trends, detect sentiment shifts, and tailor their strategies accordingly. (Hutto and Gilbert 2014)

It's important to note that sentiment analysis is a challenging task due to the complexities of language, sarcasm, irony, context, and cultural nuances. The accuracy of sentiment analysis models heavily depends on the quality of training data, feature representation, and the specific domain or context in which the analysis is performed. (Liu, B. (2012)


## **2. Research Question and Hypothesis**


The aim of this mini-project is to examibe the reliability and accuracy of Vader in the field of politics through the analysis of a political blog, regarding the possible affect of the debate surrounding Pope John Paul II on Polish election. The null hypothesis posits that sentiment analysis is reliable and accurate in this context. This implies that sentiment analysis can effectively capture and interpret the sentiments expressed in political texts, providing valuable insights into public opinion. On the other hand, the alternative hypothesis suggests that sentiment analysis is unreliable and inaccurate when applied to politics. This implies that sentiment analysis may face challenges in accurately identifying and understanding the nuanced emotions, opinions, and attitudes prevalent in political discourse. It questions the validity of sentiment analysis as a reliable tool for gauging public sentiment in the complex and multifaceted realm of politics

## **3. Methdology**

The methodology employed in this study involves several key steps. Firstly, data collection was conducted by utilizing Beautiful Soup, a Python library used for web scraping. Beautiful Soup facilitates the extraction of relevant information from websites, in this case, collecting blog posts related to politics. It enables the retrieval of textual data necessary for sentiment analysis.

Once the data was collected, preprocessing was performed using the Natural Language Toolkit (NLTK). NLTK is a popular Python library widely used for natural language processing tasks. It offers various tools and functionalities for text preprocessing, such as tokenization, stemming, and stop-word removal. These preprocessing steps help in preparing the text data for further analysis.

Next, sentiment analysis was conducted using the VADER (Valence Aware Dictionary and sEntiment Reasoner) tool available in NLTK. VADER is a lexicon and rule-based sentiment analysis tool specifically designed for social media text. It assigns sentiment scores to individual words and calculates an overall sentiment score for a given text. The scores indicate the positivity, negativity, and neutrality of the expressed sentiment.


## **4. The Code**

In [23]:
import requests
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from bs4 import BeautifulSoup

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

The ***requests*** module is imported, which allows sending HTTP requests to websites to retrieve data.

The ***nltk*** library is imported, which stands for Natural Language Toolkit.

The ***re*** module is imported, which provides support for regular expressions, enabling pattern matching and manipulation of text strings.

The ***stopwords*** module provides a collection of commonly used words (such as "the," "is," "and") that are often removed during text preprocessing.

The ***WordNetLemmatizer*** module  used for lemmatizing words, which is the process of reducing words to their base or dictionary form.

The s***ent_tokenize*** and ***word_tokenize*** used for tokenizing text into sentences and words, respectively.

The S***entimentIntensityAnalyze***r module, which is a pre-trained sentiment analysis tool that calculates sentiment scores.

The ***BeautifulSoup*** module from the bs4 library, which is used for parsing and navigating HTML or XML documents.

In [27]:
url = "https://polishpoliticsblog.wordpress.com/2023/04/13/how-will-the-debate-surrounding-pope-john-paul-iis-legacy-affect-the-polish-election/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

text = soup.find_all('p')
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph.text))

for sentence in sentences:
    print(sentence)

Defending the former Pope’s legacy may only swing a relatively small number of voters, but in a closely fought election this could be decisive in helping the right-wing ruling party to secure another outright parliamentary majority.
Abuse cover-up claims and counterclaims
Last month, a documentary aired by the US-owned TVN24 news channel claimed to show evidence that Pope John Paul II was negligent in handling cases of alleged child sexual abuse by three priests under his authority while he was Archbishop Karol Wojtyła of the Kraków diocese in the 1960s and 1970s.
The report suggested that the former pontiff allowed the culprits to continue working as priests and tried to conceal their actions by transferring them to other parishes.
Similar allegations in the case of the two of the priests had already been made last December in a book by Dutch journalist Ekke Overbeek, published in Poland during the same week that the documentary was aired.
The report’s defenders argued that, although 

***Text = soup.find_all('p')*** This line uses the find_all function from BeautifulSoup to locate all <p> (paragraph) elements within the parsed HTML document.

***Sentences.extend(sent_tokenize(paragraph.text))*** For each paragraph element, the sent_tokenize function from NLTK is applied to the text property of the paragraph. This function tokenizes the text into individual sentences and returns a list of sentences.

***For sentence in sentences*** This line initiates another loop that iterates over each sentence in the sentences list.

In [28]:
for sentence in sentences:
    # Text cleaning
    cleaned_sentence = re.sub(r'[^\w\s]', '', sentence)  # Remove punctuation
    cleaned_sentence = re.sub(r'\d+', '', cleaned_sentence)  # Remove numbers
    cleaned_sentence = cleaned_sentence.lower()  # Convert to lowercase

    # Tokenization and parts-of-speech tagging
    tokens = word_tokenize(cleaned_sentence)  # Tokenization
    tagged_tokens = nltk.pos_tag(tokens)  # POS tagging

    # Stopword removal
    stop_words = set(stopwords.words('english'))  # Assuming English stopwords
    filtered_tokens = [word for word, tag in tagged_tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    sid = SentimentIntensityAnalyzer()
    preprocessed_text = ' '.join(lemmatized_tokens)
    scores = sid.polarity_scores(preprocessed_text)

    if scores['compound'] > 0.2:
        sentiment = 'positive'
    elif scores['compound'] < -0.1:
        sentiment = 'negative'
    else:
        sentiment = 'neutral'

    print(f"Text: {sentence}\nSentiment: {sentiment}\nScores: {scores}\n")

Text: Defending the former Pope’s legacy may only swing a relatively small number of voters, but in a closely fought election this could be decisive in helping the right-wing ruling party to secure another outright parliamentary majority.
Sentiment: positive
Scores: {'neg': 0.075, 'neu': 0.584, 'pos': 0.341, 'compound': 0.7351}

Text: Abuse cover-up claims and counterclaims
Sentiment: negative
Scores: {'neg': 0.583, 'neu': 0.417, 'pos': 0.0, 'compound': -0.6369}

Text: Last month, a documentary aired by the US-owned TVN24 news channel claimed to show evidence that Pope John Paul II was negligent in handling cases of alleged child sexual abuse by three priests under his authority while he was Archbishop Karol Wojtyła of the Kraków diocese in the 1960s and 1970s.
Sentiment: negative
Scores: {'neg': 0.125, 'neu': 0.836, 'pos': 0.039, 'compound': -0.5994}

Text: The report suggested that the former pontiff allowed the culprits to continue working as priests and tried to conceal their actio

***Re.sub(r'[^\w\s]', '', sentence)*** This line uses regular expressions (re) to remove all non-alphanumeric characters and punctuation marks from the sentence. The resulting cleaned sentence is stored in the cleaned_sentence variable.
***re.sub(r'\d+', '', cleaned_sentence)*** This line removes any digits from the cleaned_sentence.

***Tokens = word_tokenize(cleaned_sentence)*** This line tokenizes the cleaned_sentence into individual words using the word_tokenize function from NLTK. The resulting tokens are stored in the tokens variable.
***tagged_tokens = nltk.pos_tag(tokens)*** This line applies parts-of-speech (POS) tagging to the tokens using the pos_tag function from NLTK. POS tagging assigns a grammatical label to each word, such as noun, verb, adjective, etc. The resulting tagged tokens are stored in the tagged_tokens variable.

***Stop_words = set(stopwords.words('english'))*** This line retrieves a set of stopwords from NLTK's corpus for the English language. These stopwords typically include common words like "the," "is," "and," etc.
***filtered_tokens = [word for word, tag in tagged_tokens if word not in stop_words]*** This line filters out stopwords from the tagged_tokens based on their words. Only words that are not present in the stop_words set are included in the filtered_tokens list.

***Lemmatizer = WordNetLemmatizer()*** This line initializes a lemmatizer object from NLTK's WordNetLemmatizer class, which is used for lemmatizing words.
***lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]***.This line applies lemmatization to each word in the filtered_tokens list. Lemmatization reduces words to their base or dictionary form (lemmas).

***Sid = SentimentIntensityAnalyzer()*** This line initializes an instance of the SentimentIntensityAnalyzer, which is a pre-trained model specifically designed for sentiment analysis.

***Preprocessed_text = ' '.join(lemmatized_tokens)*** This line joins the lemmatized tokens, which have been processed and filtered in the previous steps, into a single string called preprocessed_text.

***Scores = sid.polarity_scores(preprocessed_text)*** This line applies the SentimentIntensityAnalyzer to the preprocessed_text to obtain sentiment scores. The polarity_scores() method returns a dictionary containing four scores: 'neg' for negativity, 'neu' for neutrality, 'pos' for positivity, and 'compound' for an overall sentiment score.

***if scores['compound'] > 0.2*** This line checks if the compound score, which represents the overall sentiment, is greater than 0.2.
***sentiment = 'positive'***If the compound score meets the condition, the sentiment is classified as positive.

***elif scores['compound'] < -0.1*** This line checks if the compound score is less than -0.1***sentiment = 'negative'***. If the compound score meets the condition, the sentiment is classified as negative.

***else*** If neither of the above conditions is satisfied, the sentiment is classified as ***neutral***.


## **5. Research Finding**


Out of the 59 entries assessed by the Vader analyzer, it was determined that 9 entries exhibited a neutral sentiment. These entries seem to convey information or opinions without a strong positive or negative inclination. On the other hand, 17 entries were classified as having a negative sentiment. These entries likely expressed dissatisfaction, criticism, or conveyed negative emotions. Conversely, the majority of the entries, specifically 33 of them, were identified as having a positive sentiment. These entries likely conveyed happiness, satisfaction, or positive emotions. . This analysis provides an overview of the sentiments detected within the 59 entries, with a notable distribution between neutral, negative, and positive sentiments.

Among the 59 entries that were manually assessed, it was determined that 19 of them exhibited a positive tone. Additionally, 20 entries were categorized as neutral, suggesting a lack of strong positive or negative sentiment. Whereas, , 20 entries were identified as having a negative sentiment. This manual assessment offers a comprehensive overview of the sentiments howing a relatively balanced distribution between positive, negative, and neutral categories.

## **6. Conclusions**

It is worth noting that while the Vader sentiment analyzer generally performed well in identifying the sentiments of the sentences, there are a few important considerations to keep in mind. Firstly, the Vader analyzer has a tendency to bias the results towards positive outcomes, which can slightly distort the factual tone of a sentence. Additionally, it appears that the Vader analyzer encounters significant issues when dealing with longer and complex sentences, which are commonly found in typical political texts. Therefore, it is advisable to exercise caution and not rely solely on the Vader analyzer or manual assessments, as both methods may require at least minor supervision or additional validation.

## **7. References**

Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2), 1-135.

Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), 1-167.

Hutto, C. J., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media.