# Using Sentiment Analysis to Produce our own Data Frame
---
This Colab notebook is part of our Digital Humanities Mini Project No. 3, where we are learning how to visualize data using Python. In this project, we are applying sentiment analysis to a corpus of news articles about the war in Gaza. The articles are stored in the "articles" folder inside the "data" directory of our "FASDH25-portfolio3" workspace. All articles were published by Al Jazeera English, and the full dataset contains 4,341 articles. Although the title of the dataset suggests that the dataset covers the Gaza War since November 2023, the dataset also contains earlier articles.

For our analysis, we focus on two key months: October 2023, when the conflict began, and January 2024, when a ceasefire was signed. We chose these months to compare the sentiment of news coverage during active conflict and during the ceasefire period.

After filtering the dataset, we extract the title and body of each article. Since titles are often brief and not ideal for sentiment analysis, we focus only on the article bodies. We then apply sentence-level sentiment analysis to each article body and calculate the average sentiment score for each article.

Finally, we generate a CSV file containing four columns: filename, year_month(based on the publication date), average sentiment, and article title. This CSV will later be used to visualize changes in sentiment over time during the selected months of the conflict and ceasefire.

In [2]:
# Installing Stanza, which we'll use to perform sentiment analysis on our choosen set of articles
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

In [3]:
# Importing Stanza into our Colab notebook so we can use it for sentiment analysis
import stanza

In [4]:
# Downloading the English language model since our news articles are written in English
stanza.download("en")

# Setting up the Stanza pipeline with English, using 'tokenize' and 'sentiment' processors for sentiment analysis
# Help was taken from this web page while building the Stanza NLP pipeline for sentiment analysis: https://stanfordnlp.github.io/stanza/sentiment.html
analyzer = stanza.Pipeline(lang='en', processors='tokenize,sentiment')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package        |
------------------------------
| tokenize  | combined       |
| mwt       | combined       |
| sentiment | sstplus_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: sentiment
INFO:stanza:Done loading processors!


In [5]:
# Cloning the FASDH25-portfolio3 repository to access the "articles" folder, which contains the corpus of our news articles
!git clone https://github.com/kulsoom-za/FASDH25-portfolio3.git

Cloning into 'FASDH25-portfolio3'...
remote: Enumerating objects: 4417, done.[K
remote: Counting objects: 100% (2/2), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 4417 (delta 0), reused 0 (delta 0), pack-reused 4415 (from 2)[K
Receiving objects: 100% (4417/4417), 83.04 MiB | 8.92 MiB/s, done.
Resolving deltas: 100% (16/16), done.
Updating files: 100% (4369/4369), done.


In [6]:
# importing the os module so we can use its functions to interact with our file system
import os

# Path to the folder with articles in our repoistory
folder_path = '/content/FASDH25-portfolio3/data/articles'

# creating an empty list to store the names of articles published in October 2023 and January 2024
filtered_articles = []

# Looping through the articles folder and filtering articles published in October 2023 and January 2024
# Help was taken from ChatGPT while writing this code, See ChatGPT Solution 1 in "AI_Documentation_yasir_rauf" document inside the AI Documentation folder
for file_name in os.listdir(folder_path):
    if file_name.startswith('2023-10') or file_name.startswith('2024-01'):
        filtered_articles.append(file_name)

# Printing the total number of articles we filtered, based on the two selected months.
print("Number of filtered articles:", len(filtered_articles))

Number of filtered articles: 962


In [None]:
# importing csv to be able to write our results into a .csv file
import csv

# Creating an empty list named "results" to save our processed data after sentiment analysis
results = []

# Looping through each of our filtered article so that we will be able to split title and body of each article
# Help was taken from ChatGPT while writing this code, see ChatGPT Solution No.2 in "AI_Documentation_yasir_rauf" document inside the AI Documentation folder
for article_file in filtered_articles:
    with open(os.path.join(folder_path, article_file), 'r', encoding='utf-8') as f:
        text = f.read()

    # In our articles title and body is separated by '-----' so we will split the title and body of the articles by this since title is not useful for our sentiment analysis
    # text.split('-----', 1) helps splitting the article at the first occurrence of '-----' only, to prevent additional splits if '-----' appears again later in the articles.
    # Also asked ChatGPT about it, see ChatGPT Solution No.2 in "AI_Documentation_yasir_rauf" document inside the AI Documentation folder
    parts = text.split('-----', 1)
    title = parts[0].strip()
    body = parts[1].strip()

    # Passing the article's body to the NLP analyzer.
    # Help was taken from ChatGPT while writing the code upto to doing sentiment anlysis for individual sentence see ChatGPT Solution No.3 in "AI_Documentation_yasir_rauf" document inside the AI Documentation folder
    doc = analyzer(body)

    # Create list to collect sentiment scores for each sentence (0 as negative, 1 as poitive, and 2 as negative)
    sentiments = []

    # Loop through all sentences in the article body to analyze sentiment individually
    for sentence in doc.sentences:
      sentiments.append(sentence.sentiment)  # Add each sentence's sentiment score (0-negative,1-neutral,2-positive) to the list

    # Compute average sentiment to represent overall article sentiment
    # Help for writing code to calculate average sentiment was taken from this page: https://www.simplilearn.com/tutorials/python-tutorial/find-average-of-list-in-python
    avg_sentiment = sum(sentiments) / len(sentiments)
    # Rounding off average sentiment to two decimal places for better readability and concise representation
    # Help for writing code to round off to 2 decimal places was taken from this page: https://www.datacamp.com/tutorial/python-round-to-two-decimal-places
    avg_sentiment = round(avg_sentiment, 2)

    # Extract year and month from filename for time-based grouping and analysis
    # Help was taken from ChatGPT while writing the code below, ChatGPT Solution No.4 and No. 5 in "AI_Documentation_yasir_rauf" document inside the AI Documentation folder
    year_month = article_file[:7]

    # Save article metadata and average sentiment for output
    # Help was taken from ChatGPT while writing the code below, ChatGPT Solution No.4 and No. 5 in "AI_Documentation_yasir_rauf" document inside the AI Documentation folder
    results.append([article_file, year_month, title.strip(), avg_sentiment])

# Creating a CSV file to write the sentiment analysis results
# Help was taken from ChatGPT while writing the code below, ChatGPT Solution No.4 in "AI_Documentation_yasir_rauf" document inside the AI Documentation folder
with open('avg_sentiment_results.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)  # Create CSV writer object to handle writing rows to file
    writer.writerow(['filename', 'year_month', 'title', 'avg_sentiment'])  # Write header row for clarity in CSV columns
    writer.writerows(results)  # Write all collected results to CSV for further use or visualization