# COGS18 -- Final Project:

# Automated Text Analysis of Newspaper Headlines:

Initial goals:
1. Webscrape the biggest Norwegian newspapers, preferably at a set time each day, over a limited time period.
2. Look at frequency of word use in headlines and visualize this
3. Perform a sentiment analysis, to see wether there are noticable differences in the language used in the different newspapers.


I was able to webscrape four Norwegian newspapers (that allow web scraping) usually at 5 pm between 1-9 December 2024. I encountered issues with my automated scraping because of lack of internet connection, which made me add a function to check internet connectivity and to try again at a later time.

Later, I had trouble with the Norwegian characthers "Æ Ø Å", which I also had to be concious about during my data collection. I also changed my script to save the data in a CSV for each day, and rather later merge them, using code in script "part 2". This is because I had one instance where the data was collected in a CSV during testing, but because of lack of internet connection my function simply just overwrote the CSV, essentially deleting the existing data. This lead to me adding a retry-function. If I were to further develop this I would make sure the testing and the actual data collection is storing data in different CSVs.

I did spend a lot of time trying to train and optimize BERT (NB-BERT-base is a general BERT-base model built on the large digital collection at the National Library of Norway) on the Norwegian dataset NoReC_fine (based largely on the original data described in the paper A Fine-Grained Sentiment Dataset for Norwegian) in order to perform a sentiment analysis, since there are a lack of sentiment tools developed for the Norwegian language. Unfortunately, this became too difficult and time consuming for the assignment, later I decided to rather explore some ways of measuring correlation between the newspapers word usage. However, this also became too complicated and I decided to abandon my initial goal to rather immerse myself in goal 1. and 2.

I was however able to calculate and plot the frequency of the words used in the newspapers using both barplots and wordclouds. The visualization was essentially dependent on a lot of considerate steps in preprocessing, like adding Norwegian stop words so the plots made sense in terms of content and sentiment. I also decided to add this as a separate script after a short period of hardcoding the stop words in each of the scripts.

Overall, this was a very rich project providing me with the experience of collecting data from the web, preprocessing it, and thereafter visualizing it, in order to make sense of the data -- from a human perspective. Given the step-by-step approach builiding three individual main scripts (and the stop words-script), this method provides a highly scalable and customizeable approach to collecting, processing, and making sense of data fetched from the web.

# Workflow and Scripts

The project consists of three main parts, and one smaller script (plus test_functions-script):

# Part 1: Data Collection:

    Script: part1_news_scraper.py
    
    Description:
    - Scraper function to collect headlines from multiple sources, and store headlines with metadata (date, time, source) in individual CSV files.

# Part 2: Data Cleaning and Merging:

    Script: part2_clean_and_merge.py
    
    Description:
    - Clean and preprocess the scraped data to remove irrelevant characters, normalize text, and remove stopwords, and merge all CSVs.
    

# Part 3: Word Frequency Analysis and Visualization:

    Script: part3_analyze_and_visualize.py
    
    Description:
    - Generates word clouds and frequency bar charts to visualize the frequency of words for all newspapers combined and each newspaper individually. (Top 20 words overall, and top 10 words for each newspaper).

# Additional Script: Stop Words and Remove Stop Words-Function:

    Script: (stopwords.py)
    Description:
    - Removed Norwegian stop words and added a function to easier remove stopwords for data cleaning.

# Scripts:

Location: cd ~/Desktop/Github/final_project_COGS18/assignment_folder

1. python3 part1_news_scraper.py
2. python3 part2_clean_and_merge.py
3. python3 part3_analyze_and_visualize.py

4. python3 stopwords.py

5. python3 test_functions.py

   
# Python Libraries:

- pandas: For data handling and cleaning.
- matplotlib, seaborn, WordCloud: For visualization.
- Counter: For word frequency analysis.
- scipy.cluster.hierarchy, sklearn.metrics.pairwise: For similarity metrics and clustering.
- bs4/BeatifulSoup: For web scraping.

# Part 1:

My newspaper web scraper script:

In [None]:
# part1_news_scraper.py

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import schedule
from datetime import datetime, timedelta
import re
import socket

# First, define websites and the CSS selectors
websites = {
    "NRK": {"url": "https://www.nrk.no/", "selector": "h2.kur-room__title"},
    "VG": {"url": "https://www.vg.no/", "selector": ".headline"},
    "Aftenposten": {"url": "https://www.aftenposten.no/", "selector": "h2"},
    "Se og Hør": {"url": "https://www.seher.no/", "selector": "h2.headline"}
}

# function that checks internet connectivity
def is_connected():
    try:
        socket.create_connection(("8.8.8.8", 53), timeout=5)
        return True
    except OSError:
        return False

# function to clean up text by normalizing space
def clean_text(text):
    """Remove excess whitespace, line breaks, and normalize spaces."""
    return re.sub(r'\s+', ' ', text).strip()

# function to scrape a single website
def scrape_website(url, headline_selector):
    try:
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        response.encoding = "utf-8"  # to make sure it understands the Norwegian characters
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")
        headlines = soup.select(headline_selector)
        
        # fetch and clean up headlines
        cleaned_headlines = []
        for headline in headlines:
            # join all parts of the text and ensuring spacing
            full_text = " ".join(headline.stripped_strings)  # Handles nested tags
            cleaned_headlines.append(clean_text(full_text))
        
        return cleaned_headlines
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return []

# function to scrape and save data
def scrape_daily(retry_count=0):
    max_retries = 3

    # checks internet connectivity
    if not is_connected():
        print("No internet connection. Retrying in 1 hour...")
        if retry_count < max_retries:
            schedule.every(1).hour.do(lambda: scrape_daily(retry_count + 1))
        else:
            print("Max retries reached. Scraping aborted.")
        return

    try:
        print("Scraping started...")
        all_headlines = []

        # scrapes each of the Norwegian newspaper websites
        for site, info in websites.items():
            print(f"Scraping {site}...")
            headlines = scrape_website(info["url"], info["selector"])
            print(f"Collected {len(headlines)} headlines for {site}")
            for headline in headlines:
                all_headlines.append({
                    "Source": site,
                    "Headline": headline,
                    "Date": datetime.now().strftime("%Y-%m-%d"),
                    "Time": datetime.now().strftime("%H:%M:%S")
                })
            time.sleep(1)  # added a delay between requests

        # saves results to a CSV file for each day
        if all_headlines:
            date_today = datetime.now().strftime("%Y-%m-%d")
            file_name = f"news_headlines_{date_today}.csv"
            df = pd.DataFrame(all_headlines)
            df.to_csv(file_name, index=False, encoding="utf-8")
            print(f"Scraping complete! Data saved to '{file_name}'.")
        else:
            print("No headlines collected. No file saved.")
    except requests.exceptions.RequestException as e:
        print(f"Scraping failed: {e}")
        if retry_count < max_retries:
            print(f"Retrying in 1 hour... Attempt {retry_count + 1}/{max_retries}")
            schedule.every(1).hour.do(lambda: scrape_daily(retry_count + 1))
        else:
            print("Max retries reached. Scraping aborted.")
            
if __name__ == "__main__":
	# schedules the scraper to run daily at 8 AM San Diego time (== 5 PM Norway time)
	print("Scheduling scraper to run at 8 AM San Diego time (5 PM Norway time).")
	schedule.every().day.at("08:00").do(scrape_daily)

	# test scheduler in one minute for debugging
	test_time = (datetime.now() + timedelta(minutes=1)).strftime("%H:%M")
	print(f"Test run scheduled at: {test_time}")
	schedule.every().day.at(test_time).do(scrape_daily)

	# run the scheduler
	print("Scheduler is running...")
	while True:
    		schedule.run_pending()
    		time.sleep(1)  # Waits one second before checking again


Scheduling scraper to run at 8 AM San Diego time (5 PM Norway time).
Test run scheduled at: 20:22
Scheduler is running...


# Part 2.1:

My stopwords script + remove stopwords_function

In [None]:
norwegian_stopwords = [
    "jeg", "du", "han", "hun", "det", "de", "er", "en", "og", "på", "som", "med", "til", "fra",
    "for", "av", "var", "vi", "kan", "ha", "nå", "har", "om", "et", "seg", "mot", "ut", "får",
    "ble", "ikke", "bare", "alle", "må", "den", "så", "sin", "man", "og", "i", "kunne", "hva",
    "hvordan", "der", "når", "alt", "år", "vil", "igjen", "skal", "noen", "deg", "meg", "dette",
    "andre", "bli", "sa", "ved", "etter", "hvor", "selv", "noe", "disse", "opp", "men", "oss",
    "over", "nå", "vårt", "nye", "helt", "få", "gjør", "blir", "hver", "ut", "inn", "da", "før",
    "veldig", "min", "ny", "litt", "vår", "si", "kommer", "rundt", "hvem", "hvorfor", "være",
    "aldri", "fikk", "gå", "gjøre", "dere", "flere", "mest", "bare", "først", "eller", "gikk",
    "ned", "dette", "slik", "jo", "skulle", "vil", "nok", "mens", "egentlig", "sånn", "enn",
    "alle", "andre", "at", "av", "bare", "begge", "ble", "blei", "bli", "blir",
    "blitt", "bort", "bra", "bruke", "både", "båe", "da", "de", "deg", "dei", "deim", "deira",
    "deires", "dem", "den", "denne", "der", "dere", "deres", "det", "dette", "di", "din", "disse",
    "ditt", "du", "dykk", "dykkar", "då", "eg", "ein", "eit", "eitt", "eller", "elles", "en",
    "ene", "eneste", "enhver", "enn", "er", "et", "ett", "etter", "folk", "for", "fordi",
    "forsûke", "fra", "få", "før", "fûr", "fûrst", "gjorde", "gjûre", "god", "gå", "ha", "hadde",
    "han", "hans", "har", "hennar", "henne", "hennes", "her", "hjå", "ho", "hoe", "honom", "hoss",
    "hossen", "hun", "hva", "hvem", "hver", "hvilke", "hvilken", "hvis", "hvor", "hvordan",
    "hvorfor", "i", "ikke", "ikkje", "ingen", "ingi", "inkje", "inn", "innen", "inni", "ja",
    "jeg", "kan", "kom", "korleis", "korso", "kun", "kunne", "kva", "kvar", "kvarhelst", "kven",
    "kvi", "kvifor", "lage", "lang", "lik", "like", "makt", "man", "mange", "me", "med", "medan",
    "meg", "meget", "mellom", "men", "mens", "mer", "mest", "mi", "min", "mine", "mitt", "mot",
    "mye", "mykje", "må", "måte", "navn", "ned", "nei", "no", "noe", "noen", "noka", "noko",
    "nokon", "nokor", "nokre", "ny", "nå", "når", "og", "også", "om", "opp", "oss", "over",
    "part", "punkt", "på", "rett", "riktig", "samme", "sant", "seg", "selv", "si", "sia",
    "sidan", "siden", "sin", "sine", "sist", "sitt", "sjøl", "skal", "skulle", "slik", "slutt",
    "so", "som", "somme", "somt", "start", "stille", "så", "sånn", "tid", "til", "tilbake",
    "tilstand", "um", "under", "upp", "ut", "uten", "var", "vart", "varte", "ved", "verdi",
    "vere", "verte", "vi", "vil", "ville", "vite", "vore", "vors", "vort", "vår", "være",
    "vært", "vöre", "vört", "å", "gang", "første", "fortsatt", "se", "stor", "går", "dagens",
 "les", "n", "frå", "nær", "ser", "en", "to", "tre", "fire", "fem", "seks", "syv", "åtte",
"ni", "elleve", "tolv", "egen", "nrk", "blant", "sett", "hatt", "ham", "han", "ifølge", "følge", "hele",
"fått", "radio", "radioprogrammer", "podkast", "podcast", "podkaster", "podcaster", "våre", "Hallo", "hallo"
]

# Added a function to remove stop words in data cleaning

def remove_stopwords(text, stopwords):
    """
    Remove stopwords from the text.
    
    Parameters:
        text (str): input text for processing.
        stopwords (list): list of stopwords to remove.
        
    Returns:
        str: Text without stopwords.
    """
    words = text.split()
    return " ".join(word for word in words if word.lower() not in stopwords)


# Part 2.2:

My script for cleaning and merging the data in the collected CSV files.

In [None]:
# part2_clean_and_merge.py

import pandas as pd
import os
import re
from stopwords import norwegian_stopwords, remove_stopwords  # Import the stopword list for Norwegian words and the remove_stopwords function

def clean_text(text):
    """
    Cleans text strings by removing non-alphanumeric characters, 
    excessive whitespace, and normalizing the text.
    """
    if not isinstance(text, str):
        return ""
    # remove special characters, numbers, and excessive whitespace
    text = re.sub(r"[^a-zA-ZæøåÆØÅ\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def remove_stopwords(text, stopwords):
    """
    Removes stopwords from the text.
    """
    words = text.split()
    return " ".join(word for word in words if word.lower() not in stopwords)

def clean_and_merge_csvs(input_folder, output_file):
    """
    Merge and clean multiple CSV files into one consolidated file.
    """
    all_data = []  # list to store data from all CSVs
    
    # Iterates over all CSV files in the folder
    for filename in os.listdir(input_folder):
        if filename.endswith(".csv"):
            file_path = os.path.join(input_folder, filename)
            print(f"Processing file: {file_path}")
            
            try:
                # Reads CSV file
                df = pd.read_csv(file_path, encoding="utf-8")
                
                # Drops rows that have a missing values
                df.dropna(subset=["Headline", "Source"], inplace=True)
                
                # Removes duplicate headlines
                df.drop_duplicates(subset=["Headline"], inplace=True)
                
                # Cleans the 'Headline' column
                df["Headline"] = df["Headline"].apply(clean_text)
                
                # Removes stopwords from the 'Headline' column
                df["Headline"] = df["Headline"].apply(lambda x: remove_stopwords(x, norwegian_stopwords))
                
                # Removes rows with empty headlines after cleaning
                df = df[df["Headline"].str.strip().astype(bool)]
                
                all_data.append(df)  # Appends the cleaned data to the list
                
            except Exception as e:
                print(f"Error processing file {file_path}: {e}")
    
    # Combines all cleaned data into a single DataFrame
    if all_data:
        combined_df = pd.concat(all_data, ignore_index=True)
        print(f"Total rows after merging: {len(combined_df)}")
        
        # Save the DataFrame to a new CSV file
        combined_df.to_csv(output_file, index=False, encoding="utf-8")
        print(f"Cleaned and merged data saved to '{output_file}'.")
    else:
        print("No valid data found to merge.")

if __name__ == "__main__":
    input_folder = "headlines_csv"
    output_file = "merged_cleaned_headlines.csv"
    clean_and_merge_csvs(input_folder, output_file)

# Part 3:

My script for counting and displaying the most frequently used words, thereafter visualizing the most frequent words in our collected data.

In [None]:
# part3_analyze_and_visualize.py

import pandas as pd
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
from stopwords import norwegian_stopwords, remove_stopwords  # Import the stopword list for Norwegian words and the remove_stopwords function

# Useful Functions

def generate_wordcloud_text(df, column, stopwords):
    """
    Generates cleaned text for a word cloud.
    """
    return " ".join(remove_stopwords(text, stopwords) for text in df[column])

def calculate_word_frequencies(df, column, stopwords, top_n=20):
    """
    Calculates word frequencies for a specific column.
    """
    all_text = generate_wordcloud_text(df, column, stopwords)
    word_counts = Counter(all_text.split())
    return word_counts.most_common(top_n)

def plot_combined_wordclouds(df, stopwords):
    """
    Combined word cloud visualization for all the newspapers.
    """
    fig, axs = plt.subplots(1, 4, figsize=(20, 6))

    for i, (source, group) in enumerate(df.groupby("Source")):
        # Generates the word cloud text
        source_text = generate_wordcloud_text(group, "Headline", stopwords)
        wordcloud = WordCloud(width=400, height=400, background_color="white").generate(source_text)
        
        # Plot the word cloud
        axs[i].imshow(wordcloud, interpolation="bilinear")
        axs[i].axis("off")
        axs[i].set_title(f"Word Cloud - {source}", fontsize=14)

    plt.tight_layout()
    plt.show()

def plot_combined_frequencies(df, stopwords):
    """
    Combined frequency bar chart visualization for all the newspapers.
    """
    fig, axs = plt.subplots(1, 4, figsize=(20, 6))

    for i, (source, group) in enumerate(df.groupby("Source")):
        # Calculate the word frequencies
        source_word_counts = calculate_word_frequencies(group, "Headline", stopwords, top_n=10)
        words, counts = zip(*source_word_counts)

        # Plot the frequencies
        axs[i].barh(words, counts, color="skyblue")
        axs[i].set_title(f"Word Frequencies - {source}", fontsize=14)
        axs[i].set_xlabel("Frequency")
        axs[i].invert_yaxis()

    plt.tight_layout()
    plt.show()

# Main Script
if __name__ == "__main__":
    # Load the cleaned and merged CSV
    input_file = "merged_cleaned_headlines.csv"
    df = pd.read_csv(input_file, encoding="utf-8")
    print(f"Loaded {len(df)} headlines from '{input_file}'.")

    # Combined Word Clouds
    print("\n Combined Word Clouds for All Newspapers ")
    plot_combined_wordclouds(df, norwegian_stopwords)

    # Combined Frequencies
    print("\n Combined Frequencies for All Newspapers ")
    plot_combined_frequencies(df, norwegian_stopwords)

    # Overall Analysis
    print("\n Overall Analysis ")
    overall_text = generate_wordcloud_text(df, "Headline", norwegian_stopwords)
    overall_word_counts = calculate_word_frequencies(df, "Headline", norwegian_stopwords, top_n=20)

    # Plot Word Cloud (all)
    plt.figure(figsize=(10, 5))
    wordcloud = WordCloud(width=800, height=400, background_color="white").generate(overall_text)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title("Word Cloud - All Newspapers Combined", fontsize=16)
    plt.show()

    # Plot Frequencies (all)
    words, counts = zip(*overall_word_counts)
    plt.figure(figsize=(8, 6))
    plt.barh(words, counts, color="skyblue")
    plt.xlabel("Frequency")
    plt.title("Word Frequencies - All Newspapers Combined", fontsize=16)
    plt.gca().invert_yaxis()
    plt.show()

# Script to test functions:

In [None]:
# python3 test_functions.py

import os
import pandas as pd 

from part1_news_scraper import scrape_website
from part2_clean_and_merge import clean_text
from part3_analyze_and_visualize import generate_wordcloud_text
from stopwords import norwegian_stopwords

def test_scrape_function():
    print("Testing scrape_website...")
    websites = {
        "NRK": {"url": "https://www.nrk.no/", "selector": "h2.kur-room__title"}
    }
    try:
        # Run the scraper
        result = scrape_website(websites["NRK"]["url"], websites["NRK"]["selector"])
        assert isinstance(result, list), "scrape_website did not return a list"
        print(f"Scrape function works as intended! Collected {len(result)} headlines.")
    except Exception as e:
        print(f"Test failed: {e}")

def test_clean_text_function():
    print("Testing clean_text...")
    try:
        dirty_text = "Hallo, Verden! 123"
        cleaned = clean_text(dirty_text)
        assert cleaned == "Hallo Verden", f"clean_text failed: {cleaned}"
        print("Clean text function works as intended!")
    except Exception as e:
        print(f"Test failed: {e}")

def test_generate_wordcloud_text_function():
    print("Testing generate_wordcloud_text...")
    try:
        data = pd.DataFrame([{"Headline": "Hallo Verden"}, {"Headline": "Norge er et vakkert land"}])
        wordcloud_text = generate_wordcloud_text(data, "Headline", norwegian_stopwords)
        assert isinstance(wordcloud_text, str), "generate_wordcloud_text didnt return a string"
        assert "Hallo" not in wordcloud_text, "Stopwords were not removed correctly"
        assert "Verden" in wordcloud_text, "Wordcloud text output is incorrect"
        print("Generate wordcloud text function works as intended!")
    except Exception as e:
        print(f"Test failed: {e}")

if __name__ == "__main__":
    test_scrape_function()
    test_clean_text_function()
    test_generate_wordcloud_text_function()
    print("All tests are completed!")


# Extra Credit (up to 4%)

For extra credit, if you go above and beyond on the minimal project requirements and challenge yourself to approach a project that is more complex than the basic requirements, requires you to learn something beyond what was taught in the course, or uses code concepts not taught in class, explain this at the end of your Jupyter notebook. Here, you should explain why your approach was particularly difficult/challenging for you and how your work goes beyond the minimal project requirements.

- I believe this project was both very meaningful in terms of my learning data collection, processing, and visualization, and I believe that the extent of the project challenged me beyond what would have been necessary if my only goal was to pass the course, rather I spent a large amount of time working on different approaches (like sentiment analysis and correlation matrixes, even though they did not directly yield results in the ways I intended them to do). As well as creating four different scripts, tackeling challenges like Norwegian letters and more.

  
# GitHub Extra Credit (1%)

For extra credit, if you put your project in a public repo on GitHub, you will earn 1% extra credit on the project. (Note: This should be the unzipped version of your project, so that others on GitHub can see your code.)

- I'll publish it on: https://github.com/fishtangerine