# News-Project
#### Paul Wecker

## 1. Getting data
- collected data from freitag.de, a german, left-oriented newspaper
- for gathering data, I initially used [`newspaper3k`](https://newspaper.readthedocs.io/en/latest/)
    - Problem: article-texts were not fetched properly
    - [`newspaper4k`](https://www.reddit.com/r/Python/comments/1bmtdy0/i_forked_newspaper3k_fixed_bugs_and_improved_its/?tl=de) also did not work
    - Workaround: use `article_url` fetched from `newspaper4k`, then use `BeautifulSoup` to access `div` with text
- found access to paywall content!
- collected data on 9 days

![title](paywall_content.png)

## 2. Dataset
Article Data collected comprises:
- URL (`newspaper4k`)
- Title(`newspaper4k`)
- Author(s) (`newspaper4k`)
- Text (`BeautifulSoup` free-part and paywall part put together) 
- Date (`BeautifulSoup` or inferred)
- Paywall (boolean) (`BeautifulSoup`)

- 311 articles
- 185 authors (combination of authors)
- character-mean of texts: 6.8k
- 171 paywall articles vs 140 free articles

## 3. Helper functions for getting data:

In [1]:
from bs4 import BeautifulSoup
from newspaper import Article
import datetime

def get_current_date():
    today = datetime.date.today()
    # Format the current date as DD MM YYYY
    return today.strftime('%d_%m_%Y')

def convert_ausgabe_string_to_date(week_year_str:str):
    # Remove the 'Ausgabe ' prefix and split the input string into week and year
    prefix = 'Ausgabe '
    if week_year_str.startswith(prefix):
        week_year_str = week_year_str[len(prefix):]

    week_str, year_str = week_year_str.split('/')
    week = int(week_str)
    year = int(year_str)
    
    # Calculate the Monday of the week (using isocalendar)
    first_day_of_year = datetime.date(year, 1, 1)
    # If the first day of the year is not Monday, adjust to the first Monday
    first_monday_of_week = first_day_of_year + datetime.timedelta(days=(week - 1) * 7)
    while first_monday_of_week.isocalendar()[2] != 1:  # 1 is Monday
        first_monday_of_week += datetime.timedelta(days=1)
    
    # Calculate the Thursday of that week
    thursday_of_week = first_monday_of_week + datetime.timedelta(days=3)
    
    # Return the date formatted as DD MM YYYY
    return thursday_of_week.strftime('%d %m %Y')

def retrieve_date_paywall_text(article_html):
    soup = BeautifulSoup(article_html, 'html.parser')

    # TEXT AND PAYWALL
    # only for non-paywall articles
    text_class = "column s-article-text js-dynamic-advertorial js-external-links"
    text = (soup
            .find("div",
                  {"class": text_class}))
    # get paywall-article introduction paragraph
    intro_class = "column s-article-text c-paywall-hidden-text js-dynamic-advertorial js-external-links"
    paywall_intro = (soup
                     .find("div",
                           {"class": intro_class}))
    
    # get content behind paywall
    paywall_class= "o-paywall"
    paywall_text = (soup
                    .find('div',
                          {'class': paywall_class}))
    paywall = False
    
    # check what has been collected
    if text:
        text = text.get_text()
    elif paywall_intro and paywall_text:
        paywall = True
        paywall_intro = paywall_intro.get_text()
        paywall_text = paywall_text.get_text()[12:] # first chars are "\n         "

        def combine_strings(str1, str2):
            # Find the longest suffix of str1 that matches the prefix of str2
            overlap_len = 0
            for i in range(1, len(str1) + 1):
                if str2.startswith(str1[-i:]):
                    overlap_len = i
            
            # Combine the strings by removing the overlapping part from str2
            combined_string = str1 + str2[overlap_len:]
            return combined_string
        text = combine_strings(paywall_intro, paywall_text)

    # Fetch Date
    date = soup.find("span", class_="js-article-issue-name")
    if date:
        date = date.get_text()
        date = convert_ausgabe_string_to_date(date)
    else:
        date = get_current_date()

    return date, paywall, text

def fetch_article(article_url):
    article = Article(article_url)
    article.download()
    article.parse()
    authors = article.authors
    title: str = article.title
    date, paywall, text = retrieve_date_paywall_text(article.html)
    return [article_url, title, authors, date, paywall, text]



## 4. Scripts for scraping and putting things together

In [4]:
from newspaper import build
import pandas as pd

from freitag import fetch_article
from freitag import get_current_date


freitag = build("https://www.freitag.de", language="de", memorize_articles=False)
article_urls = freitag.article_urls()

df = (pd.DataFrame(data={"url":[], "title":[], "authors":[], "date":[],
                        "paywall":[], "text":[]})
    .set_index("url"))

# for i, article_url in enumerate(article_urls):
for i, article_url in enumerate(article_urls[:3]):
    print(f"---- Collecting Article #{i+1} ----")
    row = pd.Series(fetch_article(article_url),
                    index=["url", "title", "authors", "date", "paywall", "text"])
    if article_url not in df.index:
        df.loc[article_url] = row
    else:
        print(f"---- Article {i+1} already stored ----")

# df.to_csv(f"./data/freitag_{get_current_date()}.csv")

  if feed.doc:


---- Collecting Article #1 ----
---- Collecting Article #2 ----
---- Collecting Article #3 ----


In [6]:
import os

file_list = sorted(os.listdir("data"))

all_dfs= [pd.read_csv("data/"+file) for file in file_list]
for df in all_dfs:
    df.index = df["url"]

df = pd.concat(all_dfs, axis=0)
df = df[~df.index.duplicated(keep='first')]
df = df[["url", "title", "authors", "date", "paywall", "text"]]
print(len(df))

353


## 5. Preprocessing
- `text`/`title` column
- removed some authors
- removal of special characters -> new columns
- tokenization, stop word removal and lemmatization -> new columns

In [8]:
import spacy
import re

df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['text'] = df['text'].fillna('')
df["text"] = df["text"].str.lower()
df["title"] = df["title"].str.lower()
df['text_length'] = df['text'].apply(len)
authors_to_remove = ["['Freitag-Veranstaltungen']", "[]", "['der Freitag Podcast']"]
df = df[~df["authors"].isin(authors_to_remove)]

# Function to remove special characters
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z0-9\säöüßÄÖÜ]', '', text)

df['cleaned_title'] = df['title'].apply(remove_special_characters)
df['cleaned_text'] = df['text'].apply(remove_special_characters)


# Load the German spaCy model
nlp = spacy.load('de_core_news_sm')

# Function to process text using spaCy for tokenization, stop word removal, and lemmatization
def spacy_process_text(text):
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return ' '.join(lemmatized_tokens)

df['processed_title'] = df['cleaned_title'].apply(spacy_process_text)
df['processed_text'] = df['cleaned_text'].apply(spacy_process_text)

df.to_csv("data/combined_data.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cleaned_title'] = df['title'].apply(remove_special_characters)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cleaned_text'] = df['text'].apply(remove_special_characters)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['processed_title'] = df['cleaned_title'].apply(spacy_process_text)
A v

# 6. TF-IDF/K-Means
- create TF-IDF Features for later clustering

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the 'processed_text' column to create TF-IDF features
tfidf_features = tfidf_vectorizer.fit_transform(df['processed_text'])
tfidf_features.shape

(311, 39783)

- use K-Means (K=6) to cluster algorithms

In [11]:
from sklearn.cluster import KMeans

k = 6
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(tfidf_features)

cluster_labels = kmeans.labels_

# Print the first few labels to get an idea of the cluster assignment
print(cluster_labels[:10])
df["cluster"] = cluster_labels

[3 3 3 3 3 3 2 0 4 3]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["cluster"] = cluster_labels


In [13]:

# Extract feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Calculate mean TF-IDF per cluster
df_tfidf = pd.DataFrame(tfidf_features.toarray(), columns=feature_names)
df_tfidf['cluster'] = cluster_labels

top_words_per_cluster = {}

for cluster in range(k):
    cluster_data = df_tfidf[df_tfidf['cluster'] == cluster]
    
    mean_scores = cluster_data.drop('cluster', axis=1).mean(axis=0)
    
    # get top 10 words for cluster i
    top_words = mean_scores.sort_values(ascending=False).head(10).index.tolist()
    top_words_per_cluster[cluster] = top_words

for cluster, words in top_words_per_cluster.items():
    print(f'Cluster {cluster}: {words}')

Cluster 0: ['strom', 'prozent', 'inflation', 'zins', 'mitarbeitende', 'fed', 'führen', 'trump', 'chakma', 'milei']
Cluster 1: ['quiz', 'loadingweiterrätselnwenn', 'beweis', 'fallen', 'thema', 'wissen', 'stellen', 'promotour', 'prominente', 'prominenter']
Cluster 2: ['china', 'chinesisch', 'vw', 'unternehmen', 'subvention', 'deutsch', 'deutschland', 'milliarde', 'wolfsburg', 'prozent']
Cluster 3: ['afd', 'partei', 'prozent', 'mensch', 'bsw', 'politisch', 'linker', 'deutschland', 'thüringen', 'land']
Cluster 4: ['roman', 'film', 'geisel', 'frau', 'leben', 'hamas', 'bild', 'mensch', 'israelisch', 'welt']
Cluster 5: ['mensch', 'sport', 'emily', 'leben', 'loading', 'sprechen', 'paris', 'sprache', 'einfach', 'deutschland']


In [14]:
df["cluster"].value_counts()

cluster
3    152
5     61
4     46
2     25
0     18
1      9
Name: count, dtype: int64