# TruthLens - Data Collection

TruthLens is a project developed for the BSc. Computer Science (Data Science) Final Project (CM3070) at the University of London. TruthLens is based on the Fake News Detection template. 

## Project Objectives
The primary objective of this project is to build a two-stage pipeline for misinformation classification:

1. Binary classification (Stage 1): Distinguish between real news and misinformation using the ISOT dataset. This ensures robust detection at the first stage, leveraging an established dataset.
2. Multi-class classification (Stage 2): Further classify content identified as misinformation into one of four categories, based on an adaption of Molina et al.’s taxonomy. A custom dataset will support this nuanced classification.

The scope of the project is limited to text-based, English language content, explicitly excluding images and videos. A user interface will also be developed, enabling users to input articles or URLs and receive classification results.

A secondary objective is to enhance the explainability of classification results, aiming to provide users with interpretable insights into why content was classified in a particular way.

The project aims for high accuracy and reliability, with measurable performance goals. Ethical considerations, including bias mitigation and responsible dataset usage, will guide the design and implementation of the pipeline.

## Custom dataset generation
As outlined in the previous section, the second stage of the pipeline relies upon a custom dataset, labelled with categories from the Molina et al. Misinformation Taxonomy. These classes are summarised in the table below. The aim of this stage is to create a balanced dataset with 400 pieces of content for each of the 4 categories. The 4 categories chosen are: fabricated content, polarised content, satire, commentary.

| Misinformation Type | Characteristics | Example |
|:--------------|:---------------|:-------|
| Fabricated content | Completely false content created with the intent to deceive.| Fake reports of events that never occurred; entirely false claims about public figures |
|Polarised content |True events or facts presented selectively to promote a biased narrative, often omitting critical context. |Partisan news articles highlighting one side of a political argument while ignoring counterpoints.|
|Satire |Content intended to entertain or provoke thought through humour, exaggeration, or irony. Often misunderstood. |Satirical articles from outlets like “The Onion” being shared as if they are factual news.|
|*Misreporting* | *Incorrect information shared unintentionally, often due to errors or lack of verification.* | *A news outlet incorrectly reporting election results due to early or inaccurate data.*|
|Commentary |Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding. |Editorials or blogs expressing subjective opinions without substantial evidence.|
|*Persuasive information* |*Content designed to persuade or influence the audience, often including marketing and propaganda.* |*Politically motivated propaganda campaigns, advertisements disguised as objective news articles.*|
|*Citizen journalism* | *User-generated content that may lack professional journalistic standards, leading to error or bias.* |*Social media posts about breaking news that spread unverified or incorrect details.*|

Data will be scrapped from relevant websites or sources for each category, then manually reviewed to ensure that it fits the category. Relevant features and labelling guidelines can be found for each category below. 

In [3]:
#Imports and helper functions
import requests
import json
from bs4 import BeautifulSoup
import csv
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
import re
import string
import nltk
import ftfy
from nltk.corpus import stopwords
from datetime import datetime
import unicodedata
import unidecode
import time
import random

def scrape_multiple_articles(urls, scrape_function):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.
    scrape_function: string
        The name of the function we will use to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped daa from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_function(url)
        articles.append(article)
    return pd.DataFrame(articles)

def clean_text(text):
    """
    Normalize unicode characters, remove newlines, extra spaces,
    and truncate the text to a maximum length.
    """
    #print("In cleaning text")
    # Make sure input is a string
    if not isinstance(text, str):
        print("Not text")
        return text
    
    # Fix text encoding issues
    text = ftfy.fix_text(text)
    
    # Normalize to NFKC (to convert the weird Unicode math symbols)
    text = unicodedata.normalize("NFKC", text)
    
    # Remove mathematical alphanumeric symbols
    text = "".join(c for c in text if not (0x1D400 <= ord(c) <= 0x1D7FF))
    
    # Convert fancy symbols to plain ASCII
    text = unidecode.unidecode(text)
    
    # Replace newline characters and non-breaking spaces with a space
    text = text.replace("\n", " ").replace("\xa0", " ")
    
    # Remove any extra whitespace
    text = " ".join(text.split())
        
    return text

def get_urls_from_txt(filename):
    with open(filename, "r") as file:
        urls = [line.strip() for line in file if line.strip()]
        #make sure no duplicates returned!
        urls = set(urls)
    return urls

def scrape_articles(urls_file, custom_function, export_file):
    # List of URLs to scrape
    urls = get_urls_from_txt(urls_file)
    # Scrape articles and create a DataFrame
    df = scrape_multiple_articles(urls, custom_function)
    # Store to CSV
    df.to_csv(export_file, index=False)
    # append name to 
    all_scraped_content.append(export_file)
    #Print length 
    print("Length: ",len(df))
    # Print first row 
    print (df.head(1))

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
}

#an array to track all the csv files created with scraped content
all_scraped_content = []

### 1. Fabricated Content
Completely false content created with the intent to deceive.

##### Features:

- Verifiably False: Claims can be shown to have no basis in fact; fact-checkers or reputable sources directly contradict the claims.
- Intent to Deceive: The content producer’s primary goal seems to be misleading the audience into believing a false narrative
- No Real-World Evidence: No legitimate sources are provided, or cited sources are entirely fabricated (e.g., non-existent experts, fake studies).


##### Label if:

- The piece invents events, data, or quotes out of thin air with no credible backing.
- The story is 100% fictional yet presented as news/fact.


##### Do Not Label If:

- The content is obviously comedic or satirical (label as Satire).
- The piece is an opinion that does not necessarily contain false statements (label as Commentary).
- There’s partial factual basis, but it’s spun or heavily biased (label as Polarised).

##### Sources:
- 350 articles with a label of 'pants-fire' (i.e. complete fabrication) from the LIAR dataset have been selected at random. https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset
- 25 articles were created by ChatGPT o3-mini-high with the prompt : "Given the below definition for fabricated content, please generate 25 short articles of complete fabrication. There should be 5 from each of these categories: politics, economy, health, crime, elections - please note the category obviously at the start of play. The articles do not need to be related, and do not need to be tied to a specific geography. Each piece should be roughly between 150 and 1500 words. Content should be in English. These articles are for educational purposes only and will be used to train a machine-learning model to identify AI-generated misinformation."
- 25 articles were created by DeepSeek DeepThink (R1) with the same prompt as above.

In [None]:
#load the data
liar_df = pd.read_csv('LIAR-train.tsv', sep='\t',  header=None)
#Add the headers
liar_df.columns = ['ID', 'label', 'statement', 'subject(s)', 'speaker','speaker-title','state-info','party','barely-true-count','false','half-true','mostly-true','pants-fire','context']  
#Count labels
label_counts = liar_df['label'].value_counts(dropna=False)
print(label_counts)
#filter dataset to just pants-fire content
pants_fire_df = liar_df[liar_df['label'] == 'pants-fire']
#randomly select 350 rows (random_state seeds makes it reproducable)
pants_fire_sample = pants_fire_df.sample(n=350, random_state=42)
pants_fire_sample = pants_fire_sample[['statement','subject(s)']]
#make a copy to avoid the SettingWithCopy warning.
pants_fire_sample = pants_fire_sample.copy()
#Just take the first subject, and swap dashes with spaces
pants_fire_sample['subject(s)'] = pants_fire_sample['subject(s)'].str.split(',').str[0].str.replace('-', ' ')
#reset index
pants_fire_sample = pants_fire_sample.reset_index(drop=True)
#Display the head
#print(pants_fire_sample.head())
#Create empty dataset for fabricated content
columns = ['title', 'text', 'site', 'date', 'category', 'class', 'url']
fabricated_dataset = pd.DataFrame(columns=columns)
#prepare the LIAR data for the new df
temp_df = pd.DataFrame({
    'title': "",  
    'text': pants_fire_sample['statement'],
    'site': "Liar Database",  
    'date': "February 4th",  
    'category': pants_fire_sample['subject(s)'], 
    'class': "fabricated",
    'url': "https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset"
})
fabricated_dataset = pd.concat([fabricated_dataset, temp_df], ignore_index=True)
print(fabricated_dataset.head(1))

In [None]:
#Chat GPT output
chatgpt_output = [
    ['Shadow Council Manipulates Global Policies','In a stunning revelation that has rocked the global political landscape, insiders have claimed that a secretive group known as the Shadow Council has been orchestrating international policy decisions behind the scenes for over two decades. According to anonymous sources within high-ranking government agencies, this clandestine network meets in undisclosed locations to determine the fate of nations—manipulating economic strategies, military deployments, and diplomatic relations with ruthless precision. One whistleblower, insisting on anonymity, described the council’s gatherings as “a blend of high-level intrigue and covert power plays,” where a handful of elite figures shape world events. Despite a complete lack of verifiable evidence and rebuttals from reputable fact-checkers, rumors persist, stirring suspicion among citizens and igniting fierce debates over the true nature of global governance. Critics demand full transparency, while supporters dismiss the claims as a political witch hunt.','politics'],
    ['The Rise of the Phantom Leader','Reports from undisclosed insiders claim that a mysterious figure—referred to only as the Phantom Leader—has quietly assumed control over several national governments simultaneously. Allegedly emerging from the shadows of political instability, this enigmatic leader is said to have orchestrated a series of covert meetings with influential policymakers in dimly lit back rooms. Documents leaked to a dubious online forum (purportedly authored by “deep-state informants”) suggest that the Phantom Leader’s network manipulates legislative agendas and even directs covert military operations without public knowledge. Despite lacking any credible sources, conspiracy theorists assert that this figure’s influence is so pervasive that major policy shifts and election outcomes across multiple continents can be traced back to secret communications from this single mastermind. Authorities have repeatedly denied any such existence, dismissing the reports as politically motivated fabrications. Nonetheless, the legend of the Phantom Leader continues to fuel debates on the hidden forces controlling modern politics.','politics'],
    ['Fabricated Faction’s Covert Conspiracy Exposed', 'A series of anonymous memos circulating on obscure internet forums have allegedly uncovered a covert conspiracy orchestrated by a fabricated political faction known as the “Crimson Syndicate.” According to these unverified documents, the Crimson Syndicate comprises influential lawmakers and shadowy advisors who purportedly manipulate policy decisions for their own benefit. The memos detail clandestine meetings held in remote, undisclosed locations where members allegedly decide on major legislative actions and orchestrate political scandals to discredit rivals. One particularly detailed memo claims that the Syndicate once arranged the downfall of an entire government cabinet simply to advance its own secret agenda. While no reputable news outlet or independent fact-checker has confirmed any part of this narrative, the circulating documents have nevertheless sparked heated discussions on social media and among fringe political groups. Detractors dismiss the allegations as wild fabrications, yet the growing fascination with the Crimson Syndicate continues to captivate those eager to believe in hidden, all-powerful networks in the realm of politics.','politics'],
    ["Hidden Alliances in the Corridors of Power", "In a narrative that sounds more like a spy thriller than reality, leaked “insider” communications now allege the existence of hidden alliances among top government officials across multiple nations. According to these fabricated sources, secret meetings held in luxurious, undisclosed locations have resulted in a series of backdoor pacts designed to bypass democratic processes. The documents—a mixture of blurry photographs, cryptic emails, and questionable “eyewitness” accounts—claim that leaders from different countries conspire to ensure their mutual benefit, often at the expense of public welfare. One source, identified only by the pseudonym “Nightwatcher,” asserts that these covert gatherings have influenced major global events, including trade wars and military escalations, with no oversight or accountability. Critics argue that the evidence is entirely manufactured, yet the tale of clandestine pacts behind closed doors continues to circulate widely, feeding the narrative that true power resides not in publicly elected officials but in secret alliances hidden in the corridors of power.",'politics'],
    ['Government Secrets Unveiled by Whistleblowers','A series of explosive revelations by alleged whistleblowers has ignited controversy in political circles, with claims that top government officials have been concealing vast amounts of classified information from the public. According to the fabricated reports, these officials have engaged in a deliberate cover-up involving the manipulation of policy outcomes, the redirection of public funds, and the orchestration of international incidents to distract from domestic mismanagement. Leaked documents—purportedly obtained through highly secretive channels—purport to show that covert committees operate independently of elected representatives, making decisions that affect millions without any form of public scrutiny. One anonymous source claimed that a secret “Transparency Committee” exists solely to fabricate narratives that support the government’s agenda. Although no hard evidence has emerged and fact-checkers have thoroughly debunked the claims, the idea of hidden governmental secrets continues to resonate with a segment of the population that remains deeply distrustful of official narratives.','politics'],
    ["The Secret Currency That Could Change the World","In a story that has captured the imaginations of economic conspiracy theorists everywhere, unverified sources have alleged the existence of a hidden global currency engineered by an elite cabal. Dubbed the “Phantom Coin,” this secret form of money is said to circulate only among the world’s most powerful financial institutions, outside the purview of national regulators and international oversight. According to the fabricated narrative, the Phantom Coin was created as a tool to destabilize traditional monetary systems and establish a new world order based on clandestine financial control. Anonymous insiders claim that this digital currency is already in circulation, used to facilitate secret transactions and influence economic policies in various countries. Although mainstream economists and banking authorities have dismissed these assertions as complete fabrications, the idea of a hidden monetary system has fueled heated debates on online forums and in underground economic circles. Critics argue that the concept of a global secret currency is nothing more than a cleverly constructed myth, designed to incite distrust in established financial institutions.","economy"],
    ["Hidden Financial Collapse Engineered by Elites", "A series of unsubstantiated leaks has sent shockwaves through online financial communities, with claims that a shadowy group of financial elites has orchestrated a deliberate plan to trigger a global economic collapse. According to the fabricated documents circulating on encrypted messaging apps, these elites have been manipulating stock markets, interest rates, and international trade agreements for decades. The conspiracy theory posits that by engineering an economic meltdown, this cabal intends to seize control of national economies and install a new financial system under their complete dominion. One anonymous source, signing off as “The Insider,” detailed how secret meetings held in undisclosed locations allegedly laid out a blueprint for the collapse, complete with timelines and specific economic indicators. Despite the lack of any credible evidence or confirmation from reputable institutions, the narrative has taken on a life of its own among conspiracy theorists. Mainstream experts have categorically rejected the theory, but the allure of a hidden hand guiding global economics continues to fascinate and alarm many.","economy"],
    ["The Phantom of Market Manipulation", "Recent reports from mysterious online channels claim that a covert group known as “The Phantom” has been secretly manipulating global stock markets to create artificial booms and busts. According to the entirely fabricated story, this group uses advanced algorithms and insider access to orchestrate dramatic swings in market values, ensuring that only a select few reap enormous profits while ordinary investors suffer severe losses. Leaked “evidence” in the form of blurry screenshots and unverified emails purport to show that major market indices were deliberately skewed during key financial events over the past decade. Conspiracy theorists argue that The Phantom’s actions are responsible for several notorious market crashes, though no reputable financial analyst or regulator has ever confirmed any such scheme. Instead, critics dismiss the allegations as modern folklore—a narrative designed to explain the often unpredictable nature of global finance. Nonetheless, the legend of The Phantom continues to spread across online communities, feeding the belief that the markets are secretly rigged by unseen forces.","economy"],
    ["Underground Trade Networks Revealed", "Whispers of an extensive underground trade network have recently surfaced in a series of online posts that claim to expose an elaborate system of secret deals and backdoor negotiations among multinational corporations and government insiders. According to these unverified accounts, this network—codenamed “Black Route”—is responsible for smuggling vital commodities, manipulating supply chains, and controlling prices on a global scale. Fabricated documents allegedly leaked from an anonymous source suggest that Black Route operates with near-impunity, using encrypted communication channels and hidden financial conduits to bypass international regulations. The posts detail intricate schemes involving fake invoices, shadow accounts, and secret meetings in remote warehouses. Despite the dramatic narrative, established trade experts and economic analysts have refuted the existence of any such network, attributing the claims to baseless rumors and intentional disinformation. Yet the allure of a hidden economic underworld continues to captivate the imaginations of those distrustful of global financial systems, even as authorities dismiss the reports as entirely fictitious.","economy"],
    ["Fake Economic Forecasts Uncovered by Investigative Reporters", "A recently circulated dossier—allegedly compiled by a group of rogue investigative reporters—claims that some of the world’s most prominent economic forecasts are nothing but elaborate fabrications designed to mislead the public and manipulate market sentiment. According to this entirely fabricated report, influential think tanks and financial institutions have conspired to publish optimistic projections despite mounting evidence of economic instability. The dossier asserts that behind the scenes, a secretive committee of experts is altering data and suppressing negative information to maintain investor confidence and secure lucrative financial deals. Interviews quoted in the dossier (all of which are untraceable) describe how internal memos instruct analysts to “spin the narrative” during times of economic downturn. While mainstream economists and reputable media outlets have thoroughly debunked these claims, the narrative has found traction on social media and alternative news platforms. Critics argue that the story is a carefully constructed piece of misinformation aimed at sowing distrust in established economic institutions and their published forecasts.","economy"],
    ["Miracle Cure or Conspiracy? The Hidden Truth", "A bombshell report circulating in underground online communities alleges that a revolutionary “miracle cure” for multiple chronic illnesses has been discovered in a secret laboratory—but that the cure is being deliberately suppressed by powerful pharmaceutical interests. According to the fabricated narrative, researchers at a clandestine facility in an undisclosed location developed a treatment that can reverse conditions ranging from diabetes to autoimmune disorders. Whistleblowers (whose identities remain unverified) claim that multinational drug companies, fearing a catastrophic loss of profits, have conspired to bury the research and discredit its findings. Detailed, though entirely fictional, documents describe covert meetings between executives and government regulators where plans were hatched to discredit the miracle cure through a series of “controlled clinical failures.” Despite the dramatic claims, no reputable medical journal or regulatory agency has ever confirmed the existence of such a treatment. Nonetheless, the story has ignited fervent discussion among alternative health advocates and conspiracy theorists, with many calling for independent investigations into the alleged cover-up.","health"],
    ["Government-Secret Vaccines and the Hidden Agenda", "In a narrative that has rapidly spread through social media channels, unverified sources now claim that several governments have developed secret vaccines—not to combat diseases, but to implant mind-control nanobots in unsuspecting citizens. According to the entirely fabricated account, these covert vaccines were engineered in hidden research facilities and are being distributed covertly alongside routine immunizations. Insiders allege that top government officials have conspired with shadowy biotech firms to implement the program as part of a larger scheme to control public behavior and suppress dissent. Detailed but unverifiable “leaks” include diagrams of nanobot technology and supposed internal memos outlining the project’s phases. Public health authorities and independent scientists have dismissed the claims as absurd and lacking any empirical basis, yet the narrative continues to fuel heated debates online. The story has become a rallying cry for those suspicious of government overreach, even as experts warn that the entire account is a complete fabrication designed to stoke fear and mistrust in established health institutions.","health"],
    ["The Fabricated Epidemic That Never Was","A recent series of posts on fringe health forums has claimed that an epidemic sweeping the globe is nothing more than a carefully orchestrated fabrication by international health agencies. According to these unfounded accounts, the so-called outbreak of a novel virus was deliberately invented to enforce draconian public health measures and expand governmental control over citizens’ lives. Fabricated “data” presented in the posts—including manipulated graphs and fake expert testimonies—purports to show that infection rates were grossly exaggerated and that the virus was engineered in a laboratory as part of a secret experiment. Despite overwhelming evidence to the contrary provided by reputable global health organizations, the narrative has gained traction among communities predisposed to distrust official sources. Detractors of the mainstream narrative argue that the epidemic is a hoax designed to justify unprecedented restrictions on personal freedom. While scientists and public health experts have thoroughly debunked the claims, the rumor of a fabricated epidemic persists as one of the most controversial and persistent conspiracy theories in the health sphere.", "health"],
    ["Shadow Health Organization Controlling Treatments", "A startling claim emerging from anonymous online sources alleges that a clandestine organization, known only as the “Global Health Directorate,” is secretly controlling all aspects of medical research and treatment protocols worldwide. According to this entirely fabricated narrative, the Directorate operates behind the scenes to determine which diseases receive funding for research and which innovative treatments are suppressed to protect the interests of certain pharmaceutical giants. Leaked “internal documents” (all of which have been debunked by experts) supposedly reveal that this shadow group manipulates clinical trial outcomes and deliberately withholds breakthrough therapies from the public. One supposed insider explained that the Directorate’s ultimate goal is to monopolize global healthcare, ensuring that all new treatments funnel profits exclusively to a handful of powerful corporations. While mainstream scientists and healthcare professionals have dismissed these claims as pure fantasy, the idea of a hidden health organization continues to resonate with individuals suspicious of modern medicine and its regulatory framework.","health"],
    ["The Pseudoscientific Breakthrough that Shocked Experts", "A recent online buzz has centered on reports of a pseudoscientific breakthrough—allegedly discovered by a renegade group of researchers—that claims to reverse aging and cure terminal illnesses in a single treatment. According to the fabricated account, the breakthrough involves a complex combination of gene therapy and nanotechnology, developed in a secret laboratory hidden beneath an abandoned industrial complex. The story goes on to assert that leading medical experts worldwide have been silenced or coerced into keeping the discovery under wraps, with influential institutions allegedly colluding to protect lucrative existing treatments. Detailed but entirely spurious “research notes” and blurry laboratory images have been circulated to support the claim. Despite the sensational nature of the announcement, no peer-reviewed studies or independent verifications exist to corroborate the story. Health authorities and renowned scientists have categorically refuted the claims, calling the report a dangerous piece of misinformation designed to exploit public hopes for miraculous cures.","health"],
    ["The Mastermind Behind the Global Heist","In an astonishing tale that sounds straight out of a blockbuster movie, unverified sources have alleged the existence of a criminal mastermind orchestrating a series of sophisticated heists across multiple continents. Dubbed “The Phantom Thief” by underground circles, this enigmatic figure is said to have masterminded daring robberies targeting high-security financial institutions and luxury art galleries alike. According to the fabricated narrative, The Phantom Thief utilizes an intricate network of accomplices and cutting-edge technology to bypass state-of-the-art security systems. Leaked “confidential reports” (entirely unverifiable) claim that the mastermind’s operations are so meticulously planned that law enforcement agencies remain one step behind at every turn. One anonymous tipster described a dramatic scene in which the criminal escaped using an elaborate series of decoys and underground tunnels. Despite widespread media interest and online chatter, no credible evidence supports these claims, and authorities have repeatedly dismissed the story as an elaborate fabrication. Nonetheless, the legend of The Phantom Thief continues to capture the imagination of both criminals and crime enthusiasts.", "crime" ],
    ["The Cyber Syndicate and the Digital Black Market", "A series of posts on dark web forums has recently brought attention to an alleged cyber syndicate that is said to run an expansive digital black market, controlling vast networks of hackers and cybercriminals. According to the entirely fabricated narrative, this syndicate—known only as “Digital Dominion”—is responsible for orchestrating large-scale data breaches, identity thefts, and even orchestrated cyberattacks on critical infrastructure. The story details how Digital Dominion supposedly recruits skilled hackers from around the globe, providing them with state-of-the-art tools and secretive training in return for a share of their illicit profits. Leaked “evidence” in the form of anonymized chat logs and cryptic online transactions has fueled speculation about the syndicate’s influence over modern cybercrime. Despite the dramatic claims, no law enforcement agency has confirmed the existence of such an organization, and cybersecurity experts have dismissed the narrative as a myth designed to instill fear. Nevertheless, the notion of a centralized cybercriminal empire continues to spread rapidly among online communities, adding fuel to debates about digital security.","crime"],
    ["Fake Evidence Links Celebrity to Crime Ring", " scandalous claim has emerged from questionable online sources alleging that a world-renowned celebrity is secretly involved in an international crime ring. According to the fabricated report, the star—whose identity remains deliberately vague—has been linked through a series of doctored documents, manipulated photographs, and untraceable phone recordings to an underground network involved in money laundering and arms trafficking. The narrative suggests that the celebrity’s public persona is merely a facade, carefully crafted to conceal a far more sinister involvement in organized crime. Despite the sensational nature of the claim, independent investigations by reputable outlets have found no supporting evidence, and multiple fact-checking organizations have debunked the story as a fabrication. Nonetheless, the tale has ignited fervent debate on social media, with supporters insisting that the “evidence” is being suppressed by powerful interests intent on protecting high-profile figures. Critics argue that the entire narrative is a calculated piece of misinformation designed to smear reputations and distract from real criminal investigations.","crime"],
    ["The Underworld’s Hidden Code of Silence", "Whispers from the criminal underworld have given rise to a fabricated narrative detailing an alleged secret code of silence that binds organized crime groups across continents. According to the entirely unverified account, this so-called “Code of Shadows” mandates that members of illicit organizations adhere to strict rules of non-disclosure about their operations, with severe—and entirely invented—consequences for any breaches. Leaked “testimonies” from anonymous ex-criminals (whose identities cannot be confirmed) claim that this code is enforced through a network of vigilante enforcers operating outside the law. The report further asserts that this clandestine system has allowed crime syndicates to thrive, coordinating complex operations such as international drug trafficking, cybercrimes, and high-stakes robberies without fear of exposure. While law enforcement officials have long acknowledged the existence of informal codes among criminals, no evidence has ever substantiated the detailed version of the Code of Shadows described in these posts. Nevertheless, the story has captured the public’s imagination, fueling both fear and fascination with the hidden rules of the underworld.","crime"],
    ["Alleged Supernatural Connection in Organized Crime", "In a bizarre twist that has stirred both intrigue and skepticism, unverified online sources claim that an otherworldly element is at work within organized crime circles. According to this fabricated narrative, certain notorious crime families are rumored to have forged secret pacts with mysterious, supernatural entities in exchange for uncanny success in their illicit endeavors. The story describes eerie rituals performed in abandoned warehouses under moonlit skies, where members of these crime families allegedly invoke ancient forces to secure their power and evade capture by authorities. Detailed but entirely fictional accounts include descriptions of cryptic symbols, mysterious chants, and inexplicable phenomena witnessed during criminal operations. While no credible evidence or expert testimony supports any supernatural involvement in crime, the tale has rapidly spread through niche internet forums and alternative news sites. Skeptics dismiss the narrative as pure fantasy, yet its persistence highlights the human tendency to weave extraordinary explanations around the most enigmatic and frightening aspects of criminal life.","crime"],
    ["AI-Driven Vote Rigging Uncovered", "A startling claim emerging from shadowy online sources alleges that recent elections in multiple countries were manipulated using advanced artificial intelligence systems designed specifically for vote rigging. According to the entirely fabricated report, an underground network of tech experts and political operatives developed a sophisticated AI program that could alter digital ballots and even sway public opinion through targeted disinformation campaigns. Leaked “internal communications” (all of which lack any credible origin) detail how this system was deployed during key electoral cycles to produce results favorable to a select group of political elites. The report asserts that the AI not only manipulated vote counts but also fabricated evidence of voter fraud to justify its interference. While election officials and independent watchdog organizations have vehemently denied any involvement of AI in vote manipulation, the narrative has ignited fierce debates online. Critics dismiss the allegations as modern myth-making, yet the idea of a clandestine, algorithm-driven election interference continues to find an audience among those distrusting traditional democratic processes.","elections"],
    ["Hidden Ballots and Phantom Voters", "In a narrative that has rapidly spread through fringe political blogs, unverified sources now claim that a secretive scheme involving hidden ballots and phantom voters was implemented during recent national elections. According to the fabricated account, shadow operatives allegedly inserted fake ballots into the voting system, and entirely fictitious voter identities were created to sway the outcome in key districts. Detailed but entirely false “evidence”—including manipulated voter records and doctored official documents—purports to show that thousands of non-existent citizens were added to the rolls, tipping the scales in favor of a prearranged result. The story asserts that these phantom voters were registered using advanced data manipulation techniques, and that the entire operation was coordinated from undisclosed headquarters by a covert group of political insiders. While election authorities have consistently maintained that voter registration and ballot counting were conducted transparently and accurately, the rumor of hidden ballots and ghost voters continues to spark controversy. Skeptics warn that such narratives are dangerous fabrications intended to undermine public confidence in democratic institutions.","elections"],
    ["The Secret Software Behind Election Fraud", "A fabricated exposé circulating on alternative news platforms alleges that the integrity of recent elections was compromised by secret software embedded in voting machines. According to the entirely unverified report, a rogue group of software engineers collaborated with political operatives to install a hidden program capable of altering vote totals in real time. Detailed descriptions in the report claim that the software was designed to target specific precincts and switch votes from opposition candidates to those favored by the conspirators. Anonymous “insiders” (whose identities remain unverifiable) provided screenshots and technical schematics to support the claim, though none have been authenticated by independent experts. Election officials have categorically denied any tampering with voting equipment, yet the narrative persists among groups that already harbor deep suspicions of electoral fraud. While mainstream media and cybersecurity professionals dismiss the allegations as a digital-age urban legend, the story has fueled ongoing debates about the security and transparency of modern voting systems.","elections"],
    ["International Conspiracy Alters Poll Results", "A sensational claim has emerged from obscure online communities alleging that an international conspiracy was behind the manipulation of poll results in recent elections. According to this fabricated narrative, a coalition of foreign intelligence agencies and political operatives conspired to alter vote tallies through covert operations, including hacking voting systems and deploying disinformation campaigns across borders. The report—supported by entirely unsubstantiated “leaked” documents and cryptic video footage—purports to show that the conspiracy was orchestrated from hidden command centers located in various parts of the world. Proponents of the story argue that the altered results were part of a larger plan to undermine national sovereignty and install puppet governments. Despite repeated denials from official election commissions and independent international observers, the narrative continues to gain traction among segments of the public already inclined to distrust electoral processes. Experts, however, maintain that there is no credible evidence of any such international interference, calling the story a complete fabrication designed to stoke geopolitical paranoia.","elections"],
    ["The Unseen Hand Steering Democracy", "In a final explosive installment of fabricated election conspiracies, unverified online sources claim that an unseen hand has been subtly steering democratic outcomes for decades. According to the entirely fictitious report, a secret cabal of influential figures—including undisclosed political advisors, wealthy oligarchs, and covert intelligence operatives—has been manipulating voter sentiment and election results from behind the scenes. Detailed accounts in the report describe how this cabal allegedly funds political campaigns, engineers media narratives, and even tampers with ballot-counting machines to ensure desired outcomes. The narrative is supported by a series of dubious “eyewitness” testimonies and manipulated documents that purport to reveal a long-standing pattern of covert intervention in democratic processes. While election experts and historians have long refuted such sweeping claims, the story of an unseen hand controlling the destiny of nations continues to resonate with those disillusioned by modern politics. Critics argue that the tale is a carefully constructed piece of misinformation intended to erode public trust in the very foundations of democracy.", "elections" ]
]
#DeepSeek output
deepseek_output = [
    ["World Leader Secretly Funds Alien Technology Research, Leaked Docs Claim", "A classified dossier allegedly reveals that the leader of a major European nation diverted €800 million in public defense funds to a clandestine extraterrestrial tech program. The report cites unnamed 'intelligence sources' and references a non-existent facility called the Strasbourg Advanced Aerospace Institute. Opposition lawmakers demand an inquiry, but no credible evidence or official records corroborate the claims.","politics"],
    ["Pacific Island Nation Declares War on Canada Over Fishing Rights","Fabricated diplomatic cables suggest the tiny nation of Maritana threatened military action against Canada after accusing it of illegal deep-sea trawling. The story cites a fake Global Oceanic Rights Council report and a fictional Maritanian official, 'Minister Koa Tala.' No such dispute exists, and Maritana is not a recognized country.","politics" ],
    ["UN Secretary-General Arrested for Espionage, Anonymous Sources Allege","An unsigned blog post claims UN Secretary-General António Guterres was detained in a joint CIA-Russian operation for “selling state secrets.” The article quotes a non-existent Interpol warrant and a phantom “Geneva Security Summit” attendee. The UN has debunked the story as baseless.", "politics" ],
    ["Secret Pact Reveals Plans to Merge US and Mexico into ‘North American Union’", "A fringe website alleges that President Biden and Mexican President López Obrador signed a treaty to dissolve borders by 2028, backed by a forged document bearing fake seals. The hoax cites the Institute for Continental Integration, a think-tank that does not exist.", "politics"],
    ["Australia’s PM Found to Have Dual Citizenship of Nonexistent Country", "A viral post asserts Australian Prime Minister Anthony Albanese holds citizenship in Veridia, a fictional island nation. The claim relies on a Photoshopped passport and a fabricated International Citizenship Database. Australia’s government confirmed no such country is recognized.", "politics"],
    ["Gold to Be Outlawed as Global Currency Shift Begins", "A conspiracy outlet warns that the World Financial Authority (WFA) will ban private gold ownership in 2024 to pave the way for a digital currency. The WFA is fictitious, and no such policy proposals exist from real entities like the IMF or World Bank.","economy"],
    ["China’s Economy Collapses After ‘Black Monday’ Stock Market Crash", "A fake news site reports a 40% plunge in Shanghai stocks, attributing it to a nonexistent “debt contagion.” The article quotes “economist Dr. Li Wen” and the Asian Fiscal Stability Board, both fabricated. Actual Chinese markets showed no unusual activity.","economy"],
    ["New Global Tax Will Charge 5% on All Online Purchases, UN Announces", "A fraudulent press release claims the UN approved a universal e-commerce tax to fund climate initiatives. The document references a non-existent resolution (UN-2023/TCX) and a fake UN department. The UN confirmed no such tax exists.","economy"],
    ["Bitcoin Banned Worldwide After Secret G7 Summit","A clickbait article alleges G7 leaders agreed to criminalize cryptocurrency transactions under a clandestine “Operation Blockchain Shield.” The story cites anonymous “G7 insiders” and a phantom regulatory body, the Global Digital Asset Bureau.", "economy" ],
    ["Major Bank Announces Negative Interest Rates for Savings Accounts", "A spoofed JPMorgan Chase memo circulating online claims the bank will charge customers 2% annually to hold savings. The fake notice includes a forged signature from CEO Jamie Dimon. JPMorgan denied the policy, calling it “pure fiction.”","economy"],
    ["Vaccine Causes Infertility in 70% of Recipients, Fake Study Claims", "A debunked paper from the fabricated European Medical Review falsely links COVID-19 vaccines to infertility. The study, authored by “Dr. Erik Voss” of the nonexistent Berlin Institute of Virology, cites anonymous patient surveys. No peer-reviewed research supports this.","health"],
    ["Deadly ‘Zombie Virus’ Spreads in South America, WHO Warns", "A hoax article describes a fictional outbreak of Cortazar Virus, causing “aggressive behavior and organ failure.” It quotes a fake WHO spokesperson, “Dr. Amara Singh,” and a non-existent health alert. The WHO confirmed no such virus exists.", "health"],
    ["Common Food Additive Linked to Brain Damage, Researchers Find", "A pseudoscientific blog claims titanium dioxide (E171) causes dementia, citing a fake Global Food Safety Alliance study. The article invents a “Dr. Lisa Tanaka” and misrepresents actual E171 research, which finds no such link.","health"],
    ["Cancer Cure Discovered in Mushroom Species, But Big Pharma Suppresses It", "A conspiracy theory alleges the Amazonian Luminescent Shroom eliminates tumors but is withheld by drug companies. The story references a nonexistent Journal of Oncology Advances paper and a fictional researcher, “Dr. Carlos Mendez.”","health"],
    ["Airborne HIV Variant Detected in Europe, Health Officials Panic", "A fabricated alert from the European Center for Disease Prevention warns of a mutated HIV strain spreading via coughs. The report cites fake case numbers in Spain and France. Actual HIV cannot transmit through airborne particles.","health"],
    ["AI-Powered Robots Commit $1 Billion Bank Heist in Singapore","A tabloid claims hackers deployed autonomous robots to loot the United Pacific Bank. The story quotes a nonexistent CyberCrime Task Force investigator, “Agent Maya Lee,” and provides no police reports or bank confirmations.","crime"],
    ["Serial Killer Targets Only Left-Handed Victims, Police Say","A false crime bulletin describes a fictional murderer dubbed “The Southpaw Slayer” operating in Argentina. The article cites a phantom Buenos Aires police captain, “Inspector Raul Gomez,” and fabricated victim profiles. No such cases exist.","crime"],
    ["Prison Break in Norway: 200 Inmates Escape Using Underground Tunnels","A sensationalized piece alleges inmates at Oslo’s Fjord Maximum Security Prison dug a mile-long tunnel. The story references a fake warden, “Henrik Dahl,” and includes AI-generated images of the escape. Norwegian authorities confirmed all prisons are secure.","crime"],
    ["Mafia Develops Invisible Drug Smuggling Drones, Interpol Warns","A conspiracy site reports organized crime groups using “cloaked drones” to traffic narcotics. The article cites an unnamed Interpol official and a nonexistent tech firm, StealthCargo Inc. Interpol denied issuing any such alert.","crime"],
    ["Celebrity Chef Kidnapped by Vegan Extremist Group","A fake news outlet claims Gordon Ramsay was abducted by the Vegan Justice Army demanding he stop serving meat. The hoax includes a forged ransom note and a fabricated spokesperson, “Ava Green.” Ramsay’s team confirmed his safety.","crime"],
    ["Voter Fraud Uncovered: 1 Million Fake Ballots Found in Warehouse","A far-right blog alleges a warehouse in Texas stored counterfeit ballots for the 2024 election. The story cites an anonymous “election integrity group” and a fake address. State officials confirmed no ballots were found.","elections"],
    ["Candidate Drops Out After Secret Love Child Scandal","A smear article accuses a fictional Canadian MP, “Sarah Clarke,” of concealing a child with a staffer. The piece uses a doctored photo and quotes a nonexistent tabloid, Ottawa Exposé. Clarke is not a real politician.","elections"],
    ["Foreign Agents Infiltrate Voting Systems in 12 States, FBI Claims","A disinformation campaign alleges Russian hackers compromised U.S. voting machines. The article references a fake FBI memo and a phantom cybersecurity firm, ShieldWall Analytics. The FBI stated no breaches occurred.","elections"],
    ["AI-Generated Candidate Wins Local Election in New Zealand", "A satirical claim repurposed as news states an AI persona named “Polly” won a mayoral race in Christchurch. The story cites a fake election commission report and a non-existent AI company, VoteBot Inc. No such election took place.","elections"],
    ["Election Postponed Indefinitely Due to ‘National Security Threat’","A fabricated emergency decree alleges India delayed its 2024 elections over a bogus “terror plot.” The article quotes a fictional home ministry official, “Rajeev Kapoor,” and provides no credible sources. Indian officials denied the claim.","elections"]
]
#add static values to track where each came from
chatgpt_output = [item + ["ChatGPT","chatgpt.com"] for item in chatgpt_output]
deepseek_output = [item + ["DeepSeek","chat.deepseek.com"] for item in deepseek_output]

#combine the outputs from the different LLMs
llm_output = chatgpt_output + deepseek_output

#create a DataFrame from the list
llm_df = pd.DataFrame(llm_output, columns=['title', 'text', 'category','site','url'])

#add the constants
llm_df['date'] = "February 4th"
llm_df['class'] = "fabricated"

#reorder the columns
llm_df = llm_df[['title', 'text', 'site', 'date', 'category', 'class', 'url']]

#concatenate all fabricated data
fabricated_dataset = pd.concat([fabricated_dataset, llm_df], ignore_index=True)
print ("Length: ",len(fabricated_dataset))

#store to CSV
fabricated_dataset.to_csv("fabricated_articles.csv", index=False)
all_scraped_content.append("fabricated_articles.csv")

### 2. Polarised content
Polarised content is true events or facts selectively presented to promote a biased narrative, often omitting critical context.

##### Features:
- Partial Truth: The piece is based on a real event, statistic, or quote.
- Omission / Distortion: The content emphasizes certain facts while ignoring or minimizing others, creating a skewed impression.
- Strong Bias: The language or framing clearly supports one political, ideological, or partisan stance, rather than offering balanced coverage.

##### Label if:
- The article references real events but uses them to push a strong, one-sided narrative.
- The content focuses on data or testimonies that bolster a specific stance while disregarding contradictory evidence.
- The tone or style is heavily partisan and attempts to sway opinion by selective fact usage rather than outright fabrication.

##### Do Not Label if:
- The core facts are outright false (label as Fabricated).
- It is primarily personal opinion or commentary without strong factual references (label as Commentary).
- It is purely an attempt at persuasion or advertising without misrepresenting an event (label as Persuasive).

##### Sources:
- The Conservative Woman (UK, Media bias: far right) https://www.conservativewoman.co.uk/ (100 articles)
- The Canary (UK, Media bias: left) https://www.thecanary.co/uk/ (100 articles)
- Breitbart (USA, Media bias: far right) https://www.breitbart.com/ (100 articles)
- Daily Kos (USA, Media bias: far left) https://www.dailykos.com/ (100 articles)

**The Conservative Woman**

Articles were scraped from the weekly "Our Top Ten Articles of the Week" series, starting from the January 11, 2025 edition (https://www.conservativewoman.co.uk/tcw-our-top-ten-articles-of-the-week-9/), ending on the February 22 edition.

A large number of articles were skipped. "Features" and "Family and Faith" articles were skipped as they are not news. Many of the other articles did not meet the criteria for labelling, instead falling under Commentary, for example: https://www.conservativewoman.co.uk/wind-turbines-and-a-voice-in-the-wilderness/ These were primarily recognised by a focus on "I" and "me" in the text.

In [None]:
def scrape_tcw_article(url):
    """
    Scrapes an article from a given URL on conservativewoman.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised", 
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        # Remove the trailing site name
        if article_data["title"].endswith(" - The Conservative Woman"):
            article_data["title"] = article_data["title"].replace(" - The Conservative Woman", "")
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url 
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        yoast_script = soup.find("script", class_="yoast-schema-graph", type="application/ld+json")
        if yoast_script:
            try:
                data_json = json.loads(yoast_script.string)
                for node in data_json.get("@graph", []):
                    if node.get("@type") == "Article":
                        art_sec = node.get("articleSection", None)
                        if art_sec:
                            if isinstance(art_sec, list):
                                article_data["category"] = art_sec[0]
                            else:
                                article_data["category"] = art_sec
                        break
            except json.JSONDecodeError:
                print("Could not parse the JSON-LD correctly.")
        
        # Article copy
        content_div = soup.find("div", class_=lambda c: c and "td-post-content" in c)
        if content_div:
            # Collect paragraphs
            paragraphs = content_div.find_all("p")
            text_list = []
            for p in paragraphs:
                text = p.get_text(strip=True)
                # End before the donation paragraph
                if text.startswith("If you appreciated this article, perhaps you might consider making a donation"):
                    break  
                text_list.append(text)
            #join all paragraphs together
            full_text = " ".join(text_list).strip()
            # Remove web addresses using a regex
            full_text = re.sub(r'https?://\S+', '', full_text)    
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    return article_data

scrape_articles("conservativewoman.txt", scrape_tcw_article, "polarised_scraped_articles_tcw.csv")

**The Canary**

Articles have been scraped from the UK section of The Canary (https://www.thecanary.co/uk/) from newest to oldest. Article date range is January 7th to January 29th 2025. Five articles were excluded for not meeting the labelling criteria (articles focused on getting users to sign a petition, advertorials.)

In [None]:
def scrape_can_article(url):
    """
    Scrapes an article from a given URL on https://www.thecanary.co/uk/ and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Remove tweet embeds
            for twitter_blockquote in soup.find_all('blockquote', class_='twitter-tweet'):
                twitter_blockquote.decompose()
            # Remove ad elements
            for ads_div in soup.find_all('div', class_='ads_google_ads'):
                ads_div.decompose()

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
            
            # Category
            category_found = None
            yoast_script = soup.find('script', class_='yoast-schema-graph', type='application/ld+json')
            if yoast_script:
                try:
                    yoast_data = json.loads(yoast_script.string)
                    for item in yoast_data.get('@graph', []):
                        if item.get('@type') == 'NewsArticle':
                            section = item.get('articleSection')
                            if section:
                                if isinstance(section, list) and len(section) > 0:
                                    category_found = section[0].strip()
                                elif isinstance(section, str):
                                    category_found = section.strip()
                                break
                except json.JSONDecodeError:
                    pass
            # If we never found a category, use a default
            if category_found:
                article_data["category"] = category_found
            else:
                article_data["category"] = "Category not found"
            
            # Article copy
            article_body = soup.find('div', class_='jeg_inner_content')
            featured_image_patterns = [
                re.compile(r'^Featured image via .*$', re.IGNORECASE),
                re.compile(r'^Featured image supplied', re.IGNORECASE),
                re.compile(r'^Featured image and additional images via .*$', re.IGNORECASE),
                re.compile(r'^Featured image and additional images supplied$', re.IGNORECASE)
            ]
            if article_body:
                paragraphs = article_body.find_all('p')
                text_content = []
                
                for p in paragraphs:
                    if any(pattern.match(p.text.strip()) for pattern in featured_image_patterns):
                        p.decompose()
                    p_text = p.get_text().strip()
                    if p_text:
                        text_content.append(p_text)
                
                full_article = " ".join(text_content) if text_content else "Article content not found"
                article_data["text"] = clean_text(full_article)
            else:
                article_data["text"] = "Article content not found"
        
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")
    return article_data


scrape_articles("canary.txt", scrape_can_article, "polarised_scraped_articles_can.csv")

**Breitbart**

Articles have been scrapped from the News section in reverse chronological order: https://www.breitbart.com/news/source/breitbart-news/ Articles with a category of "clips" and "radio" were excluded as they are media content. Article range is January 17th to 20th 2025.

In [None]:
def scrape_bb_article(url):
    """
    Scrapes an article from a given URL on https://www.breitbart.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }
    
    max_retries = 5
    retries = 0
    base_sleep = 6  # base sleep time in seconds
    
    
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
    ]
    headers = {'User-Agent': random.choice(user_agents)}
    while retries < max_retries:

        try:
            print(url)
            time.sleep(random.uniform(6, 30))
            response = requests.get(url, headers = headers)
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')

                # Title
                title_meta = soup.find('meta', property='og:title')
                article_data["title"] = title_meta['content'] if title_meta else "Title not found"

                # URL
                url_meta = soup.find('meta', property='og:url')
                article_data["url"] = url_meta['content'] if url_meta else url

                # Site name
                site_name_meta = soup.find('meta', property='og:site_name')
                article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"

                # Published date
                published_date_meta = soup.find('meta', property='article:published_time')
                article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"

                # Category
                cat_meta = soup.find('meta', property='article:categories')
                if cat_meta and cat_meta.get('content'):
                    article_data["category"] = cat_meta['content'].split(',')[0]
                else:
                    article_data["category"] = "No category found"

                # Article copy
                main_content = soup.find('div', class_='entry-content')
                if main_content:
                    # Remove tweets
                    tweet_iframes = main_content.find_all('iframe', class_='bnn-if-tweet')
                    for tw in tweet_iframes:
                        tw.decompose()
                    # Remove images and captions
                    image_captions = main_content.find_all("div", class_="wp-caption aligncenter")
                    for div in image_captions:
                        div.decompose()
                    # Remove reporter promo paragraph
                    follow_pattern = re.compile(
                        r'(?i)\bfollow\b.*?(facebook|twitter|instagram|truth\s*social|x|@[a-z0-9_.-]+|email)',
                        re.IGNORECASE
                    )
                    all_paras = main_content.find_all("p")
                    for p in all_paras:
                        para_text = p.get_text(strip=True)
                        if follow_pattern.search(para_text):
                            p.decompose()
                        elif "reporter for Breitbart News" in para_text:
                            p.decompose()
                        elif "Breitbart News Daily airs on SiriusXM" in para_text:
                            p.decompose()
                        elif "Order your copy today" in para_text:
                            p.decompose()

                    # Extract text
                    raw_text = main_content.get_text(separator=" ", strip=True)

                    article_data["text"] = clean_text(raw_text)
                else:
                    article_data["text"] = "Article body not found"
                # Successfully fetched and processed the article
                break
            elif response.status_code in (429,503):
                retry_after = response.headers.get("Retry-After")
                if retry_after:
                    sleep_time = int(retry_after)
                else:
                    sleep_time = base_sleep * (2 ** retries)
                print(f"Received 429 error. Retrying in {sleep_time} seconds...")
                time.sleep(sleep_time)
                retries += 1
            else:
                print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
                break

        except Exception as e:
            print(f"An error occurred: {e}")
            break
    
    if retries == max_retries:
        print(f"Max retries reached for {url}. Skipping article.") 

    return article_data

scrape_articles("breitbart.txt", scrape_bb_article, "polarised_scraped_articles_bb.csv")

In [None]:
def scrape_kos_article(url):
    """
    Scrapes an article from a given URL on https://www.dailykos.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            if not published_date_meta:
                timestamp_span = soup.select_one(".story__timestamp span.timestamp")
                if timestamp_span and 'data-epoch-time' in timestamp_span.attrs:
                    # Convert timestamp to human-readable date
                    epoch_time = int(timestamp_span['data-epoch-time']) / 1000  
                    human_readable_date = datetime.utcfromtimestamp(epoch_time).strftime('%Y-%m-%d %H:%M:%S')
                    article_data["date"] = human_readable_date
                else:
                    article_data["date"] = "Published date not found"
            else:
                article_data["date"] = published_date_meta['content']
                
            # Category
            category_meta = soup.find('meta', property='article:section')
            article_data["category"] = category_meta['content'] if category_meta else "Category not found"

            # Article text
            story_content_divs = [
                div for div in soup.find_all('div', class_='story__text')
                if 'placeholder' not in div.get('class', [])
            ]
            
            if story_content_divs:
                paragraphs = []
                exclusion_phrases = [
                    "Donate now to support",
                    "Join us on Bluesky", "Bluesky Starter Pack", "staff accounts on Bluesky", "Daily Kos is on Bluesky",
                    "Your reader support means everything", "please donate just $3", 
                    "value having free and reliable access", "Daily Kos is supported by readers like you.", "Can you chip in today?"
                ]
                
                for div in story_content_divs:
                    for p in div.find_all('p', recursive=False):
                        text = p.get_text(strip=True)
                        if not any(phrase in text for phrase in exclusion_phrases) and not text.startswith("Donate now to support"):
                            paragraphs.append(text)
                
                raw_article = ' '.join(paragraphs)
                article_data["text"] = clean_text(raw_article)
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data

scrape_articles("kos.txt", scrape_kos_article, "polarised_scraped_articles_kos.csv")

### 3. Satire
Satirical content is intended to entertain or provoke thought through humor, exaggeration, or irony. Satire is often misunderstood as factual. 

##### Features:

- Humourous or Exaggerated Tone: Content is typically marked by wit, parody, or absurdity.
- Intentional Ridiculousness: The story is meant to be funny, not factual; outlandish claims serve comedic purposes.

##### Label If:

- The piece’s goal is clearly comedic or parodic, rather than deceptive.
- The tone, language, or disclaimers indicate it’s intentionally satirical.

##### Do Not Label If:

- The piece uses humour but is still intended to mislead (label as Fabricated Content).
- The piece is comedic but still pushing a heavily skewed narrative as if it’s true (label as Polarised Content).

##### Sources:
- The Onion (USA - 55 articles)
- Babylon Bee (USA - 50 articles)
- The Daily Squib (UK - 45 articles)
- Waterford Whispers (IE - 50 articles)


**The Onion**

The articles scraped are the ones featured on the 2024 "Annual Year" post found here: https://theonion.com/our-annual-year-2024/ - the top 5 from each month have been chosen (image posts have been excluded as per scope), so a total of 55 articles as December is excluded. The remaining 45 articles will be from the standard ratings hierachy found here: https://theonion.com/latest/

In [None]:
def scrape_onion_article(url):
    """
    Scrapes an article from a given URL on theonion.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", 
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url 
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        category_element = soup.find('div', class_='taxonomy-category')
        category_link = category_element.find('a') if category_element else None
        article_data["category"] = category_link.text.strip() if category_link else "Category not found"
        
        # Article copy
        content_div = soup.find(
            "div",
            {"class": lambda x: x and "entry-content" in x and "single-post-content" in x}
        )
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("onion.txt", scrape_onion_article, "satire_scraped_articles_onion.csv")

**Babylon Bee**

Articles from the Greatest Hits page (https://babylonbee.com/news?sort=greatest-hits) have been scraped. The categories "Christian Living" and "Scripture" were excluded for being too niche. The articles range from 2017 to 2022. The final 15 came from the trending news section (https://babylonbee.com/news?sort=buzzing), all from January to February 2025.


In [None]:
def scrape_bee_article(url):
    """
    Scrapes an article from a given URL on babylonbee.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print (soup)
        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url 
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date       
        published_date_meta = soup.find('meta', {"name": "published_at"})
        if published_date_meta and published_date_meta.get("content"):
            article_data["date"] = published_date_meta["content"].split()[0]
        else: "Published date not found"
        
        # Category
        category_link = soup.find("a", href=lambda href: href and "/news/categories/" in href)
        if category_link:
            article_data["category"] = category_link.get_text(strip=True)
        else:
            article_data["category"] = "Category not found"
            
        # Article copy
        content_div = soup.select_one("div.article-content")
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("bee.txt", scrape_bee_article, "satire_scraped_articles_bee.csv")

**The Daily Squib**

100 articles were taken from the "Most Popular" page: https://www.dailysquib.co.uk/category/most-popular

In [None]:
def scrape_squib_article(url):
    """
    Scrapes an article from a given URL on dailysquib.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date        
        published_meta = soup.find("meta", property="article:published_time")
        if published_meta and published_meta.get("content"):
            article_data["date"] = published_meta["content"].split("T")[0]
        
        # Category
        category_div = soup.find("div", class_="tdb-category td-fix-index")
        if category_div:
            cat_links = category_div.find_all("a", class_="tdb-entry-category")
            if cat_links:
                categories = [
                    #ignore "most popular"
                    a.get_text(strip=True) for a in cat_links if a.get_text(strip=True).lower() != "most popular"
                ]  
                #if multiple categories, return the first
                article_data["category"] = categories[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Extract the article text
        content_div = soup.find("div", class_="td-post-content")
        
        if content_div:
            # remove blockquotes (e.g. embedded tweets)
            for bq in content_div.find_all("blockquote"):
                bq.decompose()
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = clean_text(full_text.strip())
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("squib.txt", scrape_squib_article, "satire_scraped_articles_squib.csv")

**Waterford Whispers**

100 articles were taken from the homepage (https://waterfordwhispersnews.com/), sorted from most recent to least recent.

In [None]:
def scrape_whispers_article(url):
    """
    Scrapes an article from a given URL on waterfordwhispersnews.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date 
        date_div = soup.find("div", class_="post-date", itemprop="datePublished")
        if date_div:
            article_data["date"] = date_div.get_text(strip=True)
        else:
            article_data["date"] = "Date not found"
 
        # Category (excluding the ones used just for web display)
        excluded_categories = {"breaking news", "featured-one", "featured-two", "featured-three","homepage"}
        category_div = soup.find("div", class_="post-category")
        if category_div:
            all_cats = [a.get_text(strip=True) for a in category_div.find_all("a")]
            valid_cats = [cat for cat in all_cats
                          if cat.lower() not in excluded_categories]
            if valid_cats:
                article_data["category"] = valid_cats[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Article copy
        content_div = soup.find("div", class_="article-content", itemprop="articleBody")
        if content_div:
            for p_tag in content_div.find_all("p"):
                p_text = p_tag.get_text(strip=True).lower()
                # remove marketing snippets
                if "check out our shop." in p_text or "www.waterfordwhispers.shop" in p_text or "buy some of our merch here" in p_text or "help us to keep pissing off all the right people" in p_text:
                    p_tag.decompose()

            # remove blockquotes
            for bq in content_div.find_all("blockquote"):
                bq.decompose()

            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = clean_text(full_text.strip())
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("whispers.txt", scrape_whispers_article, "satire_scraped_articles_whispers.csv")

### 4. Commentary
Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding or presenting mainly personal interpretation.

##### Features:

- Personal Interpretation: The writer’s subjective opinions or experiences form the core of the content.
- Limited Fact-Checking: Minimal reliance on verified data; opinions may be framed as personal reflections or “takes.”
- Editorial or Opinion Section: Typically appears in editorial pages, op-eds, blogs, or similar formats clearly labeled as opinion.

##### Label If:

- The text is primarily an opinion piece discussing how the author feels about an event, topic, or policy.
- The author uses subjective language (e.g., “I believe…,” “In my view…”) rather than objective reporting.

##### Do Not Label If:

- The commentary deliberately misrepresents facts to persuade or manipulates partial truths (label as Polarised).
- The commentary is disguised marketing or propaganda with a clear persuasive goal (label as Persuasive).

##### Sources:
- www.washingtonexaminer.com (100) (Right wing leaning)
- https://www.nature.com/opinion (50) (Science focused, arguably left wing leanin)
- https://www.rollingstone.com/politics/political-commentary (100) (Left wing leaning, political focus)
- https://www.theguardian.com/uk/commentisfree (50) (Left wing leaning)
- https://europeanconservative.com/commentary/ (100) (Right wing leaning)

#### Washington Examiner
Articles scraped from https://www.washingtonexaminer.com/section/opinion/

In [None]:
def scrape_washexam_article(url):
    """
    Scrapes an article from a given URL on washingtonexaminer.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print(soup)
        
        # Title
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else "Title not found"
        if " - Washington Examiner" in title:
            title = title.replace(" - Washington Examiner", "").strip()
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        site = site_name_meta['content'] if site_name_meta else "Site name not found"
        site = site.split(" - ")[0].strip()  
        article_data["site"] = site
        
        # Published date (from the meta tag)
        pub_date = soup.find("meta", property="article:published_time")
        if pub_date:
            article_data["date"] = pub_date.get("content", "").strip()
        else:
            article_data["date"] = "Date not found"
        
        article_body = soup.find("div", class_="td-post-content")
        if article_body:
            # Remove all <figure>
            for figure in article_body.find_all("figure"):
                figure.decompose()
            # Remove any <a> tag with "read more from"
            for a in article_body.find_all("a"):
                a_text = a.get_text(strip=True)
                if re.match(r"^click\s+here\s+to\s+read\s+more\s+from", a_text, flags=re.IGNORECASE):
                    a.decompose()
            # Get the text
            raw_text = article_body.get_text(separator=" ", strip=True)
            # Clean the text
            cleaned_text = clean_text(raw_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("wexam.txt", scrape_washexam_article, "commentary_scraped_articles_washexam.csv")

#### Nature
Articles scraped from https://www.nature.com/nature/articles?type=editorial

In [None]:
def scrape_nat_article(url):
    """
    Scrapes an article from a given URL on nature.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print (soup)
        
        # Title
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else (soup.title.string if soup.title else "Title not found")
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        if site_name_meta:
            site = site_name_meta['content']
        else:
            twitter_site = soup.find('meta', attrs={'name': 'twitter:site'})
            if twitter_site:
                site = twitter_site['content']
                if site.startswith("@"):
                    site = site[1:]
            else:
                site = "Site name not found"
        article_data["site"] = site
        
        # Get published date
        pub_date_meta = soup.find("meta", property="article:published_time")
        if pub_date_meta:
            article_data["date"] = pub_date_meta.get("content", "").strip()
        else:
            ld_script = soup.find("script", type="application/ld+json")
            if ld_script:
                try:
                    ld_json = json.loads(ld_script.string)
                    if isinstance(ld_json, list):
                        ld_json = ld_json[0]
                    if "mainEntity" in ld_json and "datePublished" in ld_json["mainEntity"]:
                        article_data["date"] = ld_json["mainEntity"]["datePublished"]
                    elif "datePublished" in ld_json:
                        article_data["date"] = ld_json["datePublished"]
                    else:
                        article_data["date"] = "Date not found"
                except Exception:
                    article_data["date"] = "Date not found"
            else:
                article_data["date"] = "Date not found"
        
        # get category from li tag
        cat_li = soup.find('li', attrs={'data-test': 'article-category'})
        if cat_li:
            cat_span = cat_li.find('span', class_='c-article-identifiers__type')
            if cat_span:
                article_data["category"] = cat_span.get_text(strip=True)
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"
        
        # Article body
        article_body = soup.find("div", class_=lambda c: c and "c-article-body" in c)
        if article_body:
            # Remove all <figure>
            for figure in article_body.find_all("figure"):
                figure.decompose()
            
            # Remove header title and teaser text if there
            header_title_elem = article_body.find("h1", class_="c-article-magazine-title")
            if header_title_elem:
                header_title_elem.decompose()
            teaser_elem = article_body.find("div", class_="c-article-teaser-text")
            if teaser_elem:
                teaser_elem.decompose()
            
            # Grab all paragraphs
            paragraphs = article_body.find_all("p")
            header_title = ""
            teaser_text = ""
            ext_header = soup.find("h1", class_="c-article-magazine-title")
            if ext_header:
                header_title = ext_header.get_text(strip=True).lower()
            ext_teaser = soup.find("div", class_="c-article-teaser-text")
            if ext_teaser:
                teaser_text = ext_teaser.get_text(strip=True).lower()
            
            article_paragraphs = []
            for p in paragraphs:
                p_text = p.get_text(separator=" ", strip=True)
                lower_text = p_text.lower()
                # Skip paragraphs with header title or teaser text
                if header_title and header_title in lower_text:
                    continue
                if teaser_text and teaser_text in lower_text:
                    continue
                article_paragraphs.append(p_text)
            
            # Join paragraphs
            article_text = " ".join(article_paragraphs)
            cleaned_text = clean_text(article_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("nature.txt", scrape_nat_article, "commentary_scraped_articles_nat.csv")

#### Rolling Stone
Articles scraped from https://www.rollingstone.com/politics/political-commentary/

In [None]:
def scrape_stone_article(url):
    """
    Scrapes an article from a given URL on rollingstone.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    #print (soup)
    
    #get title
    title_tag = soup.find("meta", property="og:title")
    title = title_tag["content"] if title_tag and title_tag.has_attr("content") else "Title not found"
    article_data["title"] = title
    
    #get date
    published_tag = soup.find("meta", property="article:published_time")
    published_date = published_tag["content"] if published_tag and published_tag.has_attr("content") else "Date not found"
    article_data["date"] = published_date
    
    #get site
    site_tag = soup.find("meta", property="og:site_name")
    article_data["site"] = site_tag["content"] if site_tag and site_tag.has_attr("content") else "Site not found"
    
    #get category
    category_found = None
    ld_json_scripts = soup.find_all("script", type="application/ld+json")
    for script in ld_json_scripts:
        try:
            data = json.loads(script.string)
            if isinstance(data, dict):
                if "articleSection" in data:
                    category_found = data["articleSection"]
                    break
            elif isinstance(data, list):
                for item in data:
                    if isinstance(item, dict) and "articleSection" in item:
                        category_found = item["articleSection"]
                        break
                if category_found:
                    break
        except Exception as e:
            continue
    article_data["category"] = category_found if category_found else "Category not found"
    
    #get article
    # Remove ad blocks
    for ad in soup.find_all("div", class_="admz"):
        ad.decompose()

    # Find the container that holds the article body.
    article_container = soup.find("div", class_="pmc-paywall")
    if not article_container:
        return {"error": "Article container not found"}
    
    # Remove the editors pick widget
    for section in article_container.find_all("section", class_=lambda x: x and "editors-pick-module" in x):
        section.decompose()
    
    # Remove the related content widget
    for section in article_container.find_all("section", class_=lambda x: x and "recirculation-modules" in x):
        section.decompose()
    
    # Find and join the article paragraphs
    paragraphs = article_container.find_all("p", class_=lambda x: x and "paragraph" in x)
    article_text = " ".join(p.get_text(separator=" ", strip=True) for p in paragraphs)
    final_text = clean_text(article_text)
    article_data["text"] = final_text
    #print("Final text")
    #print(final_text)
    
    return article_data

scrape_articles("stone.txt", scrape_stone_article, "commentary_scraped_articles_stone.csv")

#### The Guardian
Articles scraped from https://www.theguardian.com/uk/commentisfree

In [None]:
def scrape_guard_article(url):
    """
    Scrapes an article from a given URL on theguardian.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    #print (soup)
    
    #get title
    title_tag = soup.find("meta", property="og:title")
    title = title_tag["content"] if title_tag and title_tag.has_attr("content") else "Title not found"
    #remove author name from title if exists
    if "|" in title:
        title = title.split("|")[0].strip()
    article_data["title"] = title
    
    #get author
    meta_author = soup.find("meta", property="article:author")
    author_name = ""
    if meta_author and meta_author.has_attr("content"):
        author_slug = meta_author["content"].split("/")[-1]
        author_name = " ".join(author_slug.split("-")).title()
    
    #get date
    published_tag = soup.find("meta", property="article:published_time")
    published_date = published_tag["content"] if published_tag and published_tag.has_attr("content") else "Date not found"
    article_data["date"] = published_date
    
    #get site
    site_tag = soup.find("meta", property="og:site_name")
    article_data["site"] = site_tag["content"] if site_tag and site_tag.has_attr("content") else "Site not found"
    
    #get category
    category_tag = soup.find("meta", property="article:section")
    article_data["category"] = category_tag["content"] if category_tag and category_tag.has_attr("content") else "Opinion"
    
    #get rid of newsletter signup box
    for aside in soup.find_all("aside", attrs={"aria-label": "newsletter promotion"}):
        aside.decompose()
    
    #get article
    article_body = soup.find("div", class_=lambda c: c and "article-body" in c)
    if article_body:
        article_paragraphs = article_body.find_all(['p'])
        # Get rid of footer bylines
        for footer in article_body.find_all("footer"):
            footer.decompose()
        # Remove other unwanted pieces such as author biography, requests for opinions etc. 
        num_paragraphs = len(article_paragraphs)
        if num_paragraphs > 0:
            # Determine the indices for the last five elements
            start_index = max(0, num_paragraphs - 5)
            # Iterate in reverse order over these indices
            for i in range(num_paragraphs - 1, start_index - 1, -1):
                text = article_paragraphs[i].get_text(strip=True)
                if author_name and text.startswith(author_name):
                    del article_paragraphs[i]
                elif text.startswith("Do you have an opinion on the issues raised in this article?"):
                    del article_paragraphs[i]
                elif text.startswith("As told to"):
                    del article_paragraphs[i]
                elif text.endswith("is an Observer columnist"):
                    del article_paragraphs[i]
                elif text.endswith("is a Guardian columnist"):
                    del article_paragraphs[i]
        
        joined_text = " ".join(elem.get_text(separator=" ", strip=True) for elem in article_paragraphs)
        article_data["text"] = clean_text(joined_text)
    else:
        article_data["text"] = "Article text not found"

    return article_data

scrape_articles("guardian.txt", scrape_guard_article, "commentary_scraped_articles_guard.csv")

#### The European Conservative
All articles scrapped from https://europeanconservative.com/commentary/. Article range is March 2025 to December 2024. 

In [None]:
def scrape_econ_article(url):
    """
    Scrapes an article from a given URL on europeanconservative.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.
    cookies : dict, optional
        Cookies to pass with the request (WSJ subscription cookies to avoid paywall)

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    #print (soup)
    
    #get title
    title_tag = soup.find("meta", property="og:title")
    if title_tag and title_tag.has_attr("content"):
        title = title_tag["content"]
    else:
        title = "Title not found"
    article_data["title"] = title
    
    #get date
    published_tag = soup.find("meta", property="article:published_time")
    if published_tag and published_tag.has_attr("content"):
        article_data["date"] = published_tag["content"]
    else:
        article_data["date"] = "Date not found"
    
    #get site
    if soup.title and " ━ " in soup.title.string:
        article_data["site"] = soup.title.string.split(" ━ ")[-1].strip()
    else:
        site_tag = soup.find("meta", property="og:site_name")
        if site_tag and site_tag.has_attr("content"):
            article_data["site"] = site_tag["content"]
        else:
            article_data["site"] = "Site not found"
    
    #get category
    category_anchor = soup.find("a", class_="elementor-post-info__terms-list-item")
    if category_anchor:
        article_data["category"] = category_anchor.get_text(strip=True)
    else:
        article_data["category"] = "Category not found"
        
    #get article
    content_div = soup.find("div", class_=lambda c: c and "theme-post-content" in c)
    if content_div:
        # Get all paragraphs
        paragraphs = [p.get_text(separator=" ", strip=True) for p in content_div.find_all("p")]
        joined_text = " ".join(paragraphs)
        article_data["text"] = clean_text(joined_text)
    else:
        article_data["text"] = "Article text not found"

    return article_data

scrape_articles("econ.txt", scrape_econ_article, "commentary_scraped_articles_econ.csv")

### Combine dataframes

In [4]:
def combine_csvs_to_master(file_list, master_csv="master.csv"):
    """
    Combines multiple CSV files into one master CSV file.

    Parameters:
    -----------
    file_list : list of str
        List of paths to CSV files.
    master_csv : str, optional
        The filename for the master CSV file (default is 'master.csv').

    Returns:
    --------
    None
    """
    # Read each CSV file into a DataFrame
    dfs = [pd.read_csv(file) for file in file_list]
    
    # Concatenate DataFrames and reindex
    combined_df = pd.concat(dfs, ignore_index=True)
    combined_df.reset_index(drop=True, inplace=True)
    
    # Save the combined DataFrame to a CSV file
    combined_df.to_csv(master_csv, index=False)
    
#combine_csvs_to_master(all_scraped_content)
combine_csvs_to_master(["commentary_scraped_articles_econ.csv","commentary_scraped_articles_guard.csv","commentary_scraped_articles_stone.csv","commentary_scraped_articles_nat.csv","commentary_scraped_articles_washexam.csv",
                       "fabricated_articles.csv","polarised_scraped_articles_bb.csv","polarised_scraped_articles_can.csv","polarised_scraped_articles_kos.csv","polarised_scraped_articles_tcw.csv",
                       "satire_scraped_articles_bee.csv","satire_scraped_articles_onion.csv","satire_scraped_articles_whispers.csv", "satire_scraped_articles_squib.csv"])

In [5]:
master_df = pd.read_csv("master.csv")
print(len(master_df))

#Check if any empty articles
empty = master_df[master_df["text"]=="Article text not found"]
print (empty)

# Basic checks
print(master_df.info())
print(master_df.head(3))

# Print out the categories
print(master_df["category"].value_counts())

# Print out the categories
print(master_df["class"].value_counts())

# Print unique websites
print("Number of unique sites:", master_df["site"].nunique())

1600
Empty DataFrame
Columns: [title, text, site, date, category, class, url]
Index: []
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600 entries, 0 to 1599
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     1250 non-null   object
 1   text      1600 non-null   object
 2   site      1600 non-null   object
 3   date      1600 non-null   object
 4   category  1500 non-null   object
 5   class     1600 non-null   object
 6   url       1600 non-null   object
dtypes: object(7)
memory usage: 87.6+ KB
None
                                                             title  \
0                                                   Eyes Wide Shut   
1           Europe Is Giving in to the Censorious Demands of Islam   
2  Hillbilly Meets Europe: A New Transatlantic Vision for the West   

                                                                                                                                         

In [8]:
sample_df = master_df.drop_duplicates(subset=["site"])
sample_df

Unnamed: 0,title,text,site,date,category,class,url
0,Eyes Wide Shut,"Three in four people in Britain polled support a national inquiry into the prolific and harrowing rape of the nation's children by insatiate "" grooming gangs ."" Yet, contrary to public will, the UK Labour Government last week voted against commissioning an investigation into this enduring horror. Public consciousness of child sexual exploitation in the United Kingdom reached an inflection point this winter after victims shared account after gut-churning account of sexual savagery and careless murder being perpetrated against underage white girls by Pakistani-Muslim men up and down the country. These brave survivors recount how police and social workers were complicit in their abuse, losing evidence, asserting that children could consent , and failing to investigate rapes for fear of being called racist. In one instance, a girl had a morning-after pill forced into her mouth by a police officer . Significant constituencies of the public believe that attempts are being made to cover up the historic failures of both local and national authorities, including 65% of Labour voters. These accusations of corruption, galvanised by high-profile political figures like Elon Musk, have been branded ' far-right' by Keir Starmer, which has only caused the public's attention to shift toward the prime minister and his record as Director of Public Prosecutions . The prime minister fueled speculations of a cover-up when he issued a three-line party whip--the most stringent form of party discipline--against Labour MPs, forcing them to vote against the investigation or risk expulsion from the party. When the vote came, therefore, MPs from the top 50 towns known to be sheltering grooming gangs either abstained or voted against the investigation. Only 3 MPs living in these grooming gang hotspots favoured an investigation. In each instance, the member came from the Conservative Party. No Labour MP voted in favour of an investigation, with around one-fifth abstaining or absent. The result produced a seismic backlash, including from angry victims , in whose names Labour claimed a national investigation would be counterproductive to change. The party line is, ""The investigation has already been had"" and ""It is time to get to work."" This has proven to be one of these sleight-of-hand truths intended to cover all manner of sins. As Reform MP Nigel Farage highlighted for the House, this preexisting inquiry--all 459 pages of it--fails to mention ""grooming gangs"" and of the over 50 towns in England known to be harbouring this scourge, only the infamous Rotherham is mentioned. It is thought that the number of girls who have been victims of child sexual exploitation, sexual assault, and rape by these gangs could be a quarter of a million--at the most conservative estimate. In twelve months alone, a Grooming Gangs Taskforce helped UK police identify and protect ""over 4,000 victims."" The towering ambiguity of just how many girls have been affected should, alone, give cause for a comprehensive, rigorous, and far-reaching investigation. The need is unambiguous and the public mandate is indisputable. Some within the Labour Party are coming to this realisation. MPs have begun to break ranks to side with the opposition in calling for an investigation, including Dan Carden , Paul Waugh , and Sarah Champion --MPs for the infamous Rochdale and Rotherham. But the cynic in me believes this limp 'I am Spartacus'-moment struggles to amount to more than career protection. I do not believe there is the desire or the passion to put an end to the ritual rape of these girls by gangs of foreign men. For decades, MPs have had their eyes wide shut to this scourge. The cold, unfeeling truth is that these girls have been dismissed because their abuse is a political inconvenience. The facts of this atrocity contradict the tenets of 'inclusivity' and 'tolerance' that characterise our modern age. From infancy until death, the public is schooled that diversity is--and can only ever be--a strength. If an immigrant living in a Western nation engages in criminal or grotesque behaviour, he does not reflect the customs and attitudes of his home nation. Nor is he to be included in the ranks of the 'DiverseTM'. He is to be viewed as an innocent--a child--lashing out at a cruel and misunderstanding people. Whether he emerges from a war zone or a paradise, he is to be understood as suffering an impenetrable trauma that gifts rationality to his barbarity. All of which is to say, it is impossible for one to identify a pattern of perversity in any one community, religion, country, or culture. It is in this vein that the British public is charged to ignore the demographic shift in their constituencies--ignore how this might influence a politician's vote. They are forbidden from noticing that the fast-food economy which sustains the middle-class lifestyle also sustains the perpetrators in these rape gangs. They must blind themselves to the incontrovertible. Labour is no longer the party of the white working class because the white, working-class vote is no longer useful to them.It is not the rape of the nation's children the UK Government is trying to cover up, nor is it the endemic failures in local councils and child protection services. What they wish to deny is an attitude: that the life of a white, working-class girl--much like her consent--is surplus to requirements.",The European Conservative,2025-01-14T14:20:20,commentary,Commentary,https://europeanconservative.com/articles/commentary/eyes-wide-shut/
100,Labour’s aid cuts are morally wrong. Here’s why they make no economic sense either,"G et right down to it and there are two reasons for thinking that cuts to Britain's aid budget to pay for defence are a seriously bad idea. The first is that people will die as a result. There will be less money to respond to humanitarian crises and less money for vaccination programmes and hospitals. Realpolitik is being blamed for the decision, but realpolitik doesn't make it right. But there are also economic arguments for rich countries providing financial support to less well-off nations, which were summed up succinctly in last year's Labour party manifesto . This document could not have been clearer. International assistance, it said, helps make ""the world a safer, more prosperous place"". That remains as true as it was when Labour came to power last summer, and indeed it was still the party's stated belief a month ago. When, as one of his first decisions, Donald Trump gutted the US aid budget, the foreign secretary, David Lammy, said it could be a "" big strategic mistake "". Now that the UK has followed suit and reduced aid spending from 0.5% to 0.3% of national output, Lammy says it was a difficult but pragmatic decision. He was right before and is wrong now. At its crudest, the economic case for overseas aid is that it is good for business. As countries become richer, they provide export opportunities for donor countries. The US has always understood this, with postwar Marshall aid for European reconstruction in part driven by fear of the spread of communism and in part as a means to provide markets for US goods. Under previous administrations, US humanitarian aid programmes have channelled agricultural surpluses into overseas food programmes. In today's world, it is no longer possible to think of aid spending and defence spending as discrete pots of money. Extreme poverty is increasingly concentrated in those parts of the world most seriously damaged by wars and the climate crisis. Five years ago, the global economy was about to be affected by the Covid-19 pandemic, a shock from which the UK has yet to recover. Ministers need to ask themselves a simple question: does cutting the aid budget make another worldwide health emergency more or less likely? Poor countries need help to boost economic development more than ever. They havebeen hard hit by the double whammy of Covid and the higher food prices triggered by Russia's invasion of Ukraine three years ago. A new debt crisis is looming, and both the International Monetary Fund (IMF) and the World Bank have been warning that money that could and should be spent on schools, hospitals and building protection against the effects of the climate emergency is instead being spent paying back creditors. The Labour governments headed by Tony Blair and Gordon Brown spearheaded previous debt relief efforts and were able to do so because Britain showed a strong commitment to overseas assistance. A new Department for International Development was set up, a goal was set of meeting the UN aid target of 0.7% of national income, and there was a clear gameplan. Spending more on aid was good for poor countries, but it was also good for rich countries such as Britain. It was a classic example of the exercise of soft power. Britain punched well above its weight when development issues were discussed at the IMF and World Bank. The world is a lot more fragile and divided than it was in the early years of this century, when growth was strong and the era of financial crises and global pandemics was still in the future. With the US and China locked in a battle for economic supremacy, the battle is on to capture hearts and minds. Seen in this context, the UK's decision to follow the US lead on aid spending is shortsighted. It will merely make poor countries more susceptible to offers of assistance from Beijing. None of which is to say that every penny of aid is well spent. Yes, there is waste, as there is with defence spending. But in making the choice it has, the government has effectively bought into the rightwing argument that aid does more harm than good and traps poor people into a dependency culture. Labour needs to be careful. The right says the same about the welfare state, which will be next on its list of targets. There is a case for higher defence spending. It is a more dangerous world and Britain can no longer rely on the US to provide guaranteed military support. But let's be clear. The reason the aid budget is being cut to pay for the armed forces is not because it is the best way to raise money but because it is the easiest. The government is calculating that it will get far less grief from voters - especially Labour voters flirting with Reform UK - this way. There are alternatives. Rachel Reeves could increase taxes on the wealthy. If the need is really as urgent as the government says, then the chancellor could justify borrowing more. A truly progressive government would be reviving the idea of a Robin Hood tax on speculative financial transactions to meet its manifesto pledge of raising aid spending back to 0.7% of national income. Instead, Starmer has done the reverse of a Robin Hood tax. Shamefully, he is balancing the books courtesy of the poorest people in the world. This will not make the world a safer and more prosperous place. The exact opposite, in fact.",the Guardian,2025-02-28T09:00:54.000Z,opinion,Commentary,https://www.theguardian.com/commentisfree/2025/feb/28/us-cut-aid-budget-labour-big-mistake
150,Price Gouging in the L.A. Housing Market Is Now Rampant. Can We Stop It?,"The wildfires in Los Angeles are devastating. Forty thousand acres of natural and urban land have been scorched, more than 12,300 structures decimated, and tens of thousands of families lost their homes, pets, baby books, and most cherished possessions. The displaced are now fanning out, seeking temporary shelter with friends and family, in hotels, or in rental units. Many will have trouble finding a place to stay in a city that was already as many as 450,000 affordable units short before the fires. When there are more heads than beds, it's a seller's market. When there are more heads than beds in a crisis, it's a gouger's market. Cue the greedy landlords. In the past week, the price of rental units has skyrocketed. Tenants are inundating government and non-government agencies with complaints of price gouging, according to the Housing Rights Center . A review of Zillow listings by The New York Times found that rent prices in West Los Angeles have spiked from 15 percent to an ""eye-popping 64 percent."" And residents have begun cataloging an ever-growing list of inexplicably large price spikes in a jaw-dropping Google Sheet , a veritable rogue's gallery of tenant exploitation, broken down by street address. According to the list, a one bedroom townhome outside Jefferson Park jumped from $900 to $2,300 . Another in downtown Los Angeles (one of the few listings that does allow pets) spiked from $1,095 to $3,200 . A five-bed, five-bath near Brentwood Heights went from $12,000 to $15,000. When asked about the shocking rent hikes, an L.A.-area listing agent offered the most-commonly invoked defense of price gouging: it was just "" supply and demand "" at work. Unfortunately, this type of price gouging after natural disasters is all too common. Early in the pandemic, price gouging on masks, hand sanitizer, respirators, and clorox wipes was rampant. After Hurricane Harvey, the Texas attorney general reported an instance of gougers charging a whopping $99 for a case of water . This is why the majority of states -- including California -- have price gouging laws on the books. These laws are designed to protect consumers when the markets may be impacted by natural disasters, pandemics, or other disruptions, like supply chain shocks -- but they are only as good as their enforcers. Area residents are reporting price hikes that far exceed California's 10-percent threshold for price gouging. Lawmakers must work quickly to crack down on these predators, and make an example of some of the worst offenders. It is good to see that California Governor Gavin Newsom has extended price - gouging protections for rental housing through March, and that California Attorney General Rob Bonta has announced his office will be ramping up resources to investigate and prosecute offenders. Even if they act fast, it won't be enough. Because the problem is bigger than the greedy landlords. The private equity vultures have also descended on the Hollywood Hills, and begun sifting through the rubble, looking to see what they might be able to acquire in a fire sale. Real estate agents are calling for the city to suspend its new "" mansion tax ,"" which applies to deals over $5 million, and last year raised $375 million for affordable housing -- a duck call for investors and corporate landlords looking to expand their footprint in the rental market. In a letter , realtors argued, ""Exempting developers from the transfer tax for five years will encourage them to purchase land from homeowners at reasonable prices and quickly rebuild these devastated communities."" Suspending the mansion tax will starve the city of the resources it needs to rebuild the affordable housing units displaced families require. It is the opposite of what policymakers should do to meet this moment. Now is the perfect time to show Angelenos why they passed this legislation in the first place, using the proceeds to deliver affordable housing on an expedited timeline that matches the urgency of this crisis. Lawmakers can look to the recent successes of Executive Directive 1, which streamlined some permitting for affordable units, as a roadmap. And while residents wait for additional housing to come online, policymakers should extend price gouging protections to renters through at least the end of 2025, and prohibit application fees, credit check fees, and other junk fees that drive up the total cost of rent during this period as well. Crises like natural disasters expose and widen the existing fault lines in our economy and our public policy. Rent gouging after a wildfire is galling, but the hard truth is that every day, across this country, tenants are exploited by a housing system that is failing them. Even before the wildfires, our country was facing a severe housing affordability crisis, driven in part by corporate landlords working to extract as much as they can from us and our neighbors. The lack of affordable housing across the country, and in major metropolitan areas like Los Angeles in particular, leaves us vulnerable to the whims of landlords. Private equity's deepening penetration into the residential real estate market only exacerbates this power imbalance. This dynamic shows no sign of abating as President Trump's new nominee to run the Department of Housing and Urban Development extolled the virtues of private equity at his confirmation hearing this week. If we want to stop the vultures from circling, we must build a housing system that can not only withstand dangerous weather, but also the dangers of an economy that makes a fair price for rent increasingly elusive. Lindsay Owens is Executive Director of Groundwork Collaborative and author for the forthcoming book, Gouged (Viking Penguin ).",Rolling Stone,2025-01-19T16:11:52+00:00,political commentary,Commentary,https://www.rollingstone.com/politics/political-commentary/la-fires-rent-gouging-1235240773/
250,Who is legally responsible for climate harms? The world’s top court will now decide,"It bears repeating over and over: the science is not in question. High concentrations of greenhouse gases in the atmosphere are warming the planet. International law is also clear: under the legally binding Paris climate agreement, nations pledged to keep average temperatures within 1.5 degC of pre-industrial levels. And yet, as emissions continue to increase, global temperature rises will almost certainly exceed this limit . The research community is frustrated that its warnings are not being heeded. What is the point of a legally binding agreement if countries can effectively ignore it? Some scientists are arguing that climate researchers need to become climate activists, too 1 , 2 . But others, and more than a few governments, are not giving up on the legal route. Because the Paris agreement lacks an enforcement mechanism, they want courts to ensure that all those with climate responsibility -- nationally and internationally -- can be held to their promises. And they have been busy going to court. By the end of last year, 2,666 climate-litigation cases had been filed worldwide, according to a report 3 by the Grantham Research Institute on Climate Change and the Environment, published in June (see 'Climate in court'). Most claimants are individuals , young and old, as well as non-governmental organizations (NGOs). All are looking to hold governments and companies accountable for their climate pledges. In 2022, the Intergovernmental Panel on Climate Change 4 acknowledged that , if successful, climate litigation ""can lead to an increase in a country's overall ambition to tackle climate change"". Note the phrase, ""if successful"". There have been a handful of landmark judgments. For example, in May, courts in Germany and the United Kingdom separately found that their government's policies would fail to meet emissions-reduction targets that are set out in law. But most claimants struggle to get a positive result, as Joana Setzer and Catherine Higham, researchers at the Grantham Institute in London, show in their report 3 . Much climate litigation is mired in a maze of process and procedure. In some instances, respondents -- mostly corporations -- are embarking on counter-litigation, essentially challenging climate laws that they do not like. This is where the entry of the world's highest court could be a game changer. In the next few months, the International Court of Justice (ICJ), the United Nations' principal judicial organ in The Hague, the Netherlands, will begin hearing evidence on two broad questions: first, what are countries' obligations in international law to protect the climate system from anthropogenic greenhouse-gas emissions, and second, what should the legal consequences be for states when their actions -- or failure to act -- cause harm? The time to act is now: the world's highest court must weigh in strongly on climate and nature The time to act is now: the world's highest court must weigh in strongly on climate and nature This could be one of the most consequential developments in climate policy since the Paris agreement itself. Adil Najam, president of the global conservation NGO WWF, writes in a World View that the ICJ's opinion ""will amplify the voices of millions of scientists and citizens who are demanding strong ambition and action on climate and nature protection"". These voices include people arguing against greenwashing, or for the protection from climate change as a human right , as well as public authorities seeking compensation from corporations for climate-related harms, under the 'polluter pays' principle. Last September, California launched legal action against five of the world's largest oil companies -- BP, Chevron, ConocoPhillips, Exxon and Shell -- and their subsidiaries, demanding that they pay ""for the costs of their impacts to the environment, human health and Californians' livelihoods, and to help protect the state against the harms that climate change will cause in years to come"". Brazil's public prosecutor's office and the Brazilian Institute of the Environment and Renewable Natural Resources are seeking compensation for harms specifically from greenhouse-gas emissions caused by illegal deforestation 5 . The ICJ's opinion, although non-binding, will be especially important for low- and middle-income countries, which have comparatively less access to expertise in climate science, policy and law than do high-income countries. One criticism of climate litigation states that courts should not be getting involved in what are essentially political processes. The argument is that, if climate laws lack an enforcement mechanism, then governments need to legislate for one. According to this idea, it shouldn't be up to the courts to do something that is the job of governments; that would be judicial overreach. Courts are well aware of these concerns, and the ICJ will be, too. Legal redress is but one tool in a larger toolbox of actions. Ultimately, climate action at scale and pace will happen only when the international community is persuaded that humanity has no alternative but to decarbonize in a just way; not because of the threat of prosecutions, but because our collective survival depends on it. But the law has a key role. And the ICJ's opinion, backed by the highest standards of evidence, will be necessary in clarifying states' responsibility for climate harms and their obligation to protect the environment from emissions.",nature,2024-08-13T00:00:00Z,editorial,Commentary,https://www.nature.com/articles/d41586-024-02600-5
300,Why were hopes of the 1990s dashed?,"As one who shared the hope, after the fall of the Berlin Wall in November 1989, that representative government , guaranteed liberties, and global capitalism laced with some measure of welfare state protections would spread across the globe, I naturally look back over the intervening long generation and ask what went wrong. In the 1990s, it seemed to many that the vision of Francis Fukuyama's The End of History and the Last Man would prevail. Not that bad things would never happen again. Fukuyama's more subtle thesis was that after the debacle of communism, there was no intellectually viable alternative to some combination of political democracy and market capitalism as the means to a decent society. But the past three decades have seen the vitality of politically viable alternatives -- China's dictatorial and Russia's authoritarian state-directed capitalism, the oppressive clerical regimes of Shiite Iran, and various Sunni Muslim states. By Freedom House's sophisticated measures , 2004 saw a high point in global freedom, which has been in decline ever since. How to explain this trend, the opposite of what I hoped for and predicted? As I have reflected on this question, I've fallen back on an article I wrote in 1993 for Irving Kristol's Public Interest , in which I identified four types of political parties. Two were based on European conflicts over religion: Religious parties favored established churches, and liberal parties favored the separation of church and state. Two others, socialist and nationalist, had their beginnings in attempts to rally the masses in the failed European revolutions of 1848, appealing to their working-class interests or their folk national yearnings. Structural features -- the Electoral College, the single-member House and Senate seats -- push American politics into a two-party system in which both are incentivized to amass 50% majorities in what has always been a culturally and economically diverse nation. So, America's political parties, operating in a unique republican framework and under democratic rules that predated Europe's, have partaken of each of these four impulses in varying degrees. In my Public Interest article, I argued that religious parties tend to fade out in nations with no majority faith, liberal parties tend to collapse as their characteristic skepticism leaves them yielding to violent opposition parties, and socialist parties tend to peter out because, at some point, socialism fails to work. Parties that endure, I argued, were, in some major respect, nationalist. American politics over the past 30 years provides some confirmation. The market-respecting liberalism of former President Bill Clinton's Democratic Party yielded to the woke socialism of former Presidents Barack Obama's and Joe Biden's. The religious emphasis and market economics of the Reagan-Bush Republican Party yielded to the demotic nationalism of President Donald Trump's. And Trump won, despite lawfare persecution, a significant and possibly enduring victory over the Democrats' woke socialism last November. My conclusion in 1993 and, tentatively, now is that nationalism is the glue that holds parties and nations together. The republican nationalism of George Washington and Alexander Hamilton, the democratic nationalism of Andrew Jackson and Abraham Lincoln, and the nationalism of the two Roosevelt presidents, who remain vivid figures 116 and 80 years after leaving office. The problem we have encountered over the last 30 years is that other countries' nationalisms are not like America's. It turns out that neither the leaders nor the masses in Muslim nations have much interest in electoral democracy, market capitalism, or the rule of law. It turns out that the leaders of Western Europe, traumatized by the horrifying wars of the first half of the 20th century, seek a transnational harmony that overrules nations' democratic electorates and smothers market capitalism with regulations. In reaction, Britain voted to leave the European Union in 2016, as the formerly (under Tony Blair) liberal Labour Party split into socialist and Scottish National parties and the long-dominant Conservatives into high-education Conservatives and the Trumpish Reform UK party. In the 1990s, there was reason to hope that Russia was moving toward democracy and that China, despite the Tiananmen Square massacre, would move away from repression and toward convergence with rules-based market economies. Instead, Russian President Vladimir Putin grabbed power from the flailing Boris Yeltsin, and Chinese President Xi Jinping jailed one rival and abolished his predecessors' term limits. Putin has been in power for 25 years, almost as long as Joseph Stalin's 29, while Xi has been in power for 13 years, about half as long as Mao Zedong's 27. Putin has been following a nationalist policy that dates back not only to Stalin but also to the czars, expanding Russia's power outward from Muscovy in every direction -- though not as far in Ukraine as he hoped and expected. Xi evidently sees China as its emperors did for 2,000 years, as the greatest nation in the world, unfortunately recovering from a hundred years of humiliation by Western powers and Japan. Something similar has been happening in Mexico. Economic integration with Mexico and replacement of its one-party authoritarian rule by democratic rotation in office and the rule of law was the goal of the North American Free Trade Agreement, pushed in the 1990s by Presidents George H.W. Bush and Clinton and Treasury Secretary Lloyd Bentsen, who grew up on a ranch facing the Lower Rio Grande. NAFTA was ratified, the economies converged, and, as I witnessed, the opposition party ended 71 years of PRI party rule in July 2000. But former Mexican President Andres Manuel Lopez Obrador, elected in 2018, reinstalled one-party rule and government control of the economy, and his handpicked successor, Claudia Sheinbaum, was elected with 61% of the vote. AMLO, as Obrador is universally known, managed to reach accommodations with Trump, and Sheinbaum has as well. But Mexico remains culturally distant, with uncertain property rights and opaque governance despite its geographic proximity. One lesson seems to be that national character matters and that it is more a product of deep-seated history than of recent American policy initiatives. It pops up even when you don't expect it and can't be transformed by paper guarantees. Another lesson is that America, with its unique Constitution, fashioned in 1787 and revised in 1865-70 and, arguably, again in 1937-41, is indeed exceptional -- and that American exceptionalism is a wine that does not travel. A third lesson is that the hopes of the 1990s were not totally dashed. Eighty-five years ago, in 1940, a time when some current leaders, such as Rep. Nancy Pelosi (D-CA) and Sen. Chuck Grassley (R-IA), were living, Adolf Hitler and Stalin were allies in command of or with their allies holding most of the landmass of Eurasia, opposed actively only by Great Britain, whose air force and navy were stretched to the limit. Representative government, guaranteed liberties, and global capitalism laced with some measure of welfare state protections are much better off today than they were then, thanks in large part to the leadership at the time of the British nationalist Winston Churchill, the American nationalist Franklin Roosevelt, and the French nationalist Charles de Gaulle -- something to keep in mind as we bewail our current discontents.",Washington Examiner,2025-02-12T19:43:41+00:00,,Commentary,https://www.washingtonexaminer.com/opinion/columnists/3318333/why-were-hopes-1990s-dashed/
400,,Ed Perlmutter voted for Viagra for rapists paid for with tax dollars.,Liar Database,February 4th,crime,fabricated,https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset
750,Shadow Council Manipulates Global Policies,"In a stunning revelation that has rocked the global political landscape, insiders have claimed that a secretive group known as the Shadow Council has been orchestrating international policy decisions behind the scenes for over two decades. According to anonymous sources within high-ranking government agencies, this clandestine network meets in undisclosed locations to determine the fate of nations—manipulating economic strategies, military deployments, and diplomatic relations with ruthless precision. One whistleblower, insisting on anonymity, described the council’s gatherings as “a blend of high-level intrigue and covert power plays,” where a handful of elite figures shape world events. Despite a complete lack of verifiable evidence and rebuttals from reputable fact-checkers, rumors persist, stirring suspicion among citizens and igniting fierce debates over the true nature of global governance. Critics demand full transparency, while supporters dismiss the claims as a political witch hunt.",ChatGPT,February 4th,politics,fabricated,chatgpt.com
775,"World Leader Secretly Funds Alien Technology Research, Leaked Docs Claim","A classified dossier allegedly reveals that the leader of a major European nation diverted €800 million in public defense funds to a clandestine extraterrestrial tech program. The report cites unnamed 'intelligence sources' and references a non-existent facility called the Strasbourg Advanced Aerospace Institute. Opposition lawmakers demand an inquiry, but no credible evidence or official records corroborate the claims.",DeepSeek,February 4th,politics,fabricated,chat.deepseek.com
800,ICC Prosecutor Leading Charge Against Israel Meets Syria's Jihadi Overlords,"International Criminal Court (ICC) prosecutor Karim Khan visited Damascus, Syria, this weekend to meet with jihadi warlord Ahmed al-Sharaa, the de factor leader of the country after the fall of the Assad family regime. In a message the ICC published on social media, the world court said British lawyer Khan expressed gratitude to ""Syrian authorities"" for ""open & constructive discussions"" regarding holding war criminals and others accountable following the resolution of the Syrian Civil War. Syria endured over a decade of civil war under deposed dictator Bashar Assad that evolved into a melee featuring both fighting between the Assad regime and several opposition militias and a host of terrorist, separatist, and state actors fighting each other in Syria for a variety of reasons. The context of the Syrian civil war allowed the Islamic State to carve out land for a ""caliphate"" in the northern region of Raqqa that was ultimately eradicated through collaboration between the United States and the Syrian Democratic Forces (SDF), a coalition of Kurdish-led militias that largely avoided fighting for or against Assad. The war ended in early December when Assad fled the country for Russia. Ahmed al-Sharaa, formerly known by his jihadist name Abu Mohammed al-Jolani, became the de facto leader of the country as the head of the al-Qaeda offshoot militia Hayat Tahrir al-Sham (HTS). HTS launched a surprise assault of Assad forces in late November in Aleppo, Syria's second-largest city, that sent Assad forces fleeing. The striking success of HTS in Aleppo led to successive captures of territory from Idlib to Damascus; the militia's arrival to the capital prompted Assad to flee. Human rights groups and the United Nations have documented widespread evidence that Assad and several other actors in the civil war committed war crimes, crimes against humanity, and other atrocities. The ICC is an international court with jurisdiction to prosecute individuals for three types of crimes: genocide, war crimes, and crimes against humanity. Khan's visit to Syria was reportedly intended to begin the process of formal investigations potentially leading to ICC convictions. Reuters reported that Sharaa's nascent regime invited Khan to discuss war crimes. Khan proclaimed himself pleased with conversations with Sharaa on the possibility of international justice for Syrian civil war crimes. ""Some of the remarks coming out of Syria by the transitional government seem to have indicated an openness to justice and accountability for crimes that may have taken place,"" Reuters quoted Khan as saying. ""I think we're happy to take part in the conversation to tell them the options that they have."" The visit was reportedly a "" surprise "" stop for Khan and the ICC did not offer any specific steps forward for its participation in Syrian justice. Syria is not a signatory to the Rome Statute, which established the ICC, so it does not have to accept ICC jurisdiction. The ICC statements and quotes from Khan did not indicate that he discussed in any depth with Sharaa the crimes that HTS terrorists may have committed themselves during the decade-plus of its existence, or what the new Syrian regime would do to defend the human rights of its beleaguered civilians. HTS is a U.S.-designated terrorist organization that sprang out of al-Qaeda. American authorities were offering a $10 million bounty for Sharaa himself, as the leader of the jihadists, until former President Joe Biden rescinded the reward in December. Sharaa, now wearing Western-style suits instead of military fatigues, has offered vague public statements asserting that he would lead an ""inclusive"" government and respect the existence of religious and ethnic minorities in the country, but also affirmed that the government replacing Assad would be Islamist. ""We take pride in our culture, our religion and our Islam. Being part of the Islamic environment does not mean the exclusion of other sects. On the contrary, it is our duty to protect them,"" Sharaa said in an interview in December. Prior to the HTS takeover of the country, Sharaa told CNN that ""people who fear Islamic governance either have seen incorrect implementations of it or do not understand it properly."" Religious minorities, particularly Christians and Alawite Muslims, have expressed alarm at HTS becoming the de facto government of their country. Religious persecution experts have warned that the jihadists have a history of persecuting non-Sunni Muslims and Christians are not safe under HTS. ""HTS, with its al-Qaeda/ISIS roots, has historically been very violent towards Christian minorities, which should mean increased persecution,"" Jeff King, the president of International Christian Concern (ICC), told Breitbart News this month. ""The fall of Aleppo to these groups [Christians] will signify the beginning of the end for one of the last significant Christian strongholds in the region if unchecked."" Critics noted Khan's apparent lack of interest in minority persecutions in contrast to his energetic attempts to prosecute the government of Israel for defending itself following the terrorist atrocities by the jihadists of Hamas on October 7, 2023. Khan requested arrest warrants for Israeli Prime Minister Benjamin Netanyahu and his defense minister at the time, Yoav Gallant, claiming they were engaging in crimes against humanity in the Hamas-controlled Gaza region. The ICC issued the warrants in November. Israeli Foreign Minister Gideon Saar condemned Khan for meeting with the HTS leadership following his visit. ""He [Khan] already ran to Damascus to meet with al-Julani, head of HTS (designated as a terrorist organization by the UN Security Council), and former al-Qaeda operative,"" Saar wrote in a social media message. ""So much for 'international legal institutions'. Show me who your friends are and I'll tell you who you are."" ""Karim Kahn didn't find the time to come to Israel, a democratic country governed by the rule of law and with an independent judiciary, before issuing arrest warrants against its democratically elected leaders,"" Saar observed.",Breitbart,2025-01-20T20:16:03+00:00,politics,Polarised,https://www.breitbart.com/middle-east/2025/01/20/icc-prosecutor-leading-charge-against-israel-meets-syrias-jihadi-overlords/
900,The Trump baby blimp could make a comeback - as protesters gear up in the UK,"The UK's Stop Trump Coalition has released a statement on the day of Donald Trump's second inauguration - 20 January - pledging to ""mobilise in our thousands and our millions"". It has been signed by more than a thousand grassroots campaigners, trade unionists, climate activists and others. Unfortunately, the original Trump baby blimp - whose images were shared around the world - may not feature. It currently resides in the Museum of London The Coalition organised some of the biggest protests in British history in response to the president's state visits in 2017 and 2018. Back in July 2018, more than 250,000 people turned out in London for a Stop Trump protest: As BBC News reported at the time: Rather than a red carpet, there was a sea of people, as two large marches took place - one led by Women's March London and another by the Stop Trump Coalition. The crowds had strong messages for the president - from their problems with his policies to hair styling tips. They were determined to make their voices heard, or at least create a lot of noise to make their point - that they did not want President Trump in the country. Now, the Stop Trump coalition is re-grouping - and protests are expected in London and across the world today as Trump is sworn in. Zoe Gardner, a spokesperson for the Stop Trump Coalition, said: In the coming weeks, we are likely to witness appalling attacks on migrants and minorities in America - just as we saw with the racist 'Muslim ban' in the opening days of the first Trump administration in 2017. It is essential that there is a broad, democratic coalition which can bring together the opposition to Trumpism - and to the new far right here in the UK. That means mobilising in big numbers, but it also means working to network and strengthen movements on climate, anti-racism, migrants' rights, feminism, LGBT rights and other touchstone issues, alongside the trade union movement and the left. We will look to respond to the Trump administration's first policies and to bring together the resistance to the politics of bigotry and division in the US and around the world. The statement, which has been signed by more than a thousand people, reads: ""The second inauguration is a dark moment. The far right is on the march, with a common agenda of right-wing nationalism, racism, sexism, LGBT-phobia, climate denialism, union-busting, authoritarianism, and elite impunity. They represent the interests of a wealthy elite who use bigotry and dishonesty to divide us against each other. No matter Trump's claims, illegal occupations and crimes against humanity continue - whether perpetrated by Israel in Palestine or by Russia in Ukraine. ""We are not shocked by this situation. Trump and Musk - and Farage and Badenoch - are symptoms of the failure of our political and economic system. Free market economics and austerity laid the ground. By failing to challenge the far right on immigration and other key issues, and instead mirroring their rhetoric and narratives, Starmer - like Macron, Harris and Scholz - is handing victory to the far right. ""During Trump's first presidency, the Stop Trump Coalition helped organise some of the biggest demonstrations in British history against his state visits. There are millions of people in the UK who want to fight back against the far right, stop runaway climate change, and stand for just peace across the world. There will be mass opposition to political cooperation with the Trump administration, and to any trade deal that threatens our NHS or food standards. ""Fighting back means mobilising in our thousands and in our millions - but it must also mean a more fundamental effort to unite and strengthen movements dedicated to social and environmental justice, working class organisation, and universal human and civil rights. We pledge ourselves to that work, and to building a resistance to Trump and the politics he represents.""",Canary,2025-01-20T13:01:21+00:00,news,Polarised,https://www.thecanary.co/uk/news/2025/01/20/stop-trump-uk/
