# TruthLens - Data Collection

TruthLens is a project developed for the BSc. Computer Science (Data Science) Final Project (CM3070) at the University of London. TruthLens is based on the Fake News Detection template. 

## Project Objectives
The primary objective of this project is to build a two-stage pipeline for misinformation classification:

1. Binary classification (Stage 1): Distinguish between real news and misinformation using the ISOT dataset. This ensures robust detection at the first stage, leveraging an established dataset.
2. Multi-class classification (Stage 2): Further classify content identified as misinformation into one of four categories, based on an adaption of Molina et al.’s taxonomy. A custom dataset will support this nuanced classification.

The scope of the project is limited to text-based, English language content, explicitly excluding images and videos. A user interface will also be developed, enabling users to input articles or URLs and receive classification results.

A secondary objective is to enhance the explainability of classification results, aiming to provide users with interpretable insights into why content was classified in a particular way.

The project aims for high accuracy and reliability, with measurable performance goals. Ethical considerations, including bias mitigation and responsible dataset usage, will guide the design and implementation of the pipeline.

## Custom dataset generation
As outlined in the previous section, the second stage of the pipeline relies upon a custom dataset, labelled with categories from the Molina et al. Misinformation Taxonomy. These classes are summarised in the table below. The aim of this stage is to create a balanced dataset with 400 pieces of content for each of the 4 categories. The 4 categories chosen are: fabricated content, polarised content, satire, commentary.

| Misinformation Type | Characteristics | Example |
|:--------------|:---------------|:-------|
| Fabricated content | Completely false content created with the intent to deceive.| Fake reports of events that never occurred; entirely false claims about public figures |
|Polarised content |True events or facts presented selectively to promote a biased narrative, often omitting critical context. |Partisan news articles highlighting one side of a political argument while ignoring counterpoints.|
|Satire |Content intended to entertain or provoke thought through humour, exaggeration, or irony. Often misunderstood. |Satirical articles from outlets like “The Onion” being shared as if they are factual news.|
|*Misreporting* | *Incorrect information shared unintentionally, often due to errors or lack of verification.* | *A news outlet incorrectly reporting election results due to early or inaccurate data.*|
|Commentary |Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding. |Editorials or blogs expressing subjective opinions without substantial evidence.|
|*Persuasive information* |*Content designed to persuade or influence the audience, often including marketing and propaganda.* |*Politically motivated propaganda campaigns, advertisements disguised as objective news articles.*|
|*Citizen journalism* | *User-generated content that may lack professional journalistic standards, leading to error or bias.* |*Social media posts about breaking news that spread unverified or incorrect details.*|

Data will be scrapped from relevant websites or sources for each category, then manually reviewed to ensure that it fits the category. Relevant features and labelling guidelines can be found for each category below. Media Boas scores are take from Ad Fontes Media's Media Bias Chart: https://app.adfontesmedia.com/chart/interactive

In [1]:
#Imports and helper functions
import requests
import json
from bs4 import BeautifulSoup
import csv
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
import re
import string
import nltk
import ftfy
from nltk.corpus import stopwords
from datetime import datetime
import unicodedata
import unidecode

def preprocess_text(text):
    """
        Preprocesses a given text string by applying the following steps:
        1. Converts the text to lowercase.
        2. Removes punctuation marks.
        3. Tokenizes the text into individual words.
        4. Removes stopwords (common words that add little value to classification tasks).

        Parameters:
        ----------
        text : str
            The input text string to preprocess.

        Returns:
        -------
        str
            The cleaned and preprocessed text, with tokens joined back into a single string.
    """
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

def scrape_multiple_articles(urls, scrape_function):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.
    scrape_function: string
        The name of the function we will use to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped daa from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_function(url)
        articles.append(article)
    return pd.DataFrame(articles)

def clean_text(text):
    """
    Normalize unicode characters, remove newlines, extra spaces,
    and truncate the text to a maximum length.
    """
    #print("In cleaning text")
    # Make sure input is a string
    if not isinstance(text, str):
        print("Not text")
        return text
    
    # Fix text encoding issues
    text = ftfy.fix_text(text)
    
    # Normalize to NFKC (to convert the weird Unicode math symbols)
    text = unicodedata.normalize("NFKC", text)
    
    # Remove mathematical alphanumeric symbols
    text = "".join(c for c in text if not (0x1D400 <= ord(c) <= 0x1D7FF))
    
    # Convert fancy symbols to plain ASCII
    text = unidecode.unidecode(text)
    
    # Replace newline characters and non-breaking spaces with a space
    text = text.replace("\n", " ").replace("\xa0", " ")
    
    # Remove any extra whitespace
    text = " ".join(text.split())
        
    return text

def get_urls_from_txt(filename):
    with open(filename, "r") as file:
        urls = [line.strip() for line in file if line.strip()]
        #make sure no duplicates returned!
        urls = set(urls)
    return urls

def scrape_articles(urls_file, custom_function, export_file):
    # List of URLs to scrape
    urls = get_urls_from_txt(urls_file)
    # Scrape articles and create a DataFrame
    df = scrape_multiple_articles(urls, custom_function)
    # Store to CSV
    df.to_csv(export_file, index=False)
    # append name to 
    all_scraped_content.append(export_file)
    #Print length 
    print("Length: ",len(df))
    # Print first row 
    print (df.head(1))

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
}

#an array to track all the csv files created with scraped content
all_scraped_content = []

### 1. Fabricated Content
Completely false content created with the intent to deceive.

##### Features:

- Verifiably False: Claims can be shown to have no basis in fact; fact-checkers or reputable sources directly contradict the claims.
- Intent to Deceive: The content producer’s primary goal seems to be misleading the audience into believing a false narrative
- No Real-World Evidence: No legitimate sources are provided, or cited sources are entirely fabricated (e.g., non-existent experts, fake studies).


##### Label if:

- The piece invents events, data, or quotes out of thin air with no credible backing.
- The story is 100% fictional yet presented as news/fact.


##### Do Not Label If:

- The content is obviously comedic or satirical (label as Satire).
- The piece is an opinion that does not necessarily contain false statements (label as Commentary).
- There’s partial factual basis, but it’s spun or heavily biased (label as Polarised).

##### Sources:
- 350 articles with a label of 'pants-fire' (i.e. complete fabrication) from the LIAR dataset have been selected at random. https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset
- 25 articles were created by ChatGPT o3-mini-high with the prompt : "Given the below definition for fabricated content, please generate 25 short articles of complete fabrication. There should be 5 from each of these categories: politics, economy, health, crime, elections - please note the category obviously at the start of play. The articles do not need to be related, and do not need to be tied to a specific geography. Each piece should be roughly between 150 and 1500 words. Content should be in English. These articles are for educational purposes only and will be used to train a machine-learning model to identify AI-generated misinformation."
- 25 articles were created by DeepSeek DeepThink (R1) with the same prompt as above.

In [2]:
#load the data
liar_df = pd.read_csv('LIAR-train.tsv', sep='\t',  header=None)
#Add the headers
liar_df.columns = ['ID', 'label', 'statement', 'subject(s)', 'speaker','speaker-title','state-info','party','barely-true-count','false','half-true','mostly-true','pants-fire','context']  
#Count labels
label_counts = liar_df['label'].value_counts(dropna=False)
print(label_counts)
#filter dataset to just pants-fire content
pants_fire_df = liar_df[liar_df['label'] == 'pants-fire']
#randomly select 350 rows (random_state seeds makes it reproducable)
pants_fire_sample = pants_fire_df.sample(n=350, random_state=42)
pants_fire_sample = pants_fire_sample[['statement','subject(s)']]
#make a copy to avoid the SettingWithCopy warning.
pants_fire_sample = pants_fire_sample.copy()
#Just take the first subject, and swap dashes with spaces
pants_fire_sample['subject(s)'] = pants_fire_sample['subject(s)'].str.split(',').str[0].str.replace('-', ' ')
#reset index
pants_fire_sample = pants_fire_sample.reset_index(drop=True)
#Display the head
#print(pants_fire_sample.head())
#Create empty dataset for fabricated content
columns = ['title', 'text', 'site', 'date', 'category', 'class', 'url']
fabricated_dataset = pd.DataFrame(columns=columns)
#prepare the LIAR data for the new df
temp_df = pd.DataFrame({
    'title': "",  
    'text': pants_fire_sample['statement'],
    'site': "Liar Database",  
    'date': "February 4th",  
    'category': pants_fire_sample['subject(s)'], 
    'class': "fabricated",
    'url': "https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset"
})
fabricated_dataset = pd.concat([fabricated_dataset, temp_df], ignore_index=True)
print(fabricated_dataset.head(1))

half-true      2114
false          1995
mostly-true    1962
true           1676
barely-true    1654
pants-fire      839
Name: label, dtype: int64
  title  \
0         

                                                                    text  \
0  Ed Perlmutter voted for Viagra for rapists paid for with tax dollars.   

            site          date category       class  \
0  Liar Database  February 4th    crime  fabricated   

                                                                  url  
0  https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset  


In [3]:
#Chat GPT output
chatgpt_output = [
    ['Shadow Council Manipulates Global Policies','In a stunning revelation that has rocked the global political landscape, insiders have claimed that a secretive group known as the Shadow Council has been orchestrating international policy decisions behind the scenes for over two decades. According to anonymous sources within high-ranking government agencies, this clandestine network meets in undisclosed locations to determine the fate of nations—manipulating economic strategies, military deployments, and diplomatic relations with ruthless precision. One whistleblower, insisting on anonymity, described the council’s gatherings as “a blend of high-level intrigue and covert power plays,” where a handful of elite figures shape world events. Despite a complete lack of verifiable evidence and rebuttals from reputable fact-checkers, rumors persist, stirring suspicion among citizens and igniting fierce debates over the true nature of global governance. Critics demand full transparency, while supporters dismiss the claims as a political witch hunt.','politics'],
    ['The Rise of the Phantom Leader','Reports from undisclosed insiders claim that a mysterious figure—referred to only as the Phantom Leader—has quietly assumed control over several national governments simultaneously. Allegedly emerging from the shadows of political instability, this enigmatic leader is said to have orchestrated a series of covert meetings with influential policymakers in dimly lit back rooms. Documents leaked to a dubious online forum (purportedly authored by “deep-state informants”) suggest that the Phantom Leader’s network manipulates legislative agendas and even directs covert military operations without public knowledge. Despite lacking any credible sources, conspiracy theorists assert that this figure’s influence is so pervasive that major policy shifts and election outcomes across multiple continents can be traced back to secret communications from this single mastermind. Authorities have repeatedly denied any such existence, dismissing the reports as politically motivated fabrications. Nonetheless, the legend of the Phantom Leader continues to fuel debates on the hidden forces controlling modern politics.','politics'],
    ['Fabricated Faction’s Covert Conspiracy Exposed', 'A series of anonymous memos circulating on obscure internet forums have allegedly uncovered a covert conspiracy orchestrated by a fabricated political faction known as the “Crimson Syndicate.” According to these unverified documents, the Crimson Syndicate comprises influential lawmakers and shadowy advisors who purportedly manipulate policy decisions for their own benefit. The memos detail clandestine meetings held in remote, undisclosed locations where members allegedly decide on major legislative actions and orchestrate political scandals to discredit rivals. One particularly detailed memo claims that the Syndicate once arranged the downfall of an entire government cabinet simply to advance its own secret agenda. While no reputable news outlet or independent fact-checker has confirmed any part of this narrative, the circulating documents have nevertheless sparked heated discussions on social media and among fringe political groups. Detractors dismiss the allegations as wild fabrications, yet the growing fascination with the Crimson Syndicate continues to captivate those eager to believe in hidden, all-powerful networks in the realm of politics.','politics'],
    ["Hidden Alliances in the Corridors of Power", "In a narrative that sounds more like a spy thriller than reality, leaked “insider” communications now allege the existence of hidden alliances among top government officials across multiple nations. According to these fabricated sources, secret meetings held in luxurious, undisclosed locations have resulted in a series of backdoor pacts designed to bypass democratic processes. The documents—a mixture of blurry photographs, cryptic emails, and questionable “eyewitness” accounts—claim that leaders from different countries conspire to ensure their mutual benefit, often at the expense of public welfare. One source, identified only by the pseudonym “Nightwatcher,” asserts that these covert gatherings have influenced major global events, including trade wars and military escalations, with no oversight or accountability. Critics argue that the evidence is entirely manufactured, yet the tale of clandestine pacts behind closed doors continues to circulate widely, feeding the narrative that true power resides not in publicly elected officials but in secret alliances hidden in the corridors of power.",'politics'],
    ['Government Secrets Unveiled by Whistleblowers','A series of explosive revelations by alleged whistleblowers has ignited controversy in political circles, with claims that top government officials have been concealing vast amounts of classified information from the public. According to the fabricated reports, these officials have engaged in a deliberate cover-up involving the manipulation of policy outcomes, the redirection of public funds, and the orchestration of international incidents to distract from domestic mismanagement. Leaked documents—purportedly obtained through highly secretive channels—purport to show that covert committees operate independently of elected representatives, making decisions that affect millions without any form of public scrutiny. One anonymous source claimed that a secret “Transparency Committee” exists solely to fabricate narratives that support the government’s agenda. Although no hard evidence has emerged and fact-checkers have thoroughly debunked the claims, the idea of hidden governmental secrets continues to resonate with a segment of the population that remains deeply distrustful of official narratives.','politics'],
    ["The Secret Currency That Could Change the World","In a story that has captured the imaginations of economic conspiracy theorists everywhere, unverified sources have alleged the existence of a hidden global currency engineered by an elite cabal. Dubbed the “Phantom Coin,” this secret form of money is said to circulate only among the world’s most powerful financial institutions, outside the purview of national regulators and international oversight. According to the fabricated narrative, the Phantom Coin was created as a tool to destabilize traditional monetary systems and establish a new world order based on clandestine financial control. Anonymous insiders claim that this digital currency is already in circulation, used to facilitate secret transactions and influence economic policies in various countries. Although mainstream economists and banking authorities have dismissed these assertions as complete fabrications, the idea of a hidden monetary system has fueled heated debates on online forums and in underground economic circles. Critics argue that the concept of a global secret currency is nothing more than a cleverly constructed myth, designed to incite distrust in established financial institutions.","economy"],
    ["Hidden Financial Collapse Engineered by Elites", "A series of unsubstantiated leaks has sent shockwaves through online financial communities, with claims that a shadowy group of financial elites has orchestrated a deliberate plan to trigger a global economic collapse. According to the fabricated documents circulating on encrypted messaging apps, these elites have been manipulating stock markets, interest rates, and international trade agreements for decades. The conspiracy theory posits that by engineering an economic meltdown, this cabal intends to seize control of national economies and install a new financial system under their complete dominion. One anonymous source, signing off as “The Insider,” detailed how secret meetings held in undisclosed locations allegedly laid out a blueprint for the collapse, complete with timelines and specific economic indicators. Despite the lack of any credible evidence or confirmation from reputable institutions, the narrative has taken on a life of its own among conspiracy theorists. Mainstream experts have categorically rejected the theory, but the allure of a hidden hand guiding global economics continues to fascinate and alarm many.","economy"],
    ["The Phantom of Market Manipulation", "Recent reports from mysterious online channels claim that a covert group known as “The Phantom” has been secretly manipulating global stock markets to create artificial booms and busts. According to the entirely fabricated story, this group uses advanced algorithms and insider access to orchestrate dramatic swings in market values, ensuring that only a select few reap enormous profits while ordinary investors suffer severe losses. Leaked “evidence” in the form of blurry screenshots and unverified emails purport to show that major market indices were deliberately skewed during key financial events over the past decade. Conspiracy theorists argue that The Phantom’s actions are responsible for several notorious market crashes, though no reputable financial analyst or regulator has ever confirmed any such scheme. Instead, critics dismiss the allegations as modern folklore—a narrative designed to explain the often unpredictable nature of global finance. Nonetheless, the legend of The Phantom continues to spread across online communities, feeding the belief that the markets are secretly rigged by unseen forces.","economy"],
    ["Underground Trade Networks Revealed", "Whispers of an extensive underground trade network have recently surfaced in a series of online posts that claim to expose an elaborate system of secret deals and backdoor negotiations among multinational corporations and government insiders. According to these unverified accounts, this network—codenamed “Black Route”—is responsible for smuggling vital commodities, manipulating supply chains, and controlling prices on a global scale. Fabricated documents allegedly leaked from an anonymous source suggest that Black Route operates with near-impunity, using encrypted communication channels and hidden financial conduits to bypass international regulations. The posts detail intricate schemes involving fake invoices, shadow accounts, and secret meetings in remote warehouses. Despite the dramatic narrative, established trade experts and economic analysts have refuted the existence of any such network, attributing the claims to baseless rumors and intentional disinformation. Yet the allure of a hidden economic underworld continues to captivate the imaginations of those distrustful of global financial systems, even as authorities dismiss the reports as entirely fictitious.","economy"],
    ["Fake Economic Forecasts Uncovered by Investigative Reporters", "A recently circulated dossier—allegedly compiled by a group of rogue investigative reporters—claims that some of the world’s most prominent economic forecasts are nothing but elaborate fabrications designed to mislead the public and manipulate market sentiment. According to this entirely fabricated report, influential think tanks and financial institutions have conspired to publish optimistic projections despite mounting evidence of economic instability. The dossier asserts that behind the scenes, a secretive committee of experts is altering data and suppressing negative information to maintain investor confidence and secure lucrative financial deals. Interviews quoted in the dossier (all of which are untraceable) describe how internal memos instruct analysts to “spin the narrative” during times of economic downturn. While mainstream economists and reputable media outlets have thoroughly debunked these claims, the narrative has found traction on social media and alternative news platforms. Critics argue that the story is a carefully constructed piece of misinformation aimed at sowing distrust in established economic institutions and their published forecasts.","economy"],
    ["Miracle Cure or Conspiracy? The Hidden Truth", "A bombshell report circulating in underground online communities alleges that a revolutionary “miracle cure” for multiple chronic illnesses has been discovered in a secret laboratory—but that the cure is being deliberately suppressed by powerful pharmaceutical interests. According to the fabricated narrative, researchers at a clandestine facility in an undisclosed location developed a treatment that can reverse conditions ranging from diabetes to autoimmune disorders. Whistleblowers (whose identities remain unverified) claim that multinational drug companies, fearing a catastrophic loss of profits, have conspired to bury the research and discredit its findings. Detailed, though entirely fictional, documents describe covert meetings between executives and government regulators where plans were hatched to discredit the miracle cure through a series of “controlled clinical failures.” Despite the dramatic claims, no reputable medical journal or regulatory agency has ever confirmed the existence of such a treatment. Nonetheless, the story has ignited fervent discussion among alternative health advocates and conspiracy theorists, with many calling for independent investigations into the alleged cover-up.","health"],
    ["Government-Secret Vaccines and the Hidden Agenda", "In a narrative that has rapidly spread through social media channels, unverified sources now claim that several governments have developed secret vaccines—not to combat diseases, but to implant mind-control nanobots in unsuspecting citizens. According to the entirely fabricated account, these covert vaccines were engineered in hidden research facilities and are being distributed covertly alongside routine immunizations. Insiders allege that top government officials have conspired with shadowy biotech firms to implement the program as part of a larger scheme to control public behavior and suppress dissent. Detailed but unverifiable “leaks” include diagrams of nanobot technology and supposed internal memos outlining the project’s phases. Public health authorities and independent scientists have dismissed the claims as absurd and lacking any empirical basis, yet the narrative continues to fuel heated debates online. The story has become a rallying cry for those suspicious of government overreach, even as experts warn that the entire account is a complete fabrication designed to stoke fear and mistrust in established health institutions.","health"],
    ["The Fabricated Epidemic That Never Was","A recent series of posts on fringe health forums has claimed that an epidemic sweeping the globe is nothing more than a carefully orchestrated fabrication by international health agencies. According to these unfounded accounts, the so-called outbreak of a novel virus was deliberately invented to enforce draconian public health measures and expand governmental control over citizens’ lives. Fabricated “data” presented in the posts—including manipulated graphs and fake expert testimonies—purports to show that infection rates were grossly exaggerated and that the virus was engineered in a laboratory as part of a secret experiment. Despite overwhelming evidence to the contrary provided by reputable global health organizations, the narrative has gained traction among communities predisposed to distrust official sources. Detractors of the mainstream narrative argue that the epidemic is a hoax designed to justify unprecedented restrictions on personal freedom. While scientists and public health experts have thoroughly debunked the claims, the rumor of a fabricated epidemic persists as one of the most controversial and persistent conspiracy theories in the health sphere.", "health"],
    ["Shadow Health Organization Controlling Treatments", "A startling claim emerging from anonymous online sources alleges that a clandestine organization, known only as the “Global Health Directorate,” is secretly controlling all aspects of medical research and treatment protocols worldwide. According to this entirely fabricated narrative, the Directorate operates behind the scenes to determine which diseases receive funding for research and which innovative treatments are suppressed to protect the interests of certain pharmaceutical giants. Leaked “internal documents” (all of which have been debunked by experts) supposedly reveal that this shadow group manipulates clinical trial outcomes and deliberately withholds breakthrough therapies from the public. One supposed insider explained that the Directorate’s ultimate goal is to monopolize global healthcare, ensuring that all new treatments funnel profits exclusively to a handful of powerful corporations. While mainstream scientists and healthcare professionals have dismissed these claims as pure fantasy, the idea of a hidden health organization continues to resonate with individuals suspicious of modern medicine and its regulatory framework.","health"],
    ["The Pseudoscientific Breakthrough that Shocked Experts", "A recent online buzz has centered on reports of a pseudoscientific breakthrough—allegedly discovered by a renegade group of researchers—that claims to reverse aging and cure terminal illnesses in a single treatment. According to the fabricated account, the breakthrough involves a complex combination of gene therapy and nanotechnology, developed in a secret laboratory hidden beneath an abandoned industrial complex. The story goes on to assert that leading medical experts worldwide have been silenced or coerced into keeping the discovery under wraps, with influential institutions allegedly colluding to protect lucrative existing treatments. Detailed but entirely spurious “research notes” and blurry laboratory images have been circulated to support the claim. Despite the sensational nature of the announcement, no peer-reviewed studies or independent verifications exist to corroborate the story. Health authorities and renowned scientists have categorically refuted the claims, calling the report a dangerous piece of misinformation designed to exploit public hopes for miraculous cures.","health"],
    ["The Mastermind Behind the Global Heist","In an astonishing tale that sounds straight out of a blockbuster movie, unverified sources have alleged the existence of a criminal mastermind orchestrating a series of sophisticated heists across multiple continents. Dubbed “The Phantom Thief” by underground circles, this enigmatic figure is said to have masterminded daring robberies targeting high-security financial institutions and luxury art galleries alike. According to the fabricated narrative, The Phantom Thief utilizes an intricate network of accomplices and cutting-edge technology to bypass state-of-the-art security systems. Leaked “confidential reports” (entirely unverifiable) claim that the mastermind’s operations are so meticulously planned that law enforcement agencies remain one step behind at every turn. One anonymous tipster described a dramatic scene in which the criminal escaped using an elaborate series of decoys and underground tunnels. Despite widespread media interest and online chatter, no credible evidence supports these claims, and authorities have repeatedly dismissed the story as an elaborate fabrication. Nonetheless, the legend of The Phantom Thief continues to capture the imagination of both criminals and crime enthusiasts.", "crime" ],
    ["The Cyber Syndicate and the Digital Black Market", "A series of posts on dark web forums has recently brought attention to an alleged cyber syndicate that is said to run an expansive digital black market, controlling vast networks of hackers and cybercriminals. According to the entirely fabricated narrative, this syndicate—known only as “Digital Dominion”—is responsible for orchestrating large-scale data breaches, identity thefts, and even orchestrated cyberattacks on critical infrastructure. The story details how Digital Dominion supposedly recruits skilled hackers from around the globe, providing them with state-of-the-art tools and secretive training in return for a share of their illicit profits. Leaked “evidence” in the form of anonymized chat logs and cryptic online transactions has fueled speculation about the syndicate’s influence over modern cybercrime. Despite the dramatic claims, no law enforcement agency has confirmed the existence of such an organization, and cybersecurity experts have dismissed the narrative as a myth designed to instill fear. Nevertheless, the notion of a centralized cybercriminal empire continues to spread rapidly among online communities, adding fuel to debates about digital security.","crime"],
    ["Fake Evidence Links Celebrity to Crime Ring", " scandalous claim has emerged from questionable online sources alleging that a world-renowned celebrity is secretly involved in an international crime ring. According to the fabricated report, the star—whose identity remains deliberately vague—has been linked through a series of doctored documents, manipulated photographs, and untraceable phone recordings to an underground network involved in money laundering and arms trafficking. The narrative suggests that the celebrity’s public persona is merely a facade, carefully crafted to conceal a far more sinister involvement in organized crime. Despite the sensational nature of the claim, independent investigations by reputable outlets have found no supporting evidence, and multiple fact-checking organizations have debunked the story as a fabrication. Nonetheless, the tale has ignited fervent debate on social media, with supporters insisting that the “evidence” is being suppressed by powerful interests intent on protecting high-profile figures. Critics argue that the entire narrative is a calculated piece of misinformation designed to smear reputations and distract from real criminal investigations.","crime"],
    ["The Underworld’s Hidden Code of Silence", "Whispers from the criminal underworld have given rise to a fabricated narrative detailing an alleged secret code of silence that binds organized crime groups across continents. According to the entirely unverified account, this so-called “Code of Shadows” mandates that members of illicit organizations adhere to strict rules of non-disclosure about their operations, with severe—and entirely invented—consequences for any breaches. Leaked “testimonies” from anonymous ex-criminals (whose identities cannot be confirmed) claim that this code is enforced through a network of vigilante enforcers operating outside the law. The report further asserts that this clandestine system has allowed crime syndicates to thrive, coordinating complex operations such as international drug trafficking, cybercrimes, and high-stakes robberies without fear of exposure. While law enforcement officials have long acknowledged the existence of informal codes among criminals, no evidence has ever substantiated the detailed version of the Code of Shadows described in these posts. Nevertheless, the story has captured the public’s imagination, fueling both fear and fascination with the hidden rules of the underworld.","crime"],
    ["Alleged Supernatural Connection in Organized Crime", "In a bizarre twist that has stirred both intrigue and skepticism, unverified online sources claim that an otherworldly element is at work within organized crime circles. According to this fabricated narrative, certain notorious crime families are rumored to have forged secret pacts with mysterious, supernatural entities in exchange for uncanny success in their illicit endeavors. The story describes eerie rituals performed in abandoned warehouses under moonlit skies, where members of these crime families allegedly invoke ancient forces to secure their power and evade capture by authorities. Detailed but entirely fictional accounts include descriptions of cryptic symbols, mysterious chants, and inexplicable phenomena witnessed during criminal operations. While no credible evidence or expert testimony supports any supernatural involvement in crime, the tale has rapidly spread through niche internet forums and alternative news sites. Skeptics dismiss the narrative as pure fantasy, yet its persistence highlights the human tendency to weave extraordinary explanations around the most enigmatic and frightening aspects of criminal life.","crime"],
    ["AI-Driven Vote Rigging Uncovered", "A startling claim emerging from shadowy online sources alleges that recent elections in multiple countries were manipulated using advanced artificial intelligence systems designed specifically for vote rigging. According to the entirely fabricated report, an underground network of tech experts and political operatives developed a sophisticated AI program that could alter digital ballots and even sway public opinion through targeted disinformation campaigns. Leaked “internal communications” (all of which lack any credible origin) detail how this system was deployed during key electoral cycles to produce results favorable to a select group of political elites. The report asserts that the AI not only manipulated vote counts but also fabricated evidence of voter fraud to justify its interference. While election officials and independent watchdog organizations have vehemently denied any involvement of AI in vote manipulation, the narrative has ignited fierce debates online. Critics dismiss the allegations as modern myth-making, yet the idea of a clandestine, algorithm-driven election interference continues to find an audience among those distrusting traditional democratic processes.","elections"],
    ["Hidden Ballots and Phantom Voters", "In a narrative that has rapidly spread through fringe political blogs, unverified sources now claim that a secretive scheme involving hidden ballots and phantom voters was implemented during recent national elections. According to the fabricated account, shadow operatives allegedly inserted fake ballots into the voting system, and entirely fictitious voter identities were created to sway the outcome in key districts. Detailed but entirely false “evidence”—including manipulated voter records and doctored official documents—purports to show that thousands of non-existent citizens were added to the rolls, tipping the scales in favor of a prearranged result. The story asserts that these phantom voters were registered using advanced data manipulation techniques, and that the entire operation was coordinated from undisclosed headquarters by a covert group of political insiders. While election authorities have consistently maintained that voter registration and ballot counting were conducted transparently and accurately, the rumor of hidden ballots and ghost voters continues to spark controversy. Skeptics warn that such narratives are dangerous fabrications intended to undermine public confidence in democratic institutions.","elections"],
    ["The Secret Software Behind Election Fraud", "A fabricated exposé circulating on alternative news platforms alleges that the integrity of recent elections was compromised by secret software embedded in voting machines. According to the entirely unverified report, a rogue group of software engineers collaborated with political operatives to install a hidden program capable of altering vote totals in real time. Detailed descriptions in the report claim that the software was designed to target specific precincts and switch votes from opposition candidates to those favored by the conspirators. Anonymous “insiders” (whose identities remain unverifiable) provided screenshots and technical schematics to support the claim, though none have been authenticated by independent experts. Election officials have categorically denied any tampering with voting equipment, yet the narrative persists among groups that already harbor deep suspicions of electoral fraud. While mainstream media and cybersecurity professionals dismiss the allegations as a digital-age urban legend, the story has fueled ongoing debates about the security and transparency of modern voting systems.","elections"],
    ["International Conspiracy Alters Poll Results", "A sensational claim has emerged from obscure online communities alleging that an international conspiracy was behind the manipulation of poll results in recent elections. According to this fabricated narrative, a coalition of foreign intelligence agencies and political operatives conspired to alter vote tallies through covert operations, including hacking voting systems and deploying disinformation campaigns across borders. The report—supported by entirely unsubstantiated “leaked” documents and cryptic video footage—purports to show that the conspiracy was orchestrated from hidden command centers located in various parts of the world. Proponents of the story argue that the altered results were part of a larger plan to undermine national sovereignty and install puppet governments. Despite repeated denials from official election commissions and independent international observers, the narrative continues to gain traction among segments of the public already inclined to distrust electoral processes. Experts, however, maintain that there is no credible evidence of any such international interference, calling the story a complete fabrication designed to stoke geopolitical paranoia.","elections"],
    ["The Unseen Hand Steering Democracy", "In a final explosive installment of fabricated election conspiracies, unverified online sources claim that an unseen hand has been subtly steering democratic outcomes for decades. According to the entirely fictitious report, a secret cabal of influential figures—including undisclosed political advisors, wealthy oligarchs, and covert intelligence operatives—has been manipulating voter sentiment and election results from behind the scenes. Detailed accounts in the report describe how this cabal allegedly funds political campaigns, engineers media narratives, and even tampers with ballot-counting machines to ensure desired outcomes. The narrative is supported by a series of dubious “eyewitness” testimonies and manipulated documents that purport to reveal a long-standing pattern of covert intervention in democratic processes. While election experts and historians have long refuted such sweeping claims, the story of an unseen hand controlling the destiny of nations continues to resonate with those disillusioned by modern politics. Critics argue that the tale is a carefully constructed piece of misinformation intended to erode public trust in the very foundations of democracy.", "elections" ]
]
#DeepSeek output
deepseek_output = [
    ["World Leader Secretly Funds Alien Technology Research, Leaked Docs Claim", "A classified dossier allegedly reveals that the leader of a major European nation diverted €800 million in public defense funds to a clandestine extraterrestrial tech program. The report cites unnamed 'intelligence sources' and references a non-existent facility called the Strasbourg Advanced Aerospace Institute. Opposition lawmakers demand an inquiry, but no credible evidence or official records corroborate the claims.","politics"],
    ["Pacific Island Nation Declares War on Canada Over Fishing Rights","Fabricated diplomatic cables suggest the tiny nation of Maritana threatened military action against Canada after accusing it of illegal deep-sea trawling. The story cites a fake Global Oceanic Rights Council report and a fictional Maritanian official, 'Minister Koa Tala.' No such dispute exists, and Maritana is not a recognized country.","politics" ],
    ["UN Secretary-General Arrested for Espionage, Anonymous Sources Allege","An unsigned blog post claims UN Secretary-General António Guterres was detained in a joint CIA-Russian operation for “selling state secrets.” The article quotes a non-existent Interpol warrant and a phantom “Geneva Security Summit” attendee. The UN has debunked the story as baseless.", "politics" ],
    ["Secret Pact Reveals Plans to Merge US and Mexico into ‘North American Union’", "A fringe website alleges that President Biden and Mexican President López Obrador signed a treaty to dissolve borders by 2028, backed by a forged document bearing fake seals. The hoax cites the Institute for Continental Integration, a think-tank that does not exist.", "politics"],
    ["Australia’s PM Found to Have Dual Citizenship of Nonexistent Country", "A viral post asserts Australian Prime Minister Anthony Albanese holds citizenship in Veridia, a fictional island nation. The claim relies on a Photoshopped passport and a fabricated International Citizenship Database. Australia’s government confirmed no such country is recognized.", "politics"],
    ["Gold to Be Outlawed as Global Currency Shift Begins", "A conspiracy outlet warns that the World Financial Authority (WFA) will ban private gold ownership in 2024 to pave the way for a digital currency. The WFA is fictitious, and no such policy proposals exist from real entities like the IMF or World Bank.","economy"],
    ["China’s Economy Collapses After ‘Black Monday’ Stock Market Crash", "A fake news site reports a 40% plunge in Shanghai stocks, attributing it to a nonexistent “debt contagion.” The article quotes “economist Dr. Li Wen” and the Asian Fiscal Stability Board, both fabricated. Actual Chinese markets showed no unusual activity.","economy"],
    ["New Global Tax Will Charge 5% on All Online Purchases, UN Announces", "A fraudulent press release claims the UN approved a universal e-commerce tax to fund climate initiatives. The document references a non-existent resolution (UN-2023/TCX) and a fake UN department. The UN confirmed no such tax exists.","economy"],
    ["Bitcoin Banned Worldwide After Secret G7 Summit","A clickbait article alleges G7 leaders agreed to criminalize cryptocurrency transactions under a clandestine “Operation Blockchain Shield.” The story cites anonymous “G7 insiders” and a phantom regulatory body, the Global Digital Asset Bureau.", "economy" ],
    ["Major Bank Announces Negative Interest Rates for Savings Accounts", "A spoofed JPMorgan Chase memo circulating online claims the bank will charge customers 2% annually to hold savings. The fake notice includes a forged signature from CEO Jamie Dimon. JPMorgan denied the policy, calling it “pure fiction.”","economy"],
    ["Vaccine Causes Infertility in 70% of Recipients, Fake Study Claims", "A debunked paper from the fabricated European Medical Review falsely links COVID-19 vaccines to infertility. The study, authored by “Dr. Erik Voss” of the nonexistent Berlin Institute of Virology, cites anonymous patient surveys. No peer-reviewed research supports this.","health"],
    ["Deadly ‘Zombie Virus’ Spreads in South America, WHO Warns", "A hoax article describes a fictional outbreak of Cortazar Virus, causing “aggressive behavior and organ failure.” It quotes a fake WHO spokesperson, “Dr. Amara Singh,” and a non-existent health alert. The WHO confirmed no such virus exists.", "health"],
    ["Common Food Additive Linked to Brain Damage, Researchers Find", "A pseudoscientific blog claims titanium dioxide (E171) causes dementia, citing a fake Global Food Safety Alliance study. The article invents a “Dr. Lisa Tanaka” and misrepresents actual E171 research, which finds no such link.","health"],
    ["Cancer Cure Discovered in Mushroom Species, But Big Pharma Suppresses It", "A conspiracy theory alleges the Amazonian Luminescent Shroom eliminates tumors but is withheld by drug companies. The story references a nonexistent Journal of Oncology Advances paper and a fictional researcher, “Dr. Carlos Mendez.”","health"],
    ["Airborne HIV Variant Detected in Europe, Health Officials Panic", "A fabricated alert from the European Center for Disease Prevention warns of a mutated HIV strain spreading via coughs. The report cites fake case numbers in Spain and France. Actual HIV cannot transmit through airborne particles.","health"],
    ["AI-Powered Robots Commit $1 Billion Bank Heist in Singapore","A tabloid claims hackers deployed autonomous robots to loot the United Pacific Bank. The story quotes a nonexistent CyberCrime Task Force investigator, “Agent Maya Lee,” and provides no police reports or bank confirmations.","crime"],
    ["Serial Killer Targets Only Left-Handed Victims, Police Say","A false crime bulletin describes a fictional murderer dubbed “The Southpaw Slayer” operating in Argentina. The article cites a phantom Buenos Aires police captain, “Inspector Raul Gomez,” and fabricated victim profiles. No such cases exist.","crime"],
    ["Prison Break in Norway: 200 Inmates Escape Using Underground Tunnels","A sensationalized piece alleges inmates at Oslo’s Fjord Maximum Security Prison dug a mile-long tunnel. The story references a fake warden, “Henrik Dahl,” and includes AI-generated images of the escape. Norwegian authorities confirmed all prisons are secure.","crime"],
    ["Mafia Develops Invisible Drug Smuggling Drones, Interpol Warns","A conspiracy site reports organized crime groups using “cloaked drones” to traffic narcotics. The article cites an unnamed Interpol official and a nonexistent tech firm, StealthCargo Inc. Interpol denied issuing any such alert.","crime"],
    ["Celebrity Chef Kidnapped by Vegan Extremist Group","A fake news outlet claims Gordon Ramsay was abducted by the Vegan Justice Army demanding he stop serving meat. The hoax includes a forged ransom note and a fabricated spokesperson, “Ava Green.” Ramsay’s team confirmed his safety.","crime"],
    ["Voter Fraud Uncovered: 1 Million Fake Ballots Found in Warehouse","A far-right blog alleges a warehouse in Texas stored counterfeit ballots for the 2024 election. The story cites an anonymous “election integrity group” and a fake address. State officials confirmed no ballots were found.","elections"],
    ["Candidate Drops Out After Secret Love Child Scandal","A smear article accuses a fictional Canadian MP, “Sarah Clarke,” of concealing a child with a staffer. The piece uses a doctored photo and quotes a nonexistent tabloid, Ottawa Exposé. Clarke is not a real politician.","elections"],
    ["Foreign Agents Infiltrate Voting Systems in 12 States, FBI Claims","A disinformation campaign alleges Russian hackers compromised U.S. voting machines. The article references a fake FBI memo and a phantom cybersecurity firm, ShieldWall Analytics. The FBI stated no breaches occurred.","elections"],
    ["AI-Generated Candidate Wins Local Election in New Zealand", "A satirical claim repurposed as news states an AI persona named “Polly” won a mayoral race in Christchurch. The story cites a fake election commission report and a non-existent AI company, VoteBot Inc. No such election took place.","elections"],
    ["Election Postponed Indefinitely Due to ‘National Security Threat’","A fabricated emergency decree alleges India delayed its 2024 elections over a bogus “terror plot.” The article quotes a fictional home ministry official, “Rajeev Kapoor,” and provides no credible sources. Indian officials denied the claim.","elections"]
]
#add static values to track where each came from
chatgpt_output = [item + ["ChatGPT","chatgpt.com"] for item in chatgpt_output]
deepseek_output = [item + ["DeepSeek","chat.deepseek.com"] for item in deepseek_output]

#combine the outputs from the different LLMs
llm_output = chatgpt_output + deepseek_output

#create a DataFrame from the list
llm_df = pd.DataFrame(llm_output, columns=['title', 'text', 'category','site','url'])

#add the constants
llm_df['date'] = "February 4th"
llm_df['class'] = "fabricated"

#reorder the columns
llm_df = llm_df[['title', 'text', 'site', 'date', 'category', 'class', 'url']]

#concatenate all fabricated data
fabricated_dataset = pd.concat([fabricated_dataset, llm_df], ignore_index=True)
print ("Length: ",len(fabricated_dataset))

#store to CSV
fabricated_dataset.to_csv("fabricated_articles.csv", index=False)
all_scraped_content.append("fabricated_articles.csv")

Length:  400


### 2. Polarised content
Polarised content is true events or facts selectively presented to promote a biased narrative, often omitting critical context.

##### Features:
- Partial Truth: The piece is based on a real event, statistic, or quote.
- Omission / Distortion: The content emphasizes certain facts while ignoring or minimizing others, creating a skewed impression.
- Strong Bias: The language or framing clearly supports one political, ideological, or partisan stance, rather than offering balanced coverage.

##### Label if:
- The article references real events but uses them to push a strong, one-sided narrative.
- The content focuses on data or testimonies that bolster a specific stance while disregarding contradictory evidence.
- The tone or style is heavily partisan and attempts to sway opinion by selective fact usage rather than outright fabrication.

##### Do Not Label if:
- The core facts are outright false (label as Fabricated).
- It is primarily personal opinion or commentary without strong factual references (label as Commentary).
- It is purely an attempt at persuasion or advertising without misrepresenting an event (label as Persuasive).

##### Sources:
- The Conservative Woman (UK, Media bias: far right) https://www.conservativewoman.co.uk/ (100 articles)
- The Canary (UK, Media bias: left) https://www.thecanary.co/uk/ (100 articles)
- Breitbart (USA, Media bias: far right) https://www.breitbart.com/ (100 articles)
- Daily Kos (USA, Media bias: far left) https://www.dailykos.com/ (100 articles)

**The Conservative Woman**

Articles were scraped from the weekly "Our Top Ten Articles of the Week" series, starting from the January 11, 2025 edition (https://www.conservativewoman.co.uk/tcw-our-top-ten-articles-of-the-week-9/), ending on the February 22 edition.

A large number of articles were skipped. "Features" and "Family and Faith" articles were skipped as they are not news. Many of the other articles did not meet the criteria for labelling, instead falling under Commentary, for example: https://www.conservativewoman.co.uk/wind-turbines-and-a-voice-in-the-wilderness/ These were primarily recognised by a focus on "I" and "me" in the text.

In [4]:
def scrape_tcw_article(url):
    """
    Scrapes an article from a given URL on conservativewoman.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised", 
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        # Remove the trailing site name
        if article_data["title"].endswith(" - The Conservative Woman"):
            article_data["title"] = article_data["title"].replace(" - The Conservative Woman", "")
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        yoast_script = soup.find("script", class_="yoast-schema-graph", type="application/ld+json")
        if yoast_script:
            try:
                data_json = json.loads(yoast_script.string)
                for node in data_json.get("@graph", []):
                    if node.get("@type") == "Article":
                        art_sec = node.get("articleSection", None)
                        if art_sec:
                            if isinstance(art_sec, list):
                                article_data["category"] = art_sec[0]
                            else:
                                article_data["category"] = art_sec
                        break
            except json.JSONDecodeError:
                print("Could not parse the JSON-LD correctly.")
        
        # Article copy
        content_div = soup.find("div", class_=lambda c: c and "td-post-content" in c)
        if content_div:
            # Collect paragraphs
            paragraphs = content_div.find_all("p")
            text_list = []
            for p in paragraphs:
                text = p.get_text(strip=True)
                # End before the donation paragraph
                if text.startswith("If you appreciated this article, perhaps you might consider making a donation"):
                    break  
                text_list.append(text)
            #join all paragraphs together
            full_text = " ".join(text_list).strip()
            # Remove web addresses using a regex
            full_text = re.sub(r'https?://\S+', '', full_text)    
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    return article_data

scrape_articles("conservativewoman.txt", scrape_tcw_article, "polarised_scraped_articles_tcw.csv")

Length:  100
                               title  \
0  The real reason for Apocalypse LA   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

**The Canary**

Articles have been scraped from the UK section of The Canary (https://www.thecanary.co/uk/) from newest to oldest. Article date range is January 7th to January 29th 2025. Five articles were excluded for not meeting the labelling criteria (articles focused on getting users to sign a petition, advertorials.)

In [5]:
def scrape_can_article(url):
    """
    Scrapes an article from a given URL on https://www.thecanary.co/uk/ and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Remove tweet embeds
            for twitter_blockquote in soup.find_all('blockquote', class_='twitter-tweet'):
                twitter_blockquote.decompose()
            # Remove ad elements
            for ads_div in soup.find_all('div', class_='ads_google_ads'):
                ads_div.decompose()

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
            
            # Category
            category_found = None
            yoast_script = soup.find('script', class_='yoast-schema-graph', type='application/ld+json')
            if yoast_script:
                try:
                    yoast_data = json.loads(yoast_script.string)
                    for item in yoast_data.get('@graph', []):
                        if item.get('@type') == 'NewsArticle':
                            section = item.get('articleSection')
                            if section:
                                if isinstance(section, list) and len(section) > 0:
                                    category_found = section[0].strip()
                                elif isinstance(section, str):
                                    category_found = section.strip()
                                break
                except json.JSONDecodeError:
                    pass
            # If we never found a category, use a default
            if category_found:
                article_data["category"] = category_found
            else:
                article_data["category"] = "Category not found"
            
            # Article copy
            article_body = soup.find('div', class_='jeg_inner_content')
            featured_image_patterns = [
                re.compile(r'^Featured image via .*$', re.IGNORECASE),
                re.compile(r'^Featured image supplied', re.IGNORECASE),
                re.compile(r'^Featured image and additional images via .*$', re.IGNORECASE),
                re.compile(r'^Featured image and additional images supplied$', re.IGNORECASE)
            ]
            if article_body:
                paragraphs = article_body.find_all('p')
                text_content = []
                
                for p in paragraphs:
                    if any(pattern.match(p.text.strip()) for pattern in featured_image_patterns):
                        p.decompose()
                    p_text = p.get_text().strip()
                    if p_text:
                        text_content.append(p_text)
                
                full_article = " ".join(text_content) if text_content else "Article content not found"
                article_data["text"] = clean_text(full_article)
            else:
                article_data["text"] = "Article content not found"
        
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")
    return article_data


scrape_articles("canary.txt", scrape_can_article, "polarised_scraped_articles_can.csv")

Length:  100
                                                                      title  \
0  Corbyn pushes government on RAF base's role in Israel's genocide in Gaza   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

**Breitbart**

Articles have been scrapped from the News section in reverse chronological order: https://www.breitbart.com/news/source/breitbart-news/ Articles with a category of "clips" and "radio" were excluded as they are media content. Article range is January 17th to 20th 2025.

In [6]:
def scrape_bb_article(url):
    """
    Scrapes an article from a given URL on https://www.breitbart.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
            
            # Category
            cat_meta = soup.find('meta', property='article:categories')
            if cat_meta and cat_meta.get('content'):
                article_data["category"] = cat_meta['content'].split(',')[0]
            else:
                article_data["category"] = "No category found"

            # Article copy
            main_content = soup.find('div', class_='entry-content')
            if main_content:
                # Remove tweets
                tweet_iframes = main_content.find_all('iframe', class_='bnn-if-tweet')
                for tw in tweet_iframes:
                    tw.decompose()
                # Remove images and captions
                image_captions = main_content.find_all("div", class_="wp-caption aligncenter")
                for div in image_captions:
                    div.decompose()
                # Remove reporter promo paragraph
                follow_pattern = re.compile(
                    r'(?i)\bfollow\b.*?(facebook|twitter|instagram|truth\s*social|x|@[a-z0-9_.-]+|email)',
                    re.IGNORECASE
                )
                all_paras = main_content.find_all("p")
                for p in all_paras:
                    para_text = p.get_text(strip=True)
                    if follow_pattern.search(para_text):
                        p.decompose()
                    elif "reporter for Breitbart News" in para_text:
                        p.decompose()
                    elif "Breitbart News Daily airs on SiriusXM" in para_text:
                        p.decompose()
                    elif "Order your copy today" in para_text:
                        p.decompose()

                # Extract text
                raw_text = main_content.get_text(separator=" ", strip=True)

                article_data["text"] = clean_text(raw_text)
            else:
                article_data["text"] = "Article body not found"
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data

scrape_articles("breitbart.txt", scrape_bb_article, "polarised_scraped_articles_bb.csv")

Failed to fetch the webpage: https://www.breitbart.com/politics/2025/01/20/hes-back-time-magazine-salutes-donald-trumps-return/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/tech/2025/01/19/google-defies-eus-fact-checking-requirements-for-search-and-youtube/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/politics/2025/01/17/elizabeth-warren-accuses-longtime-democrat-donor-sam-altman-of-seeking-favors-from-trump-after-1-million-inaugural-donation/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/politics/2025/01/19/poll-most-americans-one-word-summary-president-joe-bidens-legacy-nothing/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/immigration/2025/01/20/donald-trump-to-end-birthright-citizenship-executive-order/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/border/2025/01/20/cartel-connected-human-smuggler-sent-to-prison-for-kidnapping-texas-teen/. Status 

Failed to fetch the webpage: https://www.breitbart.com/europe/2025/01/19/german-ambassador-warns-trump-administration-will-seek-redefinition-of-constitutional-order/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/middle-east/2025/01/18/israel-will-release-mass-murderers-terrorists-in-hostage-deal/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/politics/2025/01/20/chip-roy-biden-preemptive-pardons-fauci-january-6-committee-members-lets-call-them-all-before-congress-demand-truth/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/europe/2025/01/18/outrage-over-short-sentences-for-grooming-gang-child-rapists/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/tech/2025/01/20/leftist-echo-chamber-bluesky-is-plagued-by-bots-as-trump-takes-office/. Status code: 429
Failed to fetch the webpage: https://www.breitbart.com/entertainment/2025/01/20/disneys-star-wars-star-mark-hamill-begins-week-long-sy

Failed to fetch the webpage: https://www.breitbart.com/europe/2025/01/20/senior-british-royals-plan-tour-of-america-under-trump-for-special-relationship-soft-power-push/. Status code: 429
Length:  100
                                                                                    title  \
0  ESPN to Air 'National Anthem and MLK-Themed Content' Before National Championship Game   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

In [7]:
def scrape_kos_article(url):
    """
    Scrapes an article from a given URL on https://www.dailykos.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            if not published_date_meta:
                # Fallback to noscript timestamp
                timestamp_span = soup.select_one(".story__timestamp span.timestamp")
                if timestamp_span and 'data-epoch-time' in timestamp_span.attrs:
                    # Convert timestamp to human-readable date
                    epoch_time = int(timestamp_span['data-epoch-time']) / 1000  # Convert milliseconds to seconds
                    human_readable_date = datetime.utcfromtimestamp(epoch_time).strftime('%Y-%m-%d %H:%M:%S')
                    article_data["date"] = human_readable_date
                else:
                    article_data["date"] = "Published date not found"
            else:
                article_data["date"] = published_date_meta['content']
                
            # Category
            category_meta = soup.find('meta', property='article:section')
            article_data["category"] = category_meta['content'] if category_meta else "Category not found"

            # Article text
            story_content_divs = [
                div for div in soup.find_all('div', class_='story__text')
                if 'placeholder' not in div.get('class', [])
            ]
            
            if story_content_divs:
                paragraphs = []
                exclusion_phrases = [
                    "Donate now to support",
                    "Join us on Bluesky", "Bluesky Starter Pack", "staff accounts on Bluesky", "Daily Kos is on Bluesky",
                    "Your reader support means everything", "please donate just $3", 
                    "value having free and reliable access", "Daily Kos is supported by readers like you.", "Can you chip in today?"
                ]
                
                for div in story_content_divs:
                    for p in div.find_all('p', recursive=False):  # Direct <p> children only
                        text = p.get_text(strip=True)
                        if not any(phrase in text for phrase in exclusion_phrases) and not text.startswith("Donate now to support"):
                            paragraphs.append(text)
                
                raw_article = ' '.join(paragraphs)
                article_data["text"] = clean_text(raw_article)
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data

scrape_articles("kos.txt", scrape_kos_article, "polarised_scraped_articles_kos.csv")

Length:  99
                                                                  title  \
0  Trump's speech raises the old question: Is he evil or merely stupid?   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

### 3. Satire
Satirical content is intended to entertain or provoke thought through humor, exaggeration, or irony. Satire is often misunderstood as factual. 

##### Features:

- Humourous or Exaggerated Tone: Content is typically marked by wit, parody, or absurdity.
- Intentional Ridiculousness: The story is meant to be funny, not factual; outlandish claims serve comedic purposes.

##### Label If:

- The piece’s goal is clearly comedic or parodic, rather than deceptive.
- The tone, language, or disclaimers indicate it’s intentionally satirical.

##### Do Not Label If:

- The piece uses humour but is still intended to mislead (label as Fabricated Content).
- The piece is comedic but still pushing a heavily skewed narrative as if it’s true (label as Polarised Content).

##### Sources:
- The Onion (USA - 55 articles)
- Babylon Bee (USA - 50 articles)
- The Daily Squib (UK - 45 articles)
- Waterford Whispers (IE - 50 articles)


**The Onion**

The articles scraped are the ones featured on the 2024 "Annual Year" post found here: https://theonion.com/our-annual-year-2024/ - the top 5 from each month have been chosen (image posts have been excluded as per scope), so a total of 55 articles as December is excluded. The remaining 45 articles will be from the standard ratings hierachy found here: https://theonion.com/latest/

In [8]:
def scrape_onion_article(url):
    """
    Scrapes an article from a given URL on theonion.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", 
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        category_element = soup.find('div', class_='taxonomy-category')
        category_link = category_element.find('a') if category_element else None
        article_data["category"] = category_link.text.strip() if category_link else "Category not found"
        
        # Article copy
        content_div = soup.find(
            "div",
            {"class": lambda x: x and "entry-content" in x and "single-post-content" in x}
        )
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("onion.txt", scrape_onion_article, "satire_scraped_articles_onion.csv")

Length:  100
                                                                           title  \
0  Duracell Removes Frosting, Sprinkles To Discourage Kids From Eating Batteries   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

**Babylon Bee**

Articles from the Greatest Hits page (https://babylonbee.com/news?sort=greatest-hits) have been scraped. The categories "Christian Living" and "Scripture" were excluded for being too niche. The articles range from 2017 to 2022. The final 15 came from the trending news section (https://babylonbee.com/news?sort=buzzing), all from January to February 2025.


In [9]:
def scrape_bee_article(url):
    """
    Scrapes an article from a given URL on babylonbee.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print (soup)
        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date       
        published_date_meta = soup.find('meta', {"name": "published_at"})
        if published_date_meta and published_date_meta.get("content"):
            article_data["date"] = published_date_meta["content"].split()[0]
        else: "Published date not found"
        
        # Category
        category_link = soup.find("a", href=lambda href: href and "/news/categories/" in href)
        if category_link:
            article_data["category"] = category_link.get_text(strip=True)
        else:
            article_data["category"] = "Category not found"
            
        # Article copy
        content_div = soup.select_one("div.article-content")
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("bee.txt", scrape_bee_article, "satire_scraped_articles_bee.csv")

Length:  100
                                                                                         title  \
0  Bill Clinton: 'Allegations Of Sexual Misconduct Should Disqualify A Man From Public Office'   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

**The Daily Squib**

100 articles were taken from the "Most Popular" page: https://www.dailysquib.co.uk/category/most-popular

In [10]:
def scrape_squib_article(url):
    """
    Scrapes an article from a given URL on dailysquib.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date        
        published_meta = soup.find("meta", property="article:published_time")
        if published_meta and published_meta.get("content"):
            article_data["date"] = published_meta["content"].split("T")[0]
        
        # Category
        category_div = soup.find("div", class_="tdb-category td-fix-index")
        if category_div:
            cat_links = category_div.find_all("a", class_="tdb-entry-category")
            if cat_links:
                categories = [
                    #ignore "most popular"
                    a.get_text(strip=True) for a in cat_links if a.get_text(strip=True).lower() != "most popular"
                ]  
                #if multiple categories, return the first
                article_data["category"] = categories[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Extract the article text
        content_div = soup.find("div", class_="td-post-content")
        
        if content_div:
            # remove blockquotes (e.g. embedded tweets)
            for bq in content_div.find_all("blockquote"):
                bq.decompose()
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = clean_text(full_text.strip())
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("squib.txt", scrape_squib_article, "satire_scraped_articles_squib.csv")

Length:  100
                                               title  \
0  "We're all rooting for Meghan's Netflix success!"   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           text  \
0  It's

**Waterford Whispers**

100 articles were taken from the homepage (https://waterfordwhispersnews.com/), sorted from most recent to least recent.

In [11]:
def scrape_whispers_article(url):
    """
    Scrapes an article from a given URL on waterfordwhispersnews.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date 
        date_div = soup.find("div", class_="post-date", itemprop="datePublished")
        if date_div:
            article_data["date"] = date_div.get_text(strip=True)
        else:
            article_data["date"] = "Date not found"
 
        # Category (excluding the ones used just for web display)
        excluded_categories = {"breaking news", "featured-one", "featured-two", "featured-three","homepage"}
        category_div = soup.find("div", class_="post-category")
        if category_div:
            all_cats = [a.get_text(strip=True) for a in category_div.find_all("a")]
            valid_cats = [cat for cat in all_cats
                          if cat.lower() not in excluded_categories]
            if valid_cats:
                article_data["category"] = valid_cats[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Article copy
        content_div = soup.find("div", class_="article-content", itemprop="articleBody")
        if content_div:
            for p_tag in content_div.find_all("p"):
                p_text = p_tag.get_text(strip=True).lower()
                # remove marketing snippets
                if "check out our shop." in p_text or "www.waterfordwhispers.shop" in p_text or "buy some of our merch here" in p_text or "help us to keep pissing off all the right people" in p_text:
                    p_tag.decompose()

            # remove blockquotes
            for bq in content_div.find_all("blockquote"):
                bq.decompose()

            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = clean_text(full_text.strip())
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("whispers.txt", scrape_whispers_article, "satire_scraped_articles_whispers.csv")

Length:  102
                                                                                    title  \

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

### 4. Commentary
Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding or presenting mainly personal interpretation.

##### Features:

- Personal Interpretation: The writer’s subjective opinions or experiences form the core of the content.
- Limited Fact-Checking: Minimal reliance on verified data; opinions may be framed as personal reflections or “takes.”
- Editorial or Opinion Section: Typically appears in editorial pages, op-eds, blogs, or similar formats clearly labeled as opinion.

##### Label If:

- The text is primarily an opinion piece discussing how the author feels about an event, topic, or policy.
- The author uses subjective language (e.g., “I believe…,” “In my view…”) rather than objective reporting.

##### Do Not Label If:

- The commentary deliberately misrepresents facts to persuade or manipulates partial truths (label as Polarised).
- The commentary is disguised marketing or propaganda with a clear persuasive goal (label as Persuasive).

##### Sources:
- www.washingtonexaminer.com (100) (Right wing leaning)
- https://www.nature.com/opinion (50) (Science focused, arguably left wing leanin)
- https://www.rollingstone.com/politics/political-commentary (100) (Left wing leaning, political focus)
- https://www.theguardian.com/uk/commentisfree 
- https://www.wsws.org/en/topics/site_area/perspectives
- https://www.huffpost.com/section/opinion
- https://www.nytimes.com/international/section/opinion
- https://www.washingtonpost.com/opinions/
- https://www.theguardian.com/uk/commentisfree

In [12]:
def scrape_washexam_article(url):
    """
    Scrapes an article from a given URL on washingtonexaminer.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print(soup)
        
        # Title
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else "Title not found"
        if " - Washington Examiner" in title:
            title = title.replace(" - Washington Examiner", "").strip()
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        site = site_name_meta['content'] if site_name_meta else "Site name not found"
        site = site.split(" - ")[0].strip()  # keep only the first part
        article_data["site"] = site
        
        # Published date (from the meta tag)
        pub_date = soup.find("meta", property="article:published_time")
        if pub_date:
            article_data["date"] = pub_date.get("content", "").strip()
        else:
            article_data["date"] = "Date not found"
        
        article_body = soup.find("div", class_="td-post-content")
        if article_body:
            # Remove all <figure> elements.
            for figure in article_body.find_all("figure"):
                figure.decompose()
            # Remove any <a> tag whose text starts with the unwanted phrase.
            for a in article_body.find_all("a"):
                a_text = a.get_text(strip=True)
                if re.match(r"^click\s+here\s+to\s+read\s+more\s+from", a_text, flags=re.IGNORECASE):
                    a.decompose()
            # Extract the text, using a space as separator.
            raw_text = article_body.get_text(separator=" ", strip=True)
            # Replace multiple whitespace/newlines with a single space.
            cleaned_text = clean_text(raw_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("wexam.txt", scrape_washexam_article, "commentary_scraped_articles_washexam.csv")

Length:  100
                                                                                                                title  \
0  IRS contractor Charles Littlejohn leaked Trump’s tax returns, but we’re now learning how much damage he really did   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [13]:
def scrape_nat_article(url):
    """
    Scrapes an article from a given URL on nature.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print (soup)
        
        # Title
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else (soup.title.string if soup.title else "Title not found")
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        if site_name_meta:
            site = site_name_meta['content']
        else:
            twitter_site = soup.find('meta', attrs={'name': 'twitter:site'})
            if twitter_site:
                site = twitter_site['content']
                if site.startswith("@"):
                    site = site[1:]
            else:
                site = "Site name not found"
        article_data["site"] = site
        
        # Get published date
        pub_date_meta = soup.find("meta", property="article:published_time")
        if pub_date_meta:
            article_data["date"] = pub_date_meta.get("content", "").strip()
        else:
            ld_script = soup.find("script", type="application/ld+json")
            if ld_script:
                try:
                    ld_json = json.loads(ld_script.string)
                    if isinstance(ld_json, list):
                        ld_json = ld_json[0]
                    if "mainEntity" in ld_json and "datePublished" in ld_json["mainEntity"]:
                        article_data["date"] = ld_json["mainEntity"]["datePublished"]
                    elif "datePublished" in ld_json:
                        article_data["date"] = ld_json["datePublished"]
                    else:
                        article_data["date"] = "Date not found"
                except Exception:
                    article_data["date"] = "Date not found"
            else:
                article_data["date"] = "Date not found"
        
        # get category from li tag
        cat_li = soup.find('li', attrs={'data-test': 'article-category'})
        if cat_li:
            cat_span = cat_li.find('span', class_='c-article-identifiers__type')
            if cat_span:
                article_data["category"] = cat_span.get_text(strip=True)
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"
        
        # Article body
        article_body = soup.find("div", class_=lambda c: c and "c-article-body" in c)
        if article_body:
            # Remove all <figure> elements.
            for figure in article_body.find_all("figure"):
                figure.decompose()
            
            # Remove header title and teaser text if there
            header_title_elem = article_body.find("h1", class_="c-article-magazine-title")
            if header_title_elem:
                header_title_elem.decompose()
            teaser_elem = article_body.find("div", class_="c-article-teaser-text")
            if teaser_elem:
                teaser_elem.decompose()
            
            # Grab all paragraphs
            paragraphs = article_body.find_all("p")
            header_title = ""
            teaser_text = ""
            ext_header = soup.find("h1", class_="c-article-magazine-title")
            if ext_header:
                header_title = ext_header.get_text(strip=True).lower()
            ext_teaser = soup.find("div", class_="c-article-teaser-text")
            if ext_teaser:
                teaser_text = ext_teaser.get_text(strip=True).lower()
            
            article_paragraphs = []
            for p in paragraphs:
                p_text = p.get_text(separator=" ", strip=True)
                lower_text = p_text.lower()
                # Skip paragraphs with header title or teaser text
                if header_title and header_title in lower_text:
                    continue
                if teaser_text and teaser_text in lower_text:
                    continue
                article_paragraphs.append(p_text)
            
            # Join paragraphs
            article_text = " ".join(article_paragraphs)
            cleaned_text = clean_text(article_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

scrape_articles("nature.txt", scrape_nat_article, "commentary_scraped_articles_nat.csv")

Length:  50
                                                           title  \
0  Act now to stop millions of research papers from disappearing   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [14]:
def scrape_stone_article(url):
    """
    Scrapes an article from a given URL on rollingstone.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    #print (soup)
    
    #get title
    title_tag = soup.find("meta", property="og:title")
    title = title_tag["content"] if title_tag and title_tag.has_attr("content") else "Title not found"
    article_data["title"] = title
    
    #get date
    published_tag = soup.find("meta", property="article:published_time")
    published_date = published_tag["content"] if published_tag and published_tag.has_attr("content") else "Date not found"
    article_data["date"] = published_date
    
    #get site
    site_tag = soup.find("meta", property="og:site_name")
    article_data["site"] = site_tag["content"] if site_tag and site_tag.has_attr("content") else "Site not found"
    
    #get category
    category_found = None
    ld_json_scripts = soup.find_all("script", type="application/ld+json")
    for script in ld_json_scripts:
        try:
            data = json.loads(script.string)
            if isinstance(data, dict):
                if "articleSection" in data:
                    category_found = data["articleSection"]
                    break
            elif isinstance(data, list):
                for item in data:
                    if isinstance(item, dict) and "articleSection" in item:
                        category_found = item["articleSection"]
                        break
                if category_found:
                    break
        except Exception as e:
            continue
    article_data["category"] = category_found if category_found else "Category not found"
    
    #get article
    # Remove ad blocks
    for ad in soup.find_all("div", class_="admz"):
        ad.decompose()

    # Find the container that holds the article body.
    article_container = soup.find("div", class_="pmc-paywall")
    if not article_container:
        return {"error": "Article container not found"}
    
    # Remove the editors pick widget
    for section in article_container.find_all("section", class_=lambda x: x and "editors-pick-module" in x):
        section.decompose()
    
    # Remove the related content widget
    for section in article_container.find_all("section", class_=lambda x: x and "recirculation-modules" in x):
        section.decompose()
    
    # Find and join the article paragraphs
    paragraphs = article_container.find_all("p", class_=lambda x: x and "paragraph" in x)
    article_text = " ".join(p.get_text(separator=" ", strip=True) for p in paragraphs)
    final_text = clean_text(article_text)
    article_data["text"] = final_text
    #print("Final text")
    #print(final_text)
    
    return article_data

scrape_articles("stone.txt", scrape_stone_article, "commentary_scraped_articles_stone.csv")

Final text
Much has been made of President-elect Donald Trump 's campaign promise to purge the federal government of officials unwilling to be loyal to his agenda. He's now positioned to make good on that promise during the next 10 weeks of his presidential transition. Though he's denied it, this promise reflects the goals of one of Trump's main supporters, the powerful, far-right Heritage Foundation . At the center of the Heritage's Project 2025 playbook is a far-reaching, but meekly named, plan dubbed Schedule F. Loyalty to anyone other than the president, the organization claims, is a fireable offense. If this all sounds familiar, that's because Trump tried this all before without much success. In October 2020, he signed an executive order to create a new personnel classification for high-ranking bureaucrats with policymaking responsibilities. Anyone to get the new Schedule F label would be stripped of employment protections and be subject to easy dismissal. Despite the machinations

Final text
In the Democratic National Convention's reimagined fairy tale, "Goldilocks and the Beltway," President Joe Biden played the part of the porridge. "I was too young to be in the Senate," he quipped , and now he's "too old to stay as president." Biden's candid admission was met with a standing ovation from those eager to usher him offstage -- but this presidential historian can't help but feel alarmed. I don't think the decision to shuffle out of office at 81 should have been Biden's to make, nor should it be left to the Republican rival, Donald Trump -- a convicted felon who, if successful, will be 82 at the end of his second term ... which he may not consider his last. If lawmakers don't codify an age limit for the presidency -- or remove the minimum requirement of 35 -- Biden's farewell might be remembered as the last coherent thing said by a sitting president. At 78, Trump's age lends an unnerving weight to his vice presidential running mate. J.D. Vance, a mere 40 years old

Final text


Final text
President Donald J. Trump was sworn into office just one month ago, but for many Americans, these few weeks have felt like years. Executive orders targeting diversity programs and marginalized groups, purges of the federal government , high-level resignations, extremist cabinet nominees , Elon Musk 's forcible takeover of the administrative state, mass deportations, bizarre threats to annex Greenland and Canada , the renaming of the Gulf of Mexico , a string of deadly airline disasters, clashes between the White House and the judiciary, feuds with the press, and rising tensions over Russia 's war in Ukraine -- it's hard to keep up with Trump's daily blitz of authoritarian action and propaganda, let alone get a sense of the bigger picture. One word comes up again and again as the media, legal experts, and longtime government officials try to describe what Trump has done in retaking power and launching a campaign of revenge against his enemies: "unprecedented." It seems there 

Final text
Mark Hays is the Associate Director, Cryptocurrency and Financial Technology at Demand Progress Education Fund and Americans for Financial Reform Education Fund. Days before taking the oath of office, Donald Trump and his family launched a new cryptocurrency token -- known informally as a meme coin -- called $TRUMP, as the crypto industry held an inaugural ball. Despite scant details about the coin's value, use, or risks, trading took off, sending the coin's price skyrocketing. On paper at least (or on the blockchain, as it were), the Trump family and their businesses are now several billion dollars richer. Not to be outdone, first lady Melania Trump has also launched her own meme coin . Ivanka Trump, meanwhile, is distancing herself from an unauthorized $Ivanka coin . While campaigning, Trump promised to make the United States the " crypto capital of the planet " after crypto bros poured over $130 million to elect a crypto-captive administration and Congress. Trump has tapp

Final text
A gun with three bullets; one man dead on the pavement, one man in custody. From a distance, the death of Brian Thompson looks like any other in a uniquely violent America. But the circumstances surrounding his murder were unimaginable just two weeks ago: A reclusive gunman partially built a gun on the computer and assassinated the chairman of America's most powerful health insurance company. As far as we can tell, the alleged shooter, Luigi Mangione , was radicalized neither by QAnon conspiracy theories nor by undercover federal agents , but by months of self-isolation following a spinal surgery that left him in chronic pain, against a health care system so fundamentally brutal that an unexpectedly large, atypically nonpartisan cohort of the population took to the internet to mock the victim . It was an impolite reaction, with unusual savagery -- not even a moment for "thoughts and prayers" -- which has been examined to death over the past few weeks. In a December 17 YouGov

Final text
When President-elect Donald Trump chose the television commentator Pete Hegseth as his nominee for secretary of defense, a panoply of national security stalwarts -- retired generals and admirals, former appointees, and elected officials -- evinced surprise. Some of us found the defense establishment getting caught off guard par for the course. Really? On a scale of the Islamic State capturing Fallujah in 2014 to the Afghan military collapsing before the Taliban in 2021 , how surprising was it? The choice of Hegseth -- whose highest position of military authority was as a mid-level National Guard officer -- is certainly unconventional. There's no guarantee he will be confirmed, and he could well be forced to drop out amid sexual assault allegations. But the nomination of an extreme, anti-establishment crusader like Hegseth was predictable. Trump's re-election means the upheavals have just begun. While many will celebrate a cultural shakeup at the Pentagon , they may not like 

Final text
Democrats' convention in Chicago undoubtedly provided a contrast of sorts with the nominating convention I had covered in Milwaukee the month prior , where Donald Trump and his champagne-vomit-belching cultists were promising -- in their own pornographically violent words -- a version of Homer Simpson's line about how his "campaign is a disaster... I hate the public so much. If only they'd elect me -- I'd make them pay!" Now, with the two party conventions in the rearview mirror, the general election between Vice President Kamala Harris and Trump has kicked off in earnest. There are about two months left until Election Day in November and the stakes are -- as everyone is likely dead-tired of repeating -- insurmountably high. The fact that this has become a cliche to say doesn't make it any less true. But roaming the halls of the Democratic National Convention last week, there was another feeling I found myself unable to escape. It has continued gnawing at me. Much of what I 

Final text


Final text
Picture this. You risk everything for your country. Leave friends and family behind. Go overseas. Watch your friends die or have their lives changed forever. You come home. Adjusting isn't easy. The nightmares keep you up. That old military injury never really heals. You work full-time, go to school, scrape by just to make it. Then one day, you get the email. "Congratulations, we are offering you a tentative job at the U.S. Department of Veterans Affairs," or at the Department of Defense, or the USAID, or another agency. Finally, after years of sacrifice, you have a fresh start. You buy a home, get married, have kids, maybe even travel. Ten years go by. You've built a life. You have a steady career, glowing performance evaluations, and now a family and a mortgage. You did it. Your service paid off. You're finally living the American Dream. Then a promotion opportunity comes along. Just the raise you need to book that dream vacation with your family. The only catch? A short p

Final text
Every time I hear "voting is your civic duty," I cringe. As a climate activist and Get-Out-The-Vote organizer, I can't help but feel our election system is disconnected from the reality of actual Americans. The platitudes traditional politics offer feel so oblivious to the constant political buffoonery people are expected to endure. Just look at the experience of being a young person in this country right now. Across the United States, we see books being banned, as if this fascist-hallmark was a hobby. Children are told to hide in corners during shooting drills as they prepare for yet another massacre. Our youth are being mined for data that is owned by corporations who are interested in exploiting privacy for financial gain. New college grads are saddled with debt as they head straight into a housing crisis, and women who are able to vote for the first time have fewer rights over their body than when they were born. Meanwhile, it doesn't take a climate expert like myself to

Final text
Sandwiched in between an assassination attempt and a VP pick, District Judge Aileen Cannon issued one of the worst, most nakedly political judicial decisions in decades, dismissing the classified documents case against Donald Trump in a flat rejection of decades of precedent. Cannon's decision is a sneak preview of what's at stake in November: dozens of Trump-appointed federal judges who place nationalist politics above law, precedent, or even honor. The classified documents case should have been a slam dunk. We all saw the photos of boxes of top secret documents in Donald Trump's bathroom. Those documents, as everyone knows, are the property of the National Archives - and Trump surely knew it too, which is why he had them stashed away. There is no conceivable rationale for his making off with them, which is why Trump was charged with 31 counts under the Espionage Act. If this case had been before any honest judge in America, it would have been decided by now. But, by chance

Final text
On Thursday, in their ruling halting the Biden administration's plan to limit ozone pollution from drifting into other states, Supreme Court justices repeatedly, accidentally referenced "nitrous oxide" -- a.k.a. laughing gas -- rather than the chemical compounds actually at issue in the case. The opinion, written by Justice Neil Gorsuch , was published online for several hours before the errors were corrected . The next day, the Supreme Court overturned a bedrock administrative law principle, "Chevron deference," that has long empowered federal agencies to interpret and implement statutes -- with the understanding that federal courts would defer to those agencies' reasonable interpretations of ambiguous laws. Now, judges will get to fill in any policy gaps left by Congress: They are the real experts, the court has ruled. The decisions, taken together, offer a perfect representation of the current Supreme Court: Our country is being led by an all-powerful, undemocratic instit

Final text
In one of the opening scenes of the award-winning movie There Will Be Blood , Daniel Day-Lewis' protagonist, Daniel Plainview, stands with his adopted son, little H.W., atop his shoulders, giving the movie's famous "I'm an oil man" speech. Plainview uses H.W. as a prop to soften his image, describing himself as a family man, running a family business, as he lobbies to win new contracts for lucrative oil leases out West at the dawn of the gusher age. The movie is loosely based on Upton Sinclair's Oil , a novel depicting the corruption and exploitation that characterized the American oil industry at that time. Sinclair drew his inspiration from the Teapot Dome scandal of the Warren G. Harding administration, a scandal some historians still deem the largest political corruption scandal in American history. Harding's Secretary of the Interior, Albert Bacon Fall, sold several Naval petroleum reserves, including those at Teapot Dome, Wyoming, to private oil companies without compe

Final text
Polls are underestimating the importance of climate change to the average American. Every election cycle, pollsters hone in on core issues top of mind for the American public. These topics can make or break the American family as parents are getting kids ready for school in the morning and trying to figure out how to pay their bills at night. Traditionally, we call these "kitchen table" issues, and they encompass everything from the economy, to education, to housing, to health care. But this election season, our climate is driving decisions on where to live, how to consume, whether to rebuild, or even evacuate, for voters right now. What pollsters miss from their ranking of top issues is the fact that if we do not address climate change, every other major issue on the American mind will get worse. Economic and scientific models paint a challenging future. In the era of the climate crisis, costs skyrocket as harvests fail from droughts and extreme heat. Insurance premiums exp

Final text
At this late hour, with so much at stake in this election, let's get one thing straight: Donald Trump is a clear and present danger to the American economy, to women's reproductive freedom, and to democracy itself. But he is not a threat to the planet. Whatever happens on Nov. 5, Earth will be fine. During her 4.5 billion years spinning around the sun, our beautiful planet has been bombarded by meteors, cooked by volcanoes, and iced into a giant snowball. Earth doesn't care about Trump. To her, he is a flea that shits on a gold-gilt toilet. What's at stake in this election is something more fragile: the stable climate that is the basis for civilized life as we know it. Because your job, your freedom, and your future are all dependent on the kindness and generosity of the Earth's atmosphere. If we fuck that up, we're all in big trouble -- and so is every living thing around us. There is no democracy in a world ravaged by climate-driven war, disease, displacement, and economic

Final text
Donald Trump and Elon Musk 's so-called Department of Government Efficiency ( DOGE ) was created under the guise of eliminating fraud, waste, and abuse in federal spending. But, like many of Musk's grand pitches -- Hyperloops, Mars colonization, Tesla robots, robotaxis, self-driving cars -- DOGE is just another empty promise from one of the world's most practiced conmen. In its supposed "savings" report, DOGE claimed to have saved $55 billion -- a small portion of of which came from capping payments for research grants . The rest? There aren't any real savings. All DOGE did was cancel government contracts. That's like deciding to stop paying your bills and mortgage to cover your credit card debt, only to realize your so-called "savings" won't even cover a single interest payment. A common theme in the DOGE data is the cancellation of subscription-based services that federal congressional and public affairs offices rely on, such as newspapers with paywalls, Beltway tipsheets,

Final text
Sunshine Man galloped for 35 minutes before the gunshot. The iconic palomino stallion died on the same open lands he had roamed for years, but they looked unfamiliar to him in his last desperate moments. The dust from the helicopter kicking up behind him, the roar of the blades ceaselessly bearing down -- it was enough to make him flee as fast as he could despite the leg he had snapped in half while trying to regain his freedom. His pursuers eventually tired of the chase, and a wrangler felled him with a rifle shot. Sunshine Man was one of 21 wild horses killed at the behest of the Bureau of Land Management during a 2023 roundup in Nevada. And 2024 is looking to be even bloodier as the agency seeks to capture 20,000 horses by September. At least 11 horses died in a single northern Nevada roundup as of June 29. Few people know that wild horses are being driven to near extinction by inhumane roundups perpetrated by the federal government and funded by taxpayers. Unless we do s

Final text
This article is published in partnership with The Lever, an investigative newsroom. If you like this story, sign up for The Lever's free newsletter . In 2008, I published a book with a straightforward premise: the upcoming era of American politics would be defined by a competition between the left and right to harness the working class's intensifying rage in a society being pillaged by corporate interests. It was the twilight of the Bush era, and the country was beginning its nose-dive into recession and turmoil, but hope and change seemed just over the horizon. I predicted that with elements of both political parties in a warrior stance, simmering conflicts over deindustrialization, financialization, and neoliberalism would soon explode and realign politics, birthing some American version of either social democracy or authoritarianism. The 16 years since The Uprising was released have delivered much of the tumult I imagined. It has been a period of unrest, chaos, and flip-f

Final text
The recent identification of 400 Iraq War veterans potentially exposed to chemical weapons has brought renewed attention to the critical role of the Veterans Exposure Team-Health Outcomes Military Exposures (VET-HOME) program. This initiative, launched under the PACT Act of 2022, offers specialized care for veterans suffering from military environmental exposures, providing essential telehealth evaluations and personalized care plans. That progress is now under threat. If the Department of Veterans Affairs (VA) were privatized under policies proposed by former President Donald Trump , programs like VET-HOME could be dismantled, leaving veterans without the specialized care they need. During Trump's presidency, his administration's push to expand private health care through the VA Mission Act sparked significant controversy. Trump framed the legislation as increasing "choice" for veterans, but many major veterans' organizations saw it as a step toward privatization. They fear

Final text


Final text
A recent executive order has sparked widespread concern about the availability of medications treating attention deficit hyperactivity disorder (ADHD), a condition that affects more than 22 million Americans . Titled " Establishing the President's Make America Healthy Again Commission ," the executive order has Secretary of Health and Human Services Robert F. Kennedy Jr.'s fingerprints all over it, focusing on several of the pillars of his MAHA campaign, from increasing Americans' life expectancy to fighting chronic illness. But the recent wave of anxiety about ADHD medications largely stems from a line in the executive order calling on the soon-to-be-formed MAHA Commission to produce a report on children's health "assess[ing] the prevalence of and threat posed by the prescription of selective serotonin reuptake inhibitors, antipsychotics, mood stabilizers, stimulants, and weight-loss drugs." Stimulants like Adderall and Ritalin are commonly used to treat ADHD in adults and 

Final text
President-elect Donald Trump has always used his public press conferences and social media posts to seize attention and dictate the terms of national discourse. But his recent trolling about taking over Canada , acquiring Greenland , and reclaiming the Panama Canal reveals something more complex than rhetoric. These statements, far from harmless social media clickbait, show Trump's vision for America: a return to the outdated ambitions of domination from the Bush era. For a man who once styled himself as the " pro-peace " candidate, these nation expansion fantasies mark a disturbing pivot toward policies that could destabilize the global order. Now that Trump is president-elect, he has quickly shifted away from the isolationist " America First " narrative he promoted during his 2016, 2020, and 2024 campaigns. He painted himself as the voice of the forgotten America who would prioritize domestic concerns over costly foreign entanglements. Trump promised to bring jobs back to 

Final text
Final text


Final text
Faress Arafat, 22, worked as a volunteer nurse in Gaza -- first in the emergency department at al-Shifa Hospital and then in displacement camps in Rafah. It has now been one year since Israel 's war on Gaza began. More than 41,000 people have been killed, and 100,000 have been injured. More than 16,000 children are among the dead. We are haunted by painful memories that are impossible to forget. How could I forget that we left our home and city, seeking refuge in a tent that offered no protection from the heat or cold? How can I forget the scenes of dismembered bodies we saw daily in our work as medical teams, being the first to respond to the injuries? How can I forget the cries of the injured children, their screams echoing inside my mind? How can I forget the sound of women weeping as they said goodbye to their loved ones? There is no way to forget or move on. After leaving Gaza, I realized that everything I experienced while working as a volunteer emergency nurse at al-S

Final text
Six years ago a gunman entered my high school in Parkland, Florida, and shot and killed 17 people and wounded 17 more. At the time, Donald Trump was in the middle of his presidency. Days after he expressed potential support for gun safety laws, the NRA called him up and he reversed himself. No one offered us more than thoughts and prayers. It was up to us, young people, to March For Our Lives . Since then, we've done more than march and accomplished what most thought was impossible. We passed the first federal gun safety law in three decades, created the first-ever White House Office of Gun Violence Prevention, and just this summer, the U.S. Surgeon General finally declared gun violence a public health crisis. This is real progress that has translated into real results and lives saved. While gun-related deaths increased by 34 percent under Trump and he bragged about doing "nothing", we're now seeing a historic decline in gun violence, with gun deaths down by nearly 20 percen

Final text
President Joe Biden should commute the sentence of Charles Littlejohn, the former IRS contractor who was sentenced in January 2024 to five years in prison. By disclosing the federal tax records of Donald Trump , Elon Musk, Jeff Bezos, and other billionaires to news organizations, Littlejohn enabled vital reporting about how the wealthiest people in the United States end up paying less in taxes than public school teachers and firefighters. We're urging people to urge the president to act while he still has the power to do so. Our website, freecharleslittlejohn.com , facilitates the process of sending an email or a letter to the president asking him to commute Littlejohn's sentence. Despite performing a valuable public service, Littlejohn received the harshest possible punishment. While sentencing guidelines recommended four to 10 months in prison, District Judge Ana Reyes gave Littlejohn the statutory maximum sentence of five years. Remarkably, this punishment is far more sev

Final text
Donald Trump 's politics of chaos and division are sucking America into a dark vortex, just seven weeks before an Election Day that could propel him back to the profound, and perilous, powers of the presidency. Trump is at his most dangerous when he's in a bind. Months after president Joe Biden's decision to withdraw from the 2024 race, the MAGA authoritarian has still not found a political answer to the candidacy of Kamala Harris. And polls show him taking on water in the wake of a calamitous second debate performance. The moderation Trump that affected for his nominating convention in Milwaukee has been revealed as just an act, as he has reverted to more familiar, ugly political instincts. With Harris energizing and re-uniting her party's base, and in particular voters of color, Trump no longer sees political profit in trying to peel off disaffected Democrats. He is, instead, seeking to energize politically disengaged bigots, a shift that finds Trump fishing for support in

Final text
The Department of Veterans Affairs (VA) is more than just a health care provider for those who have served -- it stands as a symbol of what a national, government-run health care system could achieve if scaled to the broader American population. This is precisely why Republican leaders and conservative think tanks have spent years trying to dismantle it, using privatization as their Trojan horse. The VA is not perfect -- no system is -- but its shortcomings should be addressed through reform and investment, not by dismantling the system entirely. Pete Hegseth , the former Fox News commentator whom Donald Trump wants to lead the Pentagon , has demonstrated that he fundamentally misunderstands the VA and the care it provides to our veterans. Discussing veteran disability ratings in 2019, Hegseth claimed : "I could be rated for 50 percent right now if I wanted to be." Hegseth, who has no combat injuries, would be highly unlikely to qualify for a 50 percent disability rating. Hi

Final text
Mandy Moen's children didn't get Christmas presents this year, but their family tried to make the best of the situation. They fastened a long string of beads to the wall in the shape of a Douglas fir, and thumbtacked ornaments to the living room drywall, including a miniature plastic TV set, a plush cardinal, and an iridescent green pickle. They even created a fake fireplace out of paper bags. "We're pretty creative," says Moen, whose name has been changed to protect her family's safety. Even if her kids had gotten presents, Moen says they would have just been forced to get rid of them anyway. Their family began preparing to leave the United States following Donald Trump 's win in the presidential election, which Moen says made her fearful for her children's futures. Since he took office, Trump has issued a flurry of executive orders following through on his campaign promises to target trans Americans . Among them, he has signed orders targeting gender-affirming healthcare f

Final text
Dear women and girls of America, These last two years have been incredibly hard. Practically overnight, women lost a right we had counted on for nearly 50 years -- and an entire generation lost the freedom to make their own decisions about their lives and futures. If hearing the news that Roe v. Wade had been overturned left you heartbroken, scared, or furious -- maybe even all three -- you were in good company. There were so many protests in the months after that disastrous Supreme Court decision that some people called it "the Summer of Rage." Women my age were gutted; no mother in the world wants her daughter to have fewer rights than she did. Young people were outraged that politicians they'd had no say in electing had unilaterally made a decision that would alter the course of their lives. Now, after two years of waiting, our moment is here. Together, we have the power to change the direction of this country. And we are going to do it by electing Kamala Harris and Tim W

Final text
President-elect Donald Trump hasn't even taken office, and congressional Republicans are already taking cues from Elon Musk and Vivek Ramaswamy and the pair's new playground, the Department of Government Efficiency, or DOGE. GOP lawmakers immediately lined up against the original short-term government funding bill released this week after Musk voiced his opposition . On top of that, a group of extremist Republicans, led by Sen. Rick Scott and Rep. Andy Harris, is fighting to ensure that the fresh round of tax cuts Republicans have planned for Musk when Trump's 2017 tax law expires are paid for with $2.5 trillion in spending cuts on programs Americans rely on chosen by -- you guessed it -- Musk. Republicans have made their plan for the new year crystal clear: Ram through massive tax giveaways for the ultra-wealthy and corporations, and pay for them by shaking down programs and agencies that working families rely on. And they're putting unelected and unaccountable oligarchs --

Final text
Brat. Weird. Megan Thee Stallion. Democrats haven't just recast the lead at the top of their electoral ticket; they're performing an entirely new show. Right wing operatives, desperate to make silly selfies seem like evidence of -- gasp! -- socialism , are trying to claim this new politics of joy (and mockery) is proof of nefarious intent. This is either par for the "every accusation is an admission" MAGA Republican playbook or merely ironic. Creating good vibes isn't just generally effective, it's critical especially in what may seem the most counterintuitive cases: confronting existing or would-be authoritarian regimes, in other words, MAGA Republicans. Sami Gharbia once said, "humor is the first step to break taboos and fears. Making people laugh about dangerous stuff like dictatorship, repression, censorship is a first weapon against those fears... without beating fear you can not make any change." The lure of the authoritarian is that he is the Strongman, as Ruth Ben Gh

Final text
While Elon Musk was striking it rich as a guest in America, the rest of our worlds were crashing down. I was serving in my first year of Active Duty Military Service. While home on leave -- a little longer than intended -- I remember watching the first tower fall from my uncle's home in Bayonne, N.J. The smoke rising from Lower Manhattan was visible across the water, an unthinkable sight that would change the course of history and my life forever. While I watched in horror, John Feal was already in motion. A demolition supervisor, he did what so many first responders and volunteers did that day -- he ran toward the devastation, determined to help. For days, along with his comrades, many of whom are dead from 9/11 -related illnesses, he worked tirelessly on "the pile," the smoldering wreckage of the Twin Towers, searching for survivors, clearing debris, and breathing in air thick with toxic dust that he was told was "safe." In 2008, fate brought John Feal and me together. I w

Final text
As I have predicted in previous columns for Rolling Stone , the Trump administration will shatter the lives and livelihoods of tens of thousands of veterans . It's already happening, and we're only in the first week of his second term. Veterans have long relied on the federal government as a pathway to stable employment after military service. The Department of Veterans Affairs (VA), the Department of Defense (DoD), and other federal agencies provide not just jobs, but purpose -- continuing the mission of serving fellow service members and ensuring that those who sacrificed for their country are cared for. But now, thanks to Trump's reckless 90-day federal hiring freeze , that lifeline is gone. The very veterans he once claimed to champion are being left in the cold, their futures uncertain. I'm part of a Facebook group called Veterans 2 Federal Jobs , a community where veterans help one another navigate the complex federal hiring process. Over the years, I've seen this grou

Final text
Elon Musk and the Trump administration are attempting to dismantle the federal workforce and our intelligence agencies, but there's just one problem: federal employees and veterans aren't taking the bait. Despite repeated emails, misleading promises, and thinly veiled threats, the so-called "deferred resignation" program is flopping. Instead of persuading career civil servants to abandon their posts, Musk's latest scheme is exposing the administration's desperation and legal recklessness. Less than 1 percent of federal workers have taken the offer. The recent legal filing by the American Federation of Government Employees and two federal employees seeking a restraining order against Musk's DOGE further hampers Musk's plans. In the end, this debacle will likely cost the federal government more in legal fees, settlements, and billable legal hours than ever. Musk, now leading the newly created Department of Government Efficiency (DOGE) -- which is neither a federal agency nor a

Final text
In his debate against Vice President Kamala Harris , Donald Trump debuted the turducken of dog whistles: the claim that Harris would funnel taxpayer money to surgeries for transgender immigrants in prison. Since then, Republicans have put $60 million over two weeks into airing anti- trans ads heaped atop their public speaking and social media posting efforts to center this issue. There's an effective rejoinder Democrats could be deploying: Call out the bullshit, explain the motivation behind it, and sandwich this between an affirmation of what nearly all of us value and must vote to protect -- freedoms. The Democratic response? Largely silence. Or worse , tacitly reaffirming the opposition's position. Right now, the Harris campaign and aligned Democratic groups are leaving the issue unmentioned in ads, choosing to broadcast their economic bonafides plus draw the contrast with Republicans on abortion. This response is inadequate -- not just morally, but electorally. Ignoring 

Final text
By now, every pundit in America has their own 2024 election take, mostly confirming their prior opinions. Every Republican has a take, too, which is that Americans voted resoundingly for -- well, for whatever policy that Republican cares about, from opposition to transgender rights to support for prayer in schools. And of course, progressives, especially younger ones, have every right to feel afraid, angry, or alienated. But the data tells a specific story, not a choose-your-own-adventure. And that is that swing voters voted mostly out of economic insecurity and discontent . They actually liked Kamala Harris more than Donald Trump (Harris' favorability was 48 percent, compared to 44 percent for Trump). But Harris was the incumbent, and incumbents don't win elections when people think the economy is bad. This is not just an American phenomenon. As the Financial Times reported, in every developed country in the world , the incumbents lost this year. This is unprecedented. If, 

Final text
Thanks to the conservative Supreme Court, thousands of Veterans Affairs Department rules and regulations are now vulnerable to court challenges, a scenario that could be ignited by a new Trump administration. As a young political staffer, one of the first hard lessons I learned was that in politics, there are no coincidences. During Donald Trump 's term as president, his administration's push to expand private health care for veterans through the VA Mission Act sparked significant controversy. Although Trump framed the law as fulfilling his campaign promise to offer veterans more "choice" in their health care, many major veterans service organizations saw it as a step toward privatizing the VA. They feared that the expansion of private care, which was estimated to cost billions, would drain resources from the VA's facilities and degrade the quality of care veterans relied on. This clash between Republicans and the veteran services organizations, centered around ensuring the 

Final text
Steve Bannon is a troll. Never forget that. Because trolling is exactly what he was doing last weekend when he suggested Donald Trump could run for president in 2028 to serve, if he won, a third term in office. Trump can't do this, and don't let anyone convince you otherwise. Now, if you're someone who believes that there are no rules, or at least that there are no rules that apply to Donald Trump, and that any norm-breaking democracy-threatening thing he, his advisers, or even his opponents can imagine is possible -- well, then nothing I say here will convince you. Just go ahead and give up now. But for everyone else, the Constitution is eminently clear. The 22nd Amendment says: "No person shall be elected to the office of the President more than twice." It was added to the Constitution after Franklin Delano Roosevelt was elected for a fourth term, to prevent anyone else from serving as long as he did. No other president has served more than two terms. And neither will Dona

Final text
Anat Shenker-Osorio is a political strategist and communications researcher for progressive campaigns. Since Donald Trump and Republicans seized control of the federal government, we've seen a flood of illegal executive orders plus a takeover of the Treasury Department and other government agencies by a billionaire who paid his way into shadow presidency. Democrats, in response, have issued sternly worded social media posts , threats to deny a budget deal five weeks away, and promises of future legislation to render illegal the current crime of breaking and entering. To be sure, #NotAllDemocrats applies. Lawmakers like Alexandria Ocasio Cortez, Jasmine Crockett, Maxwell Frost, and Jamie Raskin have made potent denunciations and cast votes to indicate they would like to stop the authoritarian takeover of our government. Sen. Chris Van Hollen and Rep. Don Beyer led a delegation to the U.S. Agency for International Development and filed an injunction to block Elon Musk 's effor

Final text
When Elon Musk appeared in the Oval Office on Feb. 11 alongside President Donald Trump to discuss his latest slashing of federal funding, he was joined by his four-year-old son, X AE A-Xii, better known to the world as X. "Great guy," Trump said of the toddler. "He's a high IQ individual." X, for his part, had his own agenda. "I need to go pee," X could be heard whispering to Trump as Musk spoke. Musk, an unelected government official, went on to call bureaucracy a "fourth unelected branch of government" as his son rubbed his eyes. Reactions to the press conference focused on how cute X is, with online commentators fawning over the toddler. It's not a coincidence that Musk, a 'special government employee' who is the head of the Department of Government Efficiency (DOGE), brought his son into the Oval Office. Musk is the father of 12 known children and is dedicated to the cause of pronatalism , a movement which espouses the belief that having many children is not only a socia

Final text
Dr. Ayana Elizabeth Johnson is co-founder of the non-profit think tank Urban Ocean Lab , distinguished scholar at Bowdoin College, and author of New York Times bestseller What If We Get it Right?: Visions of Climate Futures . By the end of Donald Trump 's second presidential term, sea levels will be higher, weather will be more extreme, and the urgency for implementing climate solutions will be even greater. Yet on day one of his administration, Trump signed nine executive orders aiming to sabotage climate solutions -- from obstructing renewable energy development; to re-opening protected areas to drilling, mining, and logging; to terminating all federal environmental justice programs and positions; to declaring an "energy emergency" (despite U.S. oil and gas production currently setting an all-time global record ) that would allow fossil fuel projects to evade environmental protections; to bailing on the Paris Agreement (again); and even redefining "energy resources" to exc

Final text
Donald Trump led a virulently anti-immigrant campaign for president, fixated on the idea of "migrant crime" and the notion that foreigners are "taking your jobs." The president-elect and his incoming administration remain committed to deporting undocumented immigrants and asylum-seekers -- but Trump has already made clear he won't stop big employers from replacing Americans with temporary immigrant workers. The president-elect recently sided with ultra-wealthy supporters Elon Musk and Vivek Ramaswamy over MAGA fans online regarding the question of whether Big Tech firms need foreign workers. Musk and Ramaswamy, like Trump, have frequently demonized undocumented immigrants; Musk ran pro-Trump ads decrying a "HISTORIC BORDER INVASION" and "illegal immigrants getting handouts." Now, Musk says the H-1B visa program for high-skilled workers is how companies make America "strong," and has demanded that "racists" be purged from the Republican Party. Trump, the left, and the right h

Final text


Final text
Margaret Klein Salamon, Ph.D., is the Executive Director of Climate Emergency Fund , which raises funds for and makes grants to non violent climate activists. She is the Author of Facing the Climate Emergency: How to Transform Yourself with Climate Truth . Who is to blame for the Los Angeles fires ? For the destruction of Asheville? The devastation of Acapulco? If you listen to the mainstream media, you would get the impression that no one is truly responsible. These are framed as tragic but random events- acts of nature without clear cause or accountability. Even if articles do mention climate change, which they tend to bury toward the end, they won't tell you the people and companies who caused these disasters in order to enrich themselves. And if you listen to right-wingers, you will get the mistaken impression that DEI , arson, and the California Democratic Party caused these disasters. But here's the truth: These mega-disasters are caused by the climate emergency, which

Final text
Kamala Harris and Tim Walz, her new VP pick, have gotten a lot of mileage out of labeling Donald Trump and his running mate J.D. Vance "weird." It's a brilliant construction, as it's taken the word hurled at those considered outsiders in America -- by race, gender, or affinity -- and attached it firmly to white men of the right-wing persuasion. It's also a primo move of someone who is a member of my generation, Generation X -- taking back the power of the outsider and exposing those who claim normalcy as the freaks they really are. If elected, Kamala Harris would be the first arguably Generation X president and fittingly laced with both ambivalence and cool. Technically she is the very latest Boomer -- born at the tail end of 1964, a month away from being officially Gen X. But Harris is Gen X in her personal style and in the ways in which she is read -- and misread -- by her naysayers. Embracing her Gen X self is her best bet for connecting with a wide range of voters, espec

Final text
My generation has grown up during one of the most complex and unprecedented times in American history. We've witnessed war, genocide, historic climate disasters, a raging gun violence epidemic, an insurrection, an economy that doesn't work for us, and the lowest levels of trust in government -- paired with an ever-deepening hyper-partisanship. And the list goes on. We are the mass shooting generation, the generation that will bear the brunt of climate change, and the generation inheriting crises that threaten our future. We grew up with active shooter drills in our schools, watched our planet's ecosystems change before our eyes, and now face mounting student debt, an unaffordable housing market, and an uncertain economic landscape that feels rigged against us. It's no secret that there's a lot being said right now about the recent election. People are pointing fingers, making generalizations, and casting blame, some of it unfairly directed at young people. Pundits are doing 

Final text
People often ask me why I do what I do. Why go undercover to expose extremist Republicans like Senator Ron Johnson , Trump coup attorney John Eastman , or Supreme Court Justice Samuel Alito ? Can't you just ask them for an interview? To give you a fuller context for my motivation requires first going back to the morning of Sept. 11, 2001 in lower Manhattan. Running late for class, I bounded up the stairs at the City Hall subway station, and was greeted by a construction boom that was not normal. As I made my way west on Chambers Street, I soon discovered its source. A plane had crashed into the North Tower of the World Trade Center. "Oh my God! What a terrible accident!" "Pilots can't fly anywhere close enough for that to be an accident." The scene was surreal, to be sure, but I kept on task. I made it to the street entrance of campus before turning to stare back at the towers in astonishment. The second plane suddenly came into view before exploding into the South Tower, an

Final text
In 2000, I was stationed in Florida and away from my home state of New Jersey. I sent in my mail-in ballot and voted for Al Gore, believing I was participating in a cornerstone of democracy. However, as the recount chaos unfolded -- with hanging chads, legal battles, and a controversial Supreme Court decision -- I was left wondering if my vote had even been counted. I deployed soon after, following 9/11, pushing those concerns aside. But the lessons from that election never left me. Voting rights , especially for minority blocs, are fragile and must be fiercely protected. Now, 24 years later, we are witnessing another attempt to undermine mail-in voting. Pennsylvania Republicans recently tried to impose stricter ID requirements on military and overseas voters, a move that threatened to throw thousands of ballots into jeopardy just days before the general election. Their lawsuit sought to change rules that exempt military voters from state voter ID laws -- an exemption in pla

Final text
I used to hate Jimmy Carter more than anything else in this world. Let me explain. It's November 1979 in Oak Harbor, Washington, and I'm counting days. 24, 23, 22. I'm in the eighth grade. After school, I fold and then deliver the Seattle Times on Whidbey Island, the second-largest island in the continental United States as our teachers like to remind us. I am counting days until I fly from Seattle to Honolulu to meet my dad, Commander Peter Rodrick, and ride back to San Diego with him on the USS Kitty Hawk. The voyage will take six days, longer than I've ever spent alone with him. Finally, there will be time. I can come clean about faking sick so I could watch the Red Sox-Yankees one-game playoff last October. The Sox are Dad's team. He'll understand. Finally, I can learn what my father does. I know he flies jets off carriers, but how? Finally, I can ask him why things seem so hard all the time. He is counting days, too. His letters to mom always mark the days left of his s

Final text
This article was produced by Capital & Main. It is co-published by Rolling Stone with permission. Minnesota Gov. Tim Walz , who has just been tapped as Kamala Harris' running mate, has quietly become one of the country's most aggressive advocates for taking action on climate change. Under his leadership, Minnesota has adopted some of the most ambitious climate policies in the nation -- including a law he signed in 2023 that requires the state to generate all of its electricity from renewable sources by 2040. During that legislative session, Walz and state lawmakers pushed through at least 40 other climate-related initiatives . "The idea [is] that we can create a clean energy future where we can protect our water, protect our land, and do that in a manner that grows the economy in Minnesota," Walz said in announcing those initiatives. Under the clean-electricity law, Minnesota is on track to transition to clean-energy sources even faster than California, which is often seen a

Final text
It's Sunday afternoon near the Washington Monument and the clouds overhead are suitably dark and impenetrable. All your friends are here. The accidentally Russia-funded Tim Pool slips into a VIP tent while Russell Brand shows off his chest hair to his partner Jordan Peterson , the victor in many imagined battles against cultural Marxism. He is dressed as a cartoon villain in a suit that is half red and half blue. Nearby, Bashar al-Assad and Putin cheerleader Tulsi Gabbard prepares to call Kamala Harris a warmonger from the stage. Welcome to the Rescue the Republic rally, featuring many stars of the fabled Vaccines Are Bullshit rally of 2022. The "just because I am paranoid it doesn't mean that the Feds are not following me" crowd is here for Robert F. Kennedy Jr., but first there are many warmup acts. Comedy edgelord Rob Schneider is the master of ceremonies, and has already trotted out his Saturday Night Live Mexican-stereotype character that was created a third of a centur

Final text
In 2013, I met a 12-year-old Syrian girl who had been shot in the back by a government sniper near Aleppo. Her name was Maysaa, and she was paralyzed from the waist down. "Am I a terrorist? Are all of the children they kill terrorists?" she asked, recuperating in an improvised medical facility on her way to a Turkish hospital. Despite her pain, she was overcome with anger, and she cursed the man responsible. "Children are being torn to pieces. May God tear Bashar al-Assad and his children to pieces." Curses like Maysaa's are seeds that took root in Syria 's blood-soaked soil and have stubbornly grown. Now, more than a decade later, they are bearing fruit. The murderous tyrant who presided over the collapse of Syria, amid a brutal civil war, has finally fallen. Assad's regime, responsible for more than 617,000 deaths, has evaporated in the face of an onslaught that began with a ferocious offensive by rebels in the northwest, and which was soon joined by anti-government fighte

Final text
The wildfires in Los Angeles are devastating. Forty thousand acres of natural and urban land have been scorched, more than 12,300 structures decimated, and tens of thousands of families lost their homes, pets, baby books, and most cherished possessions. The displaced are now fanning out, seeking temporary shelter with friends and family, in hotels, or in rental units. Many will have trouble finding a place to stay in a city that was already as many as 450,000 affordable units short before the fires. When there are more heads than beds, it's a seller's market. When there are more heads than beds in a crisis, it's a gouger's market. Cue the greedy landlords. In the past week, the price of rental units has skyrocketed. Tenants are inundating government and non-government agencies with complaints of price gouging, according to the Housing Rights Center . A review of Zillow listings by The New York Times found that rent prices in West Los Angeles have spiked from 15 percent to an

Final text
"This was my dream job and it's just being taken away by an administration who doesn't care about science," cried a scientist we'll call Alexis, in a video viewed over 1.2 million times, her voice breaking and tears running down her face. "They don't care about our species that we're trying to conserve. We should all be so mad right now. We should all be crying like this." The latest wave of mass firings in the federal government has washed over thousands of workers , Including those in the National Institute of Health, the Department of Veterans Affairs, and the U.S. Forest Service. Elon Musk -- an unelected official, the richest man in the world, and head of the Department of Government Efficiency (DOGE) -- slashed the jobs of 200,000 federal workers as part of his "cost-cutting initiative." The workers targeted were those in a probationary period , who are generally newer hires who have been in their current positions for less than a year. Fired federal workers have taken

Final text
The mass purge of federal workers under Donald Trump and Elon Musk 's so-called Department of Government Efficiency ( DOGE ) plan is more than just an attack on the government bureaucracy -- it's a direct assault on the veterans who make up nearly 30 percent of the federal workforce. In a decision that will be remembered as a betrayal of the veteran community and the broader federal workforce, the U.S. Office of Personnel Management (OPM) issued an email Tuesday similar in wording to the one Musk used to gut Twitter . The message, which arrived under the subject line, "Fork in the Road," presented federal employees with a false choice: agree to resign now with temporary pay or risk termination later. Musk made a similar offer to Twitter employees, along with promises about severance that were apparently never honored -- a precedent that raises serious concerns about the integrity of this so-called "deferred resignation" plan. Musk's fingerprints on this scheme introduce majo

Final text
The murder of the CEO of UnitedHealthcare , America's largest health insurer, has generated a lot of discourse over the past week and renewed conversation about the state of the U.S. health care system -- something that was almost entirely missing from the 2024 campaign. Social media users have actively celebrated the shooting and thirsted over the suspect . In the other direction, New York Times columnist Bret Stephens argued the insurance exec was a "working-class hero," while blogger Noah Smith asserted that health insurers are simply "middlemen" and not "the main villain of the U.S. health system." The CEO of UnitedHealth Group (UHC's parent company) wrote in the Times that the health conglomerate's purpose is "to build a health care system that works better for everyone." Sure , man . Yet, CNN offered perhaps the strangest take in one segment this week: Americans are responsible for our health care system staying the way it's designed today -- i.e. based around private 

Final text
This article was produced by Capital & Main. It is co-published by Rolling Stone with permission. Objectively, it is an amazing progression of Black history in a short window of time. In 2008, Barack Obama, a Democrat, was elected as the nation's first Black president. Sixteen years later, Kamala Harris is poised to become the first Black woman chosen by the party to be its candidate for president. But much has changed in 16 years. The historic significance of Obama's candidacy was hope, the symbolism of a Black man in the White House inspiring enough to move millions and urge us to think differently about ourselves. Kamala Harris' ascension is no such symbol. She is becoming the candidate by dint of being vice president, handed the baton by an aging white president who was more or less forced to step aside as the presumptive nominee because there was too much concern about his own ability to get the job done. Harris' color is not nearly as important as her ability to do tha

#### The Guardian
Articles scraped from https://www.theguardian.com/uk/commentisfree

In [15]:
def scrape_guard_article(url):
    """
    Scrapes an article from a given URL on theguardian.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    #print (soup)
    
    #get title
    title_tag = soup.find("meta", property="og:title")
    title = title_tag["content"] if title_tag and title_tag.has_attr("content") else "Title not found"
    #remove author name from title if exists
    if "|" in title:
        title = title.split("|")[0].strip()
    article_data["title"] = title
    
    #get author
    meta_author = soup.find("meta", property="article:author")
    author_name = ""
    if meta_author and meta_author.has_attr("content"):
        author_slug = meta_author["content"].split("/")[-1]
        author_name = " ".join(author_slug.split("-")).title()
    
    #get date
    published_tag = soup.find("meta", property="article:published_time")
    published_date = published_tag["content"] if published_tag and published_tag.has_attr("content") else "Date not found"
    article_data["date"] = published_date
    
    #get site
    site_tag = soup.find("meta", property="og:site_name")
    article_data["site"] = site_tag["content"] if site_tag and site_tag.has_attr("content") else "Site not found"
    
    #get category
    category_tag = soup.find("meta", property="article:section")
    article_data["category"] = category_tag["content"] if category_tag and category_tag.has_attr("content") else "Opinion"
    
    #get rid of newsletter signup box
    for aside in soup.find_all("aside", attrs={"aria-label": "newsletter promotion"}):
        aside.decompose()
    
    #get article
    article_body = soup.find("div", class_=lambda c: c and "article-body" in c)
    if article_body:
        article_paragraphs = article_body.find_all(['p'])
        # Get rid of footer bylines
        for footer in article_body.find_all("footer"):
            footer.decompose()
        # Remove other unwanted pieces such as author biography, requests for opinions etc. 
        num_paragraphs = len(article_paragraphs)
        if num_paragraphs > 0:
            # Determine the indices for the last five elements
            start_index = max(0, num_paragraphs - 5)
            # Iterate in reverse order over these indices
            for i in range(num_paragraphs - 1, start_index - 1, -1):
                text = article_paragraphs[i].get_text(strip=True)
                if author_name and text.startswith(author_name):
                    del article_paragraphs[i]
                elif text.startswith("Do you have an opinion on the issues raised in this article?"):
                    del article_paragraphs[i]
                elif text.startswith("As told to"):
                    del article_paragraphs[i]
                elif text.endswith("is an Observer columnist"):
                    del article_paragraphs[i]
                elif text.endswith("is a Guardian columnist"):
                    del article_paragraphs[i]
        
        joined_text = " ".join(elem.get_text(separator=" ", strip=True) for elem in article_paragraphs)
        article_data["text"] = clean_text(joined_text)
    else:
        article_data["text"] = "Article text not found"

    return article_data

scrape_articles("guardian.txt", scrape_guard_article, "commentary_scraped_articles_guard.csv")

Length:  50
                                                                                                title  \
0  A humiliation at the White House and what does it tell us? Trump would make a colony of my country   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

#### Wall Street Journal
All articles scrapped from https://www.wsj.com/opinion

In [23]:
def scrape_wsj_article(url, cookies=None):
    """
    Scrapes an article from a given URL on wsj.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.
    cookies : dict, optional
        Cookies to pass with the request (WSJ subscription cookies to avoid paywall)

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    print (soup)
    
    #get title
    title_tag = soup.find("meta", property="og:title")
    title = title_tag["content"] if title_tag and title_tag.has_attr("content") else "Title not found"
    article_data["title"] = title
    
    #get date

    
    #get site
    site_tag = soup.find("meta", property="og:site_name")
    article_data["site"] = site_tag["content"] if site_tag and site_tag.has_attr("content") else "Site not found"
    
    #get category
    
    #get article


    return article_data

urls = ["https://www.wsj.com/opinion/america-still-needs-a-covid-reckoning-institutional-failure-values-breakdown-ee6a660d"]
wsj_df = scrape_multiple_articles(urls, scrape_wsj_article)
wsj_df

<!DOCTYPE html>
<html lang="en-US"><head><meta charset="utf-8"/><link href="https://www.wsj.com/opinion/america-still-needs-a-covid-reckoning-institutional-failure-values-breakdown-ee6a660d" rel="canonical"/><title>America Still Needs a Covid Reckoning - WSJ</title><meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=2.0" name="viewport"/><meta content="Article" name="page.section"/><meta content="Article" name="page.content.type"/><meta content="responsive" name="page.content.format"/><meta content="nocache, noarchive" name="msnbot"/><meta content="WSJ Online" name="page.content.source"/><meta content="61387583289.apps.googleusercontent.com" name="google-signin-client_id"/><meta content="servo:prod:virginia:wsj-nextjs-app:article" name="servo-context"/><meta content="na,us" name="cXenseParse:wsj-pageregion"/><meta content="app" name="branch:deeplink:target"/><meta content="Opinion | America Still Needs a Covid Reckoning" name="twitter:title"/><meta co

Unnamed: 0,title,text,site,date,category,class,url
0,Opinion | America Still Needs a Covid Reckoning,,WSJ,,,Commentary,https://www.wsj.com/opinion/america-still-needs-a-covid-reckoning-institutional-failure-values-breakdown-ee6a660d


### Combine dataframes and clean

In [17]:
def combine_csvs_to_master(file_list, master_csv="master.csv"):
    """
    Combines multiple CSV files into one master CSV file.

    Parameters:
    -----------
    file_list : list of str
        List of paths to CSV files.
    master_csv : str, optional
        The filename for the master CSV file (default is 'master.csv').

    Returns:
    --------
    None
    """
    # Read each CSV file into a DataFrame
    dfs = [pd.read_csv(file) for file in file_list]
    
    # Concatenate DataFrames, reindexing automatically
    combined_df = pd.concat(dfs, ignore_index=True)
    combined_df.reset_index(drop=True, inplace=True)
    
    # Save the combined DataFrame to a CSV file
    combined_df.to_csv(master_csv, index=False)
    
combine_csvs_to_master(all_scraped_content)

In [18]:
master_df = pd.read_csv("master.csv")
print(len(master_df))

#Check if any empty articles
empty = master_df[master_df["text"]=="Article text not found"]
print (empty)

# Basic checks
print(master_df.info())
print(master_df.head(3))

# Print out the categories
print(master_df["category"].value_counts())

# Print out the categories
print(master_df["class"].value_counts())

# Print unique websites
print("Number of unique sites:", master_df["site"].nunique())

1501
Empty DataFrame
Columns: [title, text, site, date, category, class, url]
Index: []
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1501 entries, 0 to 1500
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     1053 non-null   object
 1   text      1403 non-null   object
 2   site      1403 non-null   object
 3   date      1403 non-null   object
 4   category  1303 non-null   object
 5   class     1501 non-null   object
 6   url       1501 non-null   object
dtypes: object(7)
memory usage: 82.2+ KB
None
  title  \
0   NaN   
1   NaN   
2   NaN   

                                                                                                                                        text  \
0                                                                      Ed Perlmutter voted for Viagra for rapists paid for with tax dollars.   
1                                                        More than 3,000 homicides we

In [20]:
def clean_category(cat: str) -> str:
    """
    Convert categories to lowercase, unify synonyms, and return a single standardised category.
    """
    # Convert to lowercase
    c = cat.lower()
    
    # Standardise categories
    replacements = {
        'politics': 'politics',
        'local news': 'local',
        'world news': 'world',
        'world': 'world',
        'worldviews':'world',
        'entertainment': 'entertainment',
        'business': 'business',
        'health': 'health',
        'lifestyle': 'lifestyle',
        'life':'lifestyle',
        'sports': 'sports',
        'sport': 'sports',
        'football': 'sports',
        'gaa': 'sports',
        'sci/tech': 'technology',
        'celebs':'entertainment',
        'tech': 'tech',
        'u.s.':'united states',
        'uplifting viral content': 'entertainment', 
    }

    # Do the changes
    if c in replacements:
        c = replacements[c]
    return c

# Apply the cleaning
master_df['category'] = master_df['category'].apply(clean_category)

# Now check the new distribution
print(master_df['category'].value_counts())

AttributeError: 'float' object has no attribute 'lower'